Cross-Linguistic Sarcasm Detection in Tamil and
                                Malayalam: A Multilingual Approach ⋆
                                Dhanya Krishnan1 , Krithika Dharanikota2 and B Bharathi3
                                1 2 3
                                    Department of CSE, Sri Siva Subramaniya Nadar College of Engineering, Rajiv Gandhi Salai, Chennai, Tamil Nadu,
                                India


                                                                         Abstract
                                                                         This paper introduces the methodologies developed for the Dravidian-CodeMix - FIRE 2023 task, fo-
                                                                         cusing on the objective of classifying code-mixed comments in Tamil and Malayalam from YouTube
                                                                         videos as either Non-Sarcastic or Sarcastic. The research encompasses essential data preprocessing
                                                                         steps, experimentation employing diverse machine learning algorithms and feature vectors, and the
                                                                         comprehensive assessment of model performance. The Tamil model secured the second rank, and the
                                                                         Malayalam model secured the top position in the competition. Several methods, including the utilization
                                                                         of Count Vectorizer with MLP Classifier and Logistic Regression, as well as TF-IDF Vectorizer with
                                                                         MLP Classifier and TF-IDF Vectorizer with Random Forest Classifier, exhibited exceptional performance.
                                                                         Notably, validation accuracy for the Tamil dataset ranged from 0.72 to 0.78, accompanied by macro-
                                                                         averaged F1-scores spanning 0.73 to 0.77, while the Malayalam dataset showcased validation accuracy
                                                                         levels between 0.72 and 0.85, coupled with F1-scores spanning 0.60 to 0.77. These findings underscore
                                                                         the substantial potential of the employed techniques in addressing the intricate task of sarcasm detection
                                                                         while highlighting their pertinence in mitigating challenges related to linguistic diversity, code-mixing,
                                                                         and class disparities observed in online content across prominent languages. This research significantly
                                                                         contributes to the domain of cross-linguistic sarcasm detection and bears implications for multilingual
                                                                         sentiment analysis and the identification of cyberbullying within digital ecosystems.

                                                                         Keywords
                                                                         Sarcasm Detection, Sentiment Analysis, Dravidian Languages, Tamil, Malayalam


                                1. Introduction
                                Sarcasm refers to remarks, typically used in a humorous manner, to mock at, show irritation or
                                convey contempt. It is a form of irony. With the recent surge in social media usage, sarcasm has
                                played a vital role in conveying people’s frustrations. It is found across a variety of platforms:
                                Tweets, Instagram captions, and YouTube comments. However, due to the concealed nature of
                                sentiment in sarcasm, identification and detection of the same becomes a tedious task. Sarcasm
                                identification aids in varied applications— from sentiment analysis to detection of cyber-bullying,
                                it becomes crucial to analyse sarcasm in texts. Tamil and Malayalam are Dravidian languages.
                                Tamil is the official language of Tamil Nadu and Puducherry, Singapore and Sri Lanka, while
                                Malayalam is the official language of Kerala and the union territory of Lakshadweep. Malayalam

                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                †
                                    These authors contributed equally.
                                Envelope-Open dhanya2010402@ssn.edu.in (D. Krishnan); krithika2010087@ssn.edu.in (K. Dharanikota); bharathib@ssn.edu.in
                                (B. Bharathi)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
is thought to be the closest major relative of Tamil, with the divergence of the two languages
beginning in the 9th century AD. Owing to the inherent code-mixed nature characterising
the usage patterns of these languages, the task of sentiment analysis emerges as a notably
formidable undertaking. Traditional monolingual models often perform poorly on code-mixed
datasets. This is attributed to the complexity demands of the code-switching that is required to
analyse these datasets. The Dravidian-CodeMix - FIRE 2023 [10] shared task is formulated with
the intent of encouraging collaborative research to devise suitable solutions for the discussed
issue. The organisers presented the task of categorising code-mixed Tamil and Malayalam
comments on YouTube videos featuring movie trailers as Non-Sarcastic and Sarcastic, aiming
at analysing the sentiment polarities among viewers. This work presents the methodologies
proposed for the Dravidian-CodeMix - FIRE 2023 task. It encompasses the utilisation of multiple
models, with thorough analyses of their performances. Employed Machine Learning algorithms
encompass Linear SVC, MLP Classifier, Naive Bayes, Random Forest, ULMFiT, KNN, and feature
vectors such as TF IDF and CountVector.
   The remainder of the paper is organised as follows: Section 2 discusses the literature survey
of related works. The descriptions of data and the proposed methods are detailed in Sections 3
and 4 respectively. Section 5 underscores the achieved results and presents a thorough analysis
of the performances of each model. Section 6 concludes the paper and contains a discussion
about future research.


2. Related Work
The shared task [2] for Sarcasm detection using fuzzy logic suggests a unique ensemble approach
based on text embedding that includes fuzzy evolutionary logic at the top layer. The model uses
word and phrase embedding based on Word2Vec, GloVe, and BERT architecture. The datasets
used were mainly social media datasets – Headlines dataset, the “Self-Annotated Reddit Corpus”
(SARC), and the Twitter app dataset and the model was validated. Headlines dataset received
the highest accuracy metric among the other datasets. The study regarding identification
of Sarcasm in textual data [3] is a comprehensive review which covered articles on sarcasm
detection published between 2008 and 2019 and elaborated on their classification approaches
and reviewed their performance metrics, in an attempt to provide researchers with an invaluable
insight into the research domain and to further improve sarcasm identification system using
textual data. The survey also critically analyzed various data preparation (pre-processing)
techniques and recent classification algorithms for sarcasm identification.In addition to this
there is a study [4] which includes audio file detection as well as the textual data identification.
This study uses a hybrid method that takes a combined vector of extracted audio and text features
from specified models used in their study as the input. These combined features will compensate
for the shortcomings of only text features and vice-versa. The results obtained from the hybrid
model outperforms the results achieved when text and audio were detected individually : a total
of 3 models were developed – one for textual data, one for audio features and lastly a hybrid
model. Following this, we came across a study based on Sarcasm identification using machine
learning algorithms in Twitter [5] which was aimed to improve the performance of the already
accurate SVM and CNN-SVM algorithms for sarcasm detection. It showed that using lexical,
pragmatic, frequency, and part-of-speech tagging can contribute to the performance of SVM,
whereas both lexical and personal features can enhance the performance of CNN-SVM. This
study also recommends the use of two target labels when detecting the sarcastic statements
tweets, which was implemented in our study.
   Another study which proposed a simplified method to identify sarcasm within comments in
a YouTube video channel of Bahasa Indonesia [6], which enables the channel or video owner to
identify and remove or eliminate some comments that may arise hatred between the audience.
Their results for the sarcasm comments using Regression Testing is claimed to be capable of
identifying two types of sarcasm, which are propositional sarcasm and lexical sarcasm and
further stating the performance metrics of Naïve Bayes to be very prominent in detecting the
sarcasm in comments which criticized the current government for their plan to continue their
power in their dataset. We also came across a shared task [7] which aims at developing a system
that groups posts based on emotions, sentiment and find sarcastic posts if they are present
datasets acquired from the lexical databases WordNet, SentiWordNet. They proposed a system
which developed a prototype that helps to analyze the emotions of the posts namely anger,
surprise, happy, fear, sorrow, trust, anticipation and disgust with three levels of severity. It also
uses Sarcasm detection algorithms like Emoticon sarcasm detection, Hybrid sarcasm detection,
Hashtag Processing, Interjection Word Start (IWT), in the end combining approaches of different
methods like emotion detection, use of emoticons, patterns, etc. and identifies the social site
comment is sarcastic or not. On the other hand [8], uses two ensemble based approaches - voted
ensemble classifier and random forest classifier., which is contrasting to current approaches to
sarcasm detection which rely on existing dataset of positive and negative sentiments for training
the classifiers. They used a seeding algorithm to generate training dataset. Their proposed
model also uses a pragmatic classifier to detect emoticon based sarcasm.


3. Experiment Data
The following section provides a detailed description of the data used in this study as well
as the preprocessing techniques employed. The undertaken task has also been discussed
comprehensively.

3.1. Data Description
The organisers presented participants with the task of analysing code-mixed Tamil and Malay-
alam text to identify and detect sarcasm. The dataset [9, 10] comprised user comments sourced
from YouTube videos showcasing movie trailers. It was divided into two on the basis of lan-
guage —Tamil and Malayalam. The textual content was code-mixed with the Roman script used
alongside the traditional language. Code-mixing, in this context, refers to the simultaneous use
of more than one language in a single text. The comments seamlessly integrated English along
with the Dravidian language. The average sentence length of the corpora was specified as one
by the organisers. Table 1 presents labelled comments for the Tamil and Malayalam datasets.
   The Tamil dataset comprises a total of 27,036 training samples, with 7,170 instances labeled
as sarcastic and 19,866 instances categorized as non-sarcastic. Similarly, the Malayalam dataset
encompasses 9798 non-sarcastic comments and 2259 sarcastic comments, out of a total 12057
training samples. There is an evident class imbalance in the data, mirroring real-world scenarios
where such imbalances are commonly encountered. Table 2 depicts these statistics.

3.2. Task Description
The shared task of FIRE 2023 is to identify Sarcasm in text. It requires participants to develop a
binary classification model to categorise code-mixed Tamil and Malayalam datasets as “Sarcastic”
and “Non-Sarcastic”
  Sarcastic: Comments which contain sarcastic phrases or instances, ie., comments whose true
meaning diverges from what is written. Non-sarcastic: Comments which are direct and do not
contain hidden intentions.

3.3. Data Pre-processing
The comments in the dataset consist of English alphabets as well as Tamil and Malayalam alpha-
bets. Special characters and punctuations occur frequently in the dataset. The preprocessing
step loops through a list of special characters and uses the str.replace() method to replace each
special character with a space in the text data of the training, validation, and test datasets as
they were found to add no useful information to the text. Further, entries with missing values
or labels were also removed from the dataset. The code first identifies columns in the datasets
that are unnamed. The dataset.columns.str.match(”Unnamed”) line performs this check. Then,
the loc function is used to select only those columns where the column names are not unnamed,
thus removing unnecessary columns from the dataset.
   Before pre-processing: Ithu 96’ !! Anti aging icon Rajani Sir
   After pre-processing: Ithu 96 Anti aging icon Rajani Sir


4. Proposed Methodology
This section details in-depth explanations for each of the experiments conducted for the shared
task. Figure 1 depicts the general flow of the proposed methodologies.

4.1. Experiment 1 - Tamil: Count Vectorizer and MLP Classifier
For the first run, this work has used Count Vectorizer to extract features based on character-
level-n-grams from the Tamil dataset. The length of the n-grams was declared as the range
(1,3). In order to classify the extracted features as sarcastic or non-sarcastic, an MLP Classifier is
employed. There exists one hidden layer with 128 neutrons in this model. A maximum number
of 10 iterations were performed to train the model on the training set. To avoid overfitting, the
model was instructed to stop training if there is no improvement after 5 iterations. The trained
classifier is then evaluated on the validation set with 5-fold cross-validation. The training and
validation scores (Accuracy, Sensitivity, F1-score, Precision) are computed and recorded. This
approach resulted in a validation accuracy of 0.78 and macro averaged f1-score of 0.73. The
model was then used to predict the labels of the Tamil test set. The results released by the
organisers shows a f1-score of 0.73 for the test set, ranking second among all participants.

4.2. Experiment 2 - Tamil: TF-IDF Vectorizer and MLP Classifier
In this experiment, the features have been extracted from the raining and validation set using
TF-IDF vectorizer based on character-level-n-grams where the length falls within the range
(1,3). Following feature extraction, the MLP classifier is then trained on the training set. This
classifier has a hidden layer consisting of 128 neutrons. Similar to the first experiment, the
model is limited to 10 iterations for training and 5 iterations if it shows no improvement. This is
enforced to ensure there are no instances of overfitting. After successful training, the classifier
is cross-validated with 5 folds on the validation set. The evaluation metrics are computed for
both the training and evaluation set. It displayed a macro-averaged validation accuracy of 0.72
and f1-score of 0.77. The trained classifier was the used to predict the headers for the comments
in the test set.
 Fig.1. Process Flow


4.3. Experiment 3 - Tamil: TF-IDF Vectorizer and Random Forest Classifier
In order to combat the evident class imbalance, this approach uses oversampling. Firstly,
the labels are converted to their numeric equivalents, with 1 representing “Sarcastic” and 0
representing “Non-Sarcastic”. Then, the features are extracted using TF-IDF vectorizer based
on character-level-n-grams. To balance the class distribution, RandomOverSampler is used to
oversample the “Sarcasm” class which has only 7170 samples as compared to 19866 samples
in the “Non-Sarcastic” class. Following this, the Random Forest Classifier is initialised with
100 decision trees. The class weights are initialised as 0:1,1:3 so as to assign more wright to
the minority class. The classifier is then trained on the resampled training data. After the RF
classifier has been trained, the model is evaluated on the validation set and the metrics are
calculated. This led to a macro-averaged validation f1-score of 0.72.
4.4. Experiment 4 - Malayalam: Count Vectorizer and MLP Classifier
To discriminate between sarcastic and non-sarcastic instances in the experiment on the Malay-
alam language dataset, the count vectorizer with MLP classifier approach that was utilized for
the Tamil language dataset was applied. The Count Vectorizer approach was used throughout
the feature extraction process to collect character-level n-grams, with a range of n-grams from 1
to 3. The statistics of the dataset showed the distribution, with 2259 occurrences in the training
set classified as sarcastic and 9798 instances as non-sarcastic. In the following testing, 588
sarcastic instances and 2427 non-sarcastic instances were present in the validation set. The
classification model was created using an MLP Classifier architecture, which had a solitary
hidden layer with 128 neurons.A monitoring system was employed to stop training if no per-
formance improvement was seen after 5 iterations, thereby reducing the danger of overfitting,
and a maximum of 10 iterations were specified to enable optimal model training. The trained
classifier was assessed against the validation set using a 5-fold cross-validation approach, and
performance measures such as Accuracy, Sensitivity, F1-score, and Precision were cautiously
recorded. This project produced outstanding results, with a validation accuracy of 0.85 and a
weighted F1-score of 0.64. With a remarkable F1-score of 0.64, this model’s proficiency was
extended to the Malayalam test set, winning it the first spot in the competition standings.

4.5. Experiment 5 - Malayalam: Count Vectorizer with Logistic Regression
An alternate approach was used in a parallel analysis of the Malayalam language dataset to draw
a distinction between sarcastic and non-sarcastic instances. The Count Vectorizer technique,
which captures character-level n-grams with lengths ranging from 1 to 3, served as the basis for
this strategy as well. The pattern of distribution was revealed by a statistical analysis of the
dataset, with 9798 cases classified as non-sarcastic and 2259 instances assigned to the sarcastic
class, making up the training set. And subsequently, 588 occurrences classified as sarcastic
and 2427 instances classified as non-sarcastic were observed in the validation set. A Logistic
Regression approach, which is recognized for its clarity and simplicity, was used to build the
classification model. Through 5-fold cross-validation on the validation set, the model was
rigorously evaluated after being trained and optimized for best performance. To evaluate the
model’s effectiveness, important metrics like Accuracy, Sensitivity, F1-score, and Precision were
carefully monitored. The result was outstanding and demonstrated the model’s proficiency
with a validation accuracy of 0.84 and a weighted F1-score of 0.618.

4.6. Experiment 6 - Malayalam: TF-IDF and MLP Classifier
An entirely different approach was used in a separate investigation on the Malayalam language
dataset to distinguish between sarcastic and non-sarcastic passages. The introduction of the
Term Frequency-Inverse Document Frequency (TF-IDF) technique served as the cornerstone
of this strategy and was important in capturing the essence of character-level n-grams with
n-gram lengths ranging from 1 to 3. The training dataset, which was determined by a thorough
analysis of the dataset statistics, had 2259 cases that were designated as sarcastic and 9798
examples that were non-sarcastic instances. The validation set resulted in 588 cases that were
classified as sarcastic and 2427 instances that were non-sarcastic. The Multi-Layer (MLP)
       Table 3. Metrics of Evaluation for above Experiments using validation dataset


Classifier, recognized for its ability to capture complex correlations within data, helped to
build the architecture of the classification model. The model was rigorously evaluated using
5-fold cross-validation on the validation dataset after careful training and deliberate fine-tuning.
To evaluate the model’s efficacy, key performance measures including Accuracy, Sensitivity,
F1-score, and Precision were closely recorded. The results were impressive, demonstrating the
model’s skill with a validation accuracy of 0.85 and a weighted F1-score of 0.628.


5. Results and Discussion
Results from the cross-linguistic sarcasm detection experiments in Tamil and Malayalam were
revelatory. Our team (SSNCSE1) secured high rankings as displayed in Table 5 and Table 6. Out
of all the models we have experimented with, which are displayed in the Table 4, best-selected
models demonstrated excellent performance using a variety of approaches, including Count
Vectorizer with MLP Classifier and Logistic Regression, TF-IDF Vectorizer with MLP Classifier,
and TF-IDF Vectorizer with Random Forest Classifier. The validation accuracy for the Tamil
dataset ranged from 0.72 to 0.78, while the macro-averaged F1-score varied among the several
studies from 0.73 to 0.77. The validation accuracy for the Malayalam dataset was similar, with
values between 0.72 and 0.85 and F1-scores between 0.60 and 0.77. These results show how well
the models are able to interpret the intricacies of sarcasm in code-mixed text comments sourced
from YouTube videos that are both in Tamil and Malayalam. This work evaluates performance
of the models using F1-score. F1 score is a harmonic mean of precision and recall - it is well-
suited for evaluating sarcasm detection models because it balances both false positives and
false negatives. In the context of sarcasm, where subtle linguistic nuances are crucial, F1-score
provides a comprehensive measure, ensuring that the model’s ability to capture both sarcastic
instances and non-sarcastic statements is effectively assessed, offering reliable performance
evaluation.
   The results show the complexity in the task of identifying sarcasm across languages and how
significant it is in sentiment analysis. In spite of the major hurdles brought on by language
variety, code-mixing, and class inequality, the suggested approaches showed encouraging
potential. This study highlights the significance of handling language variances in order
to get correct insights from content on the internet. It also sheds light on the multilingual
complexities of sarcasm detection and provides the foundation for future developments in
sentiment analysis such as restriction on posting a sarcastic comment under a video posted
online. The same tendencies seen in the trials conducted in Tamil and Malayalam illustrate
     Table 4. Weighted F1 scores of the machine learning models for evaluation dataset


               Table 5. Sarcasm identification task rank list for Tamil language


            Table 6. Sarcasm identification task rank list for Malayalam language


the difficulties that code-mixing and linguistic traits present in both languages, pointing to the
necessity of multilingual techniques to handle the same tasks in several languages.


6. Conclusion
In this work, we undertook a thorough investigation of cross-linguistic sarcasm detection in
Tamil and Malayalam, two Dravidian languages. Count Vectorizer with MLP Classifier and
Logistic Regression , TF-IDF Vectorizer with MLP Classifier, and TF-IDF Vectorizer with Random
Forest Classifier are just a few of the numerous methodologies we used to tackle the challenging
task of sarcasm detection and classification in code-mixed text comments from YouTube videos.
   Our findings not only demonstrated the ability of the suggested models to properly detect
the hidden sentiment underlying sarcastic comments, but also brought to light the difficulties
that multilingual sentiment analysis presents. The macro-averaged F1-scores and validation
accuracies acquired across several tests showed favorable results, demonstrating the promise of
our techniques in handling the complexity of sarcasm detection.
   The study also emphasized the need of tackling issues like linguistic diversity, code-mixing,
and class inequality because they are widespread in online content that is available in all major
languages. Given that both Tamil and Malayalam showed analogous tendencies in our models,
we underline the universality of these issues and the need of multilingual techniques for reliable
sentiment analysis.
   As a result, our study contributes to our knowledge of cross-linguistic sarcasm detection
and establishes the groundwork for future studies in sentiment analysis. The results obtained
have implications for practical applications such as sentiment-driven content analysis and the
identification of cyberbullying in multilingual environments in addition to making contributions
to the field of natural language processing. Our findings emphasizes the necessity for flexible
and nuanced models capable of understanding the complexities as language nuances continue
to impact online interactions.


References
 [1] Chakravarthi, Bharathi Raja, N Sripriya, B Bharathi, K Nandhini, Subalalitha Chinnau-
     dayar Navaneethakrishnan, Thenmozhi Durairaj, Rahul Ponnusamy, Prasanna Kumar
     Kumaresan, Kishore Kumar Ponnusamy, and Charmathi Rajkumar. ”Overview of The
     Shared Task on Sarcasm Identification of Dravidian Languages (Malayalam and Tamil)
     in DravidianCodeMix.” In: Forum of Information Retrieval and Evaluation (FIRE) - 2023,
     2023.
 [2] Sharma, D.K.; Singh, B.; Agarwal, S.; Pachauri, N.; Alhussan, A.A.; Abdallah, H.A. Sarcasm
     Detection over Social Media Platforms Using Hybrid Ensemble Model with Fuzzy Logic.
     Electronics 2023, 12, 937. URL:https://doi.org/10.3390/electronics12040937
 [3] Eke, C.I., Norman, A.A., Liyana Shuib et al. Sarcasm identification in textual data: system-
     atic review, research challenges and open directions. Artif Intell Rev 53, 4215–4258 (2020).
     URL:https://doi.org/10.1007/s10462-019-09791-8
 [4] Santosh Kumar Bharti, Rajeev Kumar Gupta, Prashant Kumar Shukla, Wesam Atef Hatam-
     leh, Hussam Tarazi, Stephen Jeswinde Nuagah, ”Multimodal Sarcasm Detection: A Deep
     Learning Approach”, Wireless Communications and Mobile Computing, vol. 2022, Article
     ID 1653696, 10 pages, 2022. URL:https://doi.org/10.1155/2022/1653696
 [5] Sarsam, S. M., Al-Samarraie, H., Alzahrani, A. I., Wright, B. (2020). Sarcasm detection
     using machine learning algorithms in Twitter: A systematic review. International Journal
     of Market Research, 62(5), 578–598. URL:https://doi.org/10.1177/1470785320921779
 [6] W. Wijaya, I. M. Murwantara and A. R. Mitra, ”A Simplified Method to Identify the Sarcastic
     Elements of Bahasa Indonesia in Youtube Comments,” 2020 8th International Conference
     on Information and Communication Technology (ICoICT), Yogyakarta, Indonesia, 2020,
     pp. 1-6, doi: 10.1109/ICoICT49345.2020.9166269. URL:https://ieeexplore.ieee.org/stamp/s-
     tamp.jsp?tp=arnumber=9166269isnumber=9166148
 [7] S. Rendalkar and C. Chandankhede, ”Sarcasm Detection of Online Comments Us-
     ing Emotion Detection,” 2018 International Conference on Inventive Research in
     Computing Applications (ICIRCA), Coimbatore, India, 2018, pp. 1244-1249, doi:
     10.1109/ICIRCA.2018.8597368. URL:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnum-
     ber=8597368isnumber=8596764
 [8] T. Jain, N. Agrawal, G. Goyal and N. Aggrawal, ”Sarcasm detection of tweets: A compar-
     ative study,” 2017 Tenth International Conference on Contemporary Computing (IC3),
     Noida, India, 2017, pp. 1-6, doi: 10.1109/IC3.2017.8284317. URL:https://ieeexplore.ieee.org/s-
     tamp/stamp.jsp?tp=arnumber=8284317isnumber=8284279
 [9] Chakravarthi, B.R. Hope speech detection in YouTube comments. Soc. Netw. Anal. Min. 12,
     75 (2022). https://doi.org/10.1007/s13278-022-00901-z
[10] Chakravarthi, B., Hande, A., Ponnusamy, R., Kumaresan, P., & Asoka Chakravarthi, R.
     (2022). How can we detect Homophobia and Transphobia? Experiments in a multilingual
     code-mixed setting for social media governance. International Journal of Information
     Management Data Insights, 2, 100119. https://doi.org/10.1016/j.jjimei.2022.100119