<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>(Malayalam and Tamil)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>V Indirakanth</string-name>
          <email>indirakanth2010681@ssn.edu.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dharunkumar Udayakumar</string-name>
          <email>dharunkumar2010504@ssn.edu.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thenmozhi Durairaj</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B. Bharathi</string-name>
          <email>bharathib@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <fpage>5</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>The rapid growth of social media has led to an increase in sarcastic comments. Detecting sarcasm in posts using multiple languages has become a critical aspect of language processing. Our work in the FIRE-2023 competition, specifically focused on 'Sarcasm Identification Of Dravidian Languages Tamil Malayalam', centered on identifying sarcasm in texts from social media, a research area of growing importance. We employed various models, including BERT, DistilBERT, XLM-RoBERTa, SVM, and TF-IDF, to categorize text as either sarcastic or not. Our team, SSN_FeaturesAlpha, achieved notable results, with the highest F1 score of 0.68 for Tamil using DistilBERT and 0.63 for Malayalam using BERT. It's worth highlighting that our submission for Tamil ranked 7th, and our Malayalam submission secured the 5th position, which underscores the efectiveness of our approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Dravidian language</kwd>
        <kwd>Text classification</kwd>
        <kwd>Transfer learning</kwd>
        <kwd>Sarcasm Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>CEUR
Workshop
Proceedings</p>
      <p>The paper with the title ”Overview of The Shared Task on Sarcasm Identification of Dravidian
Languages (Malayalam and Tamil) in Dravidian CodeMix”[9] provides an abstract view of
the Shared Task on Sarcasm Identification in Dravidian Languages. The FIRE2023
competition, Sarcasm Identification of Dravidian Languages (Tamil and Malayalam), is designed to
develop systems capable of discerning the polarity of sentiments in code-mixed Tamil and
Malayalam languages within social media forums. The organizers of FIRE2022 initiated the
task of identifying sentiments in Tamil and Malayalam code-mixed languages. Additionally, the
challenge requires the categorization of posts into positive, negative, neutral, mixed emotions,
or unclassifiable emotions in the intended language. This year’s FIRE2023 competition provides
datasets for two languages: Tamil and Malayalam, both mixed with English.</p>
      <p>In the context of the FIRE2023 competition, which revolves around the challenge of detecting
sarcasm in Tamil and Malayalam comments mixed with English, this article outlines our
systematic approach. Our primary task was to categorize comments into positive, negative,
unknown sentiment, mixed feelings, or unclassifiable in the respective language, ultimately
determining if a comment is sarcastic. To achieve this, we followed a structured approach.
Firstly, we implemented language-specific transliteration and translation techniques to enhance
our understanding of the code-mixed content. Secondly, we conducted data preprocessing using
the NLTK library for both training and test data in Tamil and Malayalam. Finally, we selected
and fine-tuned a range of machine learning models, including BERT, DistilBERT, SVM, TF-IDF,
and XLM-RoBERTa, integrating them with both Tamil and Malayalam while also incorporating
English. This article provides a comprehensive overview of our methodology and contributions
to the FIRE2023 competition, where our primary focus lies in the domain of sarcasm detection.</p>
      <p>The structure of this paper is as follows: Section 2 reviews related work on sarcasm detection,
Section 3 provides a detailed description of the data and our model methodology, Section 4
presents our experimental results and analysis, and finally, Section 5 ofers conclusions drawn
from our work and discusses potential avenues for further improvement in sarcasm detection.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        In recent years, the field of Natural Language Processing (NLP) has witnessed a surge of
interest in the detection of sarcasm, a form of figurative language that presents a formidable
challenge due to its often subtle and context-dependent nature. While the majority of research
in this domain has focused on widely spoken languages such as English, there is a growing
recognition of the need to extend this investigation to underrepresented languages like Tamil and
Malayalam, which belong to the Dravidian language family. Previous work on sarcasm detection
in NLP has predominantly leveraged various machine learning techniques, including supervised,
unsupervised, and deep learning approaches. In the paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], they created a multilingual
dataset to recognize and encourage positivity in the comments and proposed a novel custom
deep network architecture that uses a concatenation of embeddings from T5-Sentence. They
have experimented with multiple machine learning models, including SVM, logistic regression,
K-nearest neighbour, and decision trees. The paper [4] aims at developing a system that groups
posts based on emotions and sentiment and finds sarcastic posts, if present. The proposed
system is to develop a prototype that helps come to an inference about the emotions of the
posts, namely anger, surprise, happiness, fear, sorrow, trust, anticipation, and disgust, with
three sentic levels in each. The task [5] presents the findings of the shared task on Multimodal
Sentiment Analysis and Troll meme classification in Dravidian languages. This task assumes
the analysis of both textual and image features for making better predictions. The paper [6]
investigates negative sentiment tweets with the presence of hyperbole for sarcasm detection. Six
thousand and six hundred pre-processed negative sentiment tweets comprising Chinesevirus,
Kungflu, COVID19, Hantavirus and Coronavirus were gathered for sarcasm detection. In paper
[7] tasks included code-mixing at the intra-token and inter-token levels. In addition to Tamil,
Malayalam and Kannada were also introduced. the quality and quantity of the submission
show that there is great interest in Dravidian languages in code-mixed setting and state of the
art in this domain still needs improvement. Task [8] intends to improve ofensive language
identification by generating pseudo-labels on the dataset. A custom dataset is constructed by
transliterating all the code-mixed texts into the respective Dravidian language, either Kannada,
Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The
two datasets are combined using the generated pseudo-labels to create a custom dataset called
CMTRA. As Dravidian languages are under-resourced, their approach increases the amount of
training data for the language models.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Dataset Description and Proposed Methodology</title>
      <p>This section provides information about the mixed-language data in Tamil and Malayalam,
including details about the dataset and how we prepared it. In our research, we explored
various techniques commonly used in Natural Language Processing (NLP). We used methods
like Support Vector Machines (SVM), DistilBERT, XLM-RoBERTa, and transfer learning to
improve our results.</p>
      <sec id="sec-4-1">
        <title>3.1. Data Description</title>
        <p>
          For the task of ofensive detection, the organizers ofered datasets that were code-mixed in
Tamil and Malayalam. The Malayalam dataset has 12,057 posts for training and 3,768 posts for
testing, whereas the Tamil dataset has 27,036 posts for training and 8,449 posts for testing the
model system. The objective of this work is to divide the postings into two categories in the
datasets for Tamil and Malayalam: Sarcastic and Non-sarcastic (Fig 1 &amp; 2). The training data
was taken from comments on YouTube [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Data preprocessing</title>
        <p>In the context of our research paper, data preprocessing was conducted on both the Tamil
and Malayalam datasets to ensure their adaptability for sarcasm detection tasks in Dravidian
languages. This preprocessing, executed using NLTK2, aimed to enhance the quality and
uniformity of the text data. Initially, duplicate entries were removed to mitigate their potential
influence on model performance. Subsequently, text strings beginning with ”@” symbols,
typically representing author names or user IDs, were eliminated. Hashtags, punctuation, URLs, and
numerals devoid of semantic significance were also stripped from the text. Emojis were removed
to maintain textual clarity. Additionally, all uppercase text in English and native language
text in the Roman script were converted to lowercase. These preprocessing measures were
consistently applied to both the Tamil and Malayalam datasets, ensuring the data’s readiness
for sarcasm detection tasks within the Dravidian language context. We downsampled training
data, removed emojis, and eliminated common stop words to enhance data quality. We also
meticulously fine-tuned labels for clarity, transforming ”Sarcastic” and ”Non-sarcastic” into
”1” (sarcasm) and ”0” (non-sarcasm), ensuring that our data aligns perfectly with our binary
classification task.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Proposed Methodology</title>
        <p>In this section, we delve into the methodology employed for the challenging task of sarcasm
detection in Dravidian languages. Our aim is to break down the intricate process, highlighting
its various stages, and elucidate how each step contributes to our overarching goal. Sarcasm,
known for its subtlety and context-dependency, presents a formidable challenge in the realm of
Natural Language Processing (NLP). Our approach leverages NLP techniques and a range of
machine learning models to tackle this linguistic puzzle.</p>
        <p>Tokenization:
For the initial data preprocessing, we utilized tokenization to break down the text into individual
tokens or subwords. Depending on the model, diferent tokenizers were employed, such as
the BertTokenizer for BERT and DistilBERT, XLMRobertaTokenizer for XLM-RoBERTa, and
TfidfVectorizer for the Support Vector Machine (SVM) approach. Tokenization is a crucial step
in preparing the text data for further processing.</p>
        <p>Model Training:
Next, we dive into the heart of our methodology: model training. Our ensemble of models,
comprising BERT, DistilBERT, and XLM-RoBERTa, embarks on an extensive training journey.
These models come pre-equipped with vast textual knowledge, but to excel at the specific task
of sarcasm detection in Dravidian languages, they undergo fine-tuning. The fine-tuning process
serves as the bridge that adapts these models to the nuanced world of Dravidian sarcasm.
Finetuning involves training for 4 epochs, with each epoch representing a complete pass through
the training data. Reprocessing input data is enabled to adapt the models to sarcasm detection
characteristics. Additionally, the models are configured for binary classification, distinguishing
between sarcastic and non-sarcastic text. This comprehensive fine-tuning process ensures that
our models efectively capture the subtleties of sarcasm, enhancing their performance on the
specific sarcasm detection task in Dravidian languages.</p>
        <p>SVM (Support Vector Machine):
Our methodology extends its reach beyond neural network-based models. We introduce the
venerable Support Vector Machine (SVM), a classic machine learning algorithm, into the mix.
SVM stands as a benchmark for comparison with our neural network counterparts. Leveraging
TF-IDF vectors as features and a linear kernel for classification, SVM ofers a diferent perspective
on sarcasm detection, enriching our research toolkit.</p>
        <p>Model Evaluation:
To gauge the efectiveness of our sarcasm detection models, we embark on a comprehensive
evaluation journey. Our evaluation relies on standard metrics that provide insight into each
model’s performance.</p>
        <p>Classification Report: Our classification report emerges as a comprehensive mirror, reflecting
metrics like accuracy, F1-score, precision, and recall. It ofers a detailed assessment of our
models, shedding light on their prowess in distinguishing between the worlds of sarcasm and
non-sarcasm.</p>
        <p>Confusion Matrix: In addition to the classification report, we construct confusion matrices
that bring our model’s classification results to life. These matrices vividly portray true positives,
true negatives, false positives, and false negatives, painting a clear picture of our models’
strengths and areas for improvement.</p>
        <p>Prediction and Labeling:
Our methodology culminates in the real-world application of our trained models. We unleash
them on testing data, stored conveniently in a CSV file. To facilitate interpretation, we introduce
a new column labeled ”Labels” into the data. Within this column, values of 0 denote
”Nonsarcastic,” while values of 1 signify ”Sarcastic” based on our model’s astute predictions.</p>
        <p>In essence, our methodology represents a harmonious blend of NLP techniques and a diverse
range of models. Together, tokenization, model training, and meticulous evaluation allow us to
unravel the intricacies of sarcasm detection in Dravidian languages. Through this journey, we
seek to gain insights into the efectiveness of each approach and distill meaningful conclusions
about the task at hand.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results</title>
      <p>In this section, we present the evaluation of our model and submitted results for the Tamil, and
Malayalam code-mixed languages.</p>
      <sec id="sec-5-1">
        <title>4.1. Experimental Results</title>
        <p>The below given results are the F1 Scores for each models used. The highest score recorded for
Tamil language is 0.75 with DistilBERT model and highest score recorded for Malayalam
language is 0.72 with BERT model. The confusion matrix provided is the validation / development
data.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Submitted Results</title>
        <p>We applied the transfer learning technique to enrich the training data for the two languages. Our
approach utilized a combination of BERT, DistilBERT, XLM-RoBERTa, SVM, and TF-IDF models.
The collaborative eforts of our team, FeaturesAlpha_tam, yielded impressive results, with the
maximum F1-scores reaching 0.68 in Tamil and 0.63 in Malayalam using various models.</p>
        <p>In our submissions, we presented the results of DistilBERT for Tamil and BERT for Malayalam,
achieving commendable rankings of 7th for Tamil and 5th for Malayalam in the competition.
Table 5 represents Tamil Ranking and Table 6 represents Malayalam ranking. Detailed
performance metrics, including accuracy and F1-scores, are available in Table 3 for Tamil and Table 2
for Malayalam, providing a comprehensive view of our models’ efectiveness.</p>
        <p>Furthermore, we’ve presented the prediction values in confusion matrices for both languages,
enhancing the interpretability of our results. These visualizations can be found in Figure 3
for Tamil and Figure 4 for Malayalam, allowing for a deeper understanding of the model’s
performance.</p>
        <sec id="sec-5-2-1">
          <title>Team name</title>
          <p>SSNCSE1_Malayalaml
hatealert_Malayalam
ABC_malayalam
IRLabIITBHU_mal
MUCS_mal</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>SSN_FeaturesAlpha_mal</title>
          <p>TechWhiz_mal
YenCS_mal
Hydrangea_malayalamrun1
ENDEAVOUR_malayalam
ramyasiva_malayalam</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>This paper outlines our methodology for identifying sarcasm in text from Dravidian languages,
specifically Tamil and Malayalam. Our approach focused on preprocessing techniques and
the utilization of pre-trained models such as BERT, DistilBERT, XLM-RoBERTa, as well as
traditional methods like SVM and TF-IDF, with various input variations for the shared task
across both languages. Through rigorous evaluation, our findings indicate that fine-tuning
the BERT and DistilBERT architecture yields notable performance improvements. Our team
achieved enhanced F1 scores compared to baseline scores, showcasing the efectiveness of our
approach. Additionally, we harnessed the power of transfer learning to maximize results. While
our current research has shown promising results, there remains room for further advancements.
Future research endeavors can explore the potential of diferent deep learning algorithms to push
the boundaries of sarcasm detection in Dravidian languages. Moreover, extending our work to
include other languages promises to broaden the scope and applicability of our methodology.
In conclusion, our study represents a significant step forward in the realm of sarcasm detection
within Dravidian languages. By comparing our methodology and outcomes with existing
research, we hope to contribute to the ongoing dialogue and innovation in this field, ultimately
paving the way for more accurate and robust sarcasm detection systems.</p>
    </sec>
    <sec id="sec-7">
      <title>6. References</title>
      <p>[4] S. Rendalkar and C. Chandankhede, ”Sarcasm Detection of Online Comments
Using Emotion Detection,” 2018 International Conference on Inventive Research in
Computing Applications (ICIRCA), Coimbatore, India, 2018, pp. 1244-1249, doi:
10.1109/ICIRCA.2018.8597368.
URL:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=8597368isnumber=8596764
[5] Premjith B, Chakravarthi BR, Subramanian M, et al. Findings of the Shared Task on
Multimodal Sentiment Analysis and Troll Meme Classification in Dravidian Languages.
Association for Computational Linguistics. January 2022.
doi:10.18653/v1/2022.dravidianlangtech1.39
[6] A Machine Learning Approach in Analyzing the Efect of Hyperboles Using Negative
Sentiment Tweets for Sarcasm Detection.” A Machine Learning Approach in
Analyzing the Efect of Hyperboles Using Negative Sentiment Tweets for Sarcasm Detection
ScienceDirect, 22 Jan. 2022, doi:10.1016/j.jksuci.2022.01.008.
[7] Priyadharshini R, Chakravarthi BR, Thavareesan S, Chinnappa D, Thenmozhi D,
Ponnusamy R. Overview of the DravidianCodeMIX 2021 Shared Task on sentiment Detection
in Tamil, Malayalam, and Kannada. Forum for Information Retrieval Evaluation.
December 2021. doi:10.1145/3503162.3503177
[8] Hande, Adeep, et al. ”Ofensive Language Identification in Low-resourced
Codemixed Dravidian Languages Using Pseudo-labeling.” arXiv.org, 27 Aug. 2021,
arxiv.org/abs/2108.12177v1.
[9] B. R. Chakravarthi, N. Sripriya, B. Bharathi, K. Nandhini, S. Chinnaudayar
Navaneethakrishnan, T. Durairaj, R. Ponnusamy, P.K.Kumaresan, K. K. Ponnusamy, C.
Rajkumar, Overview of the shared task on sarcasm identification of Dravidian languages
(Malayalam and Tamil) in DravidianCodeMix, in: Forum of Information Retrieval and
Evaluation FIRE - 2023, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>Hope speech detection in YouTube comments</article-title>
          .
          <source>Social Network Analysis and Mining</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <fpage>75</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hande</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponnusamy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumaresan</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Priyadharshini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>How can we detect Homophobia and Transphobia? Experiments in a multilingual code-mixed setting for social media governance</article-title>
          .
          <source>International Journal of Information Management Data Insights</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <fpage>100119</fpage>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          <article-title>Hope speech detection in YouTube comments</article-title>
          .
          <source>Soc. Netw. Anal. Min</source>
          .
          <volume>12</volume>
          ,
          <issue>75</issue>
          (
          <year>2022</year>
          ). https://doi.org/10.1007/s13278-022-00901-z
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>