<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sarcasm Identification in Codemix Dravidian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prabhu Ram. N</string-name>
          <email>prabhuramnphd@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meera Devi. T</string-name>
          <email>tmeeradevi@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kanisha. V</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meharnath. S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manoji. B</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BERT</institution>
          ,
          <addr-line>Codemix, Dravidian Language, NLP, Sarcasm, Transformer</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electronics and Communication Engineering, Kongu Engineering College</institution>
          ,
          <addr-line>Erode, TamilNadu</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social media enables forms of communication using text, audio and video. The sarcastic nature of the text has an impact on an individual's well-being. It is crucial to determine whether the content is sarcastic or not. Social media posts often contain a mix of languages and address issues related to real-life situations. However, identifying sarcasm in the languages spoken in India, which has 22 languages, can be more challenging due to extensive borrowing of vocabulary. The dataset used for this study includes code-mixed languages such as Tamil-English and Malayalam-English. The training, validation, and test datasets are proportionally divided with labels to indicate whether the text is sarcastic. We performed experiments using the transfer learning model and observed that the BERT model gave the best result.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In the 21 century, social media has become a platform with over 4.9 billion users [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It enables
people to easily share their thoughts and opinions, making communication faster and more
convenient. However, the content on media can have impacts such as anxiety, depression and
even suicidal thoughts among individuals. Unfortunately, some individuals use it to criticize
or insult others, employing sarcasm that can be dificult to detect because of its implicitness.
People now frequently use irony due to the increasing availability of internet connections and
numerous new applications. Consequently, many countries have implemented social media
surveillance measures to monitor citizens [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, social media posts often involve a mix
of languages, while Dravidian languages like Tamil and Malayalam add complexity due to their
nuances. People utilize Natural Language Processing (NLP), a branch of artificial intelligence
(AI) that analyses human language patterns, to identify sarcasm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This paper explores the
challenges and opportunities in sarcasm identification in codemixed Dravidian languages. In the
subsequent sections, we will delve into the historical context of methods used in solving related
problems, the methodology involving data pre-processing and modelling, the experimental
FIRE 2023: 15th meeting of the Forum for Information Retrieval Evaluation, December 15 – 18, 2023, India
†These authors contributed equally.
CEUR
Workshop
Proceedings
setup for fine-tuning the pre-trained model, and the statistical analysis of our results with
discussion [24].
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>
        Natural language processing researchers use various techniques to tackle the challenging task
of sarcasm detection. Although hate speech and sarcasm exhibit indirect connections, the
pre-trained multilingual BERT model’s contextual understanding shows a more significant
similarity due to its training on multilingual Wikipedia language sources [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Hate speech
detection for English, German and Hindi Languages using the Multilingual BERT model uses
Machine Learning (ML) and Deep Learning approaches (DL). We employed various classification
architectures based on networks, including models like subword level LSTM, Hierarchical LSTM,
BERT, XLM-RoBERTa, LSTM, GRU, and XLNet [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. Additionally, we utilized machine
learning-based classification models such as Support Vector Machine (SVM) Logistic Regression
(LR) Random Forest Classifier (RFC) [
        <xref ref-type="bibr" rid="ref8">8, 9, 10, 11, 12</xref>
        ] and K-Nearest Neighbour (KNN) [13].
Among these models used for the code mix, the Tamil dataset classification task SVM model
showed performance compared to machine learning models. We have also employed RNN
and MLP deep learning models to enhance classification. TFIDF
(Term-Frequency-InverseDocument-Frequency) [14] serves as the text preprocessor, and SVM (Support Vector Machines)
[15] functions as the classifier. A two-phase approach for sarcasm detection using machine
learning algorithms involves feature extraction, feature selection, and classification using
support vector machines [16]. Sarcasm detection using a hierarchical attention network that
incorporates both word and sentence-level attention mechanisms, using a graph convolutional
network that includes both syntactic and semantic information, using a bootstrapping approach
that iteratively learns new sarcastic patterns from labelled data achieved state-of-the-art
performance on several benchmark datasets. In the context of emotion recognition, a study [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
focused on enhancing the efectiveness of BERT word embeddings through knowledge-based
ifne-tuning techniques. This research underscores the ongoing sentiment analysis and emotion
recognition eforts within natural language processing. The model is connected with the fully
connected network with a softmax activation function which is used to classify the emotions of
the given sentence. A hybrid model of bidirectional LSTM with a softmax attention layer and
convolution neural network for real-time sarcasm detection in code-switched tweets achieved a
superior classification accuracy of 92.71% and F1-measure of 89.05% [ 17]. The model trained
on larger datasets achieved higher eficiency, and the inclusion of Dravidian Languages in the
dataset will be helpful for this task [
        <xref ref-type="bibr" rid="ref7">7, 13, 18</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>The step-by-step process for conducting the experiment has been thoroughly explained and
outlined in the Figure ??. These sections provide a comprehensive description of how the
experiment was carried out, ensuring a clear and detailed understanding of the experimental
lfow. The process commences with the selection and initialization of a pre-trained model.
Subsequently, the dataset is collected and pre-processed. Rigorous evaluation is performed
on a designated validation and testing dataset, and fine-tuning strategies, including transfer
learning, are employed to tailor the model for the specific task. Furthermore, hyperparameter
optimization techniques are applied to enhance the model’s performance. The ensuing section
delves into a comprehensive presentation of results and constructive discussions, highlighting
the eficacy of fine-tuning and the implications of hyperparameter adjustments.</p>
      <sec id="sec-4-1">
        <title>3.1. Dataset collection</title>
        <p>The datasets for Tamil-English and Malayalam-English is included in the work, which is the
collection of comments from YouTube video. Data on all three types of code interlinked
sentences: Inter-Sentential switch, Intra-Sentential switch and Tag Swapping is included in the
dataset. Most comments were written in native script and Roman script with either Tamil /
Malayalam grammar with English lexicon or English grammar with Tamil / Malayalam lexicon.
Some of the comments have been written in Tamil and Malayalam, with English translations
between them.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Data preprocessing</title>
        <p>The dataset consists of text samples categorized as either ‘Sarcastic’ or ‘Non-Sarcastic’ and is
encoded into values assigning 0 and 1 for ‘Sarcastic’ and ‘Non-Sarcastic’ labels, respectively.
The maximum word length of the dataset is fixed to 128 tokens of words [ 19, 20].</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Model training</title>
        <p>For training the model, Simple Transformers is employed. Utilization of Bert-base-uncased
model architecture is done because of its ability to capture information from the text data
efectively, and this architecture is best suited for Binary Classification. The model is trained to
classify the text either as ‘Sarcastic’ or ‘Non-Sarcastic’.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Model evaluation and training</title>
        <p>The performance of the trained model is validated and assessed using the test dataset. Based on
the predictions made by the model, a classification report is generated that includes accuracy,
f1 score, precision and recall. The model may or may not be biased towards a particular
class because of class imbalances, as shown in Table 1. So, the highly frequent tokens from
the utterance of the Sarcastic class with respect to the Non-Sarcastic class are taken from the
training dataset. The high-frequent tokens are removed from the test dataset, and a classification
report is generated to check whether the model is influenced by the high-frequent tokens.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental setup</title>
      <p>Three datasets are utilized for the task of sarcasm identification for each English-Tamil and
English-Malayalam dataset.</p>
      <p>The training dataset consists of a larger number of ‘Non-Sarcastic’ labels than the ‘Sarcastic’
labels addressing class imbalance, as shown in Table 1. This huge diference in the occurrence
of the labels makes the model more biased towards the ‘Non-Sarcastic’ texts.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Results and discussion</title>
      <p>Precision, recall, and F1-score scores for the bert-based-uncased sarcasm detection model were
competitive. The performance of the majority class (Non-sarcastic) and the minority class
(Sarcastic), however, showed a clear discrepancy. The bert-base-uncased model tends to favour
the majority class, like many other machine learning algorithms. As a result, the minority class
had great accuracy but low recall (Sarcastic). A high F1 score indicates that the model can
accurately detect sarcasm while minimising false positives (non-sarcastic text misclassified as
sarcastic) and false negatives (sarcastic text misclassified as non-sarcastic). Tamil-English and
Malayalam-English datasets shown in Table 2 have respective F1 values of 0.79 and 0.81. The
removal of highly frequent words from the Sarcastic class with respect to the Non-Sarcastic
class did not influence the model, as shown in Table 3. The diferences in accuracy between the
English-Tamil and English-Malayalam datasets before and after high-frequency token removal
provide important information on how token removal afects model performance. The accuracy
in the English-Tamil dataset is 0.79, meaning that 79% of instances are properly classified by
the model before high-frequency tokens are eliminated. But it’s clear that the accuracy has
slightly decreased to 0.78 after the high-frequency tokens were eliminated. The possible cause
of this little decrease in accuracy is the loss of important data that high-frequency tokens were
carrying. It implies that these tokens may indeed contribute positively to the model’s capacity
to accurately identify instances in the English-Tamil dataset.</p>
      <p>On the other hand, prior to the removal of high-frequency tokens, the English-Malayalam
dataset shows an accuracy of 0.81, showing a high degree of classification accuracy. Upon
eliminating high-frequency tokens, the accuracy noticeably increases to 0.82. This improved
accuracy raises the possibility that high-frequency tokens might contaminate the data with
Sarcastic
Non-Sarcastic
Accuracy
Macro average 0.73
Weighted average 0.77
noise or ambiguity. Eliminating these tokens improves model accuracy by streamlining the
dataset.</p>
      <p>The diferences in accuracy between the two datasets demonstrate how high-frequency
token influence on model performance varies depending on the context. Their elimination
improved the accuracy of the English-Malayalam dataset but somewhat decreased the accuracy
of the English-Tamil dataset. These results highlight the need for a sophisticated strategy
in NLP tasks when determining whether high-frequency tokens to keep or discard. Making
educated judgments requires a thorough analysis of the linguistic properties of the dataset and
the possibility of noise introduction from high-frequency tokens. Enhancing model accuracy
in code-mixed Dravidian languages requires striking the ideal balance between information
retention and data refining, and this approach should be dictated by the unique linguistic
characteristics of each dataset. A confusion matrix is used to analyse the performance of the
model, as shown in Figure 3. The task involves binary classification involving the ‘Non-Sarcastic’
and ‘Sarcastic’ classes. This matrix comprises four distinct categories: True Positives (TP),
True Negatives (TN), False Positives (FP) and False Negatives (FN) [21]. Figure 3 and Figure 3
reveals the model’s strengths and areas of improvement in the binary classification task, ofering
valuable insights for refining its performance and guiding future research eforts.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion and future work</title>
      <p>Sarcasm identification using the BERT model provides a major improvement in natural
language processing. Intricate language nuances can be captured by BERT, enabling it to capture
contextual awareness and pre-trained knowledge, which makes it better for the complex task of
sarcasm detection. The accuracy of the model is substantially improved by its capacity to take
into account the larger context of a statement and comprehend how specific words or phrases fit
into that context. BERT can provide more balanced predictions, reducing the tendency to
misclassify sarcastic statements as non-sarcastic due to the imbalance issue. However, it’s essential
to acknowledge that class imbalance remains a challenge, and further research should focus on
strategies like oversampling, undersampling, or using diferent loss functions to fine-tune BERT
models efectively. By addressing this class imbalance problem, we can continue to enhance
the accuracy and reliability of sarcasm detection using BERT-based models, making them more
valuable in real-world applications. Due to the class imbalance problem, the model is more
biased towards the Non-Sarcastic texts, which reduces the model’s accuracy. In addressing the
class imbalance, the potential integration of the Synthetic Minority Over-sampling Technique
(SMOTE) algorithm holds significant promise [ 22]. By integrating SMOTE into the BERT model,
we can create samples for the class. This helps in creating a balanced dataset during training
and greatly improves the accuracy of the model.
[9] A. Muneer, S. M. Fati, A comparative analysis of machine learning techniques for
cyberbullying detection on twitter, Future Internet 12 (2020) 187.
[10] V. Pathak, M. Joshi, P. Joshi, M. Mundada, T. Joshi, Kbcnmujal@
hasoc-dravidian-codemixifre2020: Using machine learning for detection of hate speech and ofensive code-mixed
social media text, arXiv preprint arXiv:2102.09866 (2021).
[11] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
dataset for code-mixed malayalam-english, arXiv preprint arXiv:2006.00210 (2020).
[12] P. R. Nagarajan, V. Mammen, V. Mekala, M. Megalai, A fast and energy eficient path
planning algorithm for ofline navigation using svm classifier, Int. J. Sci. Technol. Res 9
(2020) 2082–2086.
[13] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
sentiment analysis in code-mixed tamil-english text, arXiv preprint arXiv:2006.00206
(2020).
[14] B. Subba, P. Gupta, A tfidfvectorizer and singular value decomposition based host intrusion
detection system framework for detecting anomalous system processes, Computers &amp;
Security 100 (2021) 102084.
[15] V. Cherkassky, Y. Ma, Practical selection of svm parameters and noise estimation for svm
regression, Neural networks 17 (2004) 113–126.
[16] S. M. Sarsam, H. Al-Samarraie, A. I. Alzahrani, B. Wright, Sarcasm detection using
machine learning algorithms in twitter: A systematic review, International Journal of
Market Research 62 (2020) 578–598.
[17] R. A. Potamias, G. Siolas, A.-G. Stafylopatis, A transformer-based approach to irony and
sarcasm detection, Neural Computing and Applications 32 (2020) 17309–17320.
[18] P. K. Roy, S. Bhawal, C. N. Subalalitha, Hate speech and ofensive language detection in
dravidian languages using deep ensemble framework, Computer Speech &amp; Language 75
(2022) 101386.
[19] B. R. Chakravarthi, Hope speech detection in youtube comments, Social Network Analysis
and Mining 12 (2022) 75.
[20] B. R. Chakravarthi, A. Hande, R. Ponnusamy, P. K. Kumaresan, R. Priyadharshini, How
can we detect homophobia and transphobia? experiments in a multilingual code-mixed
setting for social media governance, International Journal of Information Management
Data Insights 2 (2022) 100119.
[21] O. Caelen, A bayesian interpretation of the confusion matrix, Annals of Mathematics and</p>
      <p>Artificial Intelligence 81 (2017) 429–450.
[22] B. R. Chakravarthi, D. Chinnappa, R. Priyadharshini, A. K. Madasamy, S. Sivanesan,
S. C. Navaneethakrishnan, S. Thavareesan, D. Vadivel, R. Ponnusamy, P. K. Kumaresan,
Developing successful shared tasks on ofensive language identification for dravidian
languages, arXiv preprint arXiv:2111.03375 (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Belle</surname>
          </string-name>
          <string-name>
            <surname>Wong</surname>
          </string-name>
          ,
          <source>Top social media statistics and trends of</source>
          <year>2023</year>
          , Online,
          <year>2023</year>
          . URL: https: //www.forbes.com/advisor/business/social-media-statistics/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Dufy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Chan</surname>
          </string-name>
          , “
          <article-title>you never really know who's looking”: Imagined surveillance across social media platforms</article-title>
          ,
          <source>New Media &amp; Society</source>
          <volume>21</volume>
          (
          <year>2019</year>
          )
          <fpage>119</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Harish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Rangan</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on indian regional language processing</article-title>
          ,
          <source>SN Applied Sciences</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <fpage>1204</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Bert-lstm model for sarcasm detection in code-mixed social media post</article-title>
          ,
          <source>Journal of Intelligent Information Systems</source>
          <volume>60</volume>
          (
          <year>2023</year>
          )
          <fpage>235</fpage>
          -
          <lpage>254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Santosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aravind</surname>
          </string-name>
          ,
          <article-title>Hate speech detection in hindi-english code-mixed social media text</article-title>
          ,
          <source>in: Proceedings of the ACM India joint international conference on data science and management of data</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>310</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jayapal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thavareesan</surname>
          </string-name>
          , Nuig-shubhanker@
          <fpage>dravidian</fpage>
          -codemixifre2020:
          <article-title>Sentiment analysis of code-mixed dravidian text using xlnet</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>07773</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          , N. Jose, T. Mandl,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          , et al.,
          <article-title>Findings of the shared task on ofensive language identification in tamil, malayalam, and kannada</article-title>
          ,
          <source>in: Proceedings of the first workshop on speech and language technologies for Dravidian languages</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>