<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. Wang, AdaBoost for feature selection, classification and its relation with SVM, a
review, Physics Procedia</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ISS1.2019.8908018</article-id>
      <title-group>
        <article-title>Sarcasm Identification in Dravidian Languages Tamil and Malayalam</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Poorvi Shetty</string-name>
          <email>poorvishetty1202@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>JSS Science and Technology University</institution>
          ,
          <addr-line>Mysore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>25</volume>
      <issue>2012</issue>
      <fpage>24</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>Sarcasm poses a formidable challenge to sentiment analysis systems as it conveys opinions indirectly, often diverging from their literal interpretation. The escalating demand for sarcasm and sentiment detection in social media, particularly within code-mixed content in Dravidian languages, underscores the significance of this research. Code-mixing is widespread within multilingual communities, and codemixed texts frequently incorporate non-native scripts. The primary objective of this study is to discern sarcasm within a dataset comprising comments and posts in Tamil-English and Malayalam-English, sourced from social media platforms. Our research investigates combinations of various embeddings and models, yielding promising results. Notably, the top-performing system i.e., TF-IDF Vectorizer coupled with a stacking classifier (composed of Linear Support Vector Classifier, Random Forest Model, and K-Nearest Neighbors Model, with Logistic Regression serving as the meta classifier) achieved a weighted average F1 score of 0.79 for Tamil and 0.78 for Malayalam, showcasing its efectiveness in sarcasm and sentiment analysis within code-mixed content. The system proposed ranked 2nd for Tamil and 3rd for Malayalam in the shared task.</p>
      </abstract>
      <kwd-group>
        <kwd>malayalam</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Detecting sarcasm poses a significant challenge for sentiment analysis systems because it
involves conveying an opinion indirectly, often with a meaning that diverges from the literal
interpretation. As the demand for sarcasm and sentiment detection on social media texts,
particularly in Dravidian languages, continues to rise, addressing this challenge becomes crucial.</p>
      <p>
        Tamil is a Dravidian language spoken as the native tongue by more than 78 million people
worldwide, with a presence in countries including India, Sri Lanka, Malaysia, Singapore, and
Mauritius [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Meanwhile, Malayalam serves as the oficial language in Kerala and is spoken by
over 37 million people globally. The written tradition of Malayalam has a rich historical legacy
spanning centuries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Despite their significant speaker populations and extensive historical
backgrounds, these languages face a shortage of resources in the realm of Natural Language
Processing (NLP).
CEUR
Workshop
Proceedings
      </p>
      <p>
        Sarcasm detection in Natural Language Processing (NLP) is crucial for improving
machinehuman communication. It involves identifying contradictions within sarcastic statements,
where the intended meaning contradicts the literal interpretation. Accurate sarcasm detection
enhances various applications, including virtual assistants, sentiment analysis, and social media
monitoring. It enables machines to better understand the subtleties of human language and
emotions, making interactions more natural and insightful [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Code-mixing is a prevalent linguistic phenomenon in multilingual communities, and
codemixed texts are frequently composed using non-native scripts. In this context, our research
aims to discern sarcasm and determine the sentiment polarity within a dataset of comments
and posts that are code-mixed in Tamil-English and Malayalam-English, sourced from various
social media platforms. This was part of the Sarcasm Identification of Dravidian Languages
(Malayalam and Tamil) in DravidianCodeMix 2023 (DravidianCodeMix) shared task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>Several advancements have been made in the field of sarcasm detection by various researchers.
Joshi et al. [5] explored three key developments in their research, including the use of
semisupervised pattern extraction to unveil implicit sentiment, the incorporation of hashtag-based
supervision, and the integration of contextual information beyond the target text. Their
comprehensive investigation encompassed datasets, methodologies, emerging trends, and challenges
in the domain of sarcasm detection. Elgabry et al. [6] introduced an innovative approach to
enhance Arabic sarcasm detection, which involved data augmentation, contextual word
embeddings, and a random forests model, achieving superior performance in identifying sarcasm.</p>
      <p>Apon et al. [7] found that the Random Forest classifier produced the most favorable results
when applied to their original Bengali sarcasm detection dataset. Ravikiran et al. [8]
established an experimental benchmark using state-of-the-art multilingual language models like
BERT, DistillBERT, and XLM-RoBERTA for identifying ofensive language spans in Dravidian
languages. Kumar et al. [9] presented a hybrid deep learning model trained with both word and
emoji embeddings to identify sarcasm, emphasizing the significance of incorporating emojis
in sarcasm detection. Eke et al. [10] conducted an investigation into the GloVe word vector
model, revealing that it not only captures semantics and grammar but also retains contextual
information and global corpus statistics. In a comprehensive experimental analysis, Onan [11]
explored six subsets of Twitter messages, varying in size from 5,000 to 30,000, and found that
employing topic-enriched word embedding schemes alongside conventional feature sets holds
promise for sarcasm identification.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Dataset Description</title>
      <p>The dataset [12] [13] used comprised of YouTube video comments in both Tamil-English and
Malayalam-English. This dataset encompasses a wide range of code-mixed sentences. The
comments predominantly appear in two forms: written in the native script and the Roman
script, featuring either Tamil/Malayalam grammar and English vocabulary or conversely, English
grammar with Tamil/Malayalam vocabulary. Additionally, some comments are composed in
Tamil/Malayalam script, interspersed with English expressions. 1 has the statistics for both
languages. Each entry can be labelled as ’Non-Sarcastic’ or ’Sarcastic’. We notice that
’NonSarcastic’ entries significantly outnumber ’Sarcastic’ entries, indicating class imbalance.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Data Preprocessing</title>
      <p>In the preprocessing stage, several steps were applied to prepare the data for machine learning,
namely punctuation removal, numeric characters removal, whitespace cleanup and label
encoding of target series. By implementing these preprocessing steps, the text data was cleaned,
standardized, and made ready for further feature extraction and model training.</p>
      <p>In the process of text data preprocessing and feature extraction, four diferent methods were
considered: TFIDF Vectorizer, CountVectorizer, Word2Vec, and FastText.</p>
      <sec id="sec-5-1">
        <title>4.1. TF-IDF Vectorizer</title>
        <p>This method calculates the Term Frequency-Inverse Document Frequency (TF-IDF) scores for
each word in the corpus, reflecting the importance of words within individual documents and
across the entire dataset. It is particularly useful for capturing the uniqueness and significance
of words in a document [14]. Implementation was done using the Scikit-Learn library.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. CountVectorizer</title>
        <p>Unlike TFIDF, CountVectorizer simply counts the frequency of each word in the text. This
method is straightforward and eficient, making it a good choice when you want to consider
the raw word counts as features [15]. Implementation was done using the Scikit-Learn library.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Word2Vec</title>
        <p>Word2Vec is a deep learning-based technique that learns word embeddings by predicting the
context of words in a large corpus. It captures semantic relationships between words and
represents them as dense vectors in a continuous vector space. This method is efective at
capturing semantic meanings and word associations [16]. Implementation was done using the
Gensim library.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. FastText</title>
        <p>FastText is an extension of Word2Vec that also considers sub-word information. It breaks words
into smaller sub-word units (n-grams) and learns embeddings for these units as well as for
full words. This approach is beneficial for handling out-of-vocabulary words and capturing
morphological information [17]. Implementation was done using the fastText library.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Models Used</title>
      <p>Diverse machine learning models were employed for the task, aiming for a thorough comparison
and selection of the most efective approaches. The Scikit-Learn library was employed for
model implementation. Default hyperparameters from the library were used for all models.
TF-IDF vectorizer was used as it gave the best results.</p>
      <sec id="sec-6-1">
        <title>5.1. Random Forest</title>
        <p>Random Forest is an ensemble learning method based on decision trees [18]. It builds multiple
decision trees and combines their predictions to improve accuracy and reduce overfitting. Gini
criterion was used, with n_estimators as 100 and min_samples_split value as 2. Max_depth
parameter was set to None.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Logistic Regression</title>
        <p>Logistic Regression is a simple yet efective linear classification algorithm used for binary
and multi-class classification [ 19]. It models the relationship between the dependent variable
and one or more independent variables using the logistic function, which transforms linear
combinations into probabilities. L2 penalty term was utilised, C value was set to default value
of 1.0 and lbfgs solver was used.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Linear Support Vector Classifier (LinearSVC)</title>
        <p>LinearSVC is a linear classification algorithm that aims to find the hyperplane that best separates
classes in high-dimensional space [20]. It’s particularly useful when the data is linearly separable
and can handle large datasets eficiently. Here too, L2 penalty term was used, the loss function
used was squared_hinge and C value was 1.0.</p>
      </sec>
      <sec id="sec-6-4">
        <title>5.4. Decision Tree</title>
        <p>Decision Trees are non-linear models that partition the data into subsets based on feature
values [21]. They are used for both classification and regression tasks and are interpretable.
However, they can be prone to overfitting. Similar to Decion Tree, gini criterion was used, with
n_estimators as 100 and min_samples_split value as 2. Max_depth parameter was set to None.</p>
      </sec>
      <sec id="sec-6-5">
        <title>5.5. K-Nearest Neighbors (KNN)</title>
        <p>KNN is a simple instance-based learning algorithm used for classification and regression [ 22].
It assigns a class label to a data point based on the majority class among its k-nearest neighbors
in the feature space. The n_neighbours value was set to 5 with uniform weights. The algorithm
parameter was set to auto. Minkowski metric was used.</p>
      </sec>
      <sec id="sec-6-6">
        <title>5.6. AdaBoost</title>
        <p>AdaBoost is an ensemble learning method that combines the predictions of multiple weak
learners (often decision trees) to create a strong classifier [ 23]. It assigns diferent weights to
data points and focuses on those that are misclassified in previous iterations. The n_estimators
value is set to 50.</p>
      </sec>
      <sec id="sec-6-7">
        <title>5.7. One-Versus-Rest (OneVsRest)</title>
        <p>This is a technique for handling multi-class classification problems by training multiple binary
classifiers, one for each class, and then combining their results [ 24]. Logistic Regression is often
used as the base binary classifier. The parameters for the LR model within are the same as
mentioned above.</p>
      </sec>
      <sec id="sec-6-8">
        <title>5.8. Gradient Boosting</title>
        <p>Gradient Boosting is an ensemble learning method that builds an additive model by iteratively
training weak learners and adjusting their weights based on the errors of the previous iterations
[25]. It’s known for its high predictive accuracy. The min_samples_split value is set to 2, and
log_loss is used as the loss function.</p>
      </sec>
      <sec id="sec-6-9">
        <title>5.9. Stacking Classifier</title>
        <p>Stacking is an ensemble technique that combines multiple base classifiers with a meta-classifier
to improve predictive performance [26]. In this case, LinearSVC and RandomForest were used
as base classifiers, and logistic regression as the meta-classifier. The parameters for the models
within are the same as mentioned above.</p>
      </sec>
      <sec id="sec-6-10">
        <title>5.10. Voting Classifier</title>
        <p>Voting is another ensemble technique that combines the predictions of multiple classifiers by
taking a majority vote (hard voting) or weighted vote (soft voting) [ 27]. In this study, Logistic
Regression, Random Forest, and Support Vector Classifier (SVC) were combined. The parameters
for the models within are the same as mentioned above.</p>
      </sec>
      <sec id="sec-6-11">
        <title>5.11. Bagging Classifier</title>
        <p>Bagging is an ensemble technique that trains multiple instances of the same base classifier on
diferent subsets of the data and aggregates their predictions [ 28]. Here, KNN model was used
as the base classifier, and ten instances were created. The parameters for the KNN within are
the same as mentioned above.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Experiments and Results</title>
      <p>Our research methodology involved data preprocessing phase as outlined in the earlier section.
Subsequently, we conducted an exhaustive exploration of various word embedding techniques
(TF-IDF, Count, Word2Vec, FastText), pairing each with various models to evaluate their
performance. We aimed to assess the efectiveness of these combinations, and we carefully recorded
the weighted average F1 scores as a key performance metric.</p>
      <p>We use weighted average F1 scores (refer 2 3) to evaluate the performance of the models.
It is a good metric for evaluating models in a class-imbalanced binary classification problem
because it considers both precision and recall, making it robust to situations where one class
is much smaller than the other. It provides a balanced assessment of a model’s performance
by penalizing false positives and false negatives, making it more informative than accuracy in
such scenarios.</p>
      <p>The best result for both Tamil and Malayalam was achieved using TFIDFVectorizer to convert
text data into numerical form and a stacking classifier combining LinearSVC, RandomForest,
and KNN as base models, with logistic regression as the meta classifier. TFIDFVectorizer is
used to preserve semantic information in the model, enabling the extraction of meaningful
text features for better discrimination between classes. Additionally, the choice of a stacking
classifier leverages the strengths of diverse base models, efectively combining their predictions
and enhancing overall predictive power. Logistic Regression, though simple, proves efective in
aggregating base model predictions, promoting balanced and well-regulated final predictions,
which helps mitigate overfitting and bias issues, resulting in improved classification accuracy.
The weighted average F1 score for this configuration was 0.79 for Tamil and 0.78 for Malayalam.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion</title>
      <p>In this research study, we addressed the task of sarcasm identification in Dravidian languages,
specifically Tamil and Malayalam. To achieve this, we explored various combinations of word
embeddings and classifiers, systematically analyzing their efectiveness in the context of the
problem. Notably, the best-performing model emerged as a result of our experimentation,
featuring the TFIDFVectorizer for text representation and a powerful ensemble stacking classifier
composed of LinearSVC, RandomForest, and K-Nearest Neighbors, with Logistic Regression
serving as the meta classifier.
[5] A. Joshi, P. Bhattacharyya, M. J. Carman, Automatic sarcasm detection: A survey, CoRR
abs/1602.03426 (2016). URL: http://arxiv.org/abs/1602.03426. arXiv:1602.03426.
[6] H. Elgabry, S. Attia, A. Abdel-Rahman, A. Abdel-Ate, S. Girgis, A contextual word
embedding for Arabic sarcasm detection with random forests, in: Proceedings of the Sixth Arabic
Natural Language Processing Workshop, Association for Computational Linguistics, Kyiv,
Ukraine (Virtual), 2021, pp. 340–344. URL: https://aclanthology.org/2021.wanlp-1.43.
[7] T. S. Apon, R. Anan, E. A. Modhu, A. Suter, I. J. Sneha, M. G. R. Alam, Banglasarc: A dataset
for sarcasm detection, 2022. arXiv:2209.13461.
[8] M. Ravikiran, S. Annamalai, DOSA: Dravidian code-mixed ofensive span identification
dataset, in: Proceedings of the First Workshop on Speech and Language Technologies for
Dravidian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 10–17.</p>
      <p>URL: https://aclanthology.org/2021.dravidianlangtech-1.2.
[9] A. Kumar, S. R. Sangwan, A. K. Singh, G. Wadhwa, Hybrid deep learning model for sarcasm
detection in indian indigenous language using word-emoji embeddings, ACM Trans.
Asian Low-Resour. Lang. Inf. Process. 22 (2023). URL: https://doi.org/10.1145/3519299.
doi:10.1145/3519299.
[10] C. I. Eke, A. Norman, L. Shuib, F. B. Fatokun, I. Omame, The significance of global vectors
representation in sarcasm analysis, in: 2020 International Conference in Mathematics,
Computer Engineering and Computer Science (ICMCECS), 2020, pp. 1–7. doi:10.1109/
ICMCECS47690.2020.246997.
[11] A. Onan, Topic-enriched word embeddings for sarcasm identification, in: R. Silhavy
(Ed.), Software Engineering Methods in Intelligent Algorithms, Springer International
Publishing, Cham, 2019, pp. 293–304.
[12] B. R. Chakravarthi, Hope speech detection in youtube comments, Social Network Analysis
and Mining 12 (2022) 75.
[13] B. R. Chakravarthi, A. Hande, R. Ponnusamy, P. K. Kumaresan, R. Priyadharshini, How
can we detect homophobia and transphobia? experiments in a multilingual code-mixed
setting for social media governance, International Journal of Information Management
Data Insights 2 (2022) 100119.
[14] J. E. Ramos, Using tf-idf to determine word relevance in document queries, 2003. URL:
https://api.semanticscholar.org/CorpusID:14638345.
[15] O. Shahmirzadi, A. Lugowski, K. Younge, Text similarity in vector space models: A
comparative study, 2018. arXiv:1810.00664.
[16] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in
vector space, 2013. arXiv:1301.3781.
[17] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
information, 2017. arXiv:1607.04606.
[18] L. Breiman, Machine Learning 45 (2001) 5–32. URL: https://doi.org/10.1023/a:
1010933404324. doi:10.1023/a:1010933404324.
[19] A. A. T. Fernandes, D. B. F. Filho, E. C. da Rocha, W. da Silva Nascimento, Read this paper
if you want to learn logistic regression, Revista de Sociologia e Política 28 (2020). URL:
https://doi.org/10.1590/1678-987320287406en. doi:10.1590/1678- 987320287406en.
[20] S. Ghosh, A. Dasgupta, A. Swetapadma, A study on support vector machine based linear
and non-linear pattern classification, in: 2019 International Conference on Intelligent</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sarveswaran</surname>
          </string-name>
          ,
          <article-title>Towards building a modern written Tamil treebank</article-title>
          ,
          <source>in: Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT</source>
          , SyntaxFest
          <year>2021</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Sofia, Bulgaria,
          <year>2021</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>68</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .tlt-
          <volume>1</volume>
          .6.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rojan</surname>
          </string-name>
          , E. Alias,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Rajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sudarsan</surname>
          </string-name>
          ,
          <article-title>Natural language processing based text imputation for malayalam corpora</article-title>
          ,
          <source>in: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>165</lpage>
          . doi:
          <volume>10</volume>
          . 1109/ICESC48915.
          <year>2020</year>
          .
          <volume>9156036</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A. A</surname>
          </string-name>
          , S. G,
          <string-name>
            <given-names>S. H. R</given-names>
            ,
            <surname>M. Upadhyaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Ray</surname>
          </string-name>
          , M. T. C,
          <article-title>Sarcasm detection in natural language processing</article-title>
          ,
          <source>Materials Today: Proceedings</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>3324</fpage>
          -
          <lpage>3331</lpage>
          . URL: https: //doi.org/10.1016/j.matpr.
          <year>2020</year>
          .
          <volume>09</volume>
          .124. doi:
          <volume>10</volume>
          .1016/j.matpr.
          <year>2020</year>
          .
          <volume>09</volume>
          .124.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sripriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nandhini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Chinnaudayar</given-names>
            <surname>Navaneethakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Rajkumar, Overview of the shared task on sarcasm identification of Dravidian languages (Malayalam and Tamil) in DravidianCodeMix, in: Forum of Information Retrieval and Evaluation FIRE -</article-title>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>