<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DravidianCodeMix 2025: Comparative Study of Transformer-based Models for Ofensive Content Detection in Tamil, Malayalam, Kannada and Tulu Code-Mixed Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Santhiya P</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akshitha A V</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arul Murugan S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chandran T</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kongu Engineering College</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>An essential task in ensuring safe digital interactions, especially on social media platforms where multilingual exchanges are common, is the detection of ofensive language in Dravidian code-mixed text. This project focuses on building classification models for Tamil, Malayalam, Kannada, and Tulu, four major Dravidian languages. To address this, we explored transformer-based approaches and evaluated their efectiveness across the datasets. Our study shows that IndicBERTv2-m2m was more efective for Tulu and Malayalam, whereas TwHIN-BERT yielded better outcomes for Tamil and Kannada. These observations emphasize that model suitability varies by language, highlighting the necessity of adopting language-specific strategies for ofensive content detection. Furthermore, our work provides a foundation for handling low-resource language scenarios and code-mixed text challenges. The datasets also facilitate research into cross-lingual and multilingual learning approaches. Ultimately, these eforts contribute to safer online environments and more inclusive digital communication.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Survey</title>
      <p>
        Research on ofensive language identification in Dravidian code-mixed text has grown rapidly, largely
driven by shared tasks and dataset creation. The FIRE 2020 and 2021 shared tasks provided early
evaluations on Tamil, Malayalam, and Kannada, highlighting the challenges of detecting ofensive
expressions in code-mixed settings and motivating the use of both classical machine learning and
multilingual transformer methods [4]. Later FIRE tasks expanded coverage to more languages and
refined annotation schemes, enabling broader evaluation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        A key resource is the DravidianCodeMix dataset, which ofers annotated corpora for Tamil, Malayalam,
and Kannada [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It includes multiple ofensive categories and reflects natural variation in social media
text, such as Romanization and spelling inconsistencies, making it a realistic benchmark. More recently,
a low-resource corpus for Tulu extended research to another Dravidian language, addressing severe data
scarcity and enabling cross-lingual transfer studies [3]. Together, these datasets provide a foundation
for multilingual modeling and sociolinguistic analysis of abusive discourse.
      </p>
      <p>Beyond resources, model development has advanced performance. While early multilingual
transformers achieved strong baselines, results varied by language due to dataset size and code-mixing
complexity. To improve this, Chakravarthi et al. [5] introduced a multilingual MPNet and CNN fusion
model for Tamil, Malayalam, and Kannada. Their hybrid architecture handled code-mixing efectively
and outperformed both traditional machine learning and single-model transformers.</p>
      <p>Overall, shared tasks and datasets such as DravidianCodeMix and Tulu have standardized evaluation
and expanded coverage, while hybrid deep learning models have established robust baselines[5]. These
developments provide a pathway for improving low-resource adaptation and advancing ofensive
language detection in multilingual social media.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Taskset Description</title>
        <p>This work analyzes code-mixed datasets for four Dravidian languages—Tamil, Malayalam, Kannada, and
Tulu—collected from social media platforms where English is often blended with regional languages.
Each dataset is annotated as ofensive or non-ofensive, with subcategories distinguishing insults
targeted at individuals, groups, or untargeted abuse. The Tamil dataset contains categories such as
Not Ofensive, Insult Individual, Insult Group, and Untargeted, while the Malayalam dataset adds an
Other Language class for cross-lingual entries. Kannada follows a similar structure but introduces a
Not-Kannada class to capture comments written in English, Hindi, or Romanized forms. The smaller
Tulu dataset uses Not Ofensive, Targeted, and Untargeted categories, providing insights into
lowresource Tulu-English interactions. Related initiatives include the DOSA dataset for ofensive span
identification in Dravidian languages [ 6] and a recent corpus for Tulu ofensive language detection
[3], both underscoring the need to address low-resource challenges. These datasets not only enable
systematic evaluation of ofensive language detection methods but also reflect real-world issues like
code-switching, inconsistent spellings, and dialectal variation. Collectively, they form a solid basis for
advancing abusive content detection in Dravidian code-mixed contexts. Additionally, they support
experimentation with multilingual and transformer-based models to handle mixed-language inputs
efectively. They also provide opportunities to analyze sociolinguistic patterns and user behavior in
online interactions. Future research can leverage these resources to improve cross-lingual transfer and
low-resource learning strategies.</p>
        <p>Text
14.12.2018 epo trailer pathutu irken ... Semaya iruku
Paka thano poro movie la Enna irukunu
“U kena tunggu lebih lama lagi untuk tahu saya” – chiyaan recognized
Suriya anna vera level anna mass
suma katththaatha da sound over a pooda kudaathu pa s3 1 month oda
stop aakidum then bairavaa da aadchi than katthti katthti thondaiya
kilikatha pa
Labels
Not_ofensive
Not_ofensive
not-Tamil
Not_ofensive
Ofensive_Untargetede</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Collection</title>
        <p>The Data Collection module plays a key role in compiling code-mixed text in Tamil, Malayalam, Kannada,
and Tulu from varied sources such as social media platforms, discussion forums, and publicly available
repositories. This step is designed to capture a broad spectrum of linguistic features, including regional
dialects, colloquial slang, and examples of both ofensive and non-ofensive language. Creating a diverse
and balanced dataset is crucial, as it forms the foundation for training machine learning models capable
of reliably identifying harmful content within multilingual and code-mixed contexts. In addition, proper
sampling ensures fair representation of all classes, reducing bias in model predictions. The quality of
data collected at this stage directly influences the accuracy and generalizability of the final system.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Data Preprocessing and Feature Extraction</title>
        <p>Before being processed by machine learning models, the text must be standardized and transformed
into a suitable format. The preprocessing stage cleans raw code-mixed text by removing noise such as
URLs, special characters, emoticons, and extra whitespace, while also performing language-specific
tasks like tokenization and stemming for both Dravidian and English components. This improves data
consistency and prepares it for feature extraction.</p>
        <p>After preprocessing, the text is converted into dense embeddings using transformer-based models
such as IndicBERTv2-m2m and TwHIN-BERT. Unlike traditional approaches like TF-IDF or n-grams,
transformers capture semantic and syntactic patterns across multiple languages and scripts, which is
crucial in code-mixed contexts where context often shifts between English and Dravidian. Leveraging
these pretrained multilingual embeddings provides rich representations that strengthen the model’s
ability to distinguish ofensive from non-ofensive content.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Model Training and Selection</title>
        <p>The Model Training and Selection phase focuses on fine-tuning transformer-based models with
codemixed datasets. In this work, IndicBERTv2-m2m is utilized for Tulu and Malayalam, while TwHIN-BERT
is applied to Tamil and Kannada. Each model is trained with language-specific annotated corpora so
that it can adapt to the unique traits of the respective code-mixed languages. The evaluation of model
performance is carried out using common metrics such as accuracy, precision, recall, and F1-score.
Based on these results, the most suitable transformer model is chosen for each language, enabling
reliable identification of ofensive and abusive expressions in Dravidian social media content. This
process also helps reveal cross-lingual diferences, showing how certain architectures generalize better
to low-resource contexts. Moreover, careful model selection ensures that downstream applications,
such as moderation systems, remain both eficient and scalable.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>The experimental analysis of the proposed system indicates that abusive language detection yields
varied outcomes across Dravidian languages, highlighting the importance of adopting models tailored
to each language. For Tulu [3] and Malayalam, IndicBERTv2-m2m produced stronger results, whereas
TwHIN-BERT achieved the highest accuracy on Tamil and Kannada text [7]. These diferences arise
from the unique linguistic structures, levels of code-mixing, and dataset characteristics associated with
each language, demonstrating that no single algorithm consistently delivers the best performance
across all cases. The results further reveal that thorough preprocessing combined with efective feature
extraction substantially enhances model accuracy, enabling reliable identification of ofensive content
within multilingual and code-mixed social media. In addition, the findings emphasize the role of dataset
balance, as skewed distributions tend to reduce performance on minority classes. These insights provide
valuable guidance for building future models that can handle both linguistic diversity and resource
scarcity more efectively.</p>
      <sec id="sec-4-1">
        <title>4.1. Performance Analysis</title>
        <p>The evaluation highlights persistent challenges in detecting ofensive content across Dravidian
codemixed languages, largely due to the linguistic variability present in social media communication. Slang,
acronyms, emojis, and region-specific informal expressions frequently disrupt semantic clarity, leading
to misclassifications. Contextual ambiguity further complicates detection, as meanings shift based on
discourse, speaker intent, or cultural nuance, causing both false positives and false negatives. Malayalam
and Tulu sufer from small annotated datasets, limiting generalization and reducing robustness on
unseen inputs. Tamil exhibits heavy code-switching between English and native scripts, producing
multiple orthographic variants for the same term and making contextual understanding more dificult.
Kannada faces additional issues such as spelling inconsistencies, mixed-script usage, and borrowing
from neighboring languages, all of which introduce noise into classification.</p>
        <p>These findings underscore the need for larger and more representative datasets, along with advanced
context-aware architectures capable of modeling fine-grained linguistic cues. Incorporating linguistic
tools such as morphological analyzers, character-level models, or subword embeddings may improve the
handling of complex structures like agglutination, dialectal variations, and transliteration. Approaches
such as cross-lingual transfer learning, data augmentation, and pretraining on region-specific corpora
can further enhance robustness. Despite the limitations, the proposed system establishes a strong
foundation for future research and ofers practical insights for building efective moderation tools to
manage ofensive content in multilingual online platforms.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Error Analysis</title>
        <p>The performance evaluation highlights several challenges in detecting objectionable content across
code-mixed Dravidian languages. Slang, acronyms, and region-specific informal expressions often
cause misclassifications, while contextual ambiguity leads to false positives and negatives when words
shift meaning across situations. For Malayalam and Tulu, the limited size of annotated datasets
reduces generalization and weakens performance on unseen data. In Tamil, frequent code-switching
between English and native scripts complicates context detection, while Kannada sufers from spelling
inconsistencies, mixed-script usage, and borrowed terms from neighboring languages, all of which add
noise. These issues emphasize the need for larger, more representative datasets and advanced
contextaware models to improve detection accuracy. Nevertheless, the system provides a strong starting point
for future research and practical control of ofensive content in multilingual online platforms.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>The proposed ofensive language detection method performs well but still faces limitations. It depends
on small annotated datasets that fail to capture regional dialects, slang, and code-mixed expressions in
Tamil, Malayalam, and Tulu. Ambiguous sentences can lead to misclassification, and rapidly changing,
informal social media language may further reduce accuracy. Larger, more diverse datasets and advanced
context-aware methods are needed to improve robustness and generalization.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        Evaluation outcomes demonstrate that the performance of abusive language detection varies across
Dravidian languages, emphasizing the importance of selecting models tailored to individual linguistic
contexts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. IndicBERTv2-m2m achieved higher accuracy for Tulu and Malayalam [3], whereas
TwHINBERT performed better for Tamil and Kannada [7]. The suggested system ofers a solid basis for
automatic content filtering, even though some mistakes still occur because of unclear context, slang, and
small datasets. All things considered, this work promotes safer online communication and establishes
the framework for next advancements, such as context-aware algorithms and larger datasets for more
reliable foul language identification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Project Repository</title>
      <p>The full source code for this project is available on
GitHub: GitHub Repository- Chandrant-chan</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its
use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted
a thorough review and revision of the text and assume full responsibility for the final published content.
[3] A. M. D, D. Vikram, B. R. Chakravarthi, P. R. Hegde, Overcoming low-resource barriers in tulu:
Neural models and corpus creation for ofensive language identification, 2025. URL: https://arxiv.
org/abs/2508.11166. arXiv:arXiv:2508.11166.
[4] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan, R. Ponnusamy, J. P.</p>
      <p>McCrae, E. Sherly, et al., Findings of the shared task on ofensive language identification in tamil,
malayalam, and kannada, in: Proceedings of the first workshop on speech and language technologies
for Dravidian languages, 2021, pp. 133–145.
[5] B. R. Chakravarthi, M. B. Jagadeeshan, V. Palanikumar, R. Priyadharshini, Ofensive language
identification in dravidian languages using mpnet and cnn, International Journal of Information
Management Data Insights 3 (2023) 100151. URL: https://www.sciencedirect.com/science/article/
pii/S2667096822000945. doi:https://doi.org/10.1016/j.jjimei.2022.100151.
[6] B. R. Chakravarthi, et al., Dosa: Dravidian code-mixed ofensive span identification dataset, in:
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages,
2021.
[7] S. Roy, et al., Hottest: Hate and ofensive content identification in tamil using deep learning, in:
Proceedings of DravidianLangTech Workshop, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sripriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Subalalitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Anusha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vikram</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on ofensive language identification in dravidian code-mixed languages, in: Forum of Information Retrieval and Evaluation FIRE-</article-title>
          <year>2025</year>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Dravidiancodemix: Sentiment analysis and ofensive language identification dataset for dravidian languages in code-mixed text</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>56</volume>
          (
          <year>2022</year>
          )
          <fpage>765</fpage>
          -
          <lpage>806</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>