1. Introduction

Forum for Information Retrieval Evaluation, December

DravidianCodeMix 2025: Comparative Study of Transformer-based Models for Ofensive Content Detection in Tamil, Malayalam, Kannada and Tulu Code-Mixed Texts

Santhiya P

Akshitha A V

Arul Murugan S

Chandran T

0 0 Kongu Engineering College , Tamil Nadu , India

2025

1 7 20

An essential task in ensuring safe digital interactions, especially on social media platforms where multilingual exchanges are common, is the detection of ofensive language in Dravidian code-mixed text. This project focuses on building classification models for Tamil, Malayalam, Kannada, and Tulu, four major Dravidian languages. To address this, we explored transformer-based approaches and evaluated their efectiveness across the datasets. Our study shows that IndicBERTv2-m2m was more efective for Tulu and Malayalam, whereas TwHIN-BERT yielded better outcomes for Tamil and Kannada. These observations emphasize that model suitability varies by language, highlighting the necessity of adopting language-specific strategies for ofensive content detection. Furthermore, our work provides a foundation for handling low-resource language scenarios and code-mixed text challenges. The datasets also facilitate research into cross-lingual and multilingual learning approaches. Ultimately, these eforts contribute to safer online environments and more inclusive digital communication.

1. Introduction 2. Literature Survey

Research on ofensive language identification in Dravidian code-mixed text has grown rapidly, largely driven by shared tasks and dataset creation. The FIRE 2020 and 2021 shared tasks provided early evaluations on Tamil, Malayalam, and Kannada, highlighting the challenges of detecting ofensive expressions in code-mixed settings and motivating the use of both classical machine learning and multilingual transformer methods [4]. Later FIRE tasks expanded coverage to more languages and refined annotation schemes, enabling broader evaluation [ 1 ].

A key resource is the DravidianCodeMix dataset, which ofers annotated corpora for Tamil, Malayalam, and Kannada [ 2 ]. It includes multiple ofensive categories and reflects natural variation in social media text, such as Romanization and spelling inconsistencies, making it a realistic benchmark. More recently, a low-resource corpus for Tulu extended research to another Dravidian language, addressing severe data scarcity and enabling cross-lingual transfer studies [3]. Together, these datasets provide a foundation for multilingual modeling and sociolinguistic analysis of abusive discourse.

Beyond resources, model development has advanced performance. While early multilingual transformers achieved strong baselines, results varied by language due to dataset size and code-mixing complexity. To improve this, Chakravarthi et al. [5] introduced a multilingual MPNet and CNN fusion model for Tamil, Malayalam, and Kannada. Their hybrid architecture handled code-mixing efectively and outperformed both traditional machine learning and single-model transformers.

Overall, shared tasks and datasets such as DravidianCodeMix and Tulu have standardized evaluation and expanded coverage, while hybrid deep learning models have established robust baselines[5]. These developments provide a pathway for improving low-resource adaptation and advancing ofensive language detection in multilingual social media.

3. Materials and Methods 3.1. Taskset Description

This work analyzes code-mixed datasets for four Dravidian languages—Tamil, Malayalam, Kannada, and Tulu—collected from social media platforms where English is often blended with regional languages. Each dataset is annotated as ofensive or non-ofensive, with subcategories distinguishing insults targeted at individuals, groups, or untargeted abuse. The Tamil dataset contains categories such as Not Ofensive, Insult Individual, Insult Group, and Untargeted, while the Malayalam dataset adds an Other Language class for cross-lingual entries. Kannada follows a similar structure but introduces a Not-Kannada class to capture comments written in English, Hindi, or Romanized forms. The smaller Tulu dataset uses Not Ofensive, Targeted, and Untargeted categories, providing insights into lowresource Tulu-English interactions. Related initiatives include the DOSA dataset for ofensive span identification in Dravidian languages [ 6] and a recent corpus for Tulu ofensive language detection [3], both underscoring the need to address low-resource challenges. These datasets not only enable systematic evaluation of ofensive language detection methods but also reflect real-world issues like code-switching, inconsistent spellings, and dialectal variation. Collectively, they form a solid basis for advancing abusive content detection in Dravidian code-mixed contexts. Additionally, they support experimentation with multilingual and transformer-based models to handle mixed-language inputs efectively. They also provide opportunities to analyze sociolinguistic patterns and user behavior in online interactions. Future research can leverage these resources to improve cross-lingual transfer and low-resource learning strategies.

Text 14.12.2018 epo trailer pathutu irken ... Semaya iruku Paka thano poro movie la Enna irukunu “U kena tunggu lebih lama lagi untuk tahu saya” – chiyaan recognized Suriya anna vera level anna mass suma katththaatha da sound over a pooda kudaathu pa s3 1 month oda stop aakidum then bairavaa da aadchi than katthti katthti thondaiya kilikatha pa Labels Not_ofensive Not_ofensive not-Tamil Not_ofensive Ofensive_Untargetede

3.2. Data Collection

The Data Collection module plays a key role in compiling code-mixed text in Tamil, Malayalam, Kannada, and Tulu from varied sources such as social media platforms, discussion forums, and publicly available repositories. This step is designed to capture a broad spectrum of linguistic features, including regional dialects, colloquial slang, and examples of both ofensive and non-ofensive language. Creating a diverse and balanced dataset is crucial, as it forms the foundation for training machine learning models capable of reliably identifying harmful content within multilingual and code-mixed contexts. In addition, proper sampling ensures fair representation of all classes, reducing bias in model predictions. The quality of data collected at this stage directly influences the accuracy and generalizability of the final system.

3.3. Data Preprocessing and Feature Extraction

Before being processed by machine learning models, the text must be standardized and transformed into a suitable format. The preprocessing stage cleans raw code-mixed text by removing noise such as URLs, special characters, emoticons, and extra whitespace, while also performing language-specific tasks like tokenization and stemming for both Dravidian and English components. This improves data consistency and prepares it for feature extraction.

After preprocessing, the text is converted into dense embeddings using transformer-based models such as IndicBERTv2-m2m and TwHIN-BERT. Unlike traditional approaches like TF-IDF or n-grams, transformers capture semantic and syntactic patterns across multiple languages and scripts, which is crucial in code-mixed contexts where context often shifts between English and Dravidian. Leveraging these pretrained multilingual embeddings provides rich representations that strengthen the model’s ability to distinguish ofensive from non-ofensive content.

3.4. Model Training and Selection

The Model Training and Selection phase focuses on fine-tuning transformer-based models with codemixed datasets. In this work, IndicBERTv2-m2m is utilized for Tulu and Malayalam, while TwHIN-BERT is applied to Tamil and Kannada. Each model is trained with language-specific annotated corpora so that it can adapt to the unique traits of the respective code-mixed languages. The evaluation of model performance is carried out using common metrics such as accuracy, precision, recall, and F1-score. Based on these results, the most suitable transformer model is chosen for each language, enabling reliable identification of ofensive and abusive expressions in Dravidian social media content. This process also helps reveal cross-lingual diferences, showing how certain architectures generalize better to low-resource contexts. Moreover, careful model selection ensures that downstream applications, such as moderation systems, remain both eficient and scalable.

4. Results and Discussion

The experimental analysis of the proposed system indicates that abusive language detection yields varied outcomes across Dravidian languages, highlighting the importance of adopting models tailored to each language. For Tulu [3] and Malayalam, IndicBERTv2-m2m produced stronger results, whereas TwHIN-BERT achieved the highest accuracy on Tamil and Kannada text [7]. These diferences arise from the unique linguistic structures, levels of code-mixing, and dataset characteristics associated with each language, demonstrating that no single algorithm consistently delivers the best performance across all cases. The results further reveal that thorough preprocessing combined with efective feature extraction substantially enhances model accuracy, enabling reliable identification of ofensive content within multilingual and code-mixed social media. In addition, the findings emphasize the role of dataset balance, as skewed distributions tend to reduce performance on minority classes. These insights provide valuable guidance for building future models that can handle both linguistic diversity and resource scarcity more efectively.

4.1. Performance Analysis

The evaluation highlights persistent challenges in detecting ofensive content across Dravidian codemixed languages, largely due to the linguistic variability present in social media communication. Slang, acronyms, emojis, and region-specific informal expressions frequently disrupt semantic clarity, leading to misclassifications. Contextual ambiguity further complicates detection, as meanings shift based on discourse, speaker intent, or cultural nuance, causing both false positives and false negatives. Malayalam and Tulu sufer from small annotated datasets, limiting generalization and reducing robustness on unseen inputs. Tamil exhibits heavy code-switching between English and native scripts, producing multiple orthographic variants for the same term and making contextual understanding more dificult. Kannada faces additional issues such as spelling inconsistencies, mixed-script usage, and borrowing from neighboring languages, all of which introduce noise into classification.

These findings underscore the need for larger and more representative datasets, along with advanced context-aware architectures capable of modeling fine-grained linguistic cues. Incorporating linguistic tools such as morphological analyzers, character-level models, or subword embeddings may improve the handling of complex structures like agglutination, dialectal variations, and transliteration. Approaches such as cross-lingual transfer learning, data augmentation, and pretraining on region-specific corpora can further enhance robustness. Despite the limitations, the proposed system establishes a strong foundation for future research and ofers practical insights for building efective moderation tools to manage ofensive content in multilingual online platforms.

4.2. Error Analysis

The performance evaluation highlights several challenges in detecting objectionable content across code-mixed Dravidian languages. Slang, acronyms, and region-specific informal expressions often cause misclassifications, while contextual ambiguity leads to false positives and negatives when words shift meaning across situations. For Malayalam and Tulu, the limited size of annotated datasets reduces generalization and weakens performance on unseen data. In Tamil, frequent code-switching between English and native scripts complicates context detection, while Kannada sufers from spelling inconsistencies, mixed-script usage, and borrowed terms from neighboring languages, all of which add noise. These issues emphasize the need for larger, more representative datasets and advanced contextaware models to improve detection accuracy. Nevertheless, the system provides a strong starting point for future research and practical control of ofensive content in multilingual online platforms.

5. Limitations

The proposed ofensive language detection method performs well but still faces limitations. It depends on small annotated datasets that fail to capture regional dialects, slang, and code-mixed expressions in Tamil, Malayalam, and Tulu. Ambiguous sentences can lead to misclassification, and rapidly changing, informal social media language may further reduce accuracy. Larger, more diverse datasets and advanced context-aware methods are needed to improve robustness and generalization.

6. Conclusion

Evaluation outcomes demonstrate that the performance of abusive language detection varies across Dravidian languages, emphasizing the importance of selecting models tailored to individual linguistic contexts [ 1 ]. IndicBERTv2-m2m achieved higher accuracy for Tulu and Malayalam [3], whereas TwHINBERT performed better for Tamil and Kannada [7]. The suggested system ofers a solid basis for automatic content filtering, even though some mistakes still occur because of unclear context, slang, and small datasets. All things considered, this work promotes safer online communication and establishes the framework for next advancements, such as context-aware algorithms and larger datasets for more reliable foul language identification [ 2 ].

Project Repository

The full source code for this project is available on GitHub: GitHub Repository- Chandrant-chan

Declaration on Generative AI

In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted a thorough review and revision of the text and assume full responsibility for the final published content. [3] A. M. D, D. Vikram, B. R. Chakravarthi, P. R. Hegde, Overcoming low-resource barriers in tulu: Neural models and corpus creation for ofensive language identification, 2025. URL: https://arxiv. org/abs/2508.11166. arXiv:arXiv:2508.11166. [4] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan, R. Ponnusamy, J. P.

McCrae, E. Sherly, et al., Findings of the shared task on ofensive language identification in tamil, malayalam, and kannada, in: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 2021, pp. 133–145. [5] B. R. Chakravarthi, M. B. Jagadeeshan, V. Palanikumar, R. Priyadharshini, Ofensive language identification in dravidian languages using mpnet and cnn, International Journal of Information Management Data Insights 3 (2023) 100151. URL: https://www.sciencedirect.com/science/article/ pii/S2667096822000945. doi:https://doi.org/10.1016/j.jjimei.2022.100151. [6] B. R. Chakravarthi, et al., Dosa: Dravidian code-mixed ofensive span identification dataset, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2021. [7] S. Roy, et al., Hottest: Hate and ofensive content identification in tamil using deep learning, in: Proceedings of DravidianLangTech Workshop, 2023.

[1]

Sripriya ,

B. R.

Chakravarthi ,

Durairaj ,

Bharathi ,

C. N.

Subalalitha ,

P. K.

Kumaresan ,

M. D.

Anusha ,

P. R.

Hegde ,

Vikram , Overview of the shared task on ofensive language identification in dravidian code-mixed languages, in: Forum of Information Retrieval and Evaluation FIRE- 2025 , 2025 .

[2]

B. R.

Chakravarthi ,

Priyadharshini ,

Muralidaran , N. Jose,

Suryawanshi ,

Sherly ,

J. P.

McCrae , Dravidiancodemix: Sentiment analysis and ofensive language identification dataset for dravidian languages in code-mixed text , Language Resources and Evaluation 56 ( 2022 ) 765 - 806 .