1. Introduction

Forum for Information Retrieval Evaluation, December

DravidianCodeMix 2025: Empirical Analysis of Classical Machine Learning Approaches in Tamil, Malayalam, and Tulu Code-Mixed Ofensive Content Classification

Shudapreyaa R S

Surya U S

Swetha M

Sandeep P S

0 0 Kongu Engineering College , Tamil Nadu , India

2025

1 7 20

A critical responsibility for maintaining healthy online communication, particularly on social media platforms where multilingual interactions are prevalent, is identifying ofensive language in texts that contain Dravidian code. Creating machine learning models to categorize ofensive information in Tamil, Malayalam, and Tulu-three Dravidian languages-is the main goal of this project. We tested and implemented a variety of algorithms to find the best method for each language. According to experimental results, Linear SVC performed best for Tulu, while Random Forest produced better results for Malayalam, and Logistic Regression outperformed the other models for Tamil. These results show that no single algorithm is uniformly dominant across all Dravidian languages, underscoring the significance of language-specific algorithm selection in ofensive language identification.

1. Introduction 2. Literature Survey

Every day, millions of comments are left on the uploaded postings due to the rise of netizen culture and social media. The usage of derogatory language in user comments has dramatically increased. Online comments that contain abusive language initiate cyberbullying[ 3 ], which targets both a group of people (a certain nation, age, and religion) and an individual (a politician, celebrity, or product). Automated detection and analysis of abusive language in online comments are crucial. In the literature, there have been multiple attempts to identify abusive language in the English language. NB, SVM, IBK, Logistic, and JRip are five diferent machine learning models, while CNN, LSTM, BLSTM, and CLSTM[ 4 ] are four deep learning models that we use in this study to recognize abusive language in Urdu and Roman Urdu comments.

Tanjim Mahmud et al.[ 5 ] have proposed a system that creates sophisticated machine learning and deep learning models for identifying child abusive texts in the Bengali language on online platforms; this study tackles the pressing problem of child abuse in digital communications. The main objective of this project is to develop a useful tool for precisely recognizing abusive content to aid in the prevention of child abuse. This model diferentiates between abusive and non-abusive material by combining deep learning methods with natural language processing (NLP) approaches.

Dhanyashree G et al. [ 6 ] highlighted that while social networks serve as major platforms for engagement and communication, they are also increasingly misused for gender-based abuse, particularly targeting women with demeaning and harassing remarks. Their study, focusing on Malayalam and Tamil YouTube comments, aimed to identify explicit abuse, implicit bias, stereotypes, and coded language. To address this, they evaluated multiple machine learning models, including Support Vector Machines (SVM), Logistic Regression (LR), and Naive Bayes classifiers, for categorizing comments into abusive and non-abusive groups.

Anwar Hossain et al[7]. have proposed in this system that one of the major problems significantly impacting society is the spread of hate speech on social media, which contributes to an increase in violence, discrimination, and societal disintegration. Because of adversarial manipulations and cultural, linguistic, and contextual complexities, the task of identifying hate speech is inherently complex. In this work, we methodically examine how well LLMs perform in identifying hate speech in a variety of geographical contexts and multilingual datasets. Our research ofers a novel assessment approach that takes into account three factors: robustness to adversarially created text, geography-aware contextual detection, and binary classification of hate speech.

3. Materials and Methods 3.1. Taskset Description

The study explores training datasets from three Dravidian code-mixed languages—Tamil, Malayalam, and Tulu—made up of real social media comments that freely mix English and native scripts. While all three datasets contain ofensive and non-ofensive labels, the amount of detail in these annotations difers across languages. The Tamil dataset is the most detailed, with separate categories for targeted insults aimed at individuals, groups, or other entities, along with untargeted insults and comments written in other languages [8]. The Malayalam dataset is balanced but more challenging to work with because of its rich morphology, shifting dialects, and frequent code-switching, which can make ofensive expressions harder to detect [9]. The Tulu dataset is smaller in size but still meaningful, ofering useful insights for low-resource language research by capturing typical patterns of neutral, targeted, and untargeted ofensive speech found online. When comparing models across these datasets, each language shows a diferent best performer: Logistic Regression works well for Tamil due to its strength with sparse text features, Random Forest handles Malayalam efectively by modeling its non-linear linguistic patterns, and Linear SVC suits Tulu because of its stability with limited training data. Overall, the results emphasize how linguistic characteristics, dataset size, and annotation depth play a crucial role in determining which machine learning approach is most efective for detecting ofensive content in Dravidian code-mixed text.

Tamil Dataset: Labels include Not Ofensive, Targeted Insult (Individual, Group, Other), Untargeted, and Other Language.

Tulu Dataset: Classified into Not Ofensive, Ofensive Targeted, and Ofensive Untargeted.

Malayalam Dataset: Similar to Tamil, with categories for Not Ofensive, Targeted Insults, Untargeted, and Other Language.

3.2. Data Collection

Label Not ofensive Ofensive_Untargeted Ofensive_Targeted The Data Collection module is in charge of compiling code-mixed textual data in Tamil, Malayalam, and Tulu from a variety of sources, including forums, social networking sites, and publicly accessible datasets. This module guarantees that the dataset has a variety of content categories, including slang, dialects, and samples that are both ofensive and non-ofensive. Training machine learning models that can reliably detect harmful content in multilingual and code-mixed environments requires a carefully curated dataset.

3.3. Data Pre-processing 3.3.1. Lowercasing

Preprocessing was essential for preparing the text for classification and involved the following steps: All characters were converted to lowercase to maintain consistency and avoid treating words like ’Good’ and ’good’ as diferent tokens.

3.3.2. Noise Removal

Unwanted elements such as URLs, mentions (@username), hashtags, numbers, punctuation, and special characters were removed to reduce irrelevant .

3.3.3. Language Filtering

Only Tamil and English characters were retained by restricting the text to the corresponding Unicode ranges, ensuring that irrelevant scripts and symbols were excluded.

3.3.4. Whitespace Normalization

Multiple spaces were collapsed into a single space, and leading/trailing whitespace was removed, making the text cleaner and more uniform.

3.4. Feature Extraction

In order to make machine learning models understand preprocessed text, the Feature Extraction module converts it into numerical representations. Word embeddings, term frequency-inverse document frequency (TF-IDF) vectors, and n-gram generation are examples of common methods. In order to capture contextual nuances, this module may also include linguistic or semantic elements unique to code-mixed Dravidian texts. Efective feature extraction improves the model’s capacity to reliably distinguish between ofensive and non-ofensive content.

3.5. Model Training and Selection

To process the feature-rich dataset, the Model Training and Selection module uses a variety of machine learning algorithms. We use language-specific data to train and evaluate algorithms like Linear SVC, Random Forest, and Logistic Regression [10]. They evaluate each model using performance indicators such as F1-score, recall, accuracy, and precision. To ensure optimal performance in ofensive language identification, the best-performing algorithm is chosen for each language based on these evaluations.

4. Results and Discussion

The results show that abusive language detection behaves diferently across Tamil, Malayalam, and Tulu, proving that a single model cannot work equally well for all. Logistic Regression gave the best results for Tamil because the dataset was relatively balanced, which made it easier for a linear model to separate ofensive and non-ofensive text while still capturing small diferences between categories. For Malayalam, Random Forest performed better as its ensemble of decision trees could manage the imbalance across classes and deal with the greater variation in ofensive expressions. It also showed more robustness to spelling changes and noisy code-mixed text, which are common in Malayalam. In the case of Tulu, the dataset was much smaller, and complex models tended to overfit. Linear SVC was more efective here since its margin-based classification and ability to handle sparse TF–IDF features helped it generalize better in low-resource conditions. Overall, these findings highlight that the choice of model depends strongly on the size, balance, and linguistic characteristics of each dataset, and that tailoring algorithms to language-specific needs leads to more reliable ofensive language detection.

4.1. Performance Metrics

To give a thorough evaluation of the eficacy of the ofensive language detection models, their performance was assessed using common classification measures, such as accuracy, precision, recall, and F1-score. While precision shows the percentage of ofensive texts successfully detected out of all those projected to be ofensive, accuracy gauges how accurate predictions are overall. The F1-score ofers a balanced metric that combines precision and recall, whereas recall evaluates the model’s capacity to recognize every instance of actual ofensive behavior. The findings of the experiment showed that Linear SVC was the best for Tulu, Random Forest was the best for Malayalam, and Logistic Regression had the best performance metrics for Tamil. These measurements show that the system can consistently diferentiate objectionable content from non-ofensive text in code-mixed Dravidian datasets, underscoring the significance of choosing language-specific methods.

5. Error Analysis

Looking more closely at the results, we found that all three models struggled with classes that had very few training examples. In Tamil, Logistic Regression performed well for major categories but completely failed to detect the “Ofensive Targeted Insult Other” class, where recall dropped to 0.0 due to insuficient data for the model to learn meaningful patterns. Similarly, the Random Forest classifier for Malayalam showed good overall accuracy but performed poorly on group insult categories because of dataset imbalance. For Tulu, Linear SVC handled the small dataset better than other models but still struggled with subtle targeted insults. Many errors also arose from slang, spelling variations, and context-dependent words. Overall, these issues highlight that data scarcity and ambiguous expressions remain key challenges, emphasizing the need for richer and more balanced datasets in the future.

6. Limitations

The suggested ofensive language detection method performs admirably, but there are still a number of drawbacks. The models mostly rely on annotated datasets, which are small and do not fully represent the variety of regional dialects, slang, and code-mixed expressions in Tamil, Malayalam, and Tulu. Sentences with ambiguous context may cause misclassification since the models may find it dificult to discern between ofensive and non-ofensive word usage. Furthermore, real-time social media feeds with extremely informal or changing language patterns may cause the algorithm to perform worse. These drawbacks emphasize the necessity of using bigger, more varied datasets, as well as sophisticated context-aware methods to increase generality and accuracy.

7. Conclusion

A machine learning-based method was created in this study to identify abusive language in texts that contain Dravidian code, specifically in Tamil, Malayalam, and Tulu. With Linear SVC for Tulu, Random Forest for Malayalam, and Logistic Regression for Tamil, the studies showed how important it is to choose algorithms that are specific to a given language. The outcomes demonstrate how well feature extraction, meticulous preprocessing, and model training can address the dificulties posed by code-mixed and multilingual data.The suggested system ofers a solid basis for automatic content ifltering, even though some mistakes still occur because of unclear context, slang, and small datasets. All things considered, this work promotes safer online communication and establishes the framework for next advancements, such as context-aware algorithms and larger datasets for more reliable foul language identification.

Project Repository

The full source code for this project is available on GitHub: GitHub Repository- SURYAULAGANATHAN

Declaration on Generative AI

In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted a thorough review and revision of the text and assume full responsibility for the final published content. Technologies for Dravidian Languages, Association for Computational Linguistics, 2025, pp. 682– 687. URL: https://aclanthology.org/2025.dravidianlangtech-1.116. [7] A. Hossain Zahid, M. K. Roy, S. Das, Evaluation of hate speech detection using large language models and geographical contextualization, arXiv preprint arXiv:2502.19612 (2025). [8] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan, R. Ponnusamy, J. P.

McCrae, E. Sherly, et al., Findings of the shared task on ofensive language identification in tamil, malayalam, and kannada, in: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 2021, pp. 133–145. [9] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, J. P.

McCrae, Dravidiancodemix: Sentiment analysis and ofensive language identification dataset for dravidian languages in code-mixed text, Language Resources and Evaluation 56 (2022) 765–806. [10] A. M. D, D. Vikram, B. R. Chakravarthi, P. R. Hegde, Overcoming low-resource barriers in tulu: Neural models and corpus creation for ofensive language identification, 2025. URL: https://arxiv. org/abs/2508.11166. arXiv:arXiv:2508.11166.

[1]

B. R.

Chakravarthi ,

M. B.

Jagadeeshan ,

Palanikumar ,

Priyadharshini , Ofensive language identification in dravidian languages using mpnet and cnn , International Journal of Information Management Data Insights 3 ( 2023 ) 100151 . URL: https://www.sciencedirect.com/science/article/ pii/S2667096822000945. doi:https://doi.org/10.1016/j.jjimei. 2022 . 100151 .

[2] S. N ,

B. R.

Chakravarthi ,

Durairaj ,

Bharathi ,

S. C.

Navaneethakrishnan ,

P. K.

Kumaresan , A. M D , P. R. Hegde, D. Vikram , Overview of the shared task on ofensive language identification in dravidian code-mixed languages, in: Forum of Information Retrieval and Evaluation FIRE- 2025 , 2025 .

[3]

Fortuna ,

Nunes , A survey on automatic detection of hate speech in text , ACM Computing Surveys 51 ( 2018 ) 1 - 30 .

[4]

Mubarak ,

Abdelali ,

Darwish , Arabic ofensive language on twitter: Analysis and experiments with classifiers , in: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT

)

, Association for Computational Linguistics , 2020 .

[5]

Mahmud ,

Akter ,

M. K.

Uddin ,

M. T.

Aziz , et al., Machine learning techniques for identifying child abusive texts in online platforms , in: 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT) , IEEE, 2024 . doi: 10 .1109/ICCCNT61001. 2024 . 10724830 .

[6]

Dhanyashree ,

Kalpana ,

Lekhashree ,

Arivuchudar ,

Arthi ,

Sahitya ,

Pavithra , S. Johnson, Linguaists@dravidianlangtech 2025 : Abusive tamil and malayalam text targeting women on social media , in: Proceedings of the Fifth Workshop on Speech, Vision , and Language