<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DravidianCodeMix 2025: Empirical Analysis of Classical Machine Learning Approaches in Tamil, Malayalam, and Tulu Code-Mixed Ofensive Content Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shudapreyaa R S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Surya U S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Swetha M</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandeep P S</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kongu Engineering College</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>A critical responsibility for maintaining healthy online communication, particularly on social media platforms where multilingual interactions are prevalent, is identifying ofensive language in texts that contain Dravidian code. Creating machine learning models to categorize ofensive information in Tamil, Malayalam, and Tulu-three Dravidian languages-is the main goal of this project. We tested and implemented a variety of algorithms to find the best method for each language. According to experimental results, Linear SVC performed best for Tulu, while Random Forest produced better results for Malayalam, and Logistic Regression outperformed the other models for Tamil. These results show that no single algorithm is uniformly dominant across all Dravidian languages, underscoring the significance of language-specific algorithm selection in ofensive language identification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Survey</title>
      <p>
        Every day, millions of comments are left on the uploaded postings due to the rise of netizen culture and
social media. The usage of derogatory language in user comments has dramatically increased. Online
comments that contain abusive language initiate cyberbullying[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which targets both a group of people
(a certain nation, age, and religion) and an individual (a politician, celebrity, or product). Automated
detection and analysis of abusive language in online comments are crucial. In the literature, there have
been multiple attempts to identify abusive language in the English language. NB, SVM, IBK, Logistic,
and JRip are five diferent machine learning models, while CNN, LSTM, BLSTM, and CLSTM[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are
four deep learning models that we use in this study to recognize abusive language in Urdu and Roman
Urdu comments.
      </p>
      <p>
        Tanjim Mahmud et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have proposed a system that creates sophisticated machine learning and
deep learning models for identifying child abusive texts in the Bengali language on online platforms;
this study tackles the pressing problem of child abuse in digital communications. The main objective of
this project is to develop a useful tool for precisely recognizing abusive content to aid in the prevention
of child abuse. This model diferentiates between abusive and non-abusive material by combining deep
learning methods with natural language processing (NLP) approaches.
      </p>
      <p>
        Dhanyashree G et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] highlighted that while social networks serve as major platforms for
engagement and communication, they are also increasingly misused for gender-based abuse, particularly
targeting women with demeaning and harassing remarks. Their study, focusing on Malayalam and Tamil
YouTube comments, aimed to identify explicit abuse, implicit bias, stereotypes, and coded language.
To address this, they evaluated multiple machine learning models, including Support Vector Machines
(SVM), Logistic Regression (LR), and Naive Bayes classifiers, for categorizing comments into abusive
and non-abusive groups.
      </p>
      <p>Anwar Hossain et al[7]. have proposed in this system that one of the major problems significantly
impacting society is the spread of hate speech on social media, which contributes to an increase in
violence, discrimination, and societal disintegration. Because of adversarial manipulations and cultural,
linguistic, and contextual complexities, the task of identifying hate speech is inherently complex. In
this work, we methodically examine how well LLMs perform in identifying hate speech in a variety of
geographical contexts and multilingual datasets. Our research ofers a novel assessment approach that
takes into account three factors: robustness to adversarially created text, geography-aware contextual
detection, and binary classification of hate speech.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Taskset Description</title>
        <p>The study explores training datasets from three Dravidian code-mixed languages—Tamil, Malayalam,
and Tulu—made up of real social media comments that freely mix English and native scripts. While
all three datasets contain ofensive and non-ofensive labels, the amount of detail in these annotations
difers across languages. The Tamil dataset is the most detailed, with separate categories for targeted
insults aimed at individuals, groups, or other entities, along with untargeted insults and comments
written in other languages [8]. The Malayalam dataset is balanced but more challenging to work with
because of its rich morphology, shifting dialects, and frequent code-switching, which can make ofensive
expressions harder to detect [9]. The Tulu dataset is smaller in size but still meaningful, ofering useful
insights for low-resource language research by capturing typical patterns of neutral, targeted, and
untargeted ofensive speech found online. When comparing models across these datasets, each language
shows a diferent best performer: Logistic Regression works well for Tamil due to its strength with
sparse text features, Random Forest handles Malayalam efectively by modeling its non-linear linguistic
patterns, and Linear SVC suits Tulu because of its stability with limited training data. Overall, the
results emphasize how linguistic characteristics, dataset size, and annotation depth play a crucial role
in determining which machine learning approach is most efective for detecting ofensive content in
Dravidian code-mixed text.</p>
        <p>Tamil Dataset: Labels include Not Ofensive, Targeted Insult (Individual, Group, Other), Untargeted,
and Other Language.</p>
        <p>Tulu Dataset: Classified into Not Ofensive, Ofensive Targeted, and Ofensive Untargeted.</p>
        <p>Malayalam Dataset: Similar to Tamil, with categories for Not Ofensive, Targeted Insults, Untargeted,
and Other Language.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Collection</title>
        <p>Label
Not ofensive
Ofensive_Untargeted
Ofensive_Targeted
The Data Collection module is in charge of compiling code-mixed textual data in Tamil, Malayalam,
and Tulu from a variety of sources, including forums, social networking sites, and publicly accessible
datasets. This module guarantees that the dataset has a variety of content categories, including slang,
dialects, and samples that are both ofensive and non-ofensive. Training machine learning models that
can reliably detect harmful content in multilingual and code-mixed environments requires a carefully
curated dataset.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Data Pre-processing</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Lowercasing</title>
          <p>Preprocessing was essential for preparing the text for classification and involved the following steps:
All characters were converted to lowercase to maintain consistency and avoid treating words like ’Good’
and ’good’ as diferent tokens.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Noise Removal</title>
          <p>Unwanted elements such as URLs, mentions (@username), hashtags, numbers, punctuation, and special
characters were removed to reduce irrelevant .</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Language Filtering</title>
          <p>Only Tamil and English characters were retained by restricting the text to the corresponding Unicode
ranges, ensuring that irrelevant scripts and symbols were excluded.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Whitespace Normalization</title>
          <p>Multiple spaces were collapsed into a single space, and leading/trailing whitespace was removed, making
the text cleaner and more uniform.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Feature Extraction</title>
        <p>In order to make machine learning models understand preprocessed text, the Feature Extraction module
converts it into numerical representations. Word embeddings, term frequency-inverse document
frequency (TF-IDF) vectors, and n-gram generation are examples of common methods. In order to
capture contextual nuances, this module may also include linguistic or semantic elements unique to
code-mixed Dravidian texts. Efective feature extraction improves the model’s capacity to reliably
distinguish between ofensive and non-ofensive content.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Model Training and Selection</title>
        <p>To process the feature-rich dataset, the Model Training and Selection module uses a variety of machine
learning algorithms. We use language-specific data to train and evaluate algorithms like Linear SVC,
Random Forest, and Logistic Regression [10]. They evaluate each model using performance indicators
such as F1-score, recall, accuracy, and precision. To ensure optimal performance in ofensive language
identification, the best-performing algorithm is chosen for each language based on these evaluations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>The results show that abusive language detection behaves diferently across Tamil, Malayalam, and
Tulu, proving that a single model cannot work equally well for all. Logistic Regression gave the best
results for Tamil because the dataset was relatively balanced, which made it easier for a linear model to
separate ofensive and non-ofensive text while still capturing small diferences between categories.
For Malayalam, Random Forest performed better as its ensemble of decision trees could manage the
imbalance across classes and deal with the greater variation in ofensive expressions. It also showed
more robustness to spelling changes and noisy code-mixed text, which are common in Malayalam. In
the case of Tulu, the dataset was much smaller, and complex models tended to overfit. Linear SVC was
more efective here since its margin-based classification and ability to handle sparse TF–IDF features
helped it generalize better in low-resource conditions. Overall, these findings highlight that the choice
of model depends strongly on the size, balance, and linguistic characteristics of each dataset, and that
tailoring algorithms to language-specific needs leads to more reliable ofensive language detection.</p>
      <sec id="sec-4-1">
        <title>4.1. Performance Metrics</title>
        <p>To give a thorough evaluation of the eficacy of the ofensive language detection models, their
performance was assessed using common classification measures, such as accuracy, precision, recall, and
F1-score. While precision shows the percentage of ofensive texts successfully detected out of all those
projected to be ofensive, accuracy gauges how accurate predictions are overall. The F1-score ofers
a balanced metric that combines precision and recall, whereas recall evaluates the model’s capacity
to recognize every instance of actual ofensive behavior. The findings of the experiment showed that
Linear SVC was the best for Tulu, Random Forest was the best for Malayalam, and Logistic Regression
had the best performance metrics for Tamil. These measurements show that the system can
consistently diferentiate objectionable content from non-ofensive text in code-mixed Dravidian datasets,
underscoring the significance of choosing language-specific methods.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Error Analysis</title>
      <p>Looking more closely at the results, we found that all three models struggled with classes that had
very few training examples. In Tamil, Logistic Regression performed well for major categories but
completely failed to detect the “Ofensive Targeted Insult Other” class, where recall dropped to 0.0 due
to insuficient data for the model to learn meaningful patterns. Similarly, the Random Forest classifier
for Malayalam showed good overall accuracy but performed poorly on group insult categories because
of dataset imbalance. For Tulu, Linear SVC handled the small dataset better than other models but
still struggled with subtle targeted insults. Many errors also arose from slang, spelling variations, and
context-dependent words. Overall, these issues highlight that data scarcity and ambiguous expressions
remain key challenges, emphasizing the need for richer and more balanced datasets in the future.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations</title>
      <p>The suggested ofensive language detection method performs admirably, but there are still a number of
drawbacks. The models mostly rely on annotated datasets, which are small and do not fully represent
the variety of regional dialects, slang, and code-mixed expressions in Tamil, Malayalam, and Tulu.
Sentences with ambiguous context may cause misclassification since the models may find it dificult to
discern between ofensive and non-ofensive word usage. Furthermore, real-time social media feeds
with extremely informal or changing language patterns may cause the algorithm to perform worse.
These drawbacks emphasize the necessity of using bigger, more varied datasets, as well as sophisticated
context-aware methods to increase generality and accuracy.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>A machine learning-based method was created in this study to identify abusive language in texts
that contain Dravidian code, specifically in Tamil, Malayalam, and Tulu. With Linear SVC for Tulu,
Random Forest for Malayalam, and Logistic Regression for Tamil, the studies showed how important
it is to choose algorithms that are specific to a given language. The outcomes demonstrate how well
feature extraction, meticulous preprocessing, and model training can address the dificulties posed
by code-mixed and multilingual data.The suggested system ofers a solid basis for automatic content
ifltering, even though some mistakes still occur because of unclear context, slang, and small datasets.
All things considered, this work promotes safer online communication and establishes the framework
for next advancements, such as context-aware algorithms and larger datasets for more reliable foul
language identification.</p>
    </sec>
    <sec id="sec-8">
      <title>Project Repository</title>
      <p>The full source code for this project is available on
GitHub: GitHub Repository- SURYAULAGANATHAN</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its
use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted
a thorough review and revision of the text and assume full responsibility for the final published content.
Technologies for Dravidian Languages, Association for Computational Linguistics, 2025, pp. 682–
687. URL: https://aclanthology.org/2025.dravidianlangtech-1.116.
[7] A. Hossain Zahid, M. K. Roy, S. Das, Evaluation of hate speech detection using large language
models and geographical contextualization, arXiv preprint arXiv:2502.19612 (2025).
[8] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan, R. Ponnusamy, J. P.</p>
      <p>McCrae, E. Sherly, et al., Findings of the shared task on ofensive language identification in
tamil, malayalam, and kannada, in: Proceedings of the first workshop on speech and language
technologies for Dravidian languages, 2021, pp. 133–145.
[9] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, J. P.</p>
      <p>McCrae, Dravidiancodemix: Sentiment analysis and ofensive language identification dataset for
dravidian languages in code-mixed text, Language Resources and Evaluation 56 (2022) 765–806.
[10] A. M. D, D. Vikram, B. R. Chakravarthi, P. R. Hegde, Overcoming low-resource barriers in tulu:
Neural models and corpus creation for ofensive language identification, 2025. URL: https://arxiv.
org/abs/2508.11166. arXiv:arXiv:2508.11166.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Jagadeeshan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Palanikumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <article-title>Ofensive language identification in dravidian languages using mpnet and cnn</article-title>
          ,
          <source>International Journal of Information Management Data Insights</source>
          <volume>3</volume>
          (
          <year>2023</year>
          )
          <article-title>100151</article-title>
          . URL: https://www.sciencedirect.com/science/article/ pii/S2667096822000945. doi:https://doi.org/10.1016/j.jjimei.
          <year>2022</year>
          .
          <volume>100151</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>S. N</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Navaneethakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M D</surname>
            , P. R. Hegde,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Vikram</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on ofensive language identification in dravidian code-mixed languages, in: Forum of Information Retrieval and Evaluation FIRE-</article-title>
          <year>2025</year>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>A survey on automatic detection of hate speech in text</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>51</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <article-title>Arabic ofensive language on twitter: Analysis and experiments with classifiers</article-title>
          ,
          <source>in: Proceedings of the Sixth Workshop</source>
          on Noisy User-generated
          <string-name>
            <surname>Text (W-NUT</surname>
            <given-names>)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mahmud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Akter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Uddin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Aziz</surname>
          </string-name>
          , et al.,
          <article-title>Machine learning techniques for identifying child abusive texts in online platforms</article-title>
          ,
          <source>in: 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)</source>
          , IEEE,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1109/ICCCNT61001.
          <year>2024</year>
          .
          <volume>10724830</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dhanyashree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kalpana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lekhashree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Arivuchudar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Arthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sahitya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavithra</surname>
          </string-name>
          , S. Johnson, Linguaists@dravidianlangtech
          <year>2025</year>
          :
          <article-title>Abusive tamil and malayalam text targeting women on social media</article-title>
          ,
          <source>in: Proceedings of the Fifth Workshop on Speech, Vision</source>
          , and Language
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>