<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A TriFusion Model for Ofensive Language Detection in Dravidian Code-Mixed Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raksha Adyanthaya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rathnakara Shetty P</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Yenepoya Institute Of Arts</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>Commerce, and Management. Yenepoya (Deemed to be University)</institution>
          ,
          <addr-line>Mangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The social media boom in multilingual communities has resulted in the proliferation of ofending and harmful content, particularly in low-resourced Dravidian languages, where the overall phenomenon of code-mixing with English is widespread. TriFusion is a multi-model pipeline to detect ofensive language in Kannada, Tulu, Tamil, and Malayalam. The TriFusion system combines TFIDF with Ensemble classifiers, IndicBERT with MLP, and XLM-RoBERTa with XGBoost, with further improvement of majority-vote fusion. In the DravidianCodeMix 2025 shared task, the proposed system ranked Rank 3rd in Malayalam (macro F1-Score = 0.750), 5th in Tulu (macro F1-Score = 0.730), 7th in Kannada (macro F1-Score = 0.375), and 7th in Tamil (macro F1-Score = 0.369). On the whole, the findings indicate that transformers can be generalized well across languages, and Ensemble models are more efective under low-resource settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ofensive Identification</kwd>
        <kwd>Ofensive Language</kwd>
        <kwd>Indian Languages</kwd>
        <kwd>Code-Mixed</kwd>
        <kwd>Classification</kwd>
        <kwd>Dravidian Language</kwd>
        <kwd>Low-Resource Language</kwd>
        <kwd>Multilingual</kwd>
        <kwd>XLM-Roberta</kwd>
        <kwd>Ensemble</kwd>
        <kwd>TriFusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Automated moderation systems have been required by the propagation of abusive and hateful speech
on social media platforms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Numerous studies have suggested various computational techniques
to combat the growing prevalence of hostile language, from sophisticated deep learning and transfer
learning models[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to conventional machine learning. In the study of Adyanthaya and Shetty [6],
minimal preprocessing of code-mixed data was incorporated to maintain the natural structure of
multilingual inputs, following feature extraction approaches such as TFIDF and Word2Vec, ensemble
classification techniques including SVM, RF, BiLSTM and mBERT were used. These techniques are
highly relevant for identifying inappropriate language in noisy, code-mixed situations, even though
they were mainly used for sentiment analysis. Lexical characteristics like TFIDF representations and
n-grams were the mainstay of early methods, which also used classifiers like SVM, Naive Bayes, and
LR. Davidson et al. [7] showed that n-gram-based TFIDF features successfully classified tweets into
neutral, ofensive, and hateful categories. However, these conventional approaches struggle when it
came to low-resource and code-mixed languages like Kannada, where word order, spelling variants,
and transliteration add complexity. Kumar et al. [8] and Talat and Hovy [9] tried to address these issues
by utilizing ensemble classifiers and feature fusion like sentiment features, character- and word-level
n-grams. As social media content in Indian languages, particularly code-mixed forms like Kanglish,
Tanglish, and Manglish [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], has become more and more popular, several initiatives have been devoted
to identifying multilingual ofensive language, including HASOC [ 10] and DravidianCodeMix [11]. The
unique linguistic dificulties are demonstrated by benchmark datasets created by Chakravarthi et al.
[11] in Tamil, Malayalam, and Kannada. To improve comprehension of semantics, an increasing number
of researchers are adopting transformer-based models like XLM-RoBERTa [12], IndicBERT [13], and
BERT [14]. Because these models were pretrained on various multilingual datasets, they are especially
well-suited to managing the intricacies of Indian languages. In contrast to Tamil and Malayalam, Tulu
and Kannada are still comparatively understudied despite these developments.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets description</title>
      <p>
        The DravidianCodeMix@FIRE 2025 shared task on Ofensive Language Detection, which aims to detect
ofensive content in code-mixed YouTube comments, is the source of the datasets used in this study
[15][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Tamil, Malayalam, Kannada, and Tulu are the four Dravidian languages included in the datasets;
the distribution of these instances is shown in Table 1.
      </p>
      <p>Language
Tamil
Malayalam
Kannada
Tulu</p>
      <p>Train</p>
    </sec>
    <sec id="sec-4">
      <title>4. System Description</title>
      <p>The proposed ofensive language detection system has been illustrated in Figure 1, which incorporates
preprocessing, feature extraction, classification, and majority voting.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>To prepare the code-mixed datasets for classification of ofensive language, a thorough preprocessing
pipeline was implemented to reduce noise and normalize inputs.
The following are the stages:
• Noise reduction: Unnecessary elements that are commonly included in social media comments,
including links, mentions, hashtags, and irregular symbols, are eliminated. We kept only English
and language-specific characters and eliminated URLs, user mentions, hashtags, digits, and special
characters.
• Emoji Management: The function demojize() converted emojis into their descriptive text
equivalents. To preserve sentiment cues, this semantic information was retained. Emoji descriptors
were then surrounded by colons, which were removed to simplify the token structure.
• Unicode normalization: With an emphasis on language-specific characters, we performed</p>
        <p>Unicode normalization using the IndicNormalizerFactory from the indic-nlp-library.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baseline Ensemble Model with Traditional Classifier</title>
        <p>The proposed system integrates a soft-voting ensemble of six traditional machine learning classifiers
[16], including LR, XGB [17], MLP, RF, Support Vector Classifier(SVC), and KNN. The score of each
model is a probability distribution over the target classes, which is averaged to select the most probable
class based on the mean probability, therefore the ensemble can take advantage of individual model
confidence and generalize across models better than through hard voting. Lowercasing, removing
punctuations as well as normalizing code-mixing content are done on the input text, which then
get TFIDF vectorization [6] and the limit of features was set to 50000 with the n-gram range of (1,
2). The system training and testing use each language’s data separately, class labels are encoded via
LabelEncoder, and the results are assessed by using Accuracy, macro F1-score ( mF1-score), and weighted
F1-score (wF1-score). The resulting sparse representations are the inputs to the ensemble, which makes
the final prediction.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. IndicBERT -based Deep Representation with MLP Classifier</title>
        <p>The IndicBERT , a transformer multilingual model, which is optimized on Indian languages, to create
dense contextual embedding of the input text. we fine-tune the representations of the [CLS] token of
the last hidden layer based on the maximum sequence length of 128. These embeddings are fed into a
pre-trained MLP classifier, which consists of two hidden layers having 256 and 128 nodes, respectively,
an activation function of the form ReLU, and an optimizer of the form Adam, trained with up to 300
iterations. The system is trained on each language dataset respectively using LabelEncoder to encode
labels and scored against the development set using Precision, Recall and macro F1-score. Extending the
capacity of IndicBERT [18] to support rich semantic and syntactic features and that of MLP to estimate
non-linear decision boundaries, multi-lingual classification can be performed robustly using this fusion
technique.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. XLM-RoBERTa Embeddings with XGBoost and Class Balancing</title>
        <p>The XLM-RoBERTa [19] transformer model that our system uses is pre-trained on more than 100
languages, including some Indian languages, which is favorable to multilingual and code-mixed scenarios.
Each input sentence has its embedding as [CLS] token (768-dimensional) extracted at the final hidden
layer, and with maximum length up to 128 tokens. Then, these embeddings get passed to an XGBoost
classifier that is set to multi-class classification and that has a maximum depth of 6,200 estimators, and
a learning rate of 0.1. To overcome the impact of imbalance between classes, we calculate weights of
classes inversely proportional to the frequencies and use them in training. The system is optimized on
development set based on the usual classification accuracy indices and the final predictions on test set
are produced through mapping between the model output and the original label set. The strategy would
blend the strong contextual ability of XLM-R with the gradient boosting functionality of XGBoost to be
able to deal efectively with both high-dimensional embeddings and imbalanced data.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. TriFusion: Majority Voting Ensemble of Three Systems</title>
        <p>To take advantage of the complementary skill sets of our individual models, this study chose a majority
voting ensemble approach. Namely, the predictions of three diferent systems were combined:
(i) an ensemble of a classical set of classifiers trained on the basis of TFIDF.
(ii) an IndicBERT model with a MLP classifier.
(iii) an XLM-RoBERTa model with an XGBoost classifier.</p>
        <p>The three systems were queried in each test cases and the predicted label of the three systems per
test case were used to obtain the final output label by adopting hard (majority) voting. Where no label
had a majority then the mode function within the Pandas library was used to pick the most common
label. The purpose of this ensemble technique is to reduce the bias of deep contextual embeddings
and increase the overall robustness of approximations of the true model and achieve better overall
performance in classification.The result is shown in Table 3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Error Analysis</title>
      <sec id="sec-5-1">
        <title>5.1. Results</title>
        <p>The proposed models were tested on four Dravidian languages i.e. Tulu, Malayalam, Kannada and
Tamil using three important evaluation measures such as Accuracy (2), mF1-score, and wF1-score. Such
parameters combine two aspects of the prediction correctness overall and the capability of the model
to manage the aspect of the imbalance of classes, which plays a significant role in ofensive language
detection. The findings show that transformer-based architectures are always superior to conventional
ensemble models in most languages, especially in Malayalam and Tamil. These two languages had the
best scores in the XLM-RoBERTa model with an mF1-score of 0.76 and 0.73 in this order, indicating that
the model has a good grasp of the context of the code-mixed text. In the case of Tulu, though, TFIDF
+ Ensemble model made a competitive mark of 0.73 on mF1-score, meaning that classical statistical
features have not yet been discarded in small or less diverse datasets. However, Kannada was more
challenging with high data imbalance and morphological complexity, with TFIDF a little more successful
than deep models on all three metrics. TriFusion ensemble, a sergeant that is composed of the results of
TFIDF, IndicBERT, and XLM-RoBERTa based on majority voting, exhibited better language-independent
generalization. It scored the best in Accuracy in Malayalam (0.92), and all mF1 and wF1-scores of all
languages were constant, which narrowed the performance diference between the resource-rich and
low-resource cases. In general, the comparison between these three parameters depicts that
transformerbased models perform better in the case of well-represented datasets, whereas hybrid ensemble fusion
ofers a more robust and balanced result in detecting multilingual and code-mixed ofensive language.</p>
        <p>TF-IDF + Ensemble</p>
        <p>IndicBERT + MLP</p>
        <p>XLM-R + XGBoost
Tulu
Malayalam
Kannada
Tamil</p>
        <p>Accuracy</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Error Analysis</title>
        <p>The error analysis was used to analyze the diferences in performance between models, with the main
emphasis on misclassifications in the minority classes. The comparison of Accuracy, mF1-score, and
wF1-Score was found to show that the Accuracy was good across the board, but low mF1-Scores were
a manifestation of the way the models treated underrepresented classes like Ofensive Targeted and
Ofensive Non-Targeted. As an example, Tamil and Kannada got a decent Accuracy (0.77 and 0.70) but
significantly lower mF1-Score (0.37 and 0.39), indicating the issues of class imbalance.</p>
        <p>The XLM-RoBERTa model was found to be able to capture contextual nuances and classify
nonofensive categories, as well as generally overfit to the most common non-ofensive classification types,
but IndicBERT was unable to scale to fine-grained diferences because it did not support code-mixed text
representation. TriFusion ensemble alleviated such problems by taking advantage of model diversity,
providing a higher balance between classes, especially in Malayalam (mF1-score 0.41). The remaining
errors were due to the ambiguous nature of cases on sarcasm, transliterated slang, and mixed sentiment
polarity where context on interpretation was very important. On balance, the findings indicate that
although hybrid models such as TriFusion can raise the level of robustness, more necessary steps must
be implemented to reduce bias via data augmentation strategies, context-enriched embeddings, and
contrastive learning to improve the detection of minority classes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Data Availability</title>
      <p>The scripts of the proposed methodology, including the preprocessing scripts, feature extraction, and
model training modules, are publicly available on Github Repository.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>This paper introduced TriFusion, a multi-model ensemble that used ofensive language detection in
code-mixed Dravidian languages, namely, Kannada, Tulu, Tamil, and Malayalam. The system showed
that in low-resource and linguistically diverse conditions, the combination of lexical and contextual
features can help to boost the robustness of the traditional machine learning classifiers with the aid of
transformer-based deep representations. The experiments also showed that the contextual cues are
well captured by the transformer architectures like XLM-RoBERTa, and that the imbalance limitation
typical of underrepresented languages is addressed by the hybrid fusion. TriFusion performed well on
all languages, especially on Malayalam, and this proved that ensemble fusion is less biased compared to
single models.</p>
      <p>To overcome the drawbacks associated with minority class detection as noted in the present study, in
the next work, suggested to apply the method of data augmentation and context-enriched embedding.
Contrastive learning and cross-lingual alignment may also be used together to increase the resilience
of diferentiating subtle ofensive material in underrepresented groups. These guidelines will assist in
moving the field of ofensive language detection on low-resource, code-mixed Dravidian languages to a
more fair and generalized state.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>In the course of preparing this manuscript, the author(s) employed the generative AI tool ChatGPT. Its
use was limited to performing checks for grammar and spelling. Following this, the author(s) conducted
a thorough review and revision of the text and assume full responsibility for the final published content.
in dravidian code-mixed languages, in: Forum of Information Retrieval and Evaluation FIRE-2025,
2025.
[6] R. Adyanthaya, R. Shetty, Yenlp_cs@ dravidianlangtech 2025: Sentiment analysis on code-mixed
tamil-tulu data using machine learning and deep learning models, in: Proceedings of the Fifth
Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, 2025, pp.
288–292.
[7] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem
of ofensive language, in: Proceedings of the international AAAI conference on web and social
media, volume 11, 2017, pp. 512–515.
[8] R. Kumar, G. Bhanodai, R. Pamula, M. R. Chennuru, Trac-1 shared task on aggression identification:
Iit (ism)@ coling’18, in: Proceedings of the first workshop on trolling, aggression and cyberbullying
(TRAC-2018), 2018, pp. 58–65.
[9] Z. Talat, D. Hovy, Hateful symbols or hateful people? predictive features for hate speech detection
on twitter, in: Proceedings of the NAACL student research workshop, 2016, pp. 88–93.
[10] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at fire 2020:
Hate speech and ofensive language identification in tamil, malayalam, hindi, english and german,
in: Proceedings of the 12th annual meeting of the forum for information retrieval evaluation, 2020,
pp. 29–32.
[11] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan, R. Ponnusamy, J. P.</p>
      <p>McCrae, E. Sherly, et al., Findings of the shared task on ofensive language identification in
tamil, malayalam, and kannada, in: Proceedings of the first workshop on speech and language
technologies for Dravidian languages, 2021, pp. 133–145.
[12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv
preprint arXiv:1911.02116 (2019).
[13] D. Kakwani, A. Kunchukuttan, S. Golla, G. NC, A. Bhattacharyya, M. M. Khapra, P. Kumar,
Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language
models for indian languages, in: Findings of the association for computational linguistics: EMNLP
2020, 2020, pp. 4948–4961.
[14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[15] B. R. Chakravarthi, M. B. Jagadeeshan, V. Palanikumar, R. Priyadharshini, Ofensive language
identification in dravidian languages using mpnet and cnn, International Journal of Information
Management Data Insights 3 (2023) 100151. URL: https://www.sciencedirect.com/science/article/
pii/S2667096822000945. doi:https://doi.org/10.1016/j.jjimei.2022.100151.
[16] J. Preetham, J. Anitha, Ofensive language detection in social media using ensemble techniques,
in: 2023 International Conference on Circuit Power and Computing Technologies (ICCPCT), IEEE,
2023, pp. 805–808.
[17] R. Razzak, S. T. O. Taki, M. R. Mim, M. S. H. Patwary, M. I. Pavel, Vulgar comments classification:</p>
      <p>Comparison between cnn, xgboost and svm (2023).
[18] A. Garain, A. Mandal, S. K. Naskar, Junlp@ dravidianlangtech-eacl2021: Ofensive language
identification in dravidian langauges, in: Proceedings of the First Workshop on Speech and
Language Technologies for Dravidian Languages, 2021, pp. 319–322.
[19] M. Xu, S. Liu, Rb_bg_mha: A roberta-based model with bi-gru and multi-head attention for chinese
ofensive language detection in social media, Applied Sciences 13 (2023) 11000.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Towards ofensive language identification for dravidian languages</article-title>
          ,
          <source>in: Proceedings of the first workshop on speech and language technologies for Dravidian languages</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sripriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Navaneethakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M D</surname>
            , P. R. Hegde,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Vikram</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on ofensive language identification in dravidian code-mixed languages, in: Forum of Information Retrieval and Evaluation FIRE-</article-title>
          <year>2025</year>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A. M. D</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Vikram</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>Chakravarthi</surname>
            ,
            <given-names>P. R.</given-names>
          </string-name>
          <string-name>
            <surname>Hegde</surname>
          </string-name>
          ,
          <article-title>Overcoming low-resource barriers in tulu: Neural models and corpus creation for ofensive language identification</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv. org/abs/2508.11166. arXiv:arXiv:
          <fpage>2508</fpage>
          .
          <fpage>11166</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          , N. Jose,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <article-title>Dravidiancodemix: Sentiment analysis and ofensive language identification dataset for dravidian languages in code-mixed text</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>56</volume>
          (
          <year>2022</year>
          )
          <fpage>765</fpage>
          -
          <lpage>806</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sripriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Subalalitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Anusha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vikram</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on ofensive language identification</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>