<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Natural Language Processing Journal 11 (2025) 100133.
[19] B. Zhang</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3717611</article-id>
      <title-group>
        <article-title>Multi-Modal Ensemble Approach for Hate Speech and Ofensive Content Detection in Indic Memes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Radhika Bohra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <email>yash@pilani.bits-pilani.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSIS, Birla Institute of Technology and Science</institution>
          ,
          <addr-line>Pilani, 333031, Rajasthan</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Forum for Information Retrieval Evaluation</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8748</fpage>
      <lpage>8763</lpage>
      <abstract>
        <p>The proliferation of multi-modal content, such as memes, on social media has created significant challenges for automated hate speech and ofensive content detection, particularly in under-represented Indic languages. This paper presents a robust pipeline to address this challenge across four languages: Bangla, Bodo, Hindi, and Gujarati. We evaluate multiple dual-encoder architectures, combining the CLIP Vision Transformer with language-specific text models including MuRIL, XLM-Roberta, and M-BERT. Training is conducted using a 5-fold cross-validation strategy with a weighted loss function to counteract class imbalance, and performance is validated on the test set. Our results establish trustworthy performance benchmarks, with macro F1 scores consistently ranging from 0.56 to 0.61 across the diferent languages. A task-level analysis reveals that the models are efective at classifying sentiment, while more nuanced and context-dependent tasks like sarcasm remain challenging for current architectures. By systematically evaluating these strong multimodal models, this work provides a foundational benchmark that will guide and accelerate future research in content moderation for Indic languages. The research is conducted by the team of CSIS BITS Pilani. The team achieved the following ranks for the HASOC tasks on the 4 language datasets: Bangla - Rank 2, Bodo - Rank 5, Gujarati - Rank 6, and Hindi - Rank ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
      <kwd-group>
        <kwd>Multi-modal</kwd>
        <kwd>Hate speech and ofensive content detection</kwd>
        <kwd>Vision Language Models</kwd>
        <kwd>Indic memes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In little more than a decade, social media has transformed from a novelty into the world’s digital
gathering place. It’s the digital equivalent of a nation-wide “chai stall”, a bustling space where news
breaks, opinions are forged, and an ever-growing volume of user-generated content is shared every
second. Central to this digital culture is the rise of the meme. More than just simple jokes, memes are
potent capsules of cultural shorthand, capable of conveying complex ideas, emotions, and commentary
almost instantly, all through just an image and some text in it.</p>
      <p>
        But this digital gathering place has a dark side. The very features that make memes efective for
communication also make them an ideal vehicle for spreading hate speech and ofensive content [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Hateful ideologies can be laundered through the use of irony or humour, packaged in a shareable format
that makes them more palatable and viral. An otherwise innocent image can be combined with a
subtle line of text to create a deeply abusive or derogatory message, which can negatively influence the
public’s opinion based on gender, religion, individual, politics, cultural identity, etc., while flying under
the radar of traditional content moderation.
      </p>
      <p>
        The rapid, unchecked spread of such material poses a direct threat to the safety of online communities
[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. This has created an urgent need for efective detection, but identifying this content at scale
is a formidable challenge. The core challenge in detecting ofensive memes lies in their inherently
multi-modal nature . The true intent of a meme is rarely found in the text or the image alone; instead,
it emerges from the complex interplay between the two. The text might be benign, and the image
harmless, but their combination can produce a potent and unambiguously hateful message. This
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
semantic gap is often filled by implicit cultural, social, or political context that a machine must learn to
infer. Furthermore, bad actors deliberately exploit this complexity, using sarcasm, irony, and coded
language to evade traditional, text-based content moderation filters. This means any efective detection
system must not only process both modalities but also understand the nuanced, often non-literal,
relationship between them. Furthermore, the sheer scale and speed at which memes are created make
manual moderation impossible. To foster healthier digital spaces for all users, we need automated
systems sophisticated enough to understand this new, complex, and multilingual language.</p>
      <p>To address this complex fusion of visual and textual data, the research community has increasingly
turned to Vision-Language Models (VLMs). While current research is efectively focused on detecting
hate and ofensive content in English memes, one aspect being undermined in this field is the same
challenge present in other prevalent languages across the globe.</p>
      <p>This challenge is particularly acute for Indic languages, which are often critically under-represented
in digital safety research despite their massive global presence. For example, Hindi is the third most
spoken language in the world with more than 600 million speakers. Also, Bengali is the sixth most
spoken native language and the seventh most spoken language in the world, with well over 270 million
speakers. Yet, the development of robust moderation tools for Hindi, Bengali, and other languages of
the subcontinent lags significantly behind that of English. This disparity creates a gap where harmful
content can flourish, afecting millions of users.</p>
      <p>
        Therefore, our research attempts to address this challenge. This paper tackles the task of multi-modal,
multi-lingual, and multi-task classification of potential hate speech and ofensive content present
in memes in Indic languages. The goal of this research is to develop a robust pipeline to analyze
Indic memes and classify them across four distinct categories: Sentiment, Sarcasm, Vulgarity, and
Abuse. Our focus is on four Indic languages - Bodo, Bangla [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Gujarati, and Hindi - to address the
critical need for better tools in non-English contexts, accounting for challenges like code-mixed text
(”Hinglish”). Furthermore, we address the dificulty of distinguishing between related but separate
concepts, particularly Sarcasm, a nuanced form of expression that often complicates the detection of
genuine hate speech.
      </p>
      <p>
        The data for conducting this research was obtained from HASOC-meme 2025 [
        <xref ref-type="bibr" rid="ref10 ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9, 10</xref>
        ], a
part of the Forum for Information Retrieval 2025. The data provided was a mix of images and their
corresponding labels and text recognized by an OCR model. The presented research is an investigation
into the limits of current state-of-the-art models on this complex-real world data. Our aim is to establish
a trustworthy performance benchmark through a rigorous evaluation methodology and identify aspects
on which future research can be done - whether it be in the model’s architecture or in the data.
      </p>
      <p>The primary contributions of this research are as follows:
• A VLM pipeline is proposed in this work that uses the Vision Transformer (ViT) of the CLIP VLM
[11] for image processing and a comparison of three models - MuRIL [12], XLM-RoBERTa [13],
and mBERT [14] - for the text processing. Finally the outputs of both are combined using a fusion
layer and multi-task classification is performed for each image.
• Extensive experimental evaluation is conducted using the macro F1 Score as the primary
performance metric to evaluate the performance of each of the models which provided a trustworthy
performance benchmark.</p>
      <p>The subsequent sections of this paper are organized as follows. The relevant literature related this
research is reviewed in Section 2. The proposed methodology of this research is presented in Section
3. The results obtained and the inferences drawn are discussed in Section 4. Section 5 provides the
conclusion of this work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>The proliferation of hateful content on social media platforms, particularly in the form of multimodal
memes, has become a significant societal challenge. The nuanced and contextual nature of memes, which
blend text and imagery, makes automated detection a complex task. This has spurred a considerable body
of research focused on developing robust models and comprehensive datasets to identify, understand,
and counteract this form of online hate. This survey reviews recent literature, categorizing contributions
into three key areas: advancements in multimodal detection architectures, eforts to address data and
language specificity, and the extension of Vision-Language Model (VLM) capabilities beyond simple
classification.</p>
      <sec id="sec-2-1">
        <title>2.1. Advances in Multimodal Detection Architectures</title>
        <p>Early and ongoing research has focused on architecting efective multimodal frameworks that can
synergistically analyze both visual and textual components. A foundational approach involves creating
hybrid models, such as the Multi-modal Hate Speech Detection Framework (MHSDF), which combines
Convolutional Neural Networks (CNNs) for spatial feature extraction with Long Short-Term Memory
(LSTM) networks for sequential data analysis across modalities like text, images, and even video. This
framework utilizes an attention mechanism to fuse inputs and enhance model interpretability [15].
Other studies have explored pipeline-based systems, for instance by first using Optical Character
Recognition (OCR) to extract text, then applying a lexicon-based tool like VADER for sentiment analysis,
and finally using a pre-trained CNN to analyze the visual components [ 16].</p>
        <p>More sophisticated deep learning techniques have sought to improve performance by leveraging
complex model integrations. One such study proposes a three-stage framework that first generates a
textual caption of the meme’s image, then processes the multimodal data through an ensemble of three
separate transformer-based models to derive a final classification [ 17]. Another advanced approach
utilizes a multi-task learning (MTL) framework, integrating powerful pre-trained models like CLIP,
UNITER, and BERT, to concurrently train on four distinct datasets. This method of sharing knowledge
across datasets demonstrated state-of-the-art performance over existing unimodal and multimodal
techniques [18]. The evolution of this domain is further highlighted by comparative studies evaluating
the latest end-to-end Vision-Language Models (VLMs). For example, research fine-tuning the IDEFICS
model with a QLoRA strategy has shown its superior accuracy compared to both single-modality models
and earlier multimodal methods that rely on separate feature fusion techniques like co-attention [19].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Specialized Datasets and Methodologies</title>
        <p>A significant challenge in hateful meme detection is the scarcity of high-quality, diverse, and
contextrich data. Several researchers have focused on creating novel datasets to address these gaps. To tackle
data imbalance and move beyond simple binary classification, the Meme-Merge dataset was created to
help estimate the severity of ofensiveness [ 20]. Similarly, the GuardHarMem dataset was introduced
to provide more nuanced, fine-grained labels for various harm categories like racism and mockery,
accompanied by a baseline model, HarMDetect, which integrates auto-generated captions to improve
performance [21].</p>
        <p>The global nature of social media necessitates models that can function across diferent languages
and cultural contexts, which are often low-resource environments. Research in this area includes the
creation of BHM, a novel dataset for Bengali hateful memes annotated not only for hate but also for the
specific social entities being targeted, alongside the proposal of the DORA dual co-attention framework
[22]. Work has also been done for the Hindi language, involving the creation of a new dataset (with a
balanced subset generated via undersampling) and the application of a multimodal Logistic Regression
classifier [ 23]. Similarly, the detection of hateful memes has been introduced as a new research problem
for Thai-NLP, with a proposed solution pipeline that links scene text localization, an improved Thai-OCR
model, and a multi-task language model trained to handle common misspellings [24]. Beyond language,
cultural specificity is also critical, as shown in a study focusing on the Singaporean context, which
curated a large-scale dataset labeled by GPT-4V and used it to fine-tune a VLM pipeline for classifying
locally nuanced ofensive content [ 25].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Extending VLM Capabilities: Explainability, Mitigation, and Critique</title>
        <p>With the advent of powerful VLMs, research has begun to explore applications beyond simple detection,
focusing on explainability, content mitigation, and critical evaluation of these models’ capabilities and
safety. One line of work leverages VLMs in a zero-shot setting, using extensive prompt engineering
to detect hateful memes without task-specific annotated data, and contributes a typology of common
error classes to guide future improvements [26]. To address the ”black box” nature of many models, the
MemHateCaptioning framework was developed to generate clear, human-like explanations of why a
meme is classified as hateful, using a combination of models and Chain-of-Thought (CoT) prompting to
improve interpretability [27].</p>
        <p>Moving from passive detection to active intervention, the UnHateMeme framework leverages VLMs
like GPT-4o to actively mitigate hateful content by replacing toxic visual or textual elements,
transforming the meme into a non-hateful version [28]. However, as these models become more capable,
their potential for misuse becomes a critical concern. A recent evaluative study of seven diferent VLMs
revealed a significant gap between capability and safety. While the models could often understand
the complex cultural and emotional context of hateful memes, they lacked robust safety safeguards,
frequently failing to reject hateful prompts and proving vulnerable to misuse for generating new harmful
content [29]. This highlights an urgent need for stronger ethical guidelines and safety measures as a
key direction for future research.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed methodology is designed to address the challenges inherent in multi-modal,
multilingual, and multi-task hate speech classification. Our pipeline is centered around a robust training
and evaluation protocol using K-fold cross-validation and a hybrid model architecture that combines a
pre-trained vision transformer with one of three powerful, language-specific text encoders.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets and Pre-Processing</title>
        <p>For this study, we utilized four distinct datasets, provided by HASOC-meme 2025, representing four
Indic languages: Bangla, Bodo, Gujarati, and Hindi. Example meme images from the four datasets are
presented in Figure 1. The datasets provided consisted of train and test subsets consisting of the images
and a corresponding csv file. For the train subsets, the csv file contained the labels and OCR text for
each image in the subset. For the test subsets, the csv file only contained the OCR text for each image
in the subset. The details of each of the datasets are detailed as follows:
• Bangla - Consisted of total 4514 samples. Consisted of 2693 samples for training and 1821 samples
for testing.
• Bodo - Consisted of total 632 samples. Consisted of 378 samples for training and 254 samples for
testing.
• Gujarati - Consisted of total 1493 samples. Consisted of 889 samples for training and 604 samples
for testing.
• Hindi - Consisted of total 1910 samples. Consisted of 1141 samples for training and 769 samples
for testing.</p>
        <p>The train-test split for all four languages are represented in Figure 2.</p>
        <p>The model is trained for multi-task classification on the dataset to detect sentiment, abuse, vulgarity,
and sarcasm in each meme. However, it is observed that there is severe class imbalance across all four
tasks in all four datasets that needed to be addressed. Figure 3 details the class imbalance present in
each dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Architecture</title>
        <p>Our model consists of three primary components: a vision encoder, a text-encoder, and a fusion
mechanism that combines their outputs before feeding them to task-specific classification heads. The
model architecture is detailed in Figure 4.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Vision Encoder</title>
          <p>For the visual modality, the Vision Transformer (ViT-B/32) backbone from the pre-trained CLIP model
was utilized. This choice was deliberate and motivated by the unique nature of its training. Unlike
traditional vision models pre-trained on fixed-category classification tasks like ImageNet, the CLIP
vision encoder was trained via a contrastive objective on 400 million image-text pairs sourced from the
internet.</p>
          <p>This process compels the model to learn image representations that are deeply aligned with natural
language semantics. Consequently, the resulting image features are not merely representations of
objects, but also capture the abstract concepts, actions, and sentiments described in the associated
text. This semantic richness is particularly advantageous for our task, as the interpretation of memes
often depends on the nuanced interplay between visual context and textual content, making CLIP’s
language-aware features a more powerful starting point than those from standard image classifiers.</p>
          <p>The model takes an input image  ∈  ( × ×) and produces a 768-dimensional embedding,   ,
from its final layer’s pooled output.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Text Encoder</title>
          <p>The textual modality, derived from Optical Character Recognition (OCR), presents a unique set of
challenges, including noise, non-standard grammar, and the frequent use of code-mixed language
(e.g., ”Hinglish”). To efectively process the nuances of these Indic languages, our approach is to move
beyond generic text models and experiment with and compare three powerful, pre-trained multilingual
transformers : MuRIL, XLM-RoBERTa, and mBERT. The goal is to find an encoder that best captures
the specific semantic and contextual information required for our classification tasks.</p>
          <p>For a given OCR text input, the corresponding tokenizer for the chosen model first converts the text
into a sequence of tokens. This sequence is then fed to the text model to generate a 768-dimensional
embedding,   , from its pooled output, which serves as a rich semantic representation of the text.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Multi-Modal Fusion and Classification</title>
          <p>The feature vectors from the two encoders are first projected to a harmonized dimension and then
fused. The 768-dimensional vision and text embeddings are projected into a 512-dimensional space
using separate linear layers:
where  
and   are the weight matrices and  
and   are the biases for their respective
linear projection layers. These harmonized 512-dimensional vectors are then concatenated to form a
single 1024-dimensional multi-modal feature vector,   
:
To address the severe class imbalance observed in the datasets, a Weighted Cross-Entropy Loss is used
during training. The weight for each class is calculated as the inverse of its frequency in the training
fold. For a given task with  classes, the loss  for a single sample is defined as:
(1)
(2)
(3)
(4)
(5)
 
=  
×</p>
          <p>+  
 
=  
×</p>
          <p>+  
  
=</p>
          <p>⊕  


= 1

∑ 
 =1</p>
          <p>This averaging process improves generalization and produces a more stable final prediction. The
ifnal prediction of the ensemble models is made over the hold-out test set.</p>
          <p>This combined vector is passed through a final fusion block, consisting of a linear layer, a ReLU
activation, and a Dropout layer, before being passed to four independent linear classification heads for
the final predictions.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Training and Evaluation</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Hold-Out Set and K-Fold Cross Validation</title>
          <p>Since for the test set, there were no labels provided, to ensure a robust and unbiased training, and to
monitor the models’ training, each dataset’s training is first split into a training set (90%) and a hold-out
test set (10%). We then employ a 5-Fold Stratified Cross-Validation strategy on the 90% training portion.
This mitigates the risk of performance variance due to a single, arbitrary data split and allows us to
train a robust ensemble of models.
3.3.2. Weighted Loss Function
 = −  (
 )
where  is the true class index,   is the predicted probability for that class, and   is the pre-computed
weight for class  . This increases the penalty for misclassifying a minority class sample, forcing the
model to pay more attention to it.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.3. Ensemble Strategy</title>
          <p>The final reported performance is calculated by combining the 5 models trained during cross-validation
into an ensemble. For a given sample, the output logits from each of the  = 5
models are averaged.</p>
          <p>The final prediction is the class corresponding to the highest average logit value:</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussions</title>
      <p>The implementation of the proposed methodology pipeline and the results obtained are detailed in this
section. The performance of the models are evaluated using the predictions made over the hold-out
test set and the actual test set. The performance metrics thus used for evaluation are the Accuracy and
Macro F1 Score achieved on the hold-out test set and the Macro F1 Score on the actual test set. The
metrics for test set were obtained directly through submission of the test classification file, generated
by the models post-training, at the HASOC 2025 portal on the Kaggle platform.</p>
      <sec id="sec-4-1">
        <title>4.1. Implementation</title>
        <p>The model’s vision and text backbones are initialized using pre-trained weights from CLIP’s ViT and the
respective language models from Hugging Face, while the custom projection and classification layers
are initialized randomly. All input images are resized to a 224×224 dimension by the CLIP processor. For
data augmentation during training, several transformations are applied including random horizontal
lfips, random rotations up to 10 degrees, and color jittering.</p>
        <p>The feature vectors from the vision and text encoders are projected and fused into a 1024-dimensional
vector. To mitigate overfitting, a dropout rate of 0.5 is applied to this fused vector before it is passed to
the final classification heads. The optimization is handled by the Adam optimizer, utilizing a diferential
learning rate strategy. The pre-trained CLIP vision backbone was fine-tuned with a learning rate of
1 × 10−6, the language model backbone with 2 × 10−5, and the custom, randomly initialized layers with
1 × 10−4. The model was trained for a maximum of 25 epochs with a mini-batch size of 32, using a
weighted cross-entropy loss function to address class imbalance.</p>
        <p>The entire framework is developed in Python using the PyTorch and Transformers libraries. All
experiments are conducted on a standard computing system equipped with an NVIDIA A100 40GB
GPU.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Overall Performance</title>
        <p>The proposed strategy is evaluated using quantitative measures with the macro F1 score, achieved on
the test set, being the primary performance metric. Due to the datasets being highly imbalanced for all
four tasks of Abuse, Sarcasm, Vulgarity, and Sentiment, Macro F1 score is the key performance metric
as it evaluates the performance of the models across all classes equally. Table 1 details the results of
the experimentation with the CLIP’s ViT as the vision encoder and the MuRIL, XLM-RoBERTa, and
mBERT as the text encoders respectively. While it is observed that the mBERT pipeline showed the
best results on the hold-out test set, the XLM-RoBERTa outperformed the other models on the actual
test set achieving the best Macro F1 Score on all four datasets with the highest score being achieved on
the Bangla dataset. This proves that the XLM-RoBERTa is the best in learning and generalization of the
tasks.</p>
        <p>For further delving into the performance of each model pipeline, their performance for each task on
the hold-out test set is evaluated. These findings are outlined in Table 2-5.</p>
        <p>mBERT</p>
        <p>mBERT</p>
        <sec id="sec-4-2-1">
          <title>MuRIL</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>XLM-Roberta</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Language</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>F1 Score</title>
        </sec>
        <sec id="sec-4-2-6">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-7">
          <title>F1 Score</title>
        </sec>
        <sec id="sec-4-2-8">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-9">
          <title>F1 Score</title>
          <p>mBERT</p>
        </sec>
        <sec id="sec-4-2-10">
          <title>MuRIL</title>
        </sec>
        <sec id="sec-4-2-11">
          <title>XLM-Roberta</title>
        </sec>
        <sec id="sec-4-2-12">
          <title>Language</title>
        </sec>
        <sec id="sec-4-2-13">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-14">
          <title>F1 Score</title>
        </sec>
        <sec id="sec-4-2-15">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-16">
          <title>F1 Score</title>
        </sec>
        <sec id="sec-4-2-17">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-18">
          <title>F1 Score</title>
          <p>mBERT</p>
        </sec>
        <sec id="sec-4-2-19">
          <title>MuRIL</title>
        </sec>
        <sec id="sec-4-2-20">
          <title>XLM-Roberta</title>
        </sec>
        <sec id="sec-4-2-21">
          <title>Language</title>
        </sec>
        <sec id="sec-4-2-22">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-23">
          <title>F1 Score</title>
        </sec>
        <sec id="sec-4-2-24">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-25">
          <title>F1 Score</title>
        </sec>
        <sec id="sec-4-2-26">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-27">
          <title>F1 Score Table 4</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we conducted a comprehensive study on multi-modal hate speech detection in Indic memes
by systematically evaluating dual-encoder architectures combining CLIP with M-BERT, XLM-R, and
MuRIL across four languages. Our experiments, grounded in a robust 5-fold cross-validation protocol,
establish crucial performance benchmarks. A consistent pattern emerged: while models adeptly classify
sentiment, their performance is substantially lower on nuanced, context-dependent tasks like sarcasm,
with macro F1 scores on the test set plateauing in the 0.56-0.61 range. This suggests that the features
required to discern complex social phenomena are not easily captured by current state-of-the-art models
on this type of real-world social media data, thereby highlighting the inherent dificulty of the task.</p>
      <p>Potential avenues for future work could therefore shift from model-centric adjustments toward
datacentric approaches. Exploring targeted data enrichment, curation, and balancing strategies presents a
promising path for surpassing the current performance ceiling in this challenging domain.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[28] M.-H. Van, X. Wu, Detecting and mitigating hateful content in multimodal memes with
visionlanguage models, arXiv preprint arXiv:2505.00150 (2025).
[29] Y. Ma, X. Shen, Y. Qu, N. Yu, M. Backes, S. Zannettou, Y. Zhang, From meme to threat: On
the hateful meme understanding and induced hateful content generation in open-source vision
language models, in: USENIX Security Symposium (USENIX Security). USENIX, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <article-title>Hate speech detection: a comparison of mono and multilingual transformer model with cross-language evaluation</article-title>
          , in: S. Dita,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trillanes</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. I.</surname>
          </string-name>
          Lucas (Eds.),
          <source>Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation</source>
          , Association for Computational Linguistics, Manila, Philippines,
          <year>2022</year>
          , pp.
          <fpage>853</fpage>
          -
          <lpage>865</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .paclic-
          <volume>1</volume>
          .94/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mahapatra</surname>
          </string-name>
          , et al.,
          <article-title>Safespeech: a three-module pipeline for hate intensity mitigation of social media texts in indic languages</article-title>
          ,
          <source>Social Network Analysis and Mining</source>
          <volume>14</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1007/s13278-024-01393-9. doi:
          <volume>10</volume>
          .1007/s13278-024-01393-9.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <article-title>Hate speech detection in low-resourced indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments</article-title>
          ,
          <source>Natural Language Processing</source>
          <volume>31</volume>
          (
          <year>2025</year>
          )
          <fpage>393</fpage>
          -
          <lpage>414</lpage>
          . doi:
          <volume>10</volume>
          .1017/nlp.
          <year>2024</year>
          .
          <volume>28</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Banglaabusememe: A dataset for bengali abusive meme classification</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>15498</fpage>
          -
          <lpage>15512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Koyel</given-names>
            <surname>Ghosh and Mithun Das</surname>
          </string-name>
          and
          <article-title>Mwnthai Narzary and Saptarshi Saha and Shubhankar Barman and Animesh Mukherjee and Sandip Modha and Debasis Ganguly and Utpal Garain and Sylvia Jaki and Thomas Mandl, Overview of the HASOC Track at FIRE 2025: Abusive Meme Identification - Shadows Behind the Laughter</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Chakraborty (Eds.),
          <source>Forum for Information Retrieval Evaluation (Working Notes) (FIRE 2025) December</source>
          <volume>17</volume>
          -20, Varanasi , India, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Koyel</given-names>
            <surname>Ghosh and Mithun Das</surname>
          </string-name>
          and
          <article-title>Sumukh Patel and Nilotpal Bhandary and Alloy Das and Animesh Mukherjee and Sandip Modha and Debasis Ganguly and Utpal Garain and Sylvia Jaki and Thomas Mandl, Overview of the HASOC Track at FIRE 2025: Abusive Meme Identification - Shadows Behind the Laughter</article-title>
          ,
          <source>in: FIRE '25: Proceedings of the 17th Annual Meeting of the Forum for Information Retrieval Evaluation. December</source>
          <volume>17</volume>
          -20, Varanasi , India, Association for Computing Machinery (ACM), New York, NY, USA,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc track at fire 2019: Hate speech and ofensive content identification in indo-european languages</article-title>
          ,
          <source>in: Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          , FIRE '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          . URL: https://doi.org/10.1145/3368567.3368584. doi:
          <volume>10</volume>
          .1145/3368567.3368584.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Dmonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc subtracks at fire 2023: Hate speech and ofensive content identification in assamese, bengali, bodo, gujarati and sinhala</article-title>
          ,
          <source>in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          , FIRE '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          . URL: https://doi.org/10.1145/3632754.3633278. doi:
          <volume>10</volume>
          .1145/ 3632754.3633278.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Dmonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc subtracks at fire 2023: Hate speech and ofensive content identification in assamese, bengali, bodo, gujarati and sinhala, in: Proceedings of the 15th annual meeting of the forum for information retrieval evaluation</article-title>
          ,
          <year>2023</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <article-title>Findings from shared tasks on hate speech detection: Performance patterns for low-resource languages, Pattern Recognition Letters (</article-title>
          <year>2025</year>
          ). URL:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>