<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Safer Social Media: Multimodal Hate Speech Detection in Memes across Diverse Indian Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachana Nagaraju</string-name>
          <email>rachananagaraju20@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hosahalli Lakshmaiah Shashirekha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore, Karnataka</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The proliferation of hateful and ofensive content on social media raises significant societal concerns, particularly when such messages are conveyed through multimodal memes that combine text and images. Unlike pure textual posts, memes often exploit the interplay between modalities, making the detection of toxic content more challenging. The HASOC-Meme 2025 shared task at FIRE 2025 introduces benchmark datasets in four lowresource languages: Bangla, Hindi, Gujarati, and Bodo, with the objective of identifying hate speech and ofensive content by jointly analyzing textual and visual signals embedded in the memes. In this paper, we - team MUCS describe our proposed models submitted to HASOC-Meme 2025 shared task. To tackle the challenges, we have developed multimodal models that integrate transformer-based text encoders (Indic-BERT, MuRIL, XLM-Roberta) with convolutional and transformer-based vision models (ResNet, EficientNet, ViT ) using two fusion mechanisms concatenation and attention-based strategies, to efectively capture the complementary cues from both modalities. The shared task is formulated as a multi-task learning problem with three binary classification problems of: i) detecting abuse, ii) assessing vulgarity, and iii) accessing sarcasm, and two multi-class classification problems of: i) assigning one of three sentiment labels and ii) identifying one of many targeted communities. This multi-task setup reflects the heterogeneous nature of ofensive content in memes: while sentiment span multiple levels of polarity, other categories naturally align with binary distinctions. By jointly optimizing these complementary objectives within a unified architecture, the model is able to leverage shared multimodal representations while also specializing each subtask, thereby improving overall robustness and generalization across languages. Our models achieve competitive performance across languages, ranking 11th in Bangla (macro F1 score 0.5379), 14th in Hindi (macro F1 score 0.5250), 3rd in Gujarati (macro F1 score 0.6185), and 12th in Bodo (macro F1 score 0.5522). These results highlight the efectiveness of multimodal architectures for ofensive content identification in memes and demonstrate their adaptability across linguistically diverse and resource-scarce settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal learning</kwd>
        <kwd>Hate speech detection</kwd>
        <kwd>Ofensive content identification</kwd>
        <kwd>Memes</kwd>
        <kwd>Low-resource languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The exponential growth of user-generated content on social media platforms such as Twitter, Facebook,
and Instagram, enables millions of users to express their opinions, share experiences, and engage in
public discourse. However, this digital democratization also provides a fertile ground for the
dissemination of harmful material, including hate speech, abusive language, and ofensive content. Such
toxic communication not only marginalizes vulnerable groups but also undermines the quality of
online discourse and, by extension, democratic processes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While automated hate speech detection is
extensively studied in textual data [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], detecting hateful and ofensive memes is still in its infancy.
Memes are multimodal artifacts that combine images with overlaid or accompanying text and represent
a growing trend on social media. They often rely on the interplay between image and text modalities to
convey humor, sarcasm, or ofensive undertones. For instance, textual content may appear benign in
isolation, but when paired with a culturally or politically loaded image, it can communicate targeted
hate. Some sample memes from the HASOC-Meme 2025 dataset are illustrated in Figure 1, highlighting
the diversity of languages. This multimodal nature makes hate speech detection substantially more
(a) Bangla meme
(b) Hindi meme
(c) Gujarati meme
(d) Bodo meme
challenging than unimodal tasks, as it requires models to jointly interpret both textual and visual signals
[
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        The HASOC-Meme 20251 [
        <xref ref-type="bibr" rid="ref6">6, 7</xref>
        ] shared task at FIRE 20252 is organized with the goal of benchmarking
multimodal approaches for hate speech and ofensive content identification in memes in four Indian
languages: Bangla, Hindi, Gujarati, and Bodo. These languages are underrepresented due to a lack
of linguistic resources and computational tools, making the shared task particularly important for
advancing research in low-resource settings. The shared task involves analyzing multimodal data
(image and text) to detect abuse, identify targeted communities, assess vulgarity and sarcasm, and
assign sentiment labels, in the given meme.
      </p>
      <p>In this paper, we - team MUCS describe the models submitted to HASOC-Meme 2025 shared task. To
address the challenges of multimodal hate speech detection, we designed models that leverage
state-ofthe-art transformer-based text encoders—Indic-BERT [8], MuRIL [9], and XLM-RoBERTa [10]—along
with deep vision backbones—ResNet [11], EficientNet [ 12], and Vision Transformers (ViT) [13], to
represent text and image respectively. The shared task is formulated as a multi-task learning problem
with three binary classification problems of: i) detecting abuse, ii) assessing vulgarity, and iii) accessing
sarcasm, and two multi-class classification problems of: i) assigning one of three sentiment labels and
ii) identifying one of many targeted communities. We explore two fusion strategies:
concatenationbased and attention-based fusion, to integrate features from image and text modalities efectively. The
proposed models deliver competitive performance, ranking 11th in Bangla with a macro F1 score of
0.5379, 14th in Hindi with 0.5250, 3rd in Gujarati with 0.6185, and 12th in Bodo with 0.5522. These
results demonstrate both the promise and challenges of multimodal learning for hate speech meme
detection in linguistically diverse and resource-constrained environments. By contributing a systematic
exploration of multimodal architectures and fusion strategies, this work advances the development of
robust content moderation systems for low-resource languages.</p>
      <p>The subsequent sections of this paper details the related works (Section 2), methodology (Section 3),
experiments, results, and implications of our approach (Section 4) followed by conclusion and future
works (Section 5).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>The growing presence of multimodal hateful and ofensive content in Indian languages poses unique
challenges compared to English and other high-resource languages. While earlier works demonstrate
the promise of multimodal fusion techniques, the complexity of code-switching, script diversity, and
cultural nuances in memes make this task particularly demanding.</p>
      <p>Dubey et al. [14] focus on detecting ofensive content in Hindi memes, highlighting the need for
automated solutions in low-resource languages. Their approach combines textual and visual cues
through Logistic Regression classifier, achieving an accuracy of 81%. Although their work demonstrates
that multimodal fusion can significantly outperform unimodal methods in Hindi, the reliance on
shallow classifiers restricts scalability and generalization compared to transformer-based approaches.
Karim et al. [15] address the challenge of hate speech detection in Bengali by systematically exploring
multimodal architectures. They use recurrent neural networks and pretrained language models such as
BanglaBERT and XLM-RoBERTa, alongside deep visual encoders including ResNet-152 and
DenseNet161. Their best multimodal fusion model reaches an F1 score of 0.83, outperforming text-only or
image-only baselines. The study highlights the complementary value of vision and text, though fusion
gains are relatively modest, pointing to the dificulty of balancing modalities. Hossain et al. [ 16]
investigate meme classification in Bengali and code-mixed contexts, by integrating CNN-based visual
models with transformer-based text encoders. They show that multimodal pipelines improve F1 scores
by about 3% compared to unimodal setups. Their work emphasizes that visual cues often provide
disambiguating context when textual content alone is insuficient, though the improvements come with
the expense of computational complexity.</p>
      <p>Debnath et al. [17] extend the scope of hateful meme detection by examining advanced multimodal
architectures to handle context-rich Bengali memes. They evaluate BanglaBERT with visual backbones
such as ResNet, Inception, and Vision Transformer, and found that multimodal models consistently
outperform unimodal ones with accuracies around 64%. Despite these improvements, the study underscores
challenges in aligning visual and textual features, which often limits overall performance gains. Rajput
et al. [18] concentrate on politically motivated and code-switched Indian memes, a domain that presents
unique linguistic and cultural challenges. They develop a CNN + LSTM hybrid architecture, where
CNNs extract visual features and LSTMs model text sequences. Their approach achieves state-of-the-art
results on their benchmark dataset, demonstrating the strength of hybrid neural designs. However, the
system is less adaptable to broader meme domains outside political discourse.</p>
      <p>Manukonda and Kodali [19] examine misogyny detection in Tamil and Malayalam memes,
emphasizing the under representation of Dravidian languages in multimodal hate speech research. They propose a
transliteration-aware XLM-RoBERTa encoder for text, fused with ResNet-50 image embeddings through
an attention-BiLSTM module. The system delivers strong results with macro-F1 scores of 0.8805 for
Malayalam and 0.8081 for Tamil. Despite its efectiveness, the study points out dificulties with class
imbalance and limited generalization across visual domains. Wong and Durward [20] explored target
ofensive content detection in Gujarati and Hindi as part of the LT-EDI-2024 3 shared task. Their system
leveraged transformer-based classifiers with explicit handling of code-mixing and script-switching,
ranking second in Gujarati and Telugu subtasks. The work confirms transformers as efective baselines
for under-resourced Indian languages, though noise from OCR and inconsistent transliteration practices
remain limiting factors.</p>
      <p>Overall, existing studies on hate and ofensive meme detection in Indian languages clearly establish the
importance of multimodal approaches for hateful meme detection and also reveal persistent challenges
in modality alignment, code-mixing, and domain adaptation. These insights directly motivate our work,
where we design and evaluate robust multimodal fusion strategies to advance hate speech and ofensive
content detection in Bangla, Hindi, Gujarati, and Bodo memes. By leveraging recent transformer-based
language models and modern visual encoders, combined with popular fusion strategies, we aim to
contribute to more robust hate speech meme detection in multilingual social media contexts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Hate speech meme detection is formulated as a multi-task problem, with three binary classification
problems of: i) detecting abuse, ii) assessing vulgarity, and iii) accessing sarcasm, and two multi-class
classification problems of: i) assigning one of three sentiment labels and ii) identifying one of many
targeted communities, leveraging text (OCRed content or caption) and image pair as inputs. The
overall workflow consists of text pre-processing, image pre-processing, feature representation and
fusion, hyperparameter configuration, multi-task formulation, and evaluation. The proposed multi-task
learning for HASOC-Meme detection is illustrated in Figure 2 and details of the steps involved are given
below:</p>
      <sec id="sec-3-1">
        <title>3.1. Text Pre-processing</title>
        <p>The following steps are applied to prepare textual inputs:
• Text normalization: text data is normalized to lowercase and id/image_id are unified into
ids.
• Schema validation: an exception is raised if no OCR/text field exists, ensuring consistency
across datasets.
• Tokenization: tokenizers from text pretrained models are applied with a maximum sequence
length of 128 tokens, and all input sequences are padded to this fixed length.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Image Pre-processing</title>
        <p>The following steps are applied to prepare image inputs:
• Decoding: images are loaded in RGB mode using the PIL library4.
• Normalization: all images are resized to 224 × 224 pixels and pixel values are normalized with
a mean and standard deviation of 0.5 for each channel.
• Augmentation: during training, random horizontal flips (  = 0.5) and random
brightness/contrast adjustments ( = 0.2) are applied.
• Path resolution: image identifiers are matched with both “.jpg” and “.png” extensions; otherwise,
a zero tensor of size 3 × 224 × 224 is substituted to maintain batch consistency.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature Representation and Fusion</title>
        <p>Text and image feature representations are carried out independently for text and images, respectively,
followed by their fusion to obtain joint representations for image and text pairs as given below:
• Text feature representation:</p>
        <p>– Text embeddings are obtained from the following pretrained transformer-based encoders:
4https://pillow.readthedocs.io/en/stable/reference/index.html
∗ Indic-BERT - is a multilingual transformer model pre-trained on 12 major Indian
languages and English using a Masked Language Modeling (MLM) objective. It captures
rich linguistic and semantic features across related Indic languages, making it suitable
for cross-lingual and multilingual NLP tasks in the Indian context.
∗ Multilingual BERT (mBERT) - is a multilingual version of BERT trained on Wikipedia
text from 104 languages. It uses a shared WordPiece vocabulary and a single transformer
network to model multiple languages simultaneously and demonstrates impressive
zero-shot cross-lingual transfer capabilities, enabling it to perform well on languages it
was not fine-tuned on.
∗ Multilingual Representations for Indian Languages (MuRIL) - is developed by
Google to improve multilingual understanding of Indian languages. Unlike mBERT,
MuRIL is trained on both monolingual and transliterated text, as well as parallel corpora.
This helps it to better capture code-mixing and translation nuances common in Indian
language text.
∗ Distil-mBERT - is a distilled version of mBERT that retains 95% of mBERT’s
performance while being 40% smaller and 60% faster. It is trained using knowledge distillation,
where the smaller model learns to mimic the behavior of mBERT, making it eficient
for resource-constrained environments.
∗ ELECTRA-small - is a lightweight transformer-based language model trained using
the Replaced Token Detection (RTD) objective, which makes it more sample-eficient
than traditional MLM models. Instead of masking and predicting tokens, ELECTRA
trains a discriminator to detect replaced words, resulting in better performance with
fewer computational resources.
∗ XLM-Roberta - is a multilingual variant of RoBERTa trained on 2.5 TB of filtered
CommonCrawl data across 100 languages. It improves upon mBERT by leveraging
more data, longer training, and dynamic masking, achieving state-of-the-art results on
multiple cross-lingual benchmarks.
– The [CLS] token embedding (last_hidden_state[:,0,:]) is extracted as the global
sentence-level representation.
– A linear adapter layer projects the text embedding into a 512-dimensional space, followed
by ReLU activation and dropout ( = 0.1).
• Image feature representation:
– Visual embeddings are obtained from the following pretrained encoders:
∗ ResNet50 - is a 50-layer deep Convolutional Neural Network (CNN) that introduced
residual learning through skip connections. These residual blocks help train very deep
networks eficiently by addressing the vanishing gradient problem, making ResNet50 a
standard backbone for many computer vision tasks.
∗ EficientNet-B3 - belongs to the EficientNet family which scales network depth, width,
and resolution using a compound coeficient. It achieves high accuracy with optimized
computational eficiency, outperforming many larger models with significantly fewer
parameters.
∗ EficientNet-B4 - is a deeper and wider version of EficientNet-B3, ofering improved
representational power while maintaining strong eficiency. It balances accuracy and
computational cost, making it suitable for applications requiring higher performance
with moderate resource constraints.
∗ MobileNetV3 - is an eficient CNN architecture optimized for mobile and edge devices.</p>
        <p>It combines depthwise separable convolutions, squeeze-and-excitation modules, and a
• Fusion strategies:
lightweight neural architecture search design and achieves a strong trade-of between
speed and accuracy.
∗ Vision Transformer (ViT-B/32) - is a transformer-based architecture for image
understanding. It divides an image into fixed-size patches, linearly embeds them, and
processes them using standard transformer layers. ViT-B/32 uses a patch size of 32× 32
and achieves competitive results compared to convolutional models, especially when
trained on large datasets.
– Features are extracted from the penultimate layer by setting num_classes=0, which returns
globally pooled features.
– The resulting feature vector dimensionality depends on the backbone.
– A linear adapter layer projects the image embedding into a 512-dimensional space, followed
by ReLU activation and dropout ( = 0.1).
– Concatenation: text and image embeddings are concatenated into a single vector.
– Attention: text and image embeddings are projected into a shared 512-dimensional space.</p>
        <p>Multi-head attention (4–8 heads) is applied in both directions, enhancing text with image
context and image with text context. The resulting enhanced embeddings are concatenated.
– Text and Vision Encoder Combinations: the following text–image encoder combinations
are explored:
1. Indic-BERT + EfficientNet-B4</p>
        <sec id="sec-3-3-1">
          <title>2. MuRIL + EfficientNet-B3</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>3. XLM-Roberta + ResNet50</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>4. mBERT + ResNet50</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>5. Distil-mBERT + EfficientNet-B3</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>6. ELECTRA-small + MobileNetV3</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>7. XLM-Roberta + ViT-B/32</title>
          <p>The fused image and text representation is used to train a multi-task model for HASOC-Meme detection
task.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Task Formulation</title>
        <p>The HASOC-Meme detection problem is formulated as a multi-task learning problem with three binary
classification problems and two multi-class classification problems, and the formulation of the task is
described below:
• Multi-class Classification
i) Sentiment Analysis - has three categories: {Negative, Neural, and Positive}, mapped to
numeric encodings {0, 1, and 2}, respectively. The predictions for the test data are generated
using a sigmoid activation over one output logit, where a value above 0.5 indicates a positive
sentiment.
ii) Target Community Identification - has many class labels and the number of labels varies
from one language to another and these labels are encoded numerically. Some of the labels
common to all the languages are given below:
– Gender - Any reference to male, female, non-binary, or transgender identities.</p>
        <p>– Religion - Mentions or imagery related to any religious belief, deity, or practice.
– Individual - Specifically mentions or portrays a particular person.
– Political - Targets political ideologies, parties, politicians, or policies.
– National Origin - Targets people based on their country or ethnicity.
– Social Sub-groups - Groups based on socio-economic status, occupation, cultural
identity, or other afiliations.
– Others - Any target that does not fall into the above categories.</p>
        <p>– None - If the meme does not target any specific community, no target label is assigned.
Predictions for the test data are generated using a sigmoid activation over one output logit,
where a value above 0.5 indicates a positive sentiment.
• Binary Classification
i) Sarcasm vs. Non-Sarcasm
ii) Vulgar vs. Non-Vulgar
iii) Abusive vs. Non-Abusive
Labels are encoded as 1 if the corresponding field contains indicators such as Sarcastic, Vulgar,
or Abusive, and 0 otherwise. Predictions for test data are generated through sigmoid activations
over three independent logits, one for each binary task.</p>
        <p>All models are trained under identical conditions to ensure a fair comparison across architectures and
fusion strategies. The models are trained with a consistent set of hyperparameters for all experiments
and configuration of the hyperparameters is shown in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>All experiments are done in PyTorch with HuggingFace Transformers5 for text encoders and
timm6 for image encoders. The size of Train and Test sets for each language in HASOC-Meme 2025
5https://huggingface.co/docs/transformers/index
6https://huggingface.co/timm
(d) Abusive vs. Non-Abusive
(e) Target Labels
shared task is shown in Table 2 and the distribution of labels for each subtask in Hindi, Bodo, Gujarati,
and Bangla are shown in figures 3, 4, 5 and 6, respectively. The text data in these datasets are in native
as well as roman script.</p>
      <p>The best performances of the proposed models for the four languages in terms of macro F1 scores are
shown in Table 3 and comparison of the performances of proposed models with the other participants’
models in HASOC-Meme 2025 shared task for all the four languages is shown in Figure 7. The results
indicate that the model performs best on the Gujarati dataset, achieving a macro F1 score of 0.61848
with a rank of 3. This can be attributed to the relatively balanced training data for Gujarati, which
probably enables better learning of both textual and visual features. Performances on Bangla, Hindi,
and Bodo datasets, are moderate with ranks in the range 11 to 14 and macro F1 scores between 0.52497
and 0.55215. The slightly lower performance on these languages may be due to smaller training datasets,
missing annotations, or linguistic and script complexities that pose challenges for text encoding. Overall,
these results demonstrate that multimodal approaches can efectively handle meme classification across
multiple languages, although dataset size, completeness, and linguistic diversity continue to impact
performance.</p>
      <p>Beyond the observed class imbalance, several dataset-specific and multimodal challenges contribute
(d) Abusive vs. Non-Abusive
(e) Target Labels
to variations in model performance across languages. In Bangla dataset, the distribution of Target class
labels exhibit exceptionally high cardinality which influences the learning dynamics. The presence of
such intricate label patterns makes representation learning more dificult, forcing the model to cope
up with imbalanced and overlapping semantic cues. Consequently, the model struggles to maintain
consistent feature alignment across modalities, which reduces its overall macro F1 performance.</p>
      <p>Bodo dataset faces a unique issue with the presence of the “Non-Vulgar and Vulgar” class in the
Vulgar category. This ambiguous labeling creates inconsistency during training, confusing the model’s
decision boundaries between clearly defined vulgar and non-vulgar content. The presence of such mixed
or mislabeled categories introduces noise, which degrades both convergence stability and classification
accuracy in the Vulgar task.</p>
      <p>A further source of dificulty arises due to the presence of numerous NaN values in the Target
classification task across all languages. Depending on way such entries are handled—dropped, retained,
or ignored—they can alter sample distribution, thereby influencing model generalization and stability
during training.</p>
      <p>Another contributing factor is the potential mismatch between textual and visual representations
across languages due to diferences in model architectures and fusion mechanisms. For example,
IndicBERT and EficientNet_B4 with attention-based fusion are used for Hindi, whereas XLM-RoBERTa and
ViT-B/32 with concatenation are employed for Gujarati. These combinations may not align multimodal
features equally well for all languages, leading to varying degrees of representational synergy. The
attention mechanism tends to underperform when visual cues are sparse or semantically weak, while
concatenation fusion provides a more robust shared embedding space in certain cases, such as Gujarati.</p>
      <p>Sarcasm detection remains a cross-lingual challenge in this study. Despite moderate balance in the
Sarcasm category, all languages exhibit dificulty in accurately identifying sarcastic expressions. This
stems from the subtle and context-dependent nature of sarcasm, which relies heavily on linguistic
(d) Abusive vs. Non-Abusive
(e) Target Labels
cues and cultural context rather than explicit visual indicators. The visual modality provides limited
information for recognizing sarcasm, making the performance heavily dependent on the model’s ability
to interpret implicit meaning and irony.</p>
      <p>Overall, these analyses reveal that performance disparities are not solely driven by class imbalance.
They also arise from deeper issues such as complex label structures and missing entries in Target,
ambiguous or inconsistent annotations in Vulgar, suboptimal alignment between text and image
encoders, and the inherent linguistic dificulty of sarcasm detection. Addressing these factors through
refined data pre-processing, improved annotation quality, and adaptive multimodal fusion strategies
can lead to more robust and equitable multilingual meme classification performance.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this study, we - team MUCS presented a comprehensive multimodal approach for meme classification
across four Indic languages: Bangla, Bodo, Gujarati, and Hindi, as part of the HASOC-Meme 2025
shared task at FIRE 2025. We formulated meme classification as a multi-task learning problem with
two multi-class classification objectives and three binary classification objectives. We further explored
(d) Abusive vs. Non-Abusive
(e) Target Labels
two fusion strategies — concatenation and attention-based fusion, to integrate the features from both
modalities. Our system efectively leveraged both textual and visual modalities to capture the nuanced
semantics of memes. The proposed system achieved competitive results: Rank 3 for Gujarati with a
macro F1-score of 0.61848, Rank 11 for Bangla with a macro F1-score of 0.53785, Rank 12 for Bodo
with a macro F1-score of 0.55215, and Rank 14 for Hindi with a macro F1-score of 0.52497. These
results highlight the model’s strong adaptability and efectiveness across diverse linguistic and visual
contexts, even in low-resource settings. For future work, we aim to enhance performance through more
sophisticated multimodal fusion strategies and domain-adaptive pretraining. Expanding the training
data with larger multilingual and cross-domain meme datasets, as well as incorporating contrastive
learning and prompt-based fine-tuning techniques, may further improve generalization. Additionally,
we plan to explore multimodal transformers explicitly optimized for low-resource Indic languages to
achieve deeper semantic alignment between text and image modalities.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>We acknowledge the use of generative AI tools in supporting certain aspects of this work, such as
drafting text, formatting code snippets, and organizing content. However, all experimental design, data
processing, model training, and result analysis are conducted by the team. Generative AI is used solely
as an assistive tool, and all scientific conclusions, interpretations, and discussions presented in this
report are our own.
— Shadows Behind the Laughter, in: K. Ghosh, T. Mandl, S. Pal, S. Majumdar, A. Chakraborty
(Eds.), Forum for Information Retrieval Evaluation (Working Notes) (FIRE 2025), December 17–20,
Varanasi, India, CEUR-WS.org, 2025, p. N/A.
[7] K. Ghosh, M. Das, S. Patel, N. Bhandary, A. Das, A. Mukherjee, S. Modha, D. Ganguly, U. Garain,
S. Jaki, T. Mandl, Overview of the HASOC Track at FIRE 2025: Abusive Meme Identification —
Shadows Behind the Laughter, in: FIRE ’25: Proceedings of the 17th Annual Meeting of the Forum
for Information Retrieval Evaluation, December 17–20, Varanasi, India, Association for Computing
Machinery (ACM), New York, NY, USA, 2025, p. N/A.
[8] D. Kakwani, A. Kunchukuttan, S. M. Golla, et al., IndicNLPSuite: Monolingual Corpora, Evaluation
Benchmarks and Pre-trained Multilingual Language Models for Indian Languages, in: Proceedings
of the 12th Language Resources and Evaluation Conference (LREC), 2020, pp. 4940–4951. URL:
https://aclanthology.org/2020.lrec-1.609.
[9] S. Khanuja, S. Dandapat, A. Srinivasan, et al., MuRIL: Multilingual Representations for Indian
Languages, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021,
pp. 2148–2161. doi:10.18653/v1/2021.findings-acl.189.
[10] A. Conneau, K. Khandelwal, N. Goyal, et al., Unsupervised Cross-lingual Representation Learning
at Scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, 2020, pp. 8440–8451. doi:10.18653/v1/2020.acl-main.747.
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
doi:10.1109/CVPR.2016.90.
[12] M. Tan, Q. V. Le, EficientNet: Rethinking Model Scaling for Convolutional Neural Networks,
in: Proceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp.
6105–6114. URL: http://proceedings.mlr.press/v97/tan19a.html.
[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale, in: International Conference on Learning Representations (ICLR),
2021, p. N/A. URL: https://arxiv.org/abs/2010.11929.
[14] K. Dubey, V. Srivastava, G. Sharma, N. Sharma, D. Sharma, U. Ghosh, O. Alfarraj, A. Tolba,
Multimodal Detection of Ofensive Content in Hindi Memes, ACM Transactions on Asian and
Low-Resource Language Information Processing (2025). doi:10.1145/3717611, dataset of 9,262
Hindi memes; logistic regression multimodal model with 0.81 accuracy.
[15] M. R. Karim, S. K. Dey, T. Islam, B. R. Chakravarthi, Multimodal Hate Speech Detection from
Bengali Memes and Texts, 2022. URL: https://arxiv.org/abs/2204.10196, bengali multimodal dataset;
best fusion XLM-R + DenseNet-161 F1 = 0.83.
[16] E. Hossain, O. Sharif, M. M. Hoque, MUTE: A Multimodal Dataset for Detecting Hateful Memes,
in: AACL-IJCNLP Student Research Workshop, 2022, p. N/A. 4,158 Bengali and code-mixed memes;
multimodal improves 3%.
[17] R. S. Debnath, N. B. Firuj, A. W. Shakib, S. Sultana, M. S. Islam, ExMute: A Context-Enriched
Multimodal Dataset for Hateful Memes, in: First Workshop on NLP for Indo-Aryan and Dravidian
Languages (IndoNLP), 2025, p. N/A. Context-enriched Bengali hateful meme dataset; multimodal
64% accuracy.
[18] K. Rajput, R. Kapoor, K. Rai, P. Kaur, Hate Me Not: Detecting Hate Inducing Memes in Code
Switched Languages, 2022. URL: https://arxiv.org/abs/2204.11356, iPM dataset of Indian political
memes; CNN + LSTM model.
[19] D. P. Manukonda, R. G. Kodali, Multimodal Misogyny Meme Detection in Low-Resource Dravidian
Languages Using Transliteration-Aware XLM-RoBERTa, ResNet-50, and Attention-BiLSTM, in:
DravidianLangTech at NAACL 2025, 2025, p. N/A. Macro-F: 0.8805 (Malayalam), 0.8081 (Tamil).
[20] S. G.-J. Wong, M. Durward, cantnlp@LT-EDI-2024: Automatic Detection of Anti-LGBTQ+ Hate
Speech in Under-resourced Languages, arXiv preprint (2024). URL: https://arxiv.org/abs/2401.15777,
transformer system; ranked second in Gujarati and Telugu.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Habermas</surname>
          </string-name>
          ,
          <source>The Theory of Communicative Action: Reason and the Rationalization of Society</source>
          , volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Beacon</surname>
            <given-names>Press</given-names>
          </string-name>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Hate Speech Detection Using Natural Language Processing</article-title>
          ,
          <source>in: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W17</fpage>
          -1101.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Automatic Detection of Hate Speech in Text</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>51</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          . doi:
          <volume>10</volume>
          .1145/3232676.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Firooz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ringshia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Testuggine</surname>
          </string-name>
          ,
          <article-title>The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>9448</fpage>
          -
          <lpage>9459</lpage>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .04790.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pramanick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets, in: Findings of the Association for Computational Linguistics: EMNLP</article-title>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>4439</fpage>
          -
          <lpage>4455</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>379</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Das</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Narzary</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Barman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Modha</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Garain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jaki</surname>
          </string-name>
          , T. Mandl,
          <article-title>Overview of the HASOC Track at FIRE 2025: Abusive Meme Identification</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>