<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>K. Ghosh, A. Senapati, Hate speech detection in low-resourced indian languages: An analysis of
transformer-based monolingual and multilingual models with cross-lingual experiments, Natural
Language Processing</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3368567.3368584</article-id>
      <title-group>
        <article-title>Ofensive Content Identification in Memes in Bangla, Hindi, and Bodo</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tanmoy Paul</string-name>
          <email>paultanmoy932@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anupam Jamatia</string-name>
          <email>anupamjamatia@nita.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Meme Classification, Ofensive Content Identification, Hate Speech Detection, Multimodal Analysis, Multi-task</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Defence Institute of Advanced Technology</institution>
          ,
          <addr-line>Maharashtra</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Forum for Information Retrieval Evaluation</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Learning</institution>
          ,
          <addr-line>Indian Languages, Bangla, Hindi, Bodo</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>National Institute of Technology Agartala</institution>
          ,
          <addr-line>Tripura</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>31</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper presents the system developed by the Golden Ratio team for the HASOC 2025 shared task on identifying hate speech and ofensive content in multimodal memes across Bangla, Hindi, and Bodo. The task addresses the complex challenge of content moderation in diverse Indian languages by classifying memes for sentiment, sarcasm, vulgarity, and abuse. To tackle this, we propose a multimodal, multi-task framework that employs a language-specific modeling approach. Acknowledging the linguistic diversity, our system integrates the CLIP vision encoder with tailored Transformer-based language models: MuRIL for Bangla, XLM-RoBERTa for Hindi, and Bodo. Textual and visual features are fused using a cross-attention mechanism to enhance classification accuracy. Evaluated on the HASOC 2025 multilingual dataset, our approach achieves macro F1-scores of 0.615 for Bangla, 0.590 for Hindi, and 0.562 for Bodo.These results earned our model a ranking of 3rd for Bangla, 4th for Hindi, and 10th for Bodo among all submitted systems. These results underscore the eficacy of our languagespecific strategy and establish robust benchmarks for multimodal hate speech detection in these under-resourced Indian languages. Our findings highlight the potential of tailored multimodal frameworks to advance content moderation in linguistically diverse contexts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Disclaimer</title>
      <p>This research includes examples of memes containing hate speech, ofensive language, sarcasm, or culturally
sensitive content in Hindi, Bangla, and Bodo. These examples are presented solely for academic and analytical
purposes within the context of the HASOC 2025 shared task on harmful content detection in multilingual,
multimodal settings. The inclusion of such material does not reflect the personal beliefs, values, or endorsements
of the authors or their afiliated institutions. All examples are analyzed objectively to support the scientific
objectives of evaluating the proposed framework’s performance in detecting harmful content. Reader discretion</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>India’s digital landscape has expanded rapidly, with 806 million active internet users as of February
2025, over half of whom actively engage on social media platforms1. Predominantly driven by mobile
internet, this dynamic online ecosystem is characterized by a strong preference for regional languages.
A foundational report by KPMG and Google 2 noted that the majority of Indian internet users prefer
content in their native vernacular, with Hindi and Bangla alone accounting for over 300 million users as
https://github.com/Tanmoy12paul (T. Paul)
CEUR
Workshop</p>
      <p>ISSN1613-0073
early as 2016. Within this context, multimodal content, particularly memes, has emerged as a dominant
medium for communication. While memes are often humorous, they are increasingly exploited to
propagate hate speech and ofensive content, posing significant challenges for content moderation.
Conventional moderation tools, primarily designed for English, struggle to capture the cultural and
linguistic nuances embedded in harmful memes in languages such as Hindi, Bangla, and the low-resource
language Bodo.</p>
      <p>
        Existing research on hate speech detection has largely focused on English-language data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], leaving
critical gaps in addressing non-English, multimodal content. Models trained on English data often fail
to interpret culture-specific metaphors, idioms, and visual cues essential for accurate classification in
languages like Bangla and Hindi. These challenges are amplified for low-resource languages like Bodo,
where the scarcity of annotated datasets hinders model development [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To address these limitations,
this paper presents a language-specific, multimodal framework as part of the HASOC 2025 shared task
on hate speech identification. Our approach integrates Transformer-based language models tailored for
Bangla, Hindi, and Bodo with a CLIP-based vision encoder, employing a cross-attention mechanism to
fuse textual and visual features. Trained in a multi-task learning paradigm, our system concurrently
detects sentiment, sarcasm, vulgarity, and abuse.
      </p>
      <p>
        Advancements in harmful content detection have followed two primary trajectories: a shift from
unimodal (text-only) to multimodal (text and image) architectures and an expansion from English-centric
to multilingual frameworks. Early multimodal systems combined text embeddings from BERT with
image features from CNNs like ResNet via simple concatenation. A notable advancement came with
co-attentional models, such as LXMERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which introduced dual-stream Transformer architectures
enabling dynamic interaction between image and text features. This cross-attention approach directly
informs our fusion mechanism. Recent studies on Indian languages, such as Karim et al. (2022) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on
Bengali memes, demonstrated the eficacy of pairing Bangla-BERT with a CNN-based vision model
(VGG-19). Similarly, Kumari and Bhattacharya (2022) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] advanced Tamil meme classification by
combining IndicBERT with the CLIP vision encoder, highlighting the benefits of language-supervised
vision models for culturally specific tasks. However, low-resource languages like Bodo remain largely
unaddressed, exacerbating issues of algorithmic fairness and online safety [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Our research addresses these gaps by adapting a multimodal, cross-attention-based architecture for
multilingual Indian memes. We aim to conduct a systematic analysis across Hindi and Bangla while
establishing the first empirical benchmark for Bodo, a critically under-researched language.</p>
      <p>The paper is organized as follows: Section 2 describes the HASOC 2025 dataset, detailing its structure
and preprocessing pipeline. Section 3 presents the proposed framework, including the language-specific
text encoders, vision encoder, cross-attention fusion mechanism, and multi-task loss function. Section 4
outlines the experimental design, covering implementation details, hyperparameters, and evaluation
protocols. Section 5 provides a detailed analysis of the results, including task-level performance,
model selection, and competitive rankings. Section 6 examines the framework’s limitations through
a qualitative error analysis, highlighting challenges in detecting nuanced and culturally contextual
content. Section 7 summarizes the contributions and proposes directions for future research.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Dataset</title>
      <p>
        The experiments in this study utilize the oficial dataset from the HASOC 2025 shared task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], organized
by the Forum for Information Retrieval Evaluation (FIRE). This multimodal dataset comprises memes
sourced from social media, with parallel data provided for three Indian languages: Hindi, Bangla, and
Bodo. Each data instance pairs an image with its corresponding text, annotated for four classification
tasks: a multi-class sentiment analysis (Positive, Negative, Neutral) and three binary classifications
(sarcasm, vulgarity, and abuse). To ensure fair and reproducible comparisons, we adhere strictly to
the oficial training, validation, and test splits provided by the task organizers. The dataset statistics are
summarized in Table 1. The table summarizes the statistics of the HASOC 2025 multimodal dataset,
detailing the number of samples allocated to training, validation, and test sets for each language: Bangla,
Hindi, and Bodo. The dataset, provided by the Forum for Information Retrieval Evaluation (FIRE) as
part of the HASOC 2025 shared task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], comprises memes sourced from social media, each annotated
for sentiment, sarcasm, vulgarity, and abuse. Bangla has the largest dataset, with 2,154 training samples,
539 validation samples, and 1,821 test samples, reflecting its relatively higher resource availability
compared to Hindi and Bodo. Hindi includes 912 training samples, 229 validation samples, and 769
test samples, indicating a moderate dataset size suitable for robust model training. In contrast, Bodo, a
low-resource language, has significantly fewer samples—302 for training, 76 for validation, and 254
for testing—highlighting the challenge of data scarcity in this context [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The dataset’s structure,
with consistent splits across languages (approximately 48% training, 12% validation, 40% testing for
Bangla, Hindi and Bodo), ensures fair and reproducible evaluations. These statistics underscore the
linguistic diversity and varying resource constraints of the dataset, which the proposed framework
addresses through language-specific text encoders and a cross-attention mechanism to efectively handle
multimodal content across these Indian languages.
      </p>
      <p>Table 2 presents three representative examples from the HASOC 2025 dataset, illustrating the
multimodal nature of the memes and their annotations for sentiment, sarcasm, vulgarity, abuse, and target
categories, alongside the extracted OCR text. Each entry includes the meme’s identifier, a corresponding
image, and the associated labels, providing insights into the dataset’s diversty and the challanges
of harmful content detection accross Hindi, Bangla, and Bodo. The first example, a Bengali meme
(image_ben_9.jpg), conveys a negative sentiment with sarcastic undertones, targeting a political
context without vulgarity or abuse. Its OCR text, which includes a rhetorical question and a dismissive tone,
highlights the challenge of detecting sarcasm through nuanced text-image interactions. The second
example, a Hindi meme (Hindi_image_1.jpg), exhibits a positive sentiment but is labeled as sarcastic
and abusive, targeting gender. The OCR text, featuring a provocative dialogue with an abusive term
(‘kamina’), underscores the complexity of identifying harmful content in seemingly positive contexts.
The third example, a Bodo meme (image_bodo_143.jpg), carries a negative sentiment with sarcasm,
targeting gender without vulgarity or abuse. The informal and culturally specific Bodo text requires
deep contextual and linguistic knowledge to interpret correctly.</p>
      <p>To prepare the dataset for our neural architecture, we implement a standardized preprocessing
pipeline for both text and image modalities. For text preprocessing, we clean the noisy text obtained
from the ‘OCR’ column of the training dataset. We address this through a three-step process: (1)
normalizing Unicode characters and script-specific punctuation; (2) removing excessive repeating
characters and irrelevant symbols; and (3) replacing missing text with a [NO_TEXT] token. The cleaned
text is then tokenized using the language-specific tokenizer corresponding to the selected language
model.</p>
      <p>For image preprocessing, all images are resized to a uniform resolution of 224×224 pixels.
Subsequently, we apply normalization by subtracting the channel-wise mean and dividing by the standard
deviation of the ImageNet dataset. This ensures consistent scale and distribution of image data, which
is essential for stabilizing the vision model’s performance.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>This section describes our proposed framework for multimodal, multi-task classification of harmful
memes in the HASOC 2025 shared task. The framework processes text and image modalities in parallel,
integrating them through a cross-attention mechanism to capture their interactions, which are critical
for identifying hate speech, sentiment, sarcasm, vulgarity, and abuse in Hindi, Bangla, and Bodo memes.
The architecture, depicted in Figure 1, employs language-specific text encoders, a vision encoder, a
fusion mechanism, and task-specific classification heads, optimized using a weighted multi-task loss
function to address class imbalance.</p>
      <p>
        To account for the linguistic diversity of the target languages, we use pre-trained Transformer models
tailored to each language. For an input text  , the encoding process is:
(1)
(2)
(3)
 text = Transformerlang(Tokenizer( ))
where  text ∈ ℝ text is the [CLS] token’s hidden state, capturing the text’s semantic content. The selected
models are: for Bangla, google/muril-base-cased [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], xlm-roberta-base [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and
sagorsarker/bangla-bertbase [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]; for Hindi, ai4bharat/indic-bert [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], google/muril-base-cased [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and xlm-roberta-base [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]; and
for Bodo, xlm-roberta-base [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This approach ensures that culture-specific linguistic nuances, such as
idioms or sarcasm, are efectively captured, which is essential for detecting harmful content in memes.
      </p>
      <p>
        For the visual modality, we employ the pre-trained CLIP Vision Transformer (ViT) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. A normalized
input image  norm is transformed into:
      </p>
      <p>image = CLIP-ViT(norm)
where  image ∈ ℝ image is the pooler_output, encoding the semantic content of the image. This is
critical for memes, where visual elements (e.g., symbols or imagery) often convey contextual meaning
tied to harmful content.
compute:</p>
      <p>
        To integrate text and image features, we use a cross-attention mechanism, as simple concatenation
may overlook nuanced interactions. Following the scaled dot-product attention framework [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we
Attention(,  ,  ) =
softmax (
      </p>
      <p>√ 
) 
Here,  text serves as the Query ( ), and  image provides the Key ( ) and Value ( ). This allows the text
representation to be dynamically re-weighted based on visual context, capturing interactions critical
for meme classification (e.g., sarcastic text paired with specific imagery). The attention-infused text
via a cross-attention mechanism, followed by task-specific classification heads.
vector is concatenated with  image and passed through a feed-forward network to produce a fused
representation,  fused, integrating both modalities.</p>
      <p>The fused representation  fused is fed into four independent classification heads for sentiment, sarcasm,
vulgarity, and abuse. Each head, a feed-forward network, outputs logits projected into the target label
space (3 dimensions for sentiment; 2 for binary tasks). A Softmax function is applied to the sentiment
head to yield probabilities over Positive, Negative, and Neutral classes, while a Sigmoid function is
used for binary tasks to produce probabilities for the positive class.</p>
      <p>The model is trained using a composite multi-task loss function to address class imbalance. The total
loss is:</p>
      <p>total =  1 sent +  2 sarc +  3 vulg +  4 abus
where  1,  2,  3,  4 = 1. For the sentiment task, a weighted Cross-Entropy Loss is used:

=1
 sent = − ∑   ⋅   ⋅ log(  )
(4)
(5)
with class weights:
where  is the total number of training samples,  = 3 , and   is the number of samples for class  .
This weights minority classes (e.g., Negative sentiment) higher, ensuring robust detection of critical
sentiments in harmful memes.</p>
      <p>For binary tasks, we use Binary Cross-Entropy with Logits Loss, incorporating a positive weight:
  =</p>
      <p>×  
pos_weight =
 neg
 pos
(6)
(7)
(8)
The loss for a binary task is:</p>
      <p>binary = −[(pos_weight ⋅  ⋅ log( ())) + ((1 −  ) ⋅log(1 −  ()))]
where  is the logit,  ∈ {0, 1} , and  () is the sigmoid function. The pos_weight increases penalties for
misclassifying positive instances (e.g., abusive content), enhancing the model’s ability to detect rare but
critical harmful content in multilingual memes.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>This section details the experimental design for evaluating our proposed framework for multimodal,
multi-task classification of harmful memes in the HASOC 2025 shared task. The description covers
implementation details, hyperparameter settings, training procedures, and evaluation protocols, ensuring
reproducibility and transparency of our methodology.</p>
      <p>
        All experiments were conducted in the Google Colab cloud environment, utilizing the PyTorch
framework [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and the Hugging Face Transformers library [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for implementing pre-trained models.
Depending on availability, training and experimentation leveraged hardware accelerators, specifically
NVIDIA A100 GPUs or Google Tensor Processing Units (TPU v4). This setup provided the computational
resources necessary for eficient model training and evaluation across the Hindi, Bangla, and Bodo
datasets.
      </p>
      <p>
        For training, we configured each language-specific model to run for a maximum of 20 epochs,
incorporating an early stopping mechanism to halt training if the validation Macro F1-score did not improve for
ifve consecutive epochs, preserving the best-performing model checkpoint. The AdamW optimizer [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
was employed with a weight decay of 0.01 to regularize the model. To mitigate catastrophic forgetting
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the first four layers of both text and vision encoders were frozen during fine-tuning. A linear
learning rate warm-up was applied for the first 100 steps, followed by a ReduceLROnPlateau scheduler
that adjusted the learning rate based on validation performance. To accommodate computational
constraints, gradient accumulation was used over two steps, achieving an efective batch size of 24. The
key hyperparameters are summarized in Table 3.
      </p>
      <p>
        To ensure robust and unbiased evaluation, we adopted a 5-fold stratified cross-validation strategy on
the training data [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Stratification preserved the original class distribution across folds, addressing
the inherent class imbalance in the dataset. The reported results represent the average Macro F1-scores
across the five folds on the oficial held-out test set, as this metric is the primary evaluation criterion
for the shared task. The Macro F1-score was chosen for its robustness to class imbalance, ofering a
reliable measure of model performance compared to standard accuracy [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Additional metrics were
computed where relevant to provide a comprehensive analysis of the model’s efectiveness across the
sentiment, sarcasm, vulgarity, and abuse classification tasks.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. Results and Analysis</title>
      <p>This section provides a detailed analysis of the empirical results obtained from evaluating the proposed
framework on the oficial HASOC 2025 shared task test set, focusing on multimodal classification
of harmful memes in Hindi, Bangla, and Bodo. The analysis encompasses a task-level performance
breakdown across sentiment, sarcasm, vulgarity, and abuse, a comparative study to justify the selection
of text encoders, and a comparison of the proposed framework’s performance against top-performing
systems in the shared task, highlighting both strengths and areas for improvement.</p>
      <p>The per-task performance, derived from 5-fold cross-validation on the validation set, is presented in
Table 4, reporting Accuracy and Macro F1-scores for each classification task, with the highest scores
per language highlighted in bold. For Hindi, the xlm-roberta-base model excelled in sarcasm (0.536
F1), vulgarity (0.732 F1), and abuse (0.683 F1), while google/muril-base-cased led in sentiment (0.473
F1). In Bangla, sagorsarker/bangla-bert-base achieved the highest scores across all tasks, with
0.555 F1 for sentiment, 0.670 F1 for sarcasm, 0.654 F1 for vulgarity, and 0.664 F1 for abuse. For Bodo,
xlm-roberta-base consistently outperformed, with F1-scores ranging from 0.636 (sarcasm) to 0.707
(vulgarity). A consistent pattern across all languages is the stronger performance on explicit tasks
(vulgarity and abuse) compared to implicit tasks (sentiment and sarcasm). For example, in Bangla, the
google/muril-base-cased model scored 0.638 F1 on vulgarity but only 0.542 F1 on sentiment, and in
Hindi, it achieved 0.730 F1 on vulgarity versus 0.473 F1 on sentiment. Similarly, in Bodo,
xlm-robertabase scored 0.701 F1 on abuse but 0.636 F1 on sarcasm. This performance gap underscores the challenge
of capturing nuanced, context-dependent content, such as sarcasm, which requires deeper semantic
and cultural understanding.</p>
      <p>A comparative study, summarized in Table 5, was conducted to select the optimal text encoder for
each language based on the Macro F1-score on the test set. For Bangla, google/muril-base-cased
achieved the highest score of 0.615, outperforming sagorsarker/bangla-bert-base (0.607) and
xlmroberta-base (0.593). For Hindi, xlm-roberta-base led with a score of 0.590, slightly surpassing
ai4bharat/indic-bert (0.587) and google/muril-base-cased (0.583). For Bodo, xlm-roberta-base
was the sole model evaluated, yielding a score of 0.562. These results highlight the eficacy of a
languagespecific approach, as no single text encoder consistently outperformed others across all languages,
reflecting the diverse linguistic and cultural characteristics of Hindi, Bangla, and Bodo.</p>
      <p>The final performance of the proposed framework on the oficial test set is presented in Table 6,
comparing its Macro F1-scores against the top-performing system in the HASOC 2025 shared task.
For Bangla, the proposed framework, combining clip-ViT-B-32 with google/muril-base-cased,
achieved a Macro F1-score of 0.615, securing 3rd place, closely trailing the top score of 0.627. In Hindi,
the framework, utilizing clip-ViT-B-32 with xlm-roberta-base, scored 0.590, ranking 4th against the
top score of 0.657. For Bodo, the framework, also based on clip-ViT-B-32 with xlm-roberta-base,
established a baseline score of 0.562, placing 10th compared to the top score of 0.631. These results
demonstrate the proposed framework’s competitive performance, particularly in Bangla, where it closely
approached the top system, and its contribution to setting an initial benchmark for the low-resource
Bodo language.</p>
      <p>Analysis of the results reveals key insights into the proposed framework’s performance. The
consistent performance gap between explicit tasks (vulgarity and abuse) and implicit tasks (sentiment
and sarcasm) indicates that the framework struggles with content requiring nuanced semantic and
contextual interpretation. For instance, sarcasm detection in Bangla and Hindi often failed when memes
relied on culture-specific references or subtle text-image interactions, suggesting that advanced
crossmodal fusion techniques or external knowledge integration could enhance performance. The lower
performance in Bodo, despite using a robust model like xlm-roberta-base, is likely attributable to the
limited training data, which constrained the framework’s ability to generalize efectively. These findings
highlight the need for future work to focus on improving cross-modal interactions and addressing data
scarcity, particularly for low-resource languages like Bodo, to enhance the framework’s ability to detect
harmful content across diverse linguistic contexts.</p>
      <p>
        To validate the significance of performance diferences, we conducted Wilcoxon Signed-Rank Tests on
the Macro F1-scores from 5-fold cross-validation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. For Bangla, the null hypothesis of no diference
between google/muril-base-cased (0.615) and sagorsarker/bangla-bert-base (0.607) was tested,
yielding a p-value of 0.03, rejecting the null and confirming the former’s superiority. For Hindi, the
diference between xlm-roberta-base (0.590) and ai4bharat/indic-bert (0.587) was not significant
( = 0.12 ). A bootstrap test comparing the proposed framework’s Bangla score (0.615) against the
top-performing system’s score (0.627) resulted in a p-value of 0.08, suggesting no significant diference
at  = 0.05 . These tests reinforce the eficacy of the language-specific approach while highlighting the
competitiveness of the proposed framework.
      </p>
    </sec>
    <sec id="sec-7">
      <title>6. Error Analysis Discussion</title>
      <p>This section examines the limitations of the proposed framework for multimodal harmful content
detection in Hindi, Bangla, and Bodo memes, drawing on a qualitative error analysis to elucidate its
failure modes. By analyzing misclassified samples, we identify key challenges in detecting nuanced
and culturally contextual content, ofering insights into the framework’s performance and potential
avenues for improvement. The discussion avoids reiterating quantitative results, focusing instead on
the qualitative factors underlying errors and their implications for future research.</p>
      <p>
        The proposed framework’s performance highlights the challenges of detecting sarcasm and irony,
particularly when memes rely on contradictory text-image interactions. For instance, a Hindi meme
with the text “Therapist ne dukh sun kar fees wapis kar di” (translated: “The therapist heard my sorrow
and returned the fees”) paired with an image of a chihuahua in a light blue hoodie, appearing sad and
despondent, was frequently misclassified. The exaggerated scenario, where a therapist refunds fees due
to overwhelming sadness, represents dark internet humor. The framework struggled to interpret this
sarcasm, as the interplay between the negative text and the exaggeratedly sad image requires nuanced
understanding beyond literal content. This dificulty aligns with prior research indicating that sarcasm
remains a significant challenge for computational models [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], underscoring the need for enhanced
cross-modal reasoning to capture such subtleties.
      </p>
      <p>
        Another critical limitation is the framework’s dificulty in interpreting memes requiring deep cultural
or contextual knowledge. A Bodo meme with the text “Boba bobi jaflananwi ma dithniw nagirdwmg
bswr.....Angha ese sondeh dglwi bswrkhou” (translated: “What are they looking for while wandering
around? I’m a little suspicious of them”) and an image of two individuals in traditional Bodo attire was
often misclassified. The informal language, combined with the cultural significance of the attire and
the subtle, potentially judgmental tone, posed a multifaceted challenge. Without extensive cultural
training, the framework failed to infer the underlying social implications, a limitation consistent with
ifndings that models often lack the localized grounding needed for culturally nuanced content [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
This issue is particularly pronounced in low-resource languages like Bodo, where limited training data
exacerbates the challenge.
      </p>
      <p>
        These error patterns suggest that while the proposed framework efectively handles explicit content,
its performance on implicit and culturally dependent tasks is constrained by the complexity of
textimage interactions and the lack of cultural context. Future improvements could focus on integrating
external knowledge sources, such as cultural or linguistic knowledge graphs [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], to enhance contextual
understanding. Additionally, advanced cross-modal fusion techniques could better capture the interplay
between text and visuals, particularly for sarcasm detection. For low-resource languages like Bodo, data
augmentation strategies [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or cross-lingual transfer learning [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] could help address data scarcity,
improving generalization and robustness in detecting harmful content across diverse linguistic and
cultural contexts.
      </p>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion and Future Work</title>
      <p>
        This paper presents our system for the HASOC 2025 shared task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], addressing the multimodal
classification of harmful content in memes across Hindi, Bangla, and Bodo. Our framework integrates a CLIP
vision encoder with language-specific Transformer models, employing a cross-attention mechanism to
fuse text and image modalities for detecting sentiment, sarcasm, vulgarity, and abuse. The experiments
highlight the eficacy of tailoring text encoders to each language, demonstrating that optimal model
selection varies across linguistic contexts. A significant contribution of this work is establishing a
performance benchmark for the low-resource Bodo language in a multimodal setting, addressing a
critical gap in harmful content detection for under-resourced Indian languages.
      </p>
      <p>
        Several avenues exist for extending this research. To enhance performance on Bodo, data
augmentation techniques, such as back-translation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or generative models [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], could mitigate data scarcity.
Additionally, evaluating zero-shot cross-lingual transfer by applying models trained on larger Hindi and
Bangla datasets to Bodo [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] ofers a promising direction. To better handle implicit and sarcastic content
across all languages, integrating external knowledge graphs [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] could provide essential contextual
insights. Exploring advanced fusion techniques, such as those in recent foundation models [24], may
further improve modality integration. Moreover, investigating diverse Transformer models, including
newly released Indian language-specific models, and optimizing hyperparameters could yield
performance gains while balancing computational eficiency. These eforts aim to advance robust, culturally
sensitive content moderation in multilingual, multimodal settings.
      </p>
    </sec>
    <sec id="sec-9">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Writefull and Grammarly in order to: Grammar
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Budhiraja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <article-title>The state and fate of linguistic diversity and inclusion in the NLP world</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6282</fpage>
          -
          <lpage>6293</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>560</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl- main.560.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Bansal, LXMERT: Learning cross-modality encoder representations from transformers</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>5100</fpage>
          -
          <lpage>5111</lpage>
          . URL: https://aclanthology.org/D19-1514/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          - 1514.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Karim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shajalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Multimodal hate speech detection from bengali memes and texts</article-title>
          ,
          <source>arXiv preprint arXiv:2204.04077</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/ 2204.10196. arXiv:
          <volume>2204</volume>
          .
          <fpage>10196</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          , Team-IITP@
          <article-title>DravidianLangTech-2022: A Multimodal System for Troll Meme Classification in Tamil</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Koyel</given-names>
            <surname>Ghosh and Mithun Das</surname>
          </string-name>
          and
          <article-title>Mwnthai Narzary and Saptarshi Saha and Shubhankar Barman and Animesh Mukherjee and Sandip Modha and Debasis Ganguly and Utpal Garain and Sylvia Jaki and Thomas Mandl , Overview of the HASOC Track at FIRE 2025: Abusive Meme Identification - Shadows Behind the Laughter</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          , S. Pal (Eds.),
          <source>Forum for Information Retrieval Evaluation (Working Notes) (FIRE 2025) December</source>
          <volume>17</volume>
          -20, Varanasi , India, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanuja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>M. Kumar, MuRIL: Multilingual Representations for Indian Languages, in: Findings of the Association for Computational Linguistics: EMNLP</article-title>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>3355</fpage>
          -
          <lpage>3365</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . acl- main.747.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Sarker</surname>
          </string-name>
          ,
          <article-title>BanglaBERT: A Denoising Autoencoder based Pre-trained Language Model for Bangla</article-title>
          ,
          <source>arXiv preprint arXiv:2111.05601</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kakwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Golla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Gokul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Kumar</surname>
            ,
            <given-names>A. R,</given-names>
          </string-name>
          <article-title>IndicBERT: A Pre-trained Language Model for 12 Indian Languages</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>05418</volume>
          (
          <year>2020</year>
          ). URL: http://dx.doi.org/10.18653/v1/
          <year>2022</year>
          .findings-acl.
          <volume>145</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2022</year>
          .findings- acl.145.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning Transferable Visual Models From Natural Language Supervision</article-title>
          , in: M.
          <string-name>
            <surname>Meila</surname>
          </string-name>
          , T. Zhang (Eds.),
          <source>Proceedings of the 38th International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . URL: https://arxiv.org/abs/2103.00020.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention Is All You Need,
          <source>in: Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: https://arxiv.org/abs/1706.03762.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep learning library</article-title>
          ,
          <source>in: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>8026</fpage>
          -
          <lpage>8037</lpage>
          . URL: https://arxiv.org/abs/
          <year>1912</year>
          .01703. arXiv:
          <year>1912</year>
          .01703.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          , et al.,
          <article-title>Transformers: State-of-the-Art Natural Language Processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-demos.6/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- demos.6.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled Weight Decay Regularization, in: International Conference on Learning
          <source>Representations (ICLR)</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/1711.05101.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rabinowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Desjardins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Milan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ramalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grabska-Barwinska</surname>
          </string-name>
          , et al.,
          <article-title>Overcoming catastrophic forgetting in neural networks</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences (PNAS) 114</source>
          (
          <year>2017</year>
          )
          <fpage>3521</fpage>
          -
          <lpage>3526</lpage>
          . URL: http://dx. doi.org/10.1073/pnas.1611835114. doi:
          <volume>10</volume>
          .1073/pnas.1611835114.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arlot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celisse</surname>
          </string-name>
          ,
          <article-title>A survey of cross-validation procedures for model evaluation</article-title>
          ,
          <source>Statistics surveys 4</source>
          (
          <year>2010</year>
          )
          <fpage>40</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grandini</surname>
          </string-name>
          , E. Bagli, G. Visani,
          <article-title>Metrics for multi-class classification: an overview</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>05756</volume>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2008</year>
          .05756. arXiv:
          <year>2008</year>
          .05756.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>C. Van Hee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Lefever</surname>
          </string-name>
          , V. Hoste, SemEval
          <article-title>-2018 Task 3: Irony Detection in English Tweets</article-title>
          ,
          <source>in: Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          . URL: https://aclanthology.org/S18-1005/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>S18</fpage>
          - 1005.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language Models as Knowledge Bases?</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          . URL: https://aclanthology.org/D19-1250/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          - 1250.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <article-title>Improving neural machine translation models with monolingual data, in: K. Erk</article-title>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Berlin, Germany,
          <year>2016</year>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>96</lpage>
          . URL: https://aclanthology.org/P16-1009/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P16</fpage>
          - 1009.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Johnson, XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2020</year>
          , pp.
          <fpage>4411</fpage>
          -
          <lpage>4421</lpage>
          . URL: https: //arxiv.org/abs/
          <year>2003</year>
          .11080.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gangal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>A survey of data augmentation approaches for NLP (</article-title>
          <year>2021</year>
          )
          <fpage>968</fpage>
          -
          <lpage>988</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>84</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings- acl.84.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. He,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation</article-title>
          ,
          <source>in: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI)</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>13988</fpage>
          -
          <lpage>13996</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .tacl-
          <volume>1</volume>
          .11/. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00360</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>