<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. K. Mandal);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sexism Identification Using Annotator Ranking in Memes: A Multimodal Approach Using Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Deobrat Kumar Jha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mitesh Kumar Mandal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar Madasamy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Technology, National Institute of Technology Karnataka Surathkal</institution>
          ,
          <addr-line>Mangalore 575025</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Memes are a popular medium for sharing information on social media, often embedding humor and interactive content. However, they can also propagate sexism, targeting specific genders, particularly females. This paper presents a multimodal approach to detect sexism in memes and classify the intent of sexist memes and sexism categorization. We leverage BERT for textual analysis, BLIP for multimodal processing, and Vision Transformers (ViT) for image feature extraction. Our model achieves approximately 68.49% accuracy in identifying sexist memes and 68.52% accuracy in determining the source intention and 49.31% accuracy in Sexism Categorization. This work contributes to creating safer digital spaces by automating the detection of biased content on social media.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BERT</kwd>
        <kwd>BLIP</kwd>
        <kwd>Memes</kwd>
        <kwd>Social Media</kwd>
        <kwd>Sexism</kwd>
        <kwd>Vision Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>In recent years, memes have gained immense popularity as a means of communication on social
media. They encapsulate sentiments, humor, and opinions, making them an efective tool for spreading
messages, including both positive and negative discourse. The field of automated meme analysis has
been explored extensively through sentiment analysis, multimodal learning, and computer vision
techniques. This literature survey reviews key research contributions relevant to our work on detecting
sexism in memes and the source intention of the memes.</p>
      <p>A study is conducted on the sentiment analysis of text memes using various supervised machine
learning models. Their research highlights that memes often contain sentiments toward specific issues,
individuals, or entities, and their classification requires efective text-processing techniques. The study
compared Naïve Bayes, Support Vector Machines (SVM), Decision Trees, and Convolutional Neural
Networks (CNN) for analyzing Indonesian text memes. Among these models, Naïve Bayes demonstrated
the highest accuracy of 65.4% in classifying meme sentiment. This study underscores the significance
of textual feature extraction in meme classification and provides a baseline for incorporating machine
learning techniques in sentiment analysis.</p>
      <p>Another technique is presented in which a multimodal approach to sentiment analysis of memes that
integrates both visual and linguistic components is used. Their study focused on analyzing Hinglish and
English memes using a dataset of 3999 labeled memes. They explored the efectiveness of various models,
including RoBERTa, CLIP, BERT, SVM, Multinomial Naïve Bayes, and VADER, in identifying sentiments
as positive, negative, or neutral. The RoBERTa-CLIP combination yielded the highest accuracy of 82%,
significantly outperforming traditional sentiment analysis models such as BERT (64%), SVM (42%),
and Naïve Bayes (34%). Their findings emphasize the importance of incorporating both textual and
visual features for a comprehensive understanding of meme sentiment, which is highly relevant to our
approach in detecting sexism in memes.</p>
      <p>
        The Vision Transformer (ViT) model and its role in image analysis was investigated in a study [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
study delves into the four core components of ViT—patch division, token selection, position encoding,
and attention calculation—that enhance its capability in visual processing. A review of ViT applications
across various domains, including medical image processing and object detection, provides insights
into how advanced deep learning architectures can be leveraged for meme classification. Given that
memes contain both text and images, ViT’s powerful feature extraction mechanisms are crucial for
understanding their context and detecting underlying biases.
      </p>
      <p>
        A study [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] explored the detection of extreme sentiments on social networks using BERT. The work
builds on previous studies that classified social media posts as positive, negative, or neutral by refining
sentiment classification using a semi-supervised approach. The study demonstrated that many posts
classified as extremely positive or negative indeed carried heightened sentiments when analyzed with
BERT, proving its efectiveness in fine-tuning sentiment detection. This work is relevant for identifying
extreme opinions that contribute to sexism and hate speech in memes.
      </p>
      <p>
        A scalable harmful meme detection framework using Graph Neural Networks (GNNs) was introduced
in a study [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], incorporating both invariant and specific modality representations. The method enhances
cross-modal interaction by projecting visual and textual data into distinct spaces to address the modality
gap. This approach significantly improves harmful meme classification by dynamically balancing
inter-modal and intra-modal relationships. The study highlights an efective method for detecting
harmful memes, aligning with eforts to detect sexist content in memes.
      </p>
      <p>
        Another study [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] focused on hate speech detection in social media memes using machine learning.
It analyzed the challenges associated with detecting hate speech in visual content, emphasizing the
need for automatic detection mechanisms to prevent hate speech propagation. The researchers utilized
Facebook AI’s hateful meme dataset to evaluate unimodal and multimodal approaches, highlighting the
complexities in meme analysis. The findings underscore the necessity of robust multimodal techniques
for efective hate speech and bias detection in social media content.
      </p>
      <p>Recent advancements in natural language processing and deep learning have enabled significant
progress in understanding and analyzing internet memes, which typically combine text and visual
components to convey humor, opinions, or controversial content.</p>
      <p>
        A deep learning-based approach was proposed that utilizes both textual and visual features to detect
and classify memes. This multi-modal architecture not only outperformed traditional unimodal models
but also tracked the evolution of memes during political events, shedding light on how memes transform
and propagate online [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To address the complex semantics of memes, a large-scale dataset was introduced that focuses on
metaphor usage within meme content. The dataset includes annotations for sentiment, intent, metaphor
type, and ofensiveness. The study demonstrated that incorporating metaphor analysis significantly
enhances the performance of models tasked with meme sentiment classification [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In eforts to detect ofensive memes, a multi-step pipeline was developed. It involved extracting
embedded text using OCR, classifying its ofensive nature using a GRU-based model, and categorizing
the level of ofense. This approach highlights the importance of automated tools in identifying harmful
content at scale [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Focusing on emotional interpretation, transformer-based models like BERT were applied to meme
sentiment classification as part of a benchmark challenge. These models showed superior performance
in detecting emotions such as sarcasm and humor compared to older LSTM-based models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        A novel framework enhanced meme detection by utilizing a vision transformer that emphasizes
important visual regions. By introducing visual part utilization and attention mechanisms, this model
excelled at distinguishing memes from non-meme images, especially in complex scenarios [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Beyond memes, robust deep learning techniques have improved hate speech detection. One study
categorized tweets into subtypes such as racism and sexism using neural models, outperforming
traditional machine learning methods [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, another study revealed the limitations of current
models, showing their vulnerability to adversarial attacks like text obfuscation or insertion of neutral
words [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        HateBERT is a domain-adapted BERT model retrained on 1.4M Reddit posts from communities banned
for ofensive content. It efectively captures the linguistic patterns of hate and abuse, outperforming
general BERT models in tasks like ofensive, abusive, and hate speech detection. Its robustness and
cross-domain portability make it ideal for social media toxicity analysis in research contexts [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Additionally, distributed comment embeddings were employed to detect abusive language on online
platforms. These embeddings capture semantic context while reducing dimensionality, leading to
eficient and scalable classification of toxic content [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes the methodology used for meme classification, leveraging multi-modal deep
learning techniques. The approach integrates visual and textual feature extraction using BLIP (Bootstrapping
Language-Image Pretraining), BERT (Bidirectional Encoder Representations from Transformers) and
ViT (Vision transformer) followed by a fusion mechanism employing attention-based. Then we used
multi-layer perceptron (MLP) for the final classification.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Processing Workflow</title>
        <p>The process begins with loading a JSON file containing descriptions of memes, as outlined in Algorithm 1
and illustrated in Figure 1. This file is checked for correct formatting and converted into a pandas
DataFrame. The dataset is then examined to remove irrelevant columns that do not contribute to
the analysis, streamlining further processing. Next, the data is divided into two categories: Non-Tie
Cases, where a clear consensus among annotators exists, and Tie Cases, where conflicting rankings
are provided. For non-tie cases, the most frequently chosen label is assigned using a majority vote
approach. In tie cases, annotator ranking is employed to resolve conflicts based on annotator reliability
or predefined criteria. The outcomes from both the majority vote and tie resolution processes are
then merged into a single, consistent dataset. To ensure environmental compatibility, image paths are
adjusted according to the directory structure. Finally, the fully cleaned and processed dataset is exported
in CSV format for easy accessibility and compatibility with various machine learning frameworks.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Methodology for Detecting Sexism in Memes (Subtask 1)</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Overview</title>
          <p>To detect whether a meme is sexist or not, we use both the image and the text in the meme. First, we
extract features from the image and generate a caption using a model called BLIP. At the same time, the
actual meme text is processed using BERT, which helps us understand the meaning of the text. BLIP
takes care of the image part—it extracts the main details from the image and writes a short caption
describing it. Both the original text and the generated caption are then passed through BERT to get
meaningful text representations. To make these text features stronger, we apply attention pooling,
which helps the system focus on the important words. Finally, we combine everything—the image
features, the meme text, and the caption—into one complete set of features. This combined data goes
into an MLP classifier, which predicts whether the meme is sexist. We train this whole model using
standard methods like cross-entropy loss and the Adam optimizer to get the best performance</p>
          <p>The flow diagram of the methodology for this subtask is shown in Figure 2. The subsequent sections
elaborate on each stage of the process in greater detail.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Input Data Processing</title>
          <p>The system takes two main inputs: the meme image and the meme text. Feature extraction is performed
on the meme image, followed by caption generation. The texts are processed independently from the
images.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Image Pre-processing using BLIP</title>
          <p>Algorithm 1 Annotator Ranking and Label Assignment
1: Input: Dataset  with columns: annotator IDs, labels, final labels; tie-case dataset _
2: Output: Ranked annotator dictionary, updated _ with resolved labels
3: Initialize empty dictionaries correct_counts and total_counts
4: for each record in  do
5: Get annotators, labels, and final_label
6: for each annotator, label pair do
7: if label is valid (not ’-’ or null) and final_label is valid then
8: Increment total_counts[annotator]
9: if label = final_label then
10: Increment correct_counts[annotator]
11: end if
12: end if
13: end for
14: end for
15: Initialize empty dictionary annotator_accuracy
16: for each annotator in total_counts do
17: annotator_accuracy[annotator] ←
18: correct_counts[annotator]/ total_counts[annotator]
19: end for
20: Sort annotators by accuracy in descending order
21: Create annotator_rank dictionary: assign rank (1 to ) based on sorted order
22: for each record in _ do
23: Get annotators and labels
24: Create list of (rank, label) pairs for valid labels (not ’-’ or null), using annotator_rank
25: if list is not empty then
26: Sort pairs by rank (ascending)
27: Assign first label (from lowest rank) to final_label
28: else
29: Assign final_label ← null
30: end if
31: end for
32: Return annotator_rank, updated _</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>3.2.4. Text Pre-processing using BERT</title>
          <p>BERT handles meme text processing independently. The meme text is tokenized using the BERT
tokenizer for uniform representation. Special tokens like [CLS] and [SEP] are included for better
semantic encoding. These tokens are then converted to tensors along with attention masks to support
variable-length inputs.</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>3.2.5. BLIP: Image Feature Extraction and Caption Generation</title>
          <p>The BLIP model consists of two major components: a Vision Encoder (ViT) and a Text Decoder. The
Vision Encoder extracts visual features from the meme image, and the Text Decoder generates a
descriptive caption of the image. For extracting image features, the image is processed by ViT to
produce visual embeddings, and the CLS token output from the final transformer layer is used as the
image representation. For caption generation, the image is passed through BLIP’s decoder to generate a
caption, which represents the semantic meaning of the image and is then passed through BERT for
embedding.</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>3.2.6. BERT: Textual Embedding for Meme Text and Generated Captions</title>
          <p>BERT is applied to two textual inputs: the raw meme text and the generated captions. For meme text
embedding, the raw meme text is tokenized using the BERT tokenizer, encoded using BERT, and the
[CLS] token embedding is extracted. Similarly, for caption text embedding, the BLIP-generated caption
is tokenized and encoded, and its [CLS] token embedding is extracted as the caption’s representation.</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>3.2.7. Attention Pooling for Feature Aggregation</title>
          <p>To improve text feature representation, attention pooling is used. This mechanism assigns diferent
weights to token embeddings based on their relevance and aggregates these embeddings into a single
vector for each text source. Attention pooling is applied to BERT’s hidden states for both the meme text
and the captions, resulting in one aggregated vector per input source.</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>3.2.8. Feature Fusion: Combining Image and Text Embeddings</title>
          <p>All extracted features are concatenated into a single representation, comprising image features, meme
text embedding, and caption embedding. This final representation captures visual semantics from
BLIP-ViT, text semantics from the raw meme text, and generated caption semantics.</p>
        </sec>
        <sec id="sec-3-2-9">
          <title>3.2.9. MLP Classifier for Final Prediction</title>
          <p>A Multi-Layer Perceptron (MLP) processes the fused feature vector. The MLP consists of one or more
fully connected layers, uses ReLU activation for non-linearity, and includes dropout layers to prevent
overfitting. The final classification is performed via a softmax layer.</p>
        </sec>
        <sec id="sec-3-2-10">
          <title>3.2.10. Training and Optimization</title>
          <p>To ensure optimal training, several techniques are used. Cross-entropy loss is applied for multi-class
classification. The Adam optimizer is used for eficient parameter updates. Training is conducted in
mini-batches to stabilize learning and improve generalization.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Methodology for Finding the source intention behind the Sexism Memes (Subtask 2)</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Overview</title>
          <p>This part of the project focuses on building a multi-modal model to detect sexism in memes and also
understand the type of intention behind them. We take both the meme text (caption) and the meme
image into account. For text, we use the BERT model, and for image, we use the Vision Transformer
(ViT).</p>
          <p>As shown in Figure 3, we first preprocess the dataset to separate meme texts and images. The text
goes through cleaning, tokenization using BERT, and embedding generation. Similarly, the image is
preprocessed and passed through ViT to get image embeddings. These embeddings are then reduced
using linear layers.</p>
          <p>After that, both text and image embeddings are combined and passed through a simple MLP model
to predict the intention behind the meme. The output is one of the two categories:
• DIRECT
• JUDGEMENTAL</p>
          <p>This combined method helps the model understand both what is written and what is shown in the
meme, which improves accuracy in detecting sexism. The subsequent sections elaborate on each stage
of the process in greater detail.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Input Data Processing</title>
          <p>Dataset Preparations is the same as Task 1. So, we only need to change the label since both are binary
classifications.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Model Architecture</title>
          <p>In our Model architecture, we first process the text and image separately using separate tokenizer and
embedder. Later we concatenate them and feed them into a Multi-Layer Perceptron for classification.
BERT (Bidirectional Encoder Representations from Transformers) for text analyzing We
apply BERT-base-uncased, which is a transformer model that utilizes attention mechanisms to process
text data into a contextualized representation.</p>
          <p>• The text input is tokenized and sent through BERT’s embedding, mapping each word to a
highdimensional space.
• BERT’s transformer layers produce contextualized embeddings, which create semantic
relationships between words.
• We extract the last hidden state corresponding to the [CLS] token, which is a 768-dimensional
vector.
• This 768-dimensional text feature is then passed through a fully connected (FC) layer that reduces
the dimensionality to 512.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>Vision Transformer (ViT) for Memes image analysis</title>
          <p>a model that perceives images as a succession of patches.</p>
          <p>We make use of ViT-base-patch16-224-in21k,
• Each image was delineated into 16 × 16 pixel patches, and each patch was reflected into an
embedding space of high-dimensional objects.
• To account for the spatial information inherent in the patches, a position embedding was added.
• The transformer layers of ViT process these embeddings in a manner analogous to how BERT
processes words.
• At the end of the model, the hidden state corresponding to the classification token (a
768dimensional vector) is extracted.
• This image feature embedded in 768-dimensions is then processed by a fully connected (FC) layer
that reduces the dimensionality from 768 to 256.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Concatenation of features/ embeddings extracted from above methods for text and image</title>
          <p>• The text and image features of size 512 and 256 dimension are combined together into a vector of
size 768 dimension.
• This fused representation is then run through a Multi Layer Perceptron for classification.
• The output is a vector of logits with two values corresponding to the probability scores of the</p>
          <p>DIRECT and JUDGEMENTAL classes.</p>
          <p>• The predicted class is the class that has the highest score.</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>3.3.4. Process of Training</title>
          <p>Loss Function For our study, we use CrossEntropyLoss, which is a standard loss function used for
multi-class classification. Formally, it simply computes the diference between predicted probabilities
for each class and true class labels.</p>
          <p>Optimization We selected the AdamW optimizer, at a learning rate of 2− 5, as it is typically the
optimizer of choice for fine-tuning pre-trained transformers. In addition to this, we employ weight
decay as a regularizer to prevent overfitting.</p>
        </sec>
        <sec id="sec-3-3-7">
          <title>Training Loop</title>
          <p>We trained the model for 5 epochs with a batch size of 16. In each iteration:
1. Text and image inputs are inputted through the separate models.
2. Features are extracted, then transformed, and finally concatenated and fused.
3. The resultant fused representation is classified, and a loss value is computed.
4. The loss value is used to propagate a weight update according to both the loss function and
optimizer.</p>
          <p>We monitored potential improvements in model performance using the average loss per epoch.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Methodology for Sexism Categorization in Memes (Subtask 3)</title>
        <sec id="sec-3-4-1">
          <title>3.4.1. Overview</title>
          <p>This subtask deals with classifying sexist memes into five specific categories. Our model uses both the
meme image and its text to understand the meaning better and make the final prediction.</p>
          <p>As shown in Figure 4, we first separate the image and text from the training data. The meme text is
translated to English (if needed), cleaned, and preprocessed. On the image side, we use the BLIP model
to generate captions after doing basic image preprocessing.</p>
          <p>Both the original meme text and the captions generated from the image are then passed through
HateBERT. First, we tokenize the inputs, and then generate embeddings for both the text and the
captions. These embeddings are combined to create one final feature.
This combined feature is then passed through attention pooling and a Multi-Layer Perceptron (MLP)
classifier, which predicts one of the five categories:
• Ideology and Inequality
• Stereotyping and Dominance
• Objectification
• Sexual Violence
• Misogyny and Non-Sexual Violence</p>
          <p>Using both the image and text like this helps the model understand the meme more completely,
especially in cases where just one of them is not enough. The subsequent sections elaborate on each
stage of the process in greater detail.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Input Data Processing</title>
          <p>The system takes two main inputs: the meme image and the meme text. Feature extraction is performed
on the image, followed by caption generation. The texts are processed independently from the images.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.3. Image Pre-processing using BLIP</title>
          <p>All input images are preprocessed for compatibility with the BLIP model. The images are converted to
RGB, resized, and normalized to the dimensions required by BLIP’s Vision Transformer (ViT). These
processed images are then converted into tensor representations to serve as input to the BLIP model.</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>3.4.4. Text Pre-processing using HateBERT</title>
          <p>HateBERT handles meme text processing independently. The meme text is tokenized using the HateBERT
tokenizer for uniform representation. Special tokens like [CLS] and [SEP] are included for better
semantic encoding. These tokens are then converted to tensors along with attention masks to support
variable-length inputs.</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>3.4.5. BLIP: Image Feature Extraction and Caption Generation</title>
          <p>The BLIP model has two major components: a Vision Encoder (ViT) that extracts visual features from
the meme image, and a Text Decoder that generates a descriptive caption of the image. For extracting
image features, the image is processed by ViT to produce visual embeddings, and the CLS token output
from the final transformer layer is used as the image representation. For caption generation, the image
is passed through BLIP’s decoder to generate a caption, which represents the semantic meaning of the
image and is then passed through HateBERT for embedding.</p>
        </sec>
        <sec id="sec-3-4-6">
          <title>3.4.6. HateBERT: Textual Embedding for Meme Text and Generated Captions</title>
          <p>HateBERT is applied to two textual inputs. For meme text embedding, the raw meme text is tokenized
using the HateBERT tokenizer, encoded using HateBERT, and the [CLS] token embedding is extracted.
Similarly, for caption text embedding, the BLIP-generated caption is tokenized and encoded, and its
[CLS] token embedding is extracted as the caption’s representation.</p>
        </sec>
        <sec id="sec-3-4-7">
          <title>3.4.7. Attention Pooling for Feature Aggregation</title>
          <p>To improve text feature representation, attention pooling is used. This mechanism assigns diferent
weights to token embeddings based on their relevance and aggregates these embeddings into a single
vector for each text source. Attention pooling is applied to HateBERT’s hidden states for both the meme
text and the captions, resulting in one aggregated vector per input source.</p>
        </sec>
        <sec id="sec-3-4-8">
          <title>3.4.8. Feature Fusion: Combining Image and Text Embeddings</title>
          <p>All extracted features are concatenated into a single representation comprising image features, meme
text embedding, and caption embedding. This final representation captures visual semantics from
BLIP-ViT, text semantics from the raw meme text, and generated caption semantics.</p>
        </sec>
        <sec id="sec-3-4-9">
          <title>3.4.9. MLP Classifier for Final Prediction</title>
          <p>A Multi-Layer Perceptron (MLP) processes the fused feature vector. The MLP consists of one or more
fully connected layers, uses ReLU activation for non-linearity, and includes dropout layers to prevent
overfitting. The final classification is performed via a softmax layer.</p>
        </sec>
        <sec id="sec-3-4-10">
          <title>3.4.10. Training and Optimization</title>
          <p>To ensure optimal training, several techniques are used. Cross-entropy loss is applied for multi-class
classification. The Adam optimizer is used for eficient parameter updates. Training is conducted in
mini-batches to stabilize learning and improve generalization.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>This section discusses the Dataset, Experimental setup and results obtained after evaluating the
finetuned model on the training and validation data.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset Description</title>
        <p>
          The dataset we have used for the given tasks is provided by CLEF 2025 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], named EXIST-Datasets. The
EXIST 2025 Dataset aims to provide the research community with the first comprehensive multimedia
dataset encompassing tweets, memes, and videos— for sexism detection and categorization in social
media.
        </p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Meme Distribution</title>
          <p>• Total Memes: 4044
• English Memes (en): 2010
• Spanish Memes (es): 2034
Subtask 3: Sexism Categorization IDEOLOGICAL AND INEQUALITY, STEREOTYPING
in Memes AND DOMINANCE, OBJECTIFICATION, SEXUAL
VIOLENCE, MISOGYNY AND NON-SEXUAL
VIO</p>
          <p>LENCE
Annotator Gender</p>
          <p>Female, Male
Annotator Age Distribution</p>
          <p>18–22 Years, 23–45 Years, 46+ Years
Ethnicity Distribution
Education Distribution
Less than high school diploma, High school degree
or equivalent, Bachelor’s degree, Master’s degree,</p>
          <p>Doctorate, Other</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Label Distributions</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Setup</title>
        <p>We ran our experiments on Kaggle using a P100 GPU (16 GiB), with 29 GiB RAM and 57.6 GiB disk. We
used PyTorch, Transformers, and Torchvision. Full setup details are shared in our GitHub repo.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results</title>
        <p>We evaluated our model on the hard test set provided in the EXIST 2025 shared task, which includes
challenging memes exhibiting various degrees of subtle, implicit, or overt sexist content, in line with
the annotation taxonomy (e.g., ideological inequality, stereotyping, objectification, etc.). The results
obtained for Subtask 1, Subtask 2, and Subtask 3 are presented in Table 2.</p>
        <p>The significant gap between training and validation accuracy across all subtasks indicates potential
overfitting, warranting the application of further regularization strategies or architectural adjustments.
To ensure robust performance during inference, we saved and reloaded the checkpoint corresponding
to the highest validation accuracy in each subtask for final testing.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Scope</title>
      <p>In this work, we proposed a robust multimodal classification model that efectively integrates textual and
visual modalities using a combination of BERT, BLIP, HateBERT, and ViT architectures. By leveraging
the strengths of both language and vision transformers, our approach demonstrates a comprehensive
understanding of meme content, which is often ambiguous and context-dependent.</p>
      <p>Across the three subtasks, our model consistently achieved high training accuracies, with subtask 1
and subtask 2 reaching 99.36% and 99.32% respectively, and subtask 3 achieving 97.72%. In terms of
validation performance, subtask 2 yielded the highest accuracy at 68.52%, closely followed by subtask 1
at 68.49%, while subtask 3 lagged behind with 50.18%, likely due to increased task complexity or class
imbalance.</p>
      <p>These results highlight the capability of our multimodal framework to learn rich representations from
both textual and visual features. However, the observed gap between training and validation accuracy
suggests potential overfitting, indicating room for improvement through further regularization, data
augmentation, or architectural enhancements. Overall, the promising validation results, especially in
subtask 2, confirm the efectiveness of our integrated model in handling nuanced and multimodal data
like internet memes.</p>
      <sec id="sec-5-1">
        <title>5.1. Future Work</title>
        <p>Regularization to Mitigate Overfitting. While the proposed model achieved excellent training
accuracy—99.36% for Subtask 1 (Sexism Identification), 99.32% for Subtask 2 (Source Intention Classification),
and 97.72% for Subtask 3 (Sexism Categorization)—a significant drop was observed during validation,
with corresponding accuracies of 68.49%, 68.52%, and 49.31%. This performance gap indicates overfitting,
where the model captures patterns specific to the training data but fails to generalize to unseen samples.
Future work will focus on implementing advanced regularization techniques such as optimized dropout,
ifne-tuned weight decay, and strategic data augmentation. These measures aim to improve the model’s
generalization capabilities and robustness, especially in handling complex or subtle manifestations of
sexism in meme content.</p>
        <p>Dataset Expansion for Greater Robustness. Despite the diversity of the EXIST 2025 dataset, which
consists of 4044 annotated memes, its size and cultural scope remain limited for developing a model
with broad generalization. The relatively low validation accuracy in Subtask 3 (Sexism Categorization)
particularly highlights this limitation. Future eforts will concentrate on expanding the dataset to include
memes from a broader spectrum of languages, cultures, and social environments. This expansion is
expected to capture a wider variety of sexist expressions and implicit biases, enabling the model to
perform more reliably on real-world data from diverse online platforms.</p>
        <p>Improving Multimodal Fusion Techniques. The current model uses a straightforward fusion
strategy, combining image embeddings from ViT/BLIP and text embeddings from BERT/HateBERT via
concatenation, followed by classification using a multi-layer perceptron (MLP). Although this approach
yielded decent results—68.49% and 68.52% validation accuracy for Subtasks 1 and 2, respectively—it
may not fully exploit the intricate relationships between visual and textual modalities, which are often
essential in meme interpretation. Future research will explore more sophisticated fusion mechanisms,
such as co-attention networks, cross-modal transformers, and gated fusion layers, to enable deeper
interaction and contextual alignment between modalities.</p>
        <p>Web Application for Practical Deployment. To broaden the real-world impact of this research, we
plan to develop a web-based application that embeds the trained model into an accessible interface. This
tool would allow users—including educators, moderators, and researchers—to upload memes and receive
real-time predictions for sexism identification, source intention classification, and categorization. Given
the promising validation accuracies achieved, the system demonstrates strong potential for practical
use. A web application will enhance the visibility and usability of the model while supporting initiatives
aimed at fostering safer and more inclusive digital environments.
The author(s) have not employed any Generative AI tools.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Hu,
          <article-title>Scalable harmful meme detection using graph neural networks</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>46</volume>
          (
          <year>2024</year>
          )
          <fpage>789</fpage>
          -
          <lpage>802</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Asmawati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saikhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Siahaan</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of text memes using supervised machine learning</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>190</volume>
          (
          <year>2021</year>
          )
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Jamil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cordeiro</surname>
          </string-name>
          , G. Dias,
          <article-title>Detecting extreme emotions in social networks using bert</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>45678</fpage>
          -
          <lpage>45690</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Vision transformers for image analysis: A comprehensive review</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>34</volume>
          (
          <year>2023</year>
          )
          <fpage>5123</fpage>
          -
          <lpage>5140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M. K. H. Tariq</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Shrivastava</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Thakur</surname>
          </string-name>
          <article-title>, Multi-modal meme classification with image-text joint embedding and transformer-based models</article-title>
          ,
          <source>in: Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations (CONSTRAINT)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Malu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Al-Onaizan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <article-title>Mami: A multimodal metaphor annotation dataset for internet memes, in: Findings of the Association for Computational Linguistics</article-title>
          : EMNLP,
          <year>2022</year>
          , pp.
          <fpage>3063</fpage>
          -
          <lpage>3074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Munir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sohail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <article-title>Ofensive memes detection using deep learning and ocr</article-title>
          ,
          <source>in: International Conference on Information Technology and Systems (ICITS)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryawanshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McCrae</surname>
          </string-name>
          ,
          <article-title>Multimodal meme emotion classification using deep learning</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Multimodal Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <article-title>Part-aware visual meme understanding via vision transformers</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          <volume>32</volume>
          (
          <year>2023</year>
          )
          <fpage>4107</fpage>
          -
          <lpage>4119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Badjatiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <article-title>Deep learning for hate speech detection in tweets</article-title>
          ,
          <source>Proceedings of the 26th International Conference on World Wide Web Companion</source>
          (
          <year>2017</year>
          )
          <fpage>759</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Samory</surname>
          </string-name>
          , T. Mitra,
          <article-title>The unreliability of hate speech detection</article-title>
          ,
          <source>in: Proceedings of the ACM on Human-Computer Interaction</source>
          , volume
          <volume>4</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mitrović</surname>
          </string-name>
          , M. Granitzer,
          <article-title>HateBERT: Retraining BERT for abusive language detection in english</article-title>
          ,
          <source>in: Proceedings of the Fifth Workshop on Online Abuse and Harms</source>
          , Association for Computational Linguistics, Online,
          <year>2021</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>25</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . woah-
          <volume>1</volume>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zafar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Wani</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Traore</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ghafir</surname>
          </string-name>
          ,
          <article-title>Distributed comment embeddings for detecting toxic content on social media</article-title>
          ,
          <source>Journal of Information Security and Applications</source>
          <volume>63</volume>
          (
          <year>2022</year>
          )
          <fpage>103033</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>E. . Team,</surname>
          </string-name>
          <article-title>Exist 2025 dataset for sexism detection in social media</article-title>
          ,
          <source>in: Proceedings of the Conference and Labs of the Evaluation Forum (CLEF)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>