A Contrastive Learning Based Approach to Detect Sexism
                         in Memes
                         Notebook for the EXIST Lab at CLEF 2024

                         Fariha Maqbool1 , Elisabetta Fersini1
                         1
                          Dipartimento di informatica, sistemistica e comunicazione
                         University of Milano-Bicocca
                         Viale Sarca 336, 20126 Milan, Italy


                                      Abstract
                                      The widespread use of social media has created a unique challenge in detecting and mitigating sexism in online
                                      content. In this paper, we present our approach for detecting sexism in memes, developed for Task 4 of the EXIST
                                      2024 challenge. The task was based on binary classification problem to detect whether or not a meme is sexist,
                                      within the context of a learning with disagreement paradigm. In our approach, We used ResNet50 and m-BERT
                                      models finetuned on EXIST 2024 dataset to get image and text embeddings. These embeddings, along with the
                                      annotators’ data, were subsequently used to train a model using contrastive learning. The results on the test data
                                      demonstrate the effectiveness of contrastive learning techniques in addressing multimodal tasks.

                                      Keywords
                                      Sexism Identification, Learning with disagreement, Contrastive Learning


                         1. Introduction
                         Sexism is a type of bias and prejudice that leads to detrimental sex-based stereotypes and societal
                         expectations. It often involves a combination of gender-based beliefs, attitudes, and actions that result
                         in uneven treatment of men and women. Historically and culturally pervasive, sexism against women is
                         rooted in the notion of male supremacy, affecting various aspects of life such as the workplace, politics,
                         society, and the family [1].
                            In today’s digital age, the widespread use of social media has contributed to the alarming prevalence of
                         sexist content. This content spreads rapidly, fueling more instances of sexism in various forms. However,
                         detecting such content might be challenging due to the diverse ways it is expressed. Internet memes, in
                         particular, have emerged as a notable medium for communicating these concepts in an engaging manner
                         [2]. Detecting sexism and other forms of hateful content in memes poses a considerable challenge.
                         Memes typically consist of an image accompanied by text, and while the visual and textual components
                         are related, they may not convey the same meaning when viewed independently. Therefore, effectively
                         identifying hateful memes requires a careful analysis of both the visual elements and the accompanying
                         text.
                            In addition to the challenges of detecting sexism in memes, another challenge arises from the inherent
                         subjectivity and disagreement among annotators when labeling such content. Different annotators
                         may have varying perspectives on sexism or hate speech, influenced by their individual backgrounds,
                         experiences, and cultural contexts. This disagreement can lead to inconsistent annotations, which must
                         be carefully managed to train machine learning models. EXIST 2024 incorporates this learning with
                         disagreement approach, which leverages these multiple perspectives, increasing dataset richness and
                         improving the ability of models to generalize across different interpretations of harmful content.
                            In this paper, we describe the overview of the system we developed for sEXism Identification in
                         Social neTworks (EXIST 2024) [3][4] shared task at CLEF 2024. Our team participated in Task-4 a
                         binary classification task to detect whether or not the meme was sexist. We proposed a contrastive

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ f.maqbool@campus.unimib.it (F. Maqbool); elisabetta.fersini@unimib.it (E. Fersini)
                           0009-0008-2587-9417 (F. Maqbool); 0000-0002-8987-100X (E. Fersini)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
learning-based approach to predict the hard labels for each meme that represents the label for each
meme by the aggregation of perspectives of different annotators.


2. Related Work
Over the past few years, numerous academic events and shared tasks have focused on identifying
misogyny [5][6][7] and detecting hate speech against immigrants and women [8]. It is important to note
that sexism and misogyny are not always similar. Sexism encompasses a broad spectrum of oppression
or prejudice against women that can range from overt hostility, such as misogyny, to more nuanced
forms. Hence, while misogyny is a part of sexism, it certainly doesn’t define its full extent.
   To fill this research gap, sEXism Identification in Social neTworks (EXIST) shared tasks were proposed
at the IberLEF forum [9][10] whose aim was to identify and classify sexism in textual data, from explicit
or hostile to other subtle or even benevolent expressions that involve implicit sexist behaviours. In
2023, they again proposed the task with the adoption of the “learning with disagreements” paradigm
for the development of the dataset and, optionally, for the evaluation of the systems [11]. EXIST 2024
[12] is the fourth edition of the sEXism Identification in Social neTworks challenge which presents
shared tasks on sexism detection on social media. While the three previous editions focused solely on
detecting and classifying sexist textual messages, this new edition incorporates new tasks that center
around images, particularly memes. Detection of sexism in memes is quite challenging because of
the multimodality of memes. Elisabetta Fersini et al. [13] presented the first attempt to address the
challenge of automatic detection of sexist memes. The study examined both unimodal and multimodal
approaches to understand the role of textual and visual cues. They also released a benchmark dataset
containing 800 memes, which include sexist and non-sexist content. Each meme is labeled based on
visual and textual elements. The dataset comprises images along with their associated texts.
   There have been contrastive learning techniques used for the detection of hateful, misogynous, or
sexist content. Jason Angel et al.[14] presented an approach for multilingual sexism identification in
tweets. They finetuned multilingual RoBERTa language model by integrating contrastive learning as an
intermediate step. The competitive results achieved show the effectiveness of contrastive learning in
sexism identification task in textual data. In addition to textual data, contrastive learning has also been
used for vision language tasks. Charic F. Cuervo and Natalie Parde [15] used contrastive learning based
model named CLIP [16] for the task of detecting misogynous memes. They slightly modified the CLIP
model’s approach such that the language content from the meme was used as the training text along
with the correct label. Lei Chen and Hou W. Chou [17] also used CLIP model for feature extraction and
Logistic regression in these extracted features to detect misogyny in memes. These studies collectively
highlight the significant advancements made through the application of contrastive learning techniques
in detecting hateful, misogynous, and sexist content across different modalities.


3. Task Description and Dataset
The dataset consists of 4044 memes for training and 1053 memes for testing in both English and Spanish.
We split the training dataset to training, validation and test sets in the ratio of 80,10 and 10 respectively.
The text of the memes has already been compiled by the organizers in a separate file. In Table 1 we
described the details of the dataset. We worked only on the Task-4 to identify whether or not the
memes are sexist for which binary labels were provided. The dataset also follows the learning with
disagreement paradigm, in which each data point is labeled by multiple annotators, and disagreements
between their annotations are retained for analysis. All demographic data, including gender, age,
ethnicity, level of education, and country of residence, is carefully documented for every annotator.
The labels assigned to memes by each annotator was also recorded in the dataset, and a hard label was
determined through majority voting. In cases where there was an equal number of votes, the label
’unknown’ was assigned. These particular samples, labeled as ’unknown’, were excluded from our
training data.
    Table 1
    Dataset description
               Data Type    Total Size   Spanish   English   sexist   Non-sexist   unknown
                 Train        4044        2034      2010     2038       1382         624
                 Test         1053         540       513      –          –            –


4. System Overview
We implemented a contrastive learning based strategy to perform binary classification of memes to
sexist and non-sexist. Figure 1 shows the workflow of our proposed system.


Figure 1: Workflow of the Proposed System Architecture


4.1. Image and Text Encoders
We used ResNet-50 as image encoder which is a widely used deep convolutional neural network
architecture introduced by Kaiming He [18] known for its effectiveness in image recognition and
computer vision tasks. It consists of 50 layers, including convolutional layers, batch normalization,
and ReLU activation functions. We finetuned ResNet-50 model on the respective dataset before getting
image embeddings. The torchvision library of python was used to load the pre-trained ResNet-50
model. We froze all its layers to retain their pre-trained weights and unfroze the last few layers to allow
fine-tuning. To adapt the model for this specific task, we modified the head of the model by replacing
the fully connected layer with additional layers, including a dropout layer for regularization. This
model was finetuned for 30 epochs with adam optimizer and later used for feature extraction of images.
   To get the text embeddings we finetuned transfomer based multilingual BERT model introduced by
Jacob Devlin et al. [19]. To fintune this model, each text sample is tokenized using the BERT tokenizer,
ensuring that the sequences are appropriately padded and truncated to the specified maximum length.
The tokenized text data, along with the corresponding attention masks, token type IDs, and labels, are
converted into tensors suitable for model input. This preparation is crucial for ensuring that the text
data are in the correct format for BERT, facilitating efficient and effective training. The core of the
model was initialized with pre-trained weights from the multilingual cased BERT and a linear layer
was added at model’s head that maps the BERT output to the desired number of output classes, which
in this case is one (for binary classification). A sigmoid activation function was used at the output layer
to facilitate binary classification.
4.2. Projection layer
After encoding the images and texts, they were passed through a projection layer for dimensionality
reduction and feature transformation on input vectors. This projection layer consists of a Linear
projection layer, Gaussian Error Linear Unit (GELU) activation function and fully connected layers. It
ensures that embeddings from different modalities are effectively aligned and normalized, facilitating
improved downstream learning and integration tasks.

4.3. Combining features
In order to combine the image and text features, we used the feature interaction matrix (FIM) introduced
by Gokul K. Kumar et al.[20] that directly models the correlations between each text and image. This
matrix is obtained by computing the outer product of each text and image feature, but in order to reduce
the dimensionality of the representations, the authors only considered the diagonal elements of FIM.
We also followed the same strategy. The dimension of the vector obtained from this method was n.
To include the annotator’s information, the tabular data containing annotator’s information was also
concatenated with these image-text features.

4.4. Training and Testing
Our contrastive learning-based model is then trained on these combined features using infoNCE loss
function with the objective of increasing the cosine similarity between the memes of similar classes
and decreasing between dissimilar ones. To evaluate the trained model, we conducted tests on the
evaluation dataset. Firstly, we extracted the image and text data features of the test samples using
the model encoders. Next, we computed the cosine similarity between each test sample and all the
training samples. For each test sample, we used the K-Nearest Neighbors (KNN) algorithm to select the
10 training embeddings with the highest cosine similarity to that sample. The label for each test sample
was then assigned based on the most common label among these 10 training samples. We predicted the
labels for each annotator separately, then we applied majority voting on these labels to find the final
hard label for each sample.


5. Experimentation and Results
In our implementation, we used PyTorch library in Python. After feature extraction and concatenation,
the model was trained using contrastive loss for 50 epochs with Adam optimizer and a batch size of 32.
We used the transformers library to train our model with learning rate set to 1e-5 for image encoder
and 1e-4 for text encoder. A dropout layer was also added for regularization. We save the model with
the lowest contrastive loss on the validation set during training. We then use the saved model to make
predictions on the unseen test set.
   The challenge uses the ICM metric [21] to evaluate the performance. This metric is a similarity
function that extends the concept of Pointwise Mutual Information (PMI) to measure the similarity
between the model’s predictions and the ground truth categories. The normalized ICM is calculated
by considering the "Minority class" baseline, which assigns all instances to the minority class, as the
lowest score, and the "Gold standard" as the highest score.
   Table 2 shows the results of our system based on hard hard evaluation. Hard-Hard evaluation means
that the final hard labels of the samples are compared with the gold labels of test set. The model was
able to achieve the best score on samples with English text with ICM Normalized score of 0.277 and F1
score of 0.5816 for positive samples.
   Table 2
   Official results of our system on Hard-Hard Evaluation
                  Language         Model        ICM-Hard     ICM-Hard Norm      F1_YES
                                 Baseline          0.9832         1.0000         1.0000
                  All
                              Proposed Model      -0.4986         0.2465         0.5674
                                 Baseline          0.9848         1.0000         1.0000
                  English
                              Proposed Model      -0.4377         0.2778         0.5816
                                 Baseline          0.9815         1.0000         1.0000
                  Spanish
                              Proposed Model      -0.5591         0.2152         0.5537


6. Conclusion
In this paper, we present our approach and the results obtained for Task 4 of the sEXism Identification in
Social neTworks (EXIST 2024) challenge. This task involves a binary classification problem, where the
goal is to distinguish between sexist and non-sexist memes, incorporating a learning with disagreement
paradigm. We employed ResNet-50 and mBERT models to encode the visual and multilingual textual
data of the memes, respectively. After obtaining the embeddings from these models, we concatenated
the data and trained a contrastive learning-based model on these embeddings. The performance of our
model was evaluated on the test data using the ICM metric for hard labels, achieving ICM scores of
0.2778 for English, 0.2152 for Spanish, and 0.2465 for the combined dataset. These results demonstrate
the effectiveness of our approach in addressing the challenge of sexism detection in a multilingual and
multimodal context.


7. Acknowledgments
We acknowledge the support of the PNRR ICSC National Research Centre for High Performance
Computing, Big Data and Quantum Computing (CN00000013), under the NRRP MUR program funded
by the NextGenerationEU.


References
 [1] A. ElBarazi, How social media affects people’s ideas on sexist behaviours and gender-based
     violence (2023). doi:10.19080/GJIDD.2023.12.555838.
 [2] C. Jennifer, F. Tahmasbi, J. Blackburn, G. Stringhini, S. Zannettou, E. D. Cristofaro, Feels bad man:
     Dissecting automated hateful meme detection through the lens of facebook’s challenge (2022).
     doi:10.36190/2022.65.
 [3] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
     the CLEF Association (CLEF 2024), 2024.
 [4] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
 [5] E. Guest, B. Vidgen, A. Mittos, N. Sastry, G. Tyson, H. Z. Margetts, An expert annotated dataset
     for the detection of online misogyny, in: P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.), Proceedings of
     the 16th Conference of the European Chapter of the Association for Computational Linguistics:
     Main Volume, EACL 2021, Online, April 19 - 23, 2021, Association for Computational Linguistics,
     2021, pp. 1336–1350. doi:10.18653/V1/2021.EACL-MAIN.114.
 [6] E. Fersini, P. Rosso, M. Anzovino, Overview of the task on automatic misogyny identification
     at ibereval 2018, in: P. Rosso, J. Gonzalo, R. Martínez, S. Montalvo, J. C. de Albornoz (Eds.),
     Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian
     Languages (IberEval 2018) co-located with 34th Conference of the Spanish Society for Natural
     Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018, volume 2150 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2018, pp. 214–228.
 [7] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. Sorensen, Semeval-2022
     task 5: Multimedia automatic misogyny identification, in: G. Emerson, N. Schluter, G. Stanovsky,
     R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings of the 16th International
     Workshop on Semantic Evaluation, SemEval@NAACL 2022, Seattle, Washington, United States,
     July 14-15, 2022, Association for Computational Linguistics, 2022, pp. 533–549. doi:10.18653/
     V1/2022.SEMEVAL-1.74.
 [8] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. R. Pardo, P. Rosso, M. Sanguinetti, Semeval-
     2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter, in:
     J. May, E. Shutova, A. Herbelot, X. Zhu, M. Apidianaki, S. M. Mohammad (Eds.), Proceedings of the
     13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis,
     MN, USA, June 6-7, 2019, Association for Computational Linguistics, 2019, pp. 54–63. doi:10.
     18653/V1/S19-2007.
 [9] F. J. Rodríguez-Sanchez, J. Carrillo-de-Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet, T. Donoso,
     Overview of EXIST 2021: sexism identification in social networks, Proces. del Leng. Natural 67
     (2021) 195–207.
[10] F. J. Rodríguez-Sanchez, J. Carrillo-de-Albornoz, L. Plaza, A. Mendieta-Aragón, G. M. Remón,
     M. Makeienko, M. Plaza, J. Gonzalo, D. Spina, P. Rosso, Overview of EXIST 2022: sexism identifi-
     cation in social networks, Proces. del Leng. Natural 69 (2022) 229–240.
[11] L. Plaza, J. Carrillo-de-Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview
     of EXIST 2023: sexism identification in social networks, in: J. Kamps, L. Goeuriot, F. Crestani,
     M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information
     Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April
     2-6, 2023, Proceedings, Part III, volume 13982 of Lecture Notes in Computer Science, Springer, 2023,
     pp. 593–599. doi:10.1007/978-3-031-28241-6\_68.
[12] L. Plaza, J. Carrillo-de-Albornoz, E. Amigó, J. Gonzalo, R. Morante, P. Rosso, D. Spina, B. Chulvi,
     A. Maeso, V. Ruiz, EXIST 2024: sexism identification in social networks and memes, in: N. Goharian,
     N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
     Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March
     24-28, 2024, Proceedings, Part V, volume 14612 of Lecture Notes in Computer Science, Springer,
     2024, pp. 498–504. doi:10.1007/978-3-031-56069-9\_68.
[13] E. Fersini, F. Gasparini, S. Corchs, Detecting sexist MEME on the web: A study on textual and
     visual cues, in: 8th International Conference on Affective Computing and Intelligent Interaction
     Workshops and Demos, ACII Workshops 2019, Cambridge, United Kingdom, September 3-6, 2019,
     IEEE, 2019, pp. 226–231. doi:10.1109/ACIIW.2019.8925199.
[14] J. Angel, S. T. Aroyehun, A. F. Gelbukh, Multilingual sexism identification using contrastive
     learning, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the
     Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th
     to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 855–861.
[15] C. F. Cuervo, N. Parde, Exploring contrastive learning for multimodal detection of misogy-
     nistic memes, in: G. Emerson, N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, N. Schneider,
     S. Singh, S. Ratan (Eds.), Proceedings of the 16th International Workshop on Semantic Evaluation,
     SemEval@NAACL 2022, Seattle, Washington, United States, July 14-15, 2022, Association for
     Computational Linguistics, 2022, pp. 785–792. doi:10.18653/V1/2022.SEMEVAL-1.109.
[16] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
     supervision, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference
     on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of
     Machine Learning Research, PMLR, 2021, pp. 8748–8763.
[17] L. Chen, H. W. Chou, RIT boston at semeval-2022 task 5: Multimedia misogyny detection by
     using coherent visual and language features from CLIP model and data-centric AI principle, in:
     G. Emerson, N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.),
     Proceedings of the 16th International Workshop on Semantic Evaluation, SemEval@NAACL 2022,
     Seattle, Washington, United States, July 14-15, 2022, Association for Computational Linguistics,
     2022, pp. 636–641.
[18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE
     Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June
     27-30, 2016, IEEE Computer Society, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.
[19] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
     for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
     1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.
     18653/V1/N19-1423.
[20] G. K. Kumar, K. Nandakumar, Hate-clipper: Multimodal hateful meme classification based on
     cross-modal interaction of CLIP features, CoRR abs/2210.05916 (2022). doi:10.48550/ARXIV.
     2210.05916.
[21] E. Amigó, A. D. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan,
     P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27,
     2022, Association for Computational Linguistics, 2022, pp. 5809–5819. doi:10.18653/V1/2022.
     ACL-LONG.399.