Team Aditya at EXIST 2024 – Detecting Sexism in
                         Multilingual Tweets using Contrastive Learning Approach
                         Notebook for the EXIST Lab at CLEF 2024

                         Aditya Shah1,*,† , Aditya Gokhale2,†
                         1
                             Pune Institute Of Computer Technology, Pune India
                         2
                             Pune Institute Of Computer Technology, Pune India


                                        Abstract
                                        Due to the growing impact of social media, the necessity for automated mechanisms that can identify sexism
                                        and other forms of disrespectful and hateful conduct is rising, aiming to create a more inclusive and respectful
                                        digital space. However, it poses significant challenges due to the variety of hate categories and the complexity of
                                        interpreting the author’s intent, particularly under the multilingual learning framework. This paper describes
                                        Team Aditya’s participation in the EXIST (sEXism Identification in Social neTworks) Lab at CLEF 2024 . The
                                        proposed system makes use of large language models (i.e., Bertweet, mBERT and XLM-RoBERTa) for identifying
                                        sexism in English and Spanish language. This work describes our participation in EXIST task 1. Considering a
                                        hard evaluation, we obtained F1 score of 0.7691 using best epoch trained with XLM-Roberta. We are ranked 14th
                                        in the given task.

                                        Keywords
                                        Sexism, Disrespectful and hateful conduct, Large language models,


                         1. Introduction
                         Sexism is prejudice or discrimination based on one’s sex or gender, often targeting women due to their
                         gender. This harmful mindset causes inequality, limits opportunities, and reinforces oppressive power
                         dynamics, limiting progress toward a fairer society.
                            The rise of social media platforms such as Twitter and Facebook has led to a significant change in
                         communication methods. Identifying and reducing hate speech on these platforms can be a daunting
                         task, due to large volumes of data generated. This requires using automated techniques and advanced
                         technologies to efficiently process and classify the content.
                            The EXIST 2024 [1][2] shared task is focused on detecting sexism, which ranges from blatant misogyny
                         to more subtle, implicit forms of sexist behavior. This task differentiates itself from other related tasks
                         on sexism detection by encompassing not only posts that are explicitly identified as sexist but also posts
                         that document reported incidents of sexism.


                         2. Dataset Details
                         The dataset provided by the Exist2024 initiative consists of about 7K tweets, equally split between
                         Spanish and English language. To mitigate label bias, organisers have considered two different social
                         and demographic parameters: gender (MALE/FEMALE) and age (18-22 y.o./23-45 y.o./+46 y.o). The
                         dataset was split into train, dev, and test sets, roughly distributed as 70%, 10%, and 20%, respectively, for
                         both languages.
                           The labels for the tweets in Subtask 1 were categorized as "YES" or "NO" to indicate whether they
                         conveyed a sexist meaning.


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ aditya02shah@gmail.com (A. Shah); adityangokhale@gmail.com (A. Gokhale)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Distribution of majority vote per language combining the Train and Dev datasets

                                   Majority Vote       English       Spanish
                                   Sexist               47.1%         55.7%
                                   Non-Sexist           52.9%         44.3%


3. System Description
To derive the definitive hard label, we utilize the annotations by multiple annotators and apply majority
voting. Only if 3 or more of the 6 annotators unanimously agree to a label YES, then the label is set to
YES. Prior to training the model, preprocessing steps were carried out to remove any emoji’s, URLs and
mentions from the samples. This was done to remove any undue bias.
  We conducted experiments with various models and found XLM-R to be the most effective, particularly
due to its strong performance on multilingual data, as shown in Table 2.

Table 2
Evaluating Performance of Different Models on the Dev Test
                                            Model               F1
                                            BerTweet        0.7858
                                            RoBERTa         0.7463
                                            XLM-R-Large     0.7961
                                            UML-T5          0.7683
                                            mBert           0.7432

   We finetune XLM-Roberta Large, a multilingual version of Roberta, trained on 2.5TB of filtered
CommonCrawl data. This allows us to handle both English and Spanish samples by utilizing a single
model.
   The model was trained using contrastive learning, enabling it to differentiate between samples
effectively. It learns an embedding space where similar pairs are positioned in close proximity, while
dissimilar pairs are distinctly separated.
   To improve the representation of each example in a batch we created label-aware embeddings by
prefixing the text with its corresponding label [3]. A contrastive loss function was then used to align the
text features closer to the representations of their correct labels, to improve classification capabilities of
the system.
   We utilized three datasets during the finetuning process: The training dataset was used to learn
an embedding space using contrastive learning. The validation dataset was used to retain the most
effective checkpoint and the test dataset was used to evaluate performance on unseen data.
   We submitted the following systems for evaluation, with each system being trained with the following
hyperparameters:
    • Batch size: 16
    • Learning rate: 1e-6
    • Dropout: 0.35
  These hyperparameters were chosen using the Optuna library. We selected the hyperparameters
with the best performance on the validation set, after running 50 trials with varying configurations.
These optimal hyperparameters help improve performance and reliability.
   1. ADITYA1: The model was trained for 30 epochs, leveraging the combined train and dev datasets.
      The optimal epoch was subsequently saved based on the system’s performance on the test set.
      The saved model was used to make predictions on the unseen test set.
   2. ADITYA2: The model was trained for 30 epochs on the train set. The optimal epoch was
      subsequently saved based on the system’s performance on the test set. The saved model was used
      to make predictions on the unseen test set.

   3. ADITYA3: The model was trained for 12 epochs on the combined train and dev datasets. This
      model was used to make predictions on the unseen test set.


4. Results
For the exist-Task1 we utilized three systems as described in Section 3. The evaluation metric used for
these systems was the ICM Metric.
The ICM metric [4] is a similarity function that generalizes Pointwise Mutual Information (PMI) to
compute the similarity between a model’s output and the ground truth categories. To calculate the
normalized ICM, the "Minority class" baseline (that classifies all instances as the minority class) is
considered the lowest score (i.e., 0) and the "Gold standard" is considered the highest score (i.e., 1).
Additionally, the models of sexism identification provided two types of outputs, "Hard" labels that
classify samples into sexist or not-sexist and "Soft" labels that specify a value between 0 and 1 in order
to measure "the degree of sexism" involved in the sample.
These labels were used to evaluate the models across three schemes, described as follows:
   • Hard-hard evaluation: the ICM similarity between the hard system output and hard ground truth
   • Soft-soft evaluation: the ICM similarity between the soft system output and the soft ground truth
All three of our submitted systems generated Hard Labels, which were subsequently utilized for the
Hard-hard evaluation scheme. A summary of our experiments is presented in Table 3.

Table 3
Results for the hard-hard evaluation for Task 1

                Run             Rank      ICM-Hard        ICM-Hard Norm           F1_YES
                Gold               0         0.9948              1.0000            1.0000
                Best Score         1         0.5973              0.8002            0.7944
                ADITYA_1          34         0.4680              0.7320            0.7447
                ADITYA_2          18         0.5246              0.7636            0.7669
                ADITYA_3          14         0.5418              0.7723            0.7691


5. Related Works
The rise of social media has led to an increase in sexist content in society, necessitating the de- velopment
of automated systems to detect and counteract sexism. However, the discrepancy in the composition
of the tweets and the multilingual nature of the dataset cause problems . To address these problems,
we used pre-processing to improve the effectiveness of our system. To solve the sexism identification
challenge, contrastive learning with RoBERTa language model [5] has been used. Previous research
has shown that deep learning algorithms, such as those used in [6], can outperform machine learning
models for sexism detection in Spanish datasets collected from Twitter. Other studies, such as [7] and
[8], have applied multilingual transformer models, including multilingual BERT and XLM-R, to detect
sexism in multiple languages. Meanwhile, [9] has used pre-trained transformers for sexism detection in
low-resource languages such as Romanian, and [10] has employed ensemble models for multilingual
classification. Despite these efforts, there is still limited exploration of modeling and analyzing sexism
in Spanish and English datasets, highlighting the need for further research in this area. There has been
relatively limited research on modelling and analyzing sexism in datasets that contain both Spanish
and English language content.


6. Conclusion and Future Scope
This paper presents the participation of team Aditya in the Task1 of the EXIST2024 [2] lab at CLEF,
which focuses on sexism identification. We investigated a contrastive learning based approach for
fine-grained analysis. The use of contrastive learning improved the classification capabilities of our
models.
Throughout our experimentation, we evaluated various models, including BerTweet, mBERT and T5.
However, XLM-Roberta Large consistently demonstrated superior results compared to its counterparts
and the best performance was obtained with its help.

There is tremendous scope for advancements and progress in this field. In future works, we
would like to refine and enhance our approach with different contrastive learning strategies, to improve
the model’s ability to distinguish between different classes. This approach can be also used in multiclass
classification and multilabel classification problems. Large Language Models trained specifically on
Spanish text could also be utilized to further improve performance.


References
 [1] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
     the CLEF Association (CLEF 2024), 2024.
 [2] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
 [3] Q. Chen, R. Zhang, Y. Zheng, Y. Mao, Dual contrastive learning: Text classification via label-aware
     data augmentation, 2022. URL: https://arxiv.org/abs/2201.08702. arXiv:2201.08702.
 [4] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan,
     P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for
     Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
     Dublin, Ireland, 2022, pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399. doi:10.
     18653/v1/2022.acl-long.399.
 [5] J. Angel, S. Aroyehun, A. Gelbukh, Multilingual sexism identification using contrastive learning,
     Working Notes of CLEF (2023).
 [6] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, Automatic classification of sexism in social
     networks: An empirical study on twitter data, IEEE Access 8 (2020) 219563–219576. doi:10.1109/
     ACCESS.2020.3042604.
 [7] M. Schütz, J. Boeck, D. Liakhovets, D. Slijepčević, A. Kirchknopf, M. Hecht, J. Bogensperger,
     S. Schlarb, A. Schindler, M. Zeppelzauer, Automatic sexism detection with multilingual transformer
     models, arXiv preprint arXiv:2106.04908 (2021).
 [8] H. H. Hemati, S. H. Alavian, H. Beigy, H. Sameti, Sutnlp at semeval-2023 task 10: Rlat-transformer
     for explainable online sexism detection, in: Proceedings of the 17th International Workshop on
     Semantic Evaluation (SemEval-2023), 2023, pp. 347–356.
 [9] A. Moldovan, K. Csürös, A.-m. Bucur, L. Bercuci, Users hate blondes: Detecting sexism in
     user comments on online Romanian news, in: K. Narang, A. Mostafazadeh Davani, L. Mathias,
     B. Vidgen, Z. Talat (Eds.), Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH),
     Association for Computational Linguistics, Seattle, Washington (Hybrid), 2022, pp. 230–230. URL:
     https://aclanthology.org/2022.woah-1.21. doi:10.18653/v1/2022.woah-1.21.
[10] A. F. M. de Paula, R. F. da Silva, I. B. Schlicht, Sexism prediction in spanish and english tweets
     using monolingual and multilingual bert and ensemble models, 2021. arXiv:2111.04551.