<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Bilingual Sexism Classification: Fine-Tuned XLM-RoBERTa and GPT-3.5 Few-Shot Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>AmirMohammad Azadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Baktash Ansari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sina Zamani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sauleh Eetemadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Iran University of Science and Technology</institution>
          ,
          <addr-line>Tehran</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Sexism in online content is a pervasive issue that necessitates efective classification techniques to mitigate its harmful impact. Online platforms often have sexist comments and posts that create a hostile environment, especially for women and minority groups. This content not only spreads harmful stereotypes but also causes emotional harm. Reliable methods are essential to find and remove sexist content, making online spaces safer and more welcoming. Therefore, the sEXism Identification in Social neTworks (EXIST) challenge addresses this issue at CLEF 2024. This study aims to improve sexism identification in bilingual contexts (English and Spanish) by leveraging natural language processing models. The tasks are to determine whether a text is sexist and what the source intention behind it is. We fine-tuned the XLM-RoBERTa model and separately used GPT-3.5 with few-shot learning prompts to classify sexist content. The XLM-RoBERTa model exhibited robust performance in handling complex linguistic structures, while GPT-3.5's few-shot learning capability allowed for rapid adaptation to new data with minimal labeled examples. Our approach using XLM-RoBERTa achieved 4th place in the soft-soft evaluation of Task 1 (sexism identification). For Task 2 (source intention), we achieved 2nd place in the soft-soft evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism Characterization</kwd>
        <kwd>Multilingual Natural Language Processing</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Transformer-based Models</kwd>
        <kwd>Few-Shot Learning</kwd>
        <kwd>Learning with Disagreement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sexism, defined as unfair treatment or prejudice based on a person’s sex or gender, is a serious global
issue, especially prevalent online where social media can easily spread sexist ideas. Such content harms
society, particularly women, by causing emotional distress and promoting gender inequality. Accurate
detection and classicfiation of sexist content are essential for making online spaces safer and more
inclusive. This research aims to improve the detection and understanding of sexist language online,
helping platforms reduce harmful content and promote respectful digital environments. Efective
sexism classification supports content moderation and studies on gender-based discrimination in digital
communication.</p>
      <p>
        Our study is part of the EXIST 2024 (sEXism Identification in Social Networks) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] shared task,
which aims to improve automated sexism detection. Our research focuses on two main tasks: sexism
identification and intention detection. Sexism identification involves deciding if a text contains sexist
content, while intention detection tries to understand the purpose behind the sexist remarks, categorizing
them into the following three types:
• Lastly, "Judgemental" shows the intention was to judge since the tweet describes sexist situations
or behaviors with the aim of condemning them.
      </p>
      <p>These tasks are crucial for developing systems that can detect and understand sexist language in
context.</p>
      <p>To address these challenges, we used two techniques in natural language processing: XLM-RoBERTa
[3] Fine-Tuning and GPT-3.5 Few-Shot Learning. XLM-RoBERTa, an advanced version of the RoBERTa
[4] model, is fine-tuned on the dataset to better recognize and classify sexist content. This method
uses the model’s extensive training on a diverse multilingual dataset, making it good at understanding
complex language patterns. We also used GPT-3.5 through few-shot learning, which means giving the
model a few English and Spanish tweets from the dataset in each prompt to help it adapt to specific
tasks. This approach takes advantage of GPT-3.5’s large-scale training and its ability to understand the
context and generate annotations with little extra data.</p>
      <p>The rest of this paper is organized as follows: In section 2, we describe the datasets used in our
study, explaining their structure. Section 3 details our methodology, including the specific setups for
XLM-RoBERTa Fine-Tuning and GPT-3.5 Few-Shot Learning. In section 4, we present the results of our
experiments, compare the performance of both methods using various measures, and analyze how well
they detect and classify sexist content. Finally, we discuss our findings, suggest potential improvements,
and outline directions for future research in automated sexism detection in the concluding sections.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>To address label bias in the annotation process, which can arise from socio-demographic diferences
among annotators or subjective labeling, the EXIST campaign considers some demographic
parameters including: gender, age, country, study-level, and ethnicity. Each tweet was annotated by six
crowdsourcing annotators selected through Prolific, following guidelines from gender experts.</p>
      <p>The EXIST 2024 dataset incorporates multiple types of sexist expressions, including descriptive or
reported assertions where the sexist message is a description of sexist behavior. In particular, the dataset
is composed of more than 10,000 tweets both in English and Spanish, divided into a test set (2,076
tweets), a development set (1,038 tweets), and a training set (6,920 tweets).</p>
      <p>For each sample, the following attributes are provided in a JSON format:
• "id_EXIST": a unique identifier for the tweet
• "lang": the languages of the text (“en” or “es”)
• "tweet": the text of the tweet
• "number_annotators": the number of persons that have annotated the tweet
• "annotators": a unique identifier for each of the annotators
• "gender_annotators": the gender of the diferent annotators. Possible values are: “F” and “M”,
for female and male respectively
• "age_annotators": the age group of the diferent annotators. Possible values are: 18-22, 23-45,
and 46+
• "ethnicity_annotators": the self-reported ethnicity of the diferent annotators. Possible values
are: “Black or African America”, “Hispano or Latino”, “White or Caucasian”, “Multiracial”, “Asian”,
“Asian Indian” and “Middle Eastern”
• "study_level_annotators": the self-reported level of study achieved by the diferent
annotators. Possible values are: “Less than high school diploma”, “High school degree or equivalent”,
“Bachelor’s degree”, “Master’s degree” and “Doctorate”
• "country_annotators": the self-reported country where the diferent annotators live in
• "labels_task1": a set of labels (one for each of the annotators) that indicate if the tweet contains
sexist expressions or refers to sexist behaviors or not. Possible values are: “YES” and “NO”
• "labels_task2": a set of labels (one for each of the annotators) recording the intention of the
person who wrote the tweet. Possible labels are: “DIRECT”, “REPORTED”, “JUDGEMENTAL”, “-”,
and “UNKNOWN”
• "labels_task3": a set of arrays of labels (one array for each of the annotators) indicating the type
or types of sexism that are found in the tweet. Possible labels are: “IDEOLOGICAL- INEQUALITY”,
“STEREOTYPING-DOMINANCE”, “OBJECTIFICATION”, “SEXUAL-VIOLENCE”,
“MISOGYNYNON-SEXUAL-VIOLENCE”, “-”, and “UNKNOWN”
• “split”: subset within the dataset the tweet belongs to (“TRAIN”, “DEV”, “TEST” + “EN”/”ES”)
In sexism identification, natural language expressions often do not have a single, clear interpretation.
To address this, the learning with disagreements paradigm allows systems to learn from datasets that
include all annotator opinions rather than a single aggregated label. Following this method, we will
provide all annotations per instance from six diferent annotators, capturing the diversity of views. We
determined the final label using a majority voting method, ensuring that the most commonly assigned
label among annotators represents the classification.</p>
      <p>It should be noted that for Tasks 2 and 3, hard labels are assigned exclusively to tweets identified as
sexist (label "YES" for Task 1). Tweets not categorized as sexist receive a label of “–”, and those lacking
a label from annotators are marked as "UNKNOWN." The test set is composed solely of the following
attributes: "id_EXIST", "lang", "tweet" and "split."</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, we employed two distinct methodologies to tackle the challenge of characterizing sexism
on social networks. The first approach involved fine-tuning several state-of-the-art transformer models,
namely XLM-RoBERTa, mBERT [5], deBERTa [6], and BERTIN [7], on the provided dataset. The
second approach leveraged the few-shot learning capabilities of GPT-3.5. Below, we provide detailed
descriptions of each approach.</p>
      <sec id="sec-3-1">
        <title>3.1. Fine-Tuning Pre-trained Transformer Models</title>
        <p>This section describes adapting transformer models like XLM-RoBERTa and mBERT for sexism detection
through hyper-parameter tuning and optimization techniques to improve performance. The models are
evaluated using accuracy, precision, recall, and F1-score.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Model Selection and Fine-Tuning</title>
          <p>We selected several pre-trained transformer models for fine-tuning. The models are as follows:
• XLM-RoBERTa: A multilingual variant of RoBERTa, trained on 100 languages, known for robust
performance across various multilingual benchmarks
• mBERT: Multilingual BERT, trained on Wikipedia pages from 104 languages, capable of
processing multiple languages simultaneously
• deBERTa: An improved version of BERT with disentangled attention and an enhanced mask
decoder, capturing word dependencies more efectively
• BERTIN: A Spanish language model based on BERT, fine-tuned on a large corpus of Spanish
texts, tailored for Spanish NLP tasks
Each model was fine-tuned on the training set using the following steps:
1. Training Setup: The models were initialized with pre-trained weights and adapted to our specific
task.
• Learning rate
• Weight decay
• Number of epochs</p>
          <p>Model
XLM-RoBERTa/raw
XLM-RoBERTa/param tuning
mBERT
deBERTa
BERTIN
2. Hyper-Parameter Tuning: The best performing model, XLM-RoBERTa, as shown in Table 1,
underwent extensive hyper-parameter tuning. Tuned parameters included the following:
3. Optimization and Early Stopping: We used the AdamW optimizer along with early stopping
to prevent overfitting. A learning rate scheduler was employed to adjust the learning rate
dynamically during training.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Model Evaluation and Analysis</title>
          <p>To evaluate the model, we used the validation set to monitor performance metrics such as accuracy,
precision, recall, and F1-score. Additionally, we analyzed mislabeled tweets to understand the sources
of error and identify patterns that could inform further improvements.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Label Extraction</title>
          <p>We generated two types of labels for output including the following:
• Hard Labels: Direct output from the model indicating the predicted class
• Soft Labels: Probabilities for each class obtained by applying the softmax function to the last
layer’s output. This was calculated by extracting the logits from the final layer and normalizing
them</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Few-Shot Learning with GPT-3.5</title>
        <p>This section explains using GPT-3.5 for sexism classification with few-shot learning, leveraging minimal
data for training. It focuses on prompt design and evaluation metrics like accuracy, tailored for handling
multilingual input.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Prompt Design</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Model Execution</title>
          <p>We employed few-shot learning with GPT-3.5, leveraging its ability to understand context with minimal
training examples. For each prompt, we randomly selected 3 English and 3 Spanish tweets from the
training dataset, including the annotator votes to incorporate the learning with disagreement method.
Given the constraints of GPT-3.5 in providing probability scores, we only extracted hard labels from its
outputs. The prompts were designed to include the following:
• The tweet text
• Annotator votes, highlighting the disagreement and consensus among human annotators
• A clear task description asking GPT-3.5 to classify the tweet</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Evaluation</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We assessed GPT-3.5’s performance using the same metrics as for the transformer models. Given the
nature of few-shot learning, the evaluation was primarily focused on accuracy and the ability of the
model to handle multilingual input with minimal examples.</p>
      <p>In this section, we present the results of our sexism detection methodologies on social networks. We
evaluated the performance using various metrics, including ICM-Soft, ICM-Soft Norm, Cross Entropy
for soft labels, and ICM-Hard, ICM-Hard Norm, and F1 for hard labels. The tables below summarize the
performance across all data, English tweets, and Spanish tweets. The baselines used for comparison are
as follows:
• EXIST2024-test_gold: since the ICM measure is unbounded, a baseline that perfectly predicts
the ground truth is considered to provide the best possible reference.
• EXIST2024-test_majority-class: non-informative baseline that classifies all instances as the
majority class
• EXIST2024-test_minority-class: non-informative baseline that classifies all instances as the
minority class</p>
      <p>For all tasks and evaluation types (hard-hard and soft-soft), the oficial metric used is the Information
Contrast Measure (ICM). ICM is a similarity function that generalizes Pointwise Mutual Information
(PMI) and evaluates system outputs in classification problems by computing their similarity to the
ground truth categories [8].
4.1. Task 1
This task involves determining whether a given tweet contains sexist content, evaluated through various
metrics. The tables present performance metrics for diferent models on overall data, English tweets,
and Spanish tweets. Metrics measure the accuracy and similarity of the models’ predictions to the
ground truth. The following tables are the results for task 1 in three categories containing overall result
in table 2, English tweets in table 3, and Spanish tweets in table 4.</p>
      <p>ICM-Soft
3.12
-2.36
-3.07
0.82</p>
      <p>ICM-Soft
3.11
-2.20
-3.82
0.66</p>
      <p>ICM-Soft Norm
1.00
0.12
0.01
0.63</p>
      <p>ICM-Soft Norm
1.00
0.15
0.00
0.60</p>
      <p>Cross Entropy ICM-Hard ICM-Hard Norm
0.55 0.99 1.00
4.61 -0.44 0.28
5.36 -0.57 0.21
0.98 0.55 0.78</p>
      <p>- 0.35 0.67
Cross Entropy ICM-Hard ICM-Hard Norm
0.58 0.98 1.00
4.22 -0.40 0.30
5.57 -0.66 0.16
1.02 0.55 0.78
- 0.37 0.69
This task aims to determine the intention behind sexist remarks in tweets. The tables show the
performance of diferent models on overall data, English tweets, and Spanish tweets, using the same
metrics as in Task 1. These metrics assess how well the models can categorize tweets based on the
perceived intention, such as whether the remark is direct, reported, or judgmental. The following tables
are the results for task 2 in three categories containing overall result in table 5, English tweets in table
6, and Spanish tweets in table 7.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we tackled the challenge of detecting and classifying sexist content in bilingual contexts
(English and Spanish) using advanced natural language processing techniques. We fine-tuned the
XLM-RoBERTa model and leveraged GPT-3.5 for few-shot learning to address the EXIST 2024 shared
tasks. Our results demonstrated the robustness of the XLM-RoBERTa model in handling complex
linguistic structures and the adaptability of GPT-3.5 with minimal labeled examples. Specifically, our
XLM-RoBERTa model achieved 4th place in the soft-soft evaluation of Task 1 (sexism identification) and
2nd place in the soft-soft evaluation of Task 2 (source intention). These results highlight the efectiveness
of transformer-based models and few-shot learning in addressing the nuances of sexist language in
social media content.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>While our approaches yielded promising results, several areas for future improvement and research
have been identified. These suggestions are as follows:
• Enhanced Few-Shot Prompting: For few-shot prompting, instead of selecting samples
randomly, we plan to export the embeddings of the samples from our fine-tuned XLM-RoBERTa
model. Using cosine similarity, we will identify the most similar and least similar samples from
the training set for each sample in the test set. Although we have found 10 most similar and 10
least similar samples for each test sample, we did not have suficient time to make inferences. This
method could potentially improve the performance of GPT-3.5 in few-shot learning scenarios.
• Data Augmentation: We aim to gather additional sexist data for future experiments. Data
augmentation techniques, such as replacing synonyms of certain words or translating English tweets
to Spanish and vice versa, can be employed to enhance the dataset’s diversity and robustness.
• Fine-Tuning Stronger Models: In future work, we plan to fine-tune stronger models than
XLM-RoBERTa, such as newer versions of transformer models or other advanced architectures,
to further boost performance in sexism detection and classification tasks.
• Incorporating Demographic Annotators’ Information: We aim to use the demographic
information about annotators that is provided in the dataset. This includes details such as gender,
age, ethnicity, and education level. Incorporating these demographic attributes can provide
several advantages as:
– Bias Mitigation: Understanding the demographic background of annotators can help
identify and mitigate biases in the annotations, leading to fairer and more balanced models.
– Enhanced Model Performance: Demographic information can provide additional
context that may improve the model’s understanding of nuanced language use and cultural
diferences, thereby enhancing its classification accuracy.
– Richer Insights: Including demographic data allows for a more detailed analysis of how
diferent groups perceive and annotate sexist content, contributing to more comprehensive
insights into sexism detection.</p>
      <p>By addressing these areas, we aim to further refine our methodologies and contribute to the
development of more efective and robust systems for automated sexism detection in online content.
[3] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020.
arXiv:1911.02116.
[4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692.
[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. arXiv:1810.04805.
[6] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021.</p>
      <p>arXiv:2006.03654.
[7] J. de la Rosa, E. G. Ponferrada, P. Villegas, P. G. de Prado Salas, M. Romero, M. Grandury,
Bertin: Eficient pre-training of a spanish language model using perplexity sampling, 2022.
arXiv:2207.06814.
[8] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: S.
Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Dublin, Ireland, 2022, pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399.
doi:10.18653/v1/2022.acl-long.399.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes (Extended Overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>