I2C-UHU at EXIST2024: Learning from Divergence and
                         Perspectivism for Sexism Identification and Source Intent
                         Classification
                         Manuel Guerrero-García* , Manuel Cerrejón-Naranjo, Jacinto Mata-Vázquez and
                         Victoria Pachón-Álvarez
                         I2C Research Group, University of Huelva, Spain


                                     Abstract
                                     In this paper, we present the contributions of the I2C-UHU team to the EXIST2024 Lab at CLEF 2024, focusing
                                     on the identification of sexism and the classification of source intent in social media texts. State-of-the-art
                                     transformer models are employed to address the complex and nuanced nature of sexist language. We adopt a
                                     two-fold approach: firstly, classifying tweets as sexist or non-sexist, and secondly, categorizing sexist tweets based
                                     on intent. Our innovative approach, employing Learning with Disagreement, incorporates diverse perspectives
                                     from multiple annotators, enhancing the robustness and accuracy of our models. We detail our data preprocessing,
                                     augmentation techniques, and hyperparameter optimization strategies. Our results in the competition demon-
                                     strated effectiveness, with our entries achieving positive rankings in the two tasks in which we participated. In
                                     Task 1, we secured the 10th position out of 70 participants on the hard labels leaderboard and the 13th position
                                     out of 40 for soft labels. In Task 2, we achieved the 11th position out of 46 participants for hard labels and the
                                     17th position out of 35 in the best run for soft labels. Our findings provide a foundation for future research and
                                     practical applications in social media moderation and policy-making.

                                     Keywords
                                     Sexism identification, Learning with disagreement, Transformer models, Natural language processing


                         1. Introduction
                         In the EXIST2024 Lab at CLEF 2024[1], the I2C-UHU team addressed sexism on social media platforms
                         through binary classification of tweets and classification based on author intent. The first task distin-
                         guishes between sexist and non-sexist content, crucial for filtering harmful language, while the second
                         task classifies sexist tweets into direct, reported, and judgmental categories, providing deeper insights
                         into manifestations of sexism. Utilizing transformer models and data augmentation, our approach aims
                         for robustness and generalizability. By implementing "Learning with Disagreement" [2] we capture
                         diverse perspectives from human annotators, enhancing model accuracy. The paper structure includes
                         sections on related works, dataset description, methodology, results, and future research directions.


                         2. Related Works
                         In the realm of detecting sexist tweets, researchers use various methodologies to navigate the com-
                         plexities of language and intent. Binary classification models serve as a foundational tool, offering a
                         clear distinction between sexist and non-sexist content. However, the quest for a deeper understanding
                         prompts the exploration of author intent, which requires delving into contextual cues and linguistic
                         subtleties.
                            Task 1 of EXIST 2024 [3] is dedicated to binary categorization, where researchers have explored a
                         spectrum of techniques. From traditional rule-based systems to cutting-edge deep learning architectures,
                         the goal remains consistent: to accurately identify instances of sexism in tweets. Notable among these
                         endeavors is the work of Burnap and Williams [4], who leveraged automatic classification techniques

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ manuel.guerrero790@alu.uhu.es (M. Guerrero-García); manuel.cerrejon886@alu.uhu.es (M. Cerrejón-Naranjo);
                          mata@uhu.es (J. Mata-Vázquez); vpachon@dti.uhu.es (V. Pachón-Álvarez)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to detect hate speech on Twitter. Their approach, which incorporated linguistic and contextual features,
showcased significant accuracy in pinpointing problematic content.
   Task 2, however, takes a deeper dive into the realm of author intent, recognizing that the mere
presence of sexist language does not always imply malicious intent. To address this, researchers delve
into the intricate interplay between language, context, and underlying motives. Waseem and Hovy [5]
embarked on this journey by identifying predictive features for hate speech detection, underscoring the
importance of contextual and demographic attributes in discerning the author’s intent.
   In sum, the exploration of related works underscores the multidimensional nature of detecting sexist
tweets. While binary classification models provide a solid foundation, the pursuit of a more nuanced
understanding necessitates the integration of author intent analysis and cutting-edge transformer
models. These endeavors collectively advance our comprehension of sexism in online discourse and
pave the way for more effective mitigation strategies.


3. Tasks and Dataset Description
In this section, the tasks in which participation was engaged and the datasets provided by the organizers
are delineated.

3.1. Task 1: Sexism Identification in Tweets
Task 1 involves a binary classification problem where the objective is to determine whether a given tweet
contains sexist expressions or behaviors. The classification is straightforward: each tweet is categorized
as either sexist (“YES”) or not sexist (“NO”). Examples of sexist tweets include statements that directly
express sexist sentiments, describe sexist situations, or criticize sexist behaviors. For instance, tweets
that demean women’s capabilities, perpetuate stereotypes, or contain derogatory comments fall into
the “YES” category. Conversely, tweets that do not exhibit these characteristics are labeled as “NO”.

3.2. Task 2: Source Intention in Tweets
Task 2 is a multi-class classification task aimed at understanding the intention behind sexist tweets.
This task only applies to tweets already identified as sexist in Task 1. The intention of the tweet’s author
is classified into one of three categories:

    • DIRECT: The tweet itself is overtly sexist. For example, a tweet stating, “A woman’s place is in
      the home,” directly conveys a sexist message.
    • REPORTED: The tweet reports or describes a sexist incident or situation. An example is, “Today,
      I saw a man harass a woman on the subway.”
    • JUDGEMENTAL: The tweet condemns or criticizes sexist behaviors or situations. For instance,
      “It’s disgraceful how women are still paid less than men for the same work.”

  Each of these categories provides insight into the various ways sexism can manifest and the different
contexts in which it is discussed on social media.

3.3. Dataset Description
The dataset provided by the organizers contains over 8000 labeled tweets in English and Spanish, with
balanced language distribution. The training dataset has 6920 tweets and the development dataset
1038 tweets. Provided in JSON format, each tweet includes attributes such as “id_EXIST”, “lang”,
“tweet”, “number_annotators”, and detailed annotator information (“annotators”, “gender_annotators”,
“age_annotators”, “ethnicity_annotators”, “study_level_annotators”, “country_annotators”). Labels are
“labels_task1” for sexist content and “labels_task2” for author intent. The “split” attribute indicates the
dataset subset and language. In Tables 1 and 2 examples of instances for Task 1 and Task 2 are described.
Table 1
Examples of instances for Task 1
 id_EXIST     lang                 tweet                       annotators               labels_task1
  101000       es    “No sean de esos que consiguen un   Annotator_91, Annota-     NO, NO, NO, NO, YES,
                     peso y cambian con la gente. La     tor_92, Annotator_93,     NO
                     plata no es culo.”                  Annotator_94, Annota-
                                                         tor_95, Annotator_96
  201573       en    “@Avigeek96 Well men kill women     Annotator_549, Annota-    NO, YES, YES, YES, YES,
                     everyday”                           tor_550, Annotator_551,   YES
                                                         Annotator_552, Annota-
                                                         tor_553, Annotator_554


Table 2
Examples of instances for Task 2
 id_EXIST     lang                 tweet                       annotators               labels_task2
  101000       es    “No sean de esos que consiguen un   Annotator_91, Annota- -, -, -, -, JUDGEMENTAL,
                     peso y cambian con la gente. La     tor_92, Annotator_93, -
                     plata no es culo.”                  Annotator_94, Annota-
                                                         tor_95, Annotator_96
  201573       en    “@Avigeek96 Well men kill women     Annotator_549, Annota- -, REPORTED, JUDGE-
                     everyday”                           tor_550, Annotator_551, MENTAL, JUDGEMEN-
                                                         Annotator_552, Annota- TAL, JUDGEMENTAL,
                                                         tor_553, Annotator_554  REPORTED


   These instances were extracted from the file "training.json", which contains 6920 instances, of which
3660 are in Spanish and 3260 are in English. In the case of the file "dev.json", which contains 1038 total
instances, the language distribution is 549 for Spanish and 489 for English.
   In the training and development datasets, the distribution of ethnicities shows a predominant repre-
sentation of the "White or Caucasian" group, followed by the "Hispanic or Latino" category. Additionally,
regarding educational levels, the most common is "Bachelor’s degree," while the least represented are
"Less than high school diploma" and "Doctorate." The class distributions for both the binary classification
task (YES/NO) and the multiclass classification task (DIRECT, REPORTED, JUDGMENTAL) demonstrate
substantial consistency between the training and validation datasets. The class distribution in the
training and development datasets is depicted in Figure 1.
   To effectively apply Learning with Disagreement techniques, it’s important to study how different
annotator profiles are distributed across the labeled instances.
   For training the dataset, there is an equal number of male and female annotators, with a total of
20760 annotators of each gender and a total of 41520. For the development dataset, the distribution is
the same, 50% male (3114) and 50% female (3114), out of a total of 6228 annotators. The age distribution
is also equitable across both datasets, with one third of annotators falling into each of the following age
groups: 18-22 years, 23-45 years, and over 46 years.


4. Methodology
In this section, the methodology used to develop the model submitted to the competition is described.
   As previously described, our approaches are based on the use of transformer-based language models.
Given that the provided data is in both English and Spanish, four pre-trained models were chosen.

    • XLM-RoBERTa Base: A pre-trained language model utilizing the RoBERTa architecture and
      trained on multiple languages. It excels in efficiently and accurately understanding and generating
      text in various languages[6].
Figure 1: Class distribution in training and dev datasets


    • DeBERTa v3 Base: A variant of BERT incorporating improvements in attention and word repre-
      sentation, resulting in better performance across a variety of NLP tasks such as text comprehension
      and language generation[7].
    • RoBERTa Base BNE: A language-specific adaptation for Spanish of the RoBERTa Base model,
      trained on the Spanish Corpus from the Spanish Text Bank (BNE). It offers high performance in
      Spanish language processing tasks[8].
    • BERT Base Multi: A version of BERT pre-trained in multiple languages and insensitive to
      case. It can comprehend and generate text in various languages without distinguishing between
      uppercase and lowercase[9].

4.1. Baseline
The first step in developing classification tasks was to establish an initial benchmark or baseline. This
baseline establishes a fundamental methodology that serves as a reference point for comparing more
advanced models. It sets a performance threshold that other models must exceed in text classification
for our approaches. Two baselines, Version A and Version B, were developed for addressing both Task 1
and Task 2.

4.1.1. Baseline Version A
This approach focuses on training a multiclass classifier (NO, DIRECT, REPORTED, and JUDGEMENTAL)
to address all labels for Task 1 and Task 2 simultaneously. The baseline model uses the competition’s
datasets without preprocessing and with arbitrary hyperparameter values. Both Spanish and English
data are included. Models were trained and validated with the training dataset and tested with the
development dataset unless otherwise specified for hyperparameter tuning[10].
   The hyperparameters values used were: batch size of 32, learning rate of 2e-5, max length of 128, and
weight decay of 0.01. The optimizer used was adamw_torch. The maximum number of training epochs
was limited to 10 with an "early stopping" set at three epochs.
   After training the chosen pre-trained models, the results for the Baseline Version A are presented in
Table 3.
   This classification strategy yields imprecise and low results. For example, the pre-trained XLM
RoBERTa Base model achieved an F1 score of 0.8129 for the NO class, but only 0.3419 for the JUDGE-
Table 3
Results for Baseline Version A
                                  Model                     F1 for Baseline A
                                  XLM RoBERTa Base               0.4983
                                  Deberta v3 Base                0.4910
                                  Roberta Base Bne               0.4599
                                  Bert Base Multilingual         0.4388


MENTAL class. This pattern is consistent across other models, indicating difficulty in classifying all the
labels together.

4.1.2. Baseline Version B
In Version B, the initial step involves classifying tweets into the two categories of Task 1 (YES and NO).
Subsequently, tweets that are categorized as YES are further divided into the three distinct classes of
Task 2 (DIRECT, REPORTED, and JUDGEMENTAL). The outcomes achieved with this "Baseline Version
B" are detailed in Tables 4 and 5.

Table 4
Results for Baseline Version B, binary classification
                                  Model                     F1 for Baseline B
                                  XLM RoBERTa Base               0.7807
                                  Deberta v3 Base                0.7820
                                  Roberta Base Bne               0.7584
                                  Bert Base Multilingual         0.7618


Table 5
Results for Baseline Version B, multiclass classification
                                  Trained Model             F1 for Baseline B
                                  XLM RoBERTa Base              0.568331
                                  Deberta v3 Base               0.555636
                                  Roberta Base Bne              0.530543
                                  Bert Base Multilingual        0.529283


  As can be seen, the results improved significantly by breaking down the process into two classification
phases.

4.2. Split Description for Training Framework
A schematic overview illustrating the distribution and creation of datasets employed for training the
models is shown in Figure 2. These datasets are used in both Task 1 (annotated as v1.x-d) and Task 2
(annotated as v2.x-d). For example:

    • The v1.1 model is trained and validated with the v1.1-d dataset:
          – v1.1-d train (training data)
          – v1.1-d valid (validation data)
    • The v1.3 model is trained with the v1.3-d dataset:
          – v1.3-d train (training data)
          – v1.3-d valid (validation data)
    • The v2.1 model is trained with the v2.1-d dataset:
         – v2.1-d train (training data)
         – v2.1-d valid (validation data)


Figure 2: Datasets subdivisions for model training


4.3. Data Cleaning and Normalization
In the context of NLP, text preprocessing as data cleaning and normalization, are critical to ensuring
texts are consistent and noise-free before use in machine learning models. Specifically, for cleaning
tweets, the following techniques were employed:

    • Lowercase Conversion: Ensures uniform treatment of words, eliminating the distinction be-
      tween "Cat" and "cat", simplifying the dataset and reducing the number of unique features.
    • Removal of Links: Eliminates web links present in tweets as they do not add semantic value
      and are often irrelevant to sentiment analysis or text meaning.
    • Removal of User Mentions: Removes mentions of other users and retweets, which usually do
      not provide relevant information for semantic analysis and can introduce noise.
    • Removal of Hashtags: Simplifies the text by removing hashtags, which may not be relevant for
      semantic analysis, focusing the analysis on words and phrases.
    • Removal of Emojis: Although emojis convey emotions or contexts, their interpretation can
      be complex in textual analysis. Initial attempts to translate emojis into words did not improve
      results, thus they were removed to reduce noise and simplify analysis.

  An example of the data cleaning carried out is presented in Table 6.
Table 6
Data Cleaning and Normalization
     Original Tweet                                    Cleaned and Normalized Tweet
     Collab betweet WeAreEqual X @TaravaNFT            collab betweet weareequal x ? you already know
     ? YOU ALREADY KNOW IT. Join our Dis-              it. join our discord on how to join our exclusive
     cord on how to join our exclusive Giveaway :      giveaway : .
     https://t.co/x3stzfLLmh. #NFT #NFTGiveaway
     #art


4.4. Data Augmentation and Hyperparameter Search
Data augmentation is a crucial technique in natural language processing (NLP) to enhance the perfor-
mance of machine learning models by artificially expanding the dataset. Various strategies, including
back-translation, have been employed to improve model robustness and generalization. Recent studies
have demonstrated the effectiveness of data augmentation in text classification tasks, emphasizing
its importance in handling diverse linguistic patterns and enhancing model accuracy [11, 12]. Back-
translation, in particular, has been highlighted as a powerful augmentation technique, transforming text
into a target language and then translating it back to the source language to generate varied paraphrases
while preserving the original meaning [13, 14, 15].

4.4.1. Oversampling with Backtranslation
Oversampling addresses class imbalance [16] by generating syntactic and lexical variations through
backtranslation, increasing dataset diversity without altering meaning [17]. Since the datasets are
unbalanced, it is necessary to employ a balancing technique. In this case, the number of rows for the
REPORTED and JUDGEMENTAL classes has been increased through backtranslation, while the original
number of rows has been maintained for the DIRECT class. Using Helsinki-NLP/opus models from the
OPUS project [18], tweets in Spanish are translated to English, then German, and back to Spanish. An
example of data generation through backtranslation for a tweet in Spanish is shown in Table 7.

Table 7
Example of data generation through backtranslation for a tweet in Spanish
    Original Tweet                                     New Tweet Generated with Backtranslation
    Se supone q me tengo q avergonzar d ser mamá?      ¿Debería avergonzarme de ser madre?
    Jajajajaajajaja naaaa

   For tweets in English, they were translated from English to German, then from German to Spanish,
and finally from Spanish back to English. An example of a newly generated instance is shown in Table
8.

Table 8
Example of data generation through backtranslation for a tweet in English
    Original Tweet                                     New Tweet Generated with Backtranslation
    Easy to throw rocks and hide behind your gender    Easy to throw stones and hide behind your sex or
    or sexual identity #onhere                         sexual identity #onhere


4.4.2. Hyperparameter Search
Hyperparameter tuning optimizes model performance by selecting optimal values for non-learned
parameters. Optuna [19] helps define and iteratively optimize the hyperparameter search space. Ex-
haustive search (grid search) explores all possible combinations but is computationally expensive. To
expedite experiments, the training and validation datasets were reduced to 80% of the original size. To
implement exhaustive search using Optuna, a hyperparameter search space was defined, as shown in
Table 9.

Table 9
Hyperparameter Search Space
                                   Hyperparameter       Value Range
                                   Batch Size           [8, 16, 32]
                                   Learning Rate        [3e-5, 5e-5]
                                   Weight Decay         [0.001, 0.01, 0.1]


  In reference to the metrics obtained after hyperparameter optimization and the application of the
previously explained techniques, the results are explained in Tables 10 and 11.

Table 10
F1 scores Task 1
              Model                     Baseline    Data augmentation + Hyperparameters
              XLM RoBERTa Base           0.7807                       0.7876
              Deberta v3 Base            0.7820                       0.7871
              Roberta Base Bne           0.7584                       0.7616
              Bert Base Multilingual     0.7618                       0.7640


Table 11
F1 scores Task 2
               Model                   Baseline    Data augmentation + Hyperparameters
               XLM RoBERTa Base         0.5945                       0.6095
               Roberta Base Bne         0.4795                       0.4905
               Deberta v3 Base          0.5801                       0.5968


4.5. General Training Configuration
Training was conducted using the Trainer class from Hugging Face, incorporating optimized hyperpa-
rameters. The adamw _torch[20] optimizer was employed for updating model weights, with evaluations
conducted at the end of each epoch and models saved periodically. The best model, determined by
the F1 metric, was loaded. Training was halted using the EarlyStoppingCallback if no improvements
were observed. These strategies were then tested on the structured dev dataset. The RTX 4070 graphics
card was utilized for its high performance and capability to manage intensive processing tasks, thereby
ensuring efficient and speedy development and execution of complex models.

4.5.1. Identifying Sexism in Tweets - Version: v1.x
To train the final models that will generate predictions on the test data provided by the competition for
Task 1, we selected the two best-performing models based on their metrics during the training process.
  To train the v1.1 model, data from the v1.1-d dataset was used. Each tweet in this dataset is labeled
by six annotators in both the training and validation sets. To obtain the majority label, following
the competition guidelines to obtain the gold label, the votes were averaged, selecting the labels that
received two or more votes from among the six possible annotators. In case of a tie, the instance in
question was completely excluded. A multilingual model, XLM-RoBERTa-Base, was trained to handle
both English and Spanish instances simultaneously. Figure 3 shows this process.
Figure 3: Training flow for model v1.1


   Subsequently, this model was used to predict the labels of the data in the official competition test
set. The results are presented indicating the majority predicted label for each instance of the test set,
followed by the score_label, which represents the similarity score assigned by the classifier to the
majority predicted label on a scale of 0 to 1.
   To obtain the hard label, the majority predicted label was selected. Regarding the soft label, since it is
a binary classifier (YES or NO), the score_label value was assigned to the majority class in each case,
and the value of the minority class was calculated as 1 minus the score_label. It is important to note
that the sum of the label values in the soft results should not exceed 1. The results model’s evaluation is
shown in the Table 12

Table 12
Version v1.1 Evaluation models’ Results
                                     Model                  F1 score
                                     XLM RoBERTa Base       0.850


  The training process for model v1.2 is almost identical to the previously described process, but with
some key differences regarding the models used and the workflow structure. For training, two datasets
were used: v1.3-d for English instances and v1.4-d for Spanish instances. As for the models, DeBERTa
v3 Base was used for English and RoBERTa Base Bne for Spanish. The figure 4 shows the process.


Figure 4: Version v1.2 Evaluation models’ Results


  The workflow began with the separate training of the two models: the DeBERTa v3 Base model was
used for the English instances of dataset v1.3-d, and the RoBERTa Base Bne model was employed for
the Spanish instances of dataset v1.4-d. The models’ evaluation results are shown in Table 13
Table 13
Versions v1.3 and v1.4 Evaluation models’ Results
                                   Model                       F1 score
                                   XLM RoBERTa Base (v1.3)     0.854
                                   DeBERTa v3 Base (v1.3)      0.859
                                   XLM RoBERTa Base (v1.4)     0.826
                                   RoBERTa Base Bne (v1.4)     0.863
                                   BERT Base (v1.4)            0.818


Table 14
Version v1.2 - Predictions English + Spanish
                                       Model                 F1 score
                                       DeBERTa v3 base        0.8589
                                       RoBERTa Base Bne       0.8630
                                       Final Average          0.8617


4.5.2. Intent Classification in Sexist Tweets - Model Versions
Model Version v2.1 was designed to address the second task of the competition, which focuses on
classifying the intentionality of tweets previously categorized as sexist by model version v1.2 (Source
Intention in Tweets). This task follows the initial classification of sexist messages and seeks to categorize
such messages according to the author’s intent, thus providing insights into the role of social media
in issuing and spreading sexist messages. In this task, a classification between three classes DIRECT,
REPORTED, and JUDGEMENTAL is proposed.
   The training data comes from dataset version v2.1-d, containing only instances of the three classes,
excluding instances categorized as NO, thus avoiding introducing noise in the training data and refining
the model’s accuracy. Only hard labels were generated for the final predictions, as the model does not
return the score label of predicted classes as minority. The Figure 5 shows the process. Obtained results
are shown in the Table 15


Figure 5: Training flow for Model Version 2.1


Table 15
Performance of Model Version 2.1
                                      Model                 F1 score
                                      XLM RoBERTa Base      0.501


   The next model applies Learning with Disagreement because it considers and leverages the differences
in opinion among multiple human annotators when labeling the training data. This approach captures
a greater diversity of perspectives, which is especially useful in subjective or complex tasks where there
may be significant disagreement about the correct labels.
   This method improves the model’s predictions by integrating multiple viewpoints, creating a more
robust and representative training dataset. Additionally, the soft labels resulting from this process
enable the model to capture the uncertainty and variability inherent in human annotations, leading to
better generalization and performance in real-world situations where data may not be clear or fully
defined. The Figure 6 shows the process. Obtained results are shown in the Table 16.


Figure 6: Training flow for Model Version 2.2


Table 16
Version 2.2 Evaluation models’ Results
                                    Model                   F1 score
                                    XLM RoBERTa [Ann_1]      0.576
                                    XLM RoBERTa [Ann_2]      0.546
                                    XLM RoBERTa [Ann_3]      0.509
                                    XLM RoBERTa [Ann_4]      0.508
                                    XLM RoBERTa [Ann_5]      0.517
                                    XLM RoBERTa [Ann_6]      0.509
                                    Ensembler                0.527

   The training flow of the model shown in the image can be explained in detail, focusing on how the
disagreement among annotators is handled and how soft labels are generated. Here’s the step-by-step
explanation:

   1. Training data comes from six groups of annotators differentiated by gender and age: ["F 18-22",
      "F 23-45", "F 46+", "M 46+", "M 23-45", "M 18-22"]. Each group of annotators has provided labels
      for the training data.
   2. Six datasets (v2.2.1-d, v2.2.2-d, v2.2.3-d, v2.2.4-d, v2.2.5-d, and v2.2.6-d) are used to train six
      instances of the XLM-RoBERTa Base model. Each dataset corresponds to the annotations of one
      of the six mentioned groups.
   3. The six trained models are combined using an ensemble method. This process integrates the
      outputs of the different models to produce a more robust final prediction. The ensemble calculates
      a weighted average (sum) of the predictions of the six models.
   4. To generate the soft labels, the proportion of annotators who voted for each label is taken into
      account. For example, if 2 out of 6 annotators labeled a data point as "DIRECT", the soft label for
      "DIRECT" would be 2/6 = 0.33333. This process is repeated for the other labels, "REPORTED" and
      "JUDGEMENTAL".

  In the previous task (Task 1), the data was classified into the classes "YES" and "NO". If a data point
was classified as "YES" with a probability of 0.80, this value is used to adjust the soft labels of Task
2. For example, if the soft label for "DIRECT" is 0.33333, the adjusted value would be 0.33333 * 0.80
= 0.26666. This adjustment is performed for all sub-classes of "YES" ("DIRECT", "REPORTED", and
"JUDGEMENTAL").
   This process must be done for the YES label when it is the majority class in Task 1, as well as to
predict the percentage of this when it is the minority class in Task 1. In conclusion, the extremely
low probability of the different YES classes in the instances that have been classified by the models of
version 1 as NO is also being calculated.
   Finally, Model Version 2.3 follows the same guidelines as Version 2.2, explained above, but the training
data comes from three groups of annotators differentiated by gender and age: ["F 18-22", "F 23-45", "F
46"]. Only female groups have been selected to train the models that will compose the ensemble. The
Figure 7 shows the process. Obtained results are shown in the Table 18.


Figure 7: Training flow for Model Version 2.3


Table 17
Version 2.3 Evaluation models’ Results
                                     Model                   F1 score
                                     XLM RoBERTa [Ann_1]       0.5755
                                     XLM RoBERTa [Ann_2]       0.5460
                                     XLM RoBERTa [Ann_3]       0.5086
                                     Ensembler                 0.5434


4.6. Error Analysis
4.6.1. Task 1
This section provides a detailed analysis of errors made by the models in Task 1: Sexism Identification
in Tweets, focusing on classification discrepancies between YES and NO classes. By scrutinizing
misclassifications, patterns and insights into challenges faced by the models are aimed to be identified.
Additionally, potential strategies to improve classification performance, especially for the minority class
(YES), are explored. Examples are presented in Table 18.

Table 18
Examples of instances for Task 1
                                         Tweet                               Labels    Predictions
       Woman driving beside me a few minutes ago holding her phone to her      NO          YES
           ear with her shoulder, while holding a mug of coffee. Baby on
                        Boardsticker on both rear windows.
       Por qué todos los hombres cuando su novia o esposa está embarazada      YES         NO
       andan más de culeros que de costumbre
4.6.2. Task 2
This section analyzes errors encountered by models in Task 2: Source Intention in Tweets, focusing
on classification accuracy across DIRECT, REPORTED, and JUDGEMENTAL categories. Through
examination of misclassifications, factors influencing performance across these categories are aimed to
be understood, and refinements to improve the model’s ability to discern nuanced intentions in sexist
tweets are discussed. Examples are provided in Table 19, and confusion matrices in Figure 8 depict
prediction distributions for Task 2 models.

Table 19
Examples of instances for Task 2
                                       Tweet                                       Labels    Predictions
        Lo irónico es que en su mayoría sean hombres quienes apoyan la            REPORTED    DIRECT
      criminalización de las mujeres frente al aborto. Claro, a las mujeres
      hay que castigarlas, juzgarlas y señalarlas siempre, como si no fuera
             suficiente tener que cargar con el peso de una violación.
     En total delirio esta tipa quiere legalizar el terrorismo. ¿Y esta escoria    DIRECT       NO
     quiere definir los destinos de Chile? Permitirlo es de anti chilenos.
     If you don’t vote, you ARE the problem. #VoteBlueIn2022 #Women-                NO       REPORTED
     sRights #GunControl #bookban #CivilRights #VotingRights


Figure 8: Confusion Matrices for Task 2 models test predictions


4.6.3. Error Analysis Conslusions
The analysis of errors in Task 1 and Task 2 uncovers various reasons for misclassifications. Many
tweets feature nuanced language or context, challenging for models to interpret. For example, a tweet
warning about sympathetic individuals may discuss predatory behavior broadly, misinterpreted by the
model as sexist content. Tweets often employ sarcasm, idiomatic expressions, or ambiguous wording,
leading to misclassification. A tweet about a woman multitasking while driving may be misconstrued
as a gender stereotype critique rather than a comment on unsafe driving practices. Multilingual or
culturally referential tweets add complexity. A Spanish tweet discussing men’s behavior could be
viewed contextually as commentary on male behavior patterns rather than explicit sexism.


5. Official Results
In Task 1, the best-performing strategy was a combination of models for different languages: RoBERTa
Base BNE was used for classifying Spanish tweets, and DeBERTa v3 Base was employed for English
tweets. This dual-model approach significantly outperformed other strategies, emphasizing the effec-
tiveness of leveraging specialized models for each language. Following this, the multilingual model
XLM RoBERTa Base also showed strong performance, though it was slightly behind the combined
approach. In Task 1, Model v1.1 produced the run I2C-UHU_1, while v1.2 produced I2C-UHU_2. The
official results for Task 1 are shown in the tables 20 and 21.

Table 20
HARD-HARD Evaluation EXIST 2024 Leaderboard Task1
        Ranking      Run                                 ICM-Hard      ICM-Hard Norm      F1_YES
        0            EXIST2024-test_gold.json              0.9948            1.0000        1.0000
        -            -                                         -                -             -
        10           I2C-UHU_2.json                        0.5557            0.7793        0.7733
        -            -                                         -                -             -
        32           I2C-UHU_1.json                        0.4651            0.7338        0.7513
        -            -                                         -                -             -
        68           EXIST2024-test_majority-              -0.4413           0.2782        0.0000
                     class.json
        -            -                                         -                -             -
        70           EXIST2024-test_minority-              -0.5742           0.2114        0.5698
                     class.json


Table 21
SOFT-SOFT Evaluation EXIST 2024 Leaderboard Task1
   Ranking         Run                                ICM-Soft       ICM-Soft Norm    Cross Entropy
   0               EXIST2024-test_gold.json              3.1182          1.0000           0.5472
   -               -                                        -               -                -
   13              I2C-UHU_2.json                        0.6871          0.6102           0.9184
   -               -                                        -               -                -
   18              I2C-UHU_1.json                        0.5175          0.5830           1.0666
   -               -                                        -               -                -
   36              EXIST2024-test_majority-             -2.3585          0.1218           4.6115
                   class.json
   -               -                                        -               -                -
   40              EXIST2024-test_minority-             -3.0717          0.0075           5.3572
                   class.json


  In Task 2, the best results were achieved using the Learning with Disagreement method with six
groups of annotators (three male and three female). This approach outperformed the run that applied
Learning with Disagreement with only three groups of female annotators. This finding suggests
that having a more diverse set of annotators can enhance the model’s performance by providing a
broader range of perspectives, which likely leads to better generalization and robustness in the model’s
predictions.For Task 2, v2.1 generated the run I2C-UHU_1, v2.2 produced I2C-UHU_2, and v2.3 resulted
in I2C-UHU_3. The official results for Task 2 are shown in the tables 22 and 23.

Table 22
HARD-HARD Evaluation EXIST 2024 Leaderboard Task2
        Ranking       Run                                  ICM-Hard      ICM-Hard Norm       F1_YES
        0             EXIST2024-test_gold.json               1.5378            1.0000         1.0000
        -             -                                          -                -              -
        11            I2C-UHU_2.json                         0.1815            0.5590         0.4980
        -             -                                          -                -              -
        21            I2C-UHU_1.json                         0.0418            0.5136         0.4708
        -             -                                          -                -              -
        24            I2C-UHU_3.json                         0.0210            0.5068         0.4663
        -             -                                          -                -              -
        39            EXIST2024-test_majority-               -0.9504           0.1910         0.1603
                      class.json
        -             -                                          -                -              -
        46            EXIST2024-test_minority-               -3.1545           0.0000         0.0280
                      class.json


Table 23
SOFT-SOFT Evaluation EXIST 2024 Leaderboard Task2
   Ranking          Run                                 ICM-Soft       ICM-Soft Norm    Cross Entropy
   0                EXIST2024-test_gold.json               3.1182          1.0000            0.5472
   -                -                                         -               -                 -
   17               I2C-UHU_2.json                        -2.6952          0.2828            2.1440
   -                -                                         -               -                 -
   22               I2C-UHU_1.json                        -4.2278          0.1594            2.5245
   -                -                                         -               -                 -
   27               EXIST2024-test_majority-              -5.4460          0.0612            4.6233
                    class.json
   -                -                                        -                -                 -
   35               EXIST2024-test_minority-             -32.9552          0.0000            8.8517
                    class.json


6. Conclusions and Future Work
In this paper, the effectiveness of advanced transformer models in addressing the identification of
sexism and the classification of source intent in social media texts has been demonstrated. The approach
employed, which integrates Learning with Disagreement, facilitates the incorporation of diverse anno-
tator perspectives, thereby enhancing the robustness and accuracy of the models. The methodology,
consisting of classifying tweets as sexist or non-sexist and subsequently categorizing the intent of sexist
tweets, has shown significant improvements in understanding and detecting nuanced sexist content.
The results of the EXIST 2024 Leaderboard for Task 1 and Task 2 provide valuable insights into effective
strategies for multilingual tweet classification and the impact of annotator diversity. For Task 1, superior
performance was observed with the combination of language-specific models (RoBERTa Base BNE for
Spanish and DeBERTa v3 Base for English), indicating the benefit of using specialized models tailored to
individual languages. Meanwhile, Task 2 results indicated that Learning with Disagreement, utilizing a
diverse set of annotators (both male and female), led to better outcomes compared to using only female
annotators. This underscores the importance of diversity in annotation to capture a wider array of
linguistic nuances and biases, thus improving the overall performance of the model. Future work will
focus on refining the models by incorporating additional data sources and exploring more sophisticated
ensemble methods. Additionally, efforts will be made to extend the research to other forms of harmful
online content, applying the insights gained from this study to broader applications in social media
moderation and policy-making. The insights derived from this research provide a valuable foundation
for the development of more effective strategies to combat online sexism and other forms of digital
harm.


Acknowledgments
This paper is part of the I+D+i Project titled “Conspiracy Theories and hate speech online: Comparison
of patterns in narratives and social networks about COVID-19, immigrants, refugees and LGBTI people
[NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded by MCIN/AEI/10.13039/501100011033/ and by
“ERDF/EU”.


References
 [1] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
     the CLEF Association (CLEF 2024), 2024.
 [2] A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, M. Poesio, Learning from disagreement: A
     survey, J. Artif. Int. Res. 72 (2022) 1385–1470. URL: https://doi.org/10.1613/jair.1.12752. doi:10.
     1613/jair.1.12752.
 [3] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
 [4] P. Burnap, M. L. Williams, Cyber hate speech on twitter: An application of machine classification
     and statistical modeling for policy and decision making, Policy & Internet 7 (2015) 223–242.
     URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.85. doi:https://doi.org/10.1002/
     poi3.85. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/poi3.85.
 [5] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate speech
     detection on Twitter, in: J. Andreas, E. Choi, A. Lazaridou (Eds.), Proceedings of the NAACL
     Student Research Workshop, Association for Computational Linguistics, San Diego, California,
     2016, pp. 88–93. URL: https://aclanthology.org/N16-2013. doi:10.18653/v1/N16-2013.
 [6] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive
     pretraining for language understanding, arXiv preprint arXiv:1906.08237 (2019).
 [7] J. He, Z. Gan, X. Liu, J. Li, J. Gao, Deberta: Decoding-enhanced bert with disentangled attention,
     arXiv preprint arXiv:2006.03654 (2021).
 [8] A. Gutiérrez-Fandiño, J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. Carrino,
     A. Gonzalez-Agirre, C. Armentano-Oller, C. Rodriguez-Penagos, M. Villegas, Spanish language
     models, 2021.
 [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[10] T. Yu, H. Zhu, Hyper-parameter optimization: A review of algorithms and applications, 2020.
     arXiv:2003.05689.
[11] Author(s), A survey on data augmentation for text classification, Journal Name (2022).
[12] Author(s), Xlnet with data augmentation to profile cryptocurrency influencers, Journal Name
     (2023).
[13] Author(s), Backtranslate what you are saying and i will tell who you are, Journal Name (2024).
[14] Author(s), Data augmentation using back-translation for context-aware neural machine translation,
     Journal Name (2019).
[15] Author(s), Back-translation-style data augmentation for end-to-end asr, Journal Name (2018).
[16] S. Kobayashi, Contextual augmentation: Data augmentation by words with paradigmatic relations,
     in: Proceedings of the 2018 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp.
     452–457.
[17] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, E. Hovy, A survey of data
     augmentation approaches for nlp, 2021. arXiv:2105.03075.
[18] M. Aulamo, J. Tiedemann, The OPUS resource repository: An open package for creating parallel
     corpora and machine translation services, in: M. Hartmann, B. Plank (Eds.), Proceedings of the
     22nd Nordic Conference on Computational Linguistics, Linköping University Electronic Press,
     Turku, Finland, 2019, pp. 389–394. URL: https://aclanthology.org/W19-6146.
[19] M. Junczys-Dowmunt, R. Grundkiewicz, T. Dwojak, H. Hoang, K. Heafield, T. Neckermann, F. Seide,
     U. Germann, A. Aji, N. Bogoychev, A. Martins, A. Birch, Marian: Fast neural machine translation
     in c++, 2018, pp. 116–121. doi:10.18653/v1/P18-4020.
[20] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2019. arXiv:1711.05101.