RoBEXedda: Sexism Detection in Tweets
Giacomo Aru1,† , Nicola Emmolo1,† , Simone Marzeddu1,† , Andrea Piras1,† , Jacopo Raffi1,† and
Lucia C. Passaro1
1
University of Pisa (Università di Pisa), Largo Bruno Pontecorvo 3, 56127 Pisa PI, Italy
Abstract
Sexism remains a pervasive issue, significantly hindering women’s progress in various aspects of life. This paper
focuses on online misogyny, where women face high levels of abuse and threats. The “EXIST 2024” challenge
aims to detect and classify sexist content on social media. In particular, in this paper, we address the “Sexism
Categorization in Tweets” task, which involves identifying sexist tweets and categorizing them into predefined
categories. A dataset comprising over 10,000 tweets in English and Spanish was exploited to train Transformer-
based systems with “Binary Relevance” and “Classifier Chain” architectures. This report presents an analysis of
the performance of our three candidate models in relation to the EXIST 2024 challenge. It includes a detailed
examination of the results obtained and a comparison with the official ranking of the challenge. As team “Medusa”,
we achieved second place in the competition, with three runs submitted in the soft-soft ranking. The models runs,
designated “RoBEXedda”, attained the fourth, fifth, and sixth positions in the “Task 3 Soft-Soft ALL” ranking.
Keywords
Sexism Characterization, EXIST 2024, CLEF 2024, Transformer, Binary Relevance, Classifier Chain
1. Introduction
Nowadays sexism, characterized by discrimination against women, has become a pervasive issue,
creating substantial obstacles for women in numerous aspects of their lives, including work, family
life, and personal development. This discrimination acts as a significant barrier to their progress [1].
This paper focuses on the growing concern of online misogyny. Research indicates that the online
environment has long been challenging for women, as they experience higher levels of bullying, abuse,
hateful language, and threats compared to men [2].
EXIST [3, 4] is a series of scientific events and shared tasks that aim to capture sexism in a broad
sense, from explicit misogyny to other subtle expressions that involve implicit sexist behaviours [5].
In fact, many facets of a woman’s life may be the focus of sexist attitudes, including domestic and
parenting roles, career opportunities, sexual image, and life expectations, to name a few. In EXIST
2024, the fourth edition of the sEXism Identification in Social neTworks challenge at CLEF 2024 [6], the
proposed tasks were focused on detecting and classifying sexist textual messages and image memes.
Overall, the shared task comprises 5 different sub-tasks. Among them, we focus solely on the third
one, “Sexism Categorization in Tweets” [5]. In this task, each tweet must be categorized into one or
more of the six categories spanning from ideological inequality to sexual violence.
The Sub-task dataset, consisting of more than 10,000 tweets in English and Spanish, was used to train
neural networks based on the Transformer architecture [7]. To face the task, we exploited two different
architectures: “Binary Relevance”[8], which treats each label separately, and “Classifier Chain”[9], which
links classifiers to improve predictions. The rationale behind this choice is twofold. On the one hand,
Binary Relevance allows for a straightforward approach to multi-label classification by handling each
label as an independent binary classification problem. This simplicity can lead to efficient computation
and ease of implementation, making it suitable for scenarios where labels are largely uncorrelated.
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
†
These authors contributed equally.
$ g.aru@studenti.unipi.it (G. Aru); n.emmolo@studenti.unipi.it (N. Emmolo); s.marzeddu@studenti.unipi.it (S. Marzeddu);
a.piras18@studenti.unipi.it (A. Piras); j.raffi@studenti.unipi.it (J. Raffi); lucia.passaro@unipi.it (L. C. Passaro)
https://github.com/GiacomoAru (G. Aru); https://github.com/nicolaemmolo (N. Emmolo);
https://github.com/SimoneMarzeddu (S. Marzeddu); https://github.com/aprs3 (A. Piras); https://github.com/JacopoRaffi/
(J. Raffi); https://github.com/luciacpassaro (L. C. Passaro)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
On the other hand, the Classifier Chain method may enhance predictive performance by consid-
ering label dependencies. By sequentially linking classifiers, each subsequent classifier in the chain
incorporates the predictions of previous classifiers as additional features. This approach captures the
interdependencies among labels, which can significantly improve prediction accuracy, especially in
datasets where labels exhibit strong correlations.
As for the model, we decided to start from the XML-RoBERTa models family [10], with the aim of
leveraging its robust pre-training on a diverse range of languages and textual contexts. Our decision to
use XML-RoBERTa, and not larger models[11], was influenced by constraints in terms of computational
power. Larger models, while potentially offering higher accuracy and better performance due to
their increased capacity and deeper architectures, require significantly more computational resources
for training and inference. This includes the need for more powerful hardware, increased memory,
and longer training times, which were beyond the scope of our available resources. By choosing
XML-RoBERTa, we aimed to balance model complexity and resource efficiency. We named our model
family RoBEXedda, which is derived from RoBERta, adding “EX” for EXIST, and “edda” which is a
suffix in the Sardinian language meaning “tiny”. The model selection process in the development
of RoBEXedda models involved an initial search for the optimal pretrained transformer from the
XML-RoBERTa family and identifying other hyperparameters, guiding the research using Bayesian
optimisation [12].
The remainder of the paper is as follows: Section 2 presents previous works on the topic. in Section
3 are outlined the goals of our task. Section 4 explains the dataset’s structure and the preprocessing
techniques employed during the development process. Section 5 reports on the baselines taken into
account during development. Section 6 summarises the computational resources employed during
the production process. Section 7 presents an in-depth analysis of our system, highlighting the state-
of-the-art approaches considered in the processes of training and model selection. The results of the
development are discussed in Section 8. Finally, Section 9 is left for the conclusion and future expansions
of our work.
2. Related Work
Sexism, defined as prejudice or discrimination based on gender, is a pervasive issue amplified by online
platforms. Researchers have made significant progress in developing automated systems for sexism
detection. These systems employ various techniques, ranging from rule-based approaches to advanced
machine learning. Notably, recent work and competitions have begun exploring visual and multimodal
aspects of sexism detection as well. Since the very first edition of the EXIST challenge[13], several
methods have been proposed to face the task. For instance, the authors of [14] propose a system lever-
aging both multilingual and monolingual BERT models, translating data, and implementing ensemble
strategies for the identification and classification of sexism in English and Spanish. Similarly, [15] em-
ploys a multi-task learning approach that addresses distinct tasks from a unified representation, aiming
to enhance model performance by leveraging information derived from different tasks. Another notable
approach by [16] combines the final four hidden states of XLM-RoBERTa with a TextCNN equipped
with three kernels. This integration is designed to improve sexism detection, further incorporating
abusive word lexicons to demonstrate enhanced effectiveness compared to the use of the transformer’s
final layer.
In the EXIST2022 challenge, the second place team [17] based their system on an ensemble of five
different models for Spanish (XLM-R, RoBERTa, and three BERT models) and another five models for
English (XLM-R, RoBERTa, BERT, hateBERT, and ALBERT). They also translated all English tweets
to Spanish and vice versa, additionally masking randomly selected tokens to augment the data. The
third-place team’s system [18] combined linguistic features with state-of-the-art transformers using
ensemble techniques, their most effective model being a weighted ensemble of transformers. The team
that achieved first place in the EXIST2023 competition [19] employed mBERT and XLM-RoBERTa
along with ensemble techniques, further solidifying that transformers remain the optimal approach for
this task. Our work, contextualizes within the state of the art by utilizing both Binary Relevance and
Classifier Chain architectures alongside the XML-RoBERTa model family with the aim of balancing
computational efficiency and robust performance.
3. Objectives
As previously stated, certain aspects of a woman’s life may be the focus of sexist attitudes, and the
ability to automatically identify which of these aspects of women are being more frequently attacked in
social networks will facilitate the development of policies to combat sexism. This study aims to classify
tweets identified as sexist according to the type of sexism involved. This is a multi-label classification
task. In this manner, each tweet identified by the system as sexist is to be assigned one or more of the
following categories:
• Ideological and Inequality: the text discredits the feminist movement, rejects inequality
between men and women, or presents men as victims of gender-based oppression;
• Stereotyping and Dominance: the text expresses false ideas about women that suggest they
are more suitable to fulfil certain roles (mother, wife, family caregiver, faithful, tender, loving,
submissive, etc.), or inappropriate for certain tasks (driving, hard work, etc.), or claims that men
are somehow superior to women;
• Objectification: the text presents women as objects apart from their dignity and personal aspects
or assumes or describes certain physical qualities that women must have to fulfil traditional
gender roles (compliance with beauty standards, hypersexualization of female attributes, women’s
bodies at the disposal of men, etc.);
• Sexual Violence: the text includes or describes sexual suggestions, requests for sexual favours
or harassment of a sexual nature (rape or sexual assault);
• Misogyny and Non-Sexual Violence: the text expresses hatred and violence towards women,
different to that with sexual connotations.
The objective of this task is to classify tweets, both in English and Spanish, according to whether
they contain sexist expressions or behaviours. Initially, the classification should identify whether a
given tweet contains sexist content, and subsequently, the category of sexism present in the tweets.
4. Dataset
The EXIST 2024 Tweets Dataset comprises over 10,000 labelled tweets. In particular, the challenge
presents a standard splitting of the dataset into three subsets: a training set comprising 6,920 tweets,
a development set comprising 1,038 tweets, and a test set comprising 2,076 tweets. The entirety of
the dataset is bilingual, with a ratio of 0.9 to 1 between English and Spanish tweets (3,749 and 4,209
respectively). This ratio was estimated from the training and development datasets. The aforementioned
splitting has been disregarded during the development phase, in favour of an alternative internal
splitting to maintain an internal test set. The dataset entries have been shuffled, keeping 80% of the size
for our internal development set (training and validation sets), while reserving the remaining 20% for
our internal test set.
Each tweet in the dataset is represented as a JSON object containing various attributes. These
attributes include a unique identifier for the tweet, the language of the text, and the text of the tweet
itself. Additionally, metadata about the annotators is provided, including the number of annotators, and
their unique identifiers, gender, age group, ethnicity, level of education, and country of residence. The
dataset also includes sets of labels for the three tasks about sexism in tweets.
The EXIST 2024 dataset was annotated by collecting the opinions of various annotators regarding
the presence and, if any, nature of sexism in the provided tweets. Six annotators voted on each tweet,
selecting one or more of six categories, with the restriction that selecting “NO” precludes selecting any
other category. An “UNKNOWN” label is used when an annotator does not provide a label, but this is
not a class to be predicted.
The dataset provided for the challenge features both soft and hard gold labels. The soft labels indicate
the proportion of annotators who selected each category as shown in 1, reflecting the multi-label nature
of the problem. Therefore, the sum of the “NO” label and the highest value among other labels cannot
exceed one.
Gold hard labels are derived from soft labels using a probability threshold. If a category is chosen by
more than one annotator, it becomes a hard label. Tweets without a category that exceeds the threshold
are excluded from the evaluation.
(a) Annotators agreement on gold soft label values
(b) Gold hard labels distribution
Figure 1: (a) Violin plot showing the distribution of soft gold labels in the training and development sets. The
y-axis represents the value of the label and thus the level of agreement between the annotators for a given
class. For each class, the width of the violin represents, visually and not in precise scale, the frequency of that
particular value for that class. In addition, the distribution is also shown by the box plot contained in each violin.
(b) Bar chart showing the distribution of gold hard labels in the training and development sets.
The first step of preprocessing involved the removal of “irrelevant” features for our ap-
proach to the task (“labels_task2”, “labels_task1”, “labels_task3”, “annotators”, “number_annotators”,
“gender_annotators”, “age_annotators”, “ethnicities_annotators”, “study_levels_annotators”, “coun-
tries_annotators”). The features in question represent supplementary metadata that is not strictly
necessary nor present when addressing standard cases of classification approaches. In our view, build-
ing a model that relied on these features would have resulted in the creation of a highly customised
system, making it challenging to extend to general cases of online sexism identification in the absence
of datasets annotated in a manner compatible with the one provided in this instance.
Training the model on features related to the personal characteristics of the annotators would also
have introduced new biases into the system, making it more likely to produce predictions driven by
elements such as the ethnicity and gender of the annotators, with potentially discriminatory implications.
Building on [20], we implemented a preprocessing pipeline to improve the classification performance.
Specifically, we applied language-agnostic functions to remove all URLs, user tags, numbers and dates,
useless spaces and inverted exclamation/question marks at the beginning of Spanish phrases, as well
as any syntagmas that did not contain relevant information for categorising the tweet. All instances
of multiple exclamation marks, multiple question marks and mixed question and exclamation marks
were identified and reformatted to reduce the variability introduced by the alternation and repetition of
these two characters, which are often unevenly distributed. We identified and reduced all repetitions of
punctuation and letters, including extended words, to just 2 repetitions so that they retained a different
meaning from the single occurrence of the same character, while also ensuring that all repetitions were
consistent. The final issue we addressed was the omission of a space between the period at the end of a
sentence and the following word. This syntactic error was a common occurrence in the dataset.
Figure 2 illustrates two exemplary tweets in English and Spanish, respectively. These tweets demon-
strate the processing of the source tweet and its appearance following the decoding of the tokenization.
Tweet Processing Example
English tweet:
Source tweet:
@user5 Wow!!! https://example.com insaneee I can’t evennn believe it???!!!
Processed:
Wow !! insanee I can’t evenn believe it ?!
Processed and decoded from the tokenizer:
Wow!! insanee I can’t evenn believe it?!
Spanish tweet:
Source tweet:
@usuario3 ¡Mira esto!!! https://ejemplo.com ¿¿¿Qué??? ¡¡Es increíble!!
Processed:
Mira esto !! Qué ?? Es increíble !!
Processed and decoded from the tokenizer:
Mira esto!! Qué?? Es increíble!!
Figure 2: Example tweets in English and Spanish demonstrating the effect of the processing function.
After preprocessing, we created a new split of the labelled dataset by randomly shuffling its entries,
keeping 80% of the size for our internal development set (training and validation sets), while reserving
the remaining 20% for our internal test set.
5. Baseline models
In addition to the dataset, we were provided with a baseline for each task. This served as an initial
reference point for comparing the performance of various models. This approach enabled us to evaluate
how well or poorly a model performed in comparison to an unsophisticated or simple system. In our
case, considering only the third task, we had two baselines: one for the majority class and one for the
minority class.
The majority class baseline is a non-informative system where all instances are labelled with the
majority class, while the minority class is a non-informative system where all instances are classified as
the minority class. The term “non-informative” is used to describe a system or model that does not
utilise any significant information or features of the data to make predictions. Instead, it simply assigns
all instances to a particular class, regardless of the actual data.
The majority class is the “NO” class (Figure 3a), and the minority class is the “SEXUAL-VIOLENCE”
class (Figure 3b).
1 { 1 {
2 "test_case": "EXIST2024", 2 "test_case": "EXIST2024",
3 "id": "100001", 3 "id": "100001",
4 "value": { 4 "value": {
5 "IDEOLOGICAL-INEQUALITY": 0.0, 5 "IDEOLOGICAL-INEQUALITY": 0.0,
6 "STEREOTYPING-DOMINANCE": 0.0, 6 "STEREOTYPING-DOMINANCE": 0.0,
7 "MISOGYNY-NON-SEXUAL-VIOLENCE": 0.0, 7 "MISOGYNY-NON-SEXUAL-VIOLENCE": 0.0,
8 "SEXUAL-VIOLENCE": 0.0, 8 "SEXUAL-VIOLENCE": 1.0,
9 "OBJECTIFICATION": 0.0, 9 "OBJECTIFICATION": 0.0,
10 "NO": 1.0 10 "NO": 0.0
11 } 11 }
12 } 12 }
(a) Majority Class Instance (b) Minority Class Instance
Figure 3: Labels for all the instances of the majority and minority baselines
6. Resources employed
The development of RoBEXedda models was constrained by a limited number of resources, as the
shared machine assigned to us by the University of Pisa was also exploited by other students at the
same time. The machine was equipped with a NVIDIA V100 with 32 GB of memory. Alternatively,
Google Colab with the free plan was employed. An important mention goes to the Weight & Biases [21]
library, adopted for model selection. This permitted the training of distinct configurations in parallel
across multiple machines, with all results and plots being recorded directly on the library’s website.
This was achieved through the Sweep paradigm.
7. Proposed methodology
A fundamental stage in developing RoBEXedda involved searching for cutting-edge approaches that
fit well with our objectives. In addition to the techniques mentioned above that are used in the
preprocessing phase of the data, the use of AdamW [22] as an optimiser, and the implementation of
two distinct architectures based on the principles of Classifier Chain [23] and Binary Relevance [24],
respectively, deserves a more in-depth mention. These techniques have been combined with original
insights and approaches identified by our team. Both state-of-the-art approaches and integrations of
original techniques are discussed in this chapter.
7.1. Architectures
The architectures that we evaluated differed in the classification head that was placed on top of the
pretrained transformer. To address the multilingual nature of the task while respecting our computation
constraints, we focused our model selection on pretrained models from the XML-RoBERTa family [10].
Our pipeline included, after the preprocessing phase, a tweet tokenization phase. After studying the
dataset, we decided to set the length of the transformer input at 128 tokens, as this was found to be the
optimal length for the average and maximum length of the tokenized tweets shown in Figure 4.
Figure 4: Distribution of tokenized input tweet lengths across the entire dataset. Raw tokenized data is shown
in red with a vertical red line indicating the maximum length, while processed tokenized tweets are shown in
blue with a vertical blue line indicating the maximum length.
Two main architectural archetypes, Classifier Chain and Binary Relevance, were considered during
the model selection phase. We aimed to study the performance of the two architectures in tackling
the task analysed. Among the three model proposals submitted by our team, two of them were indeed
selected by us as the best Classifier Chain model and the best Binary Relevance model according to the
validation metrics considered during model selection.
7.1.1. Binary Relevance Architecture
The first architecture is based on the concept of Binary Relevance (BR) [24], a very simple technique,
often used as a baseline in multi-label classification problems. BR is a problem decomposition technique
that assumes that each label is independent of the others and can therefore be treated separately.
Furthermore, BR is a computationally efficient technique, making it a practical choice for our context.
The BR-based architecture consists of two fully connected hidden feedforward layers with GELU
activation function, placed on top of the pretrained transformer, receiving as input the contextual
embedding of the classification token produced by it. The output of the transformer does not go
through the internal pooling or classification layer of the transformer but is taken from the last block
of multi-head bidirectional attention. The head ends with a linear classification layer, followed by the
application of a sigmoidal function to the 6 computed outputs to obtain the 6 different probabilities,
one for each class. This approach is illustrated in Figure 5.
7.1.2. Classifier Chain Architecture
In Classifier Chain architectures, classifiers are chained together in a directed structure so that pre-
dictions from individual labels become features for other classifiers. Such methods are known in the
literature for their flexibility and effectiveness, achieving state-of-the-art performance on many datasets
and multi-label evaluation metrics [23].
As discussed in the previous chapter, the soft labels that our model should predict do not represent
a probability distribution on mutually exclusive classes (since several soft labels can be predicted
simultaneously), with the exception of the label “NO”, the only one whose value has a relationship with
the values of the others.
In particular, the target values represent the proportion of annotators who have chosen a set of
labels to associate with each specific tweet. In the case of the “NO” class, it is not possible to select this
label in conjunction with any of the remaining five labels. However, multiple categories of sexism can
Figure 5: Classification head architecture using Binary Relevance concept.
be selected without any restrictions. The sum of the “NO” label and the maximum value among the
remaining labels cannot exceed one. This intuition led to the development of an original architectural
idea, which consists of using a Classifier Chain model that can use its prediction of the “NO” label as a
feature for predicting the remaining labels.
The proposed Classifier Chain architecture comprises two modules, both of which constitute the
multi-label classifier head, situated at the top of the pretrained transformer. Both modules receive
contextual embeddings produced by the transformer following the processing of an observed tweet.
The first module comprises three fully connected feedforward layers, with GELU activation functions
in the hidden neurons and a sigmoid activation function in the output layer. The objective of this
module is to output the prediction of the value of the “NO” label associated with the input tweet.
The second module is analogous to the first in structure and its objective is to predict the values of
the remaining five soft labels. In light of the success of Classifier Chain architectures, we hypothesised
that the prediction produced by the first module could be used as input to the second module, thereby
serving as a feature in the prediction of the remaining five soft labels. A noteworthy design choice is
that the prediction of the “NO” label is given as input to the second module at a higher level of the
architecture (the second hidden layer rather than the first). The rationale behind this decision is that
the prediction of the first classifier (the first module) can more effectively represent a high-level feature
of the subsequent classifier, at a higher level of abstraction, than the contextual embedding returned by
the transformer.
During the training phase, the second module was trained using the teacher forcing technique, where
the input from the previous classifier in the chain was replaced by the corresponding gold label. Figure
6 shows the design of the approach.
7.2. Training
The pretrained model is employed in conjunction with the classification heads, which were based on
Classifier Chain and Binary Relevance architectures respectively described in sections 7.1.2 and 7.1.1.
The training parameters include learning rate, dropout, optimiser, hidden layer size, batch size and
epochs, which are optimised during model selection. One of the state-of-the-art techniques explored in
the training process is the AdamW optimiser.
AdamW (Adaptive Moment Estimation with Weight Decay) is an optimisation algorithm that com-
bines the properties of Adam with a weight decay mechanism. Adam is known to adapt individual
learning rates for each parameter using estimates of the first and second moments of the gradients.
AdamW differs from Adam for the weight decay that is applied separately from the updating of the
gradients. This approach allows more precise control of the weight decay and avoids unwanted inter-
ference between the learning rate and the weight decay itself, which complicates the optimal choice
of these hyperparameters and improves convergence efficiency. This facilitates the choice of learning
Figure 6: Classification head architecture using Classifier Chain concept.
parameters and leads to more efficient convergence. Studies have shown that AdamW tends to produce
models with a greater capacity for generalisation than Adam [25].
This finding was confirmed during our preliminary exploration phase. Indeed, we observed that
AdamW performed better than Adam, so we decided to directly employ it. In addition, we noticed that
also compared to Stochastic Gradient Descent (SGD), it significantly reduces the time needed to find an
effective combination of hyperparameters, allowing for more efficient and faster tuning [22].
To train our models, we employed the Binary Cross-Entropy (BCE) loss, which also served as the
primary validation metric. In addition to the BCE loss, we evaluated our models using other validation
metrics described below to ensure a comprehensive assessment of performance.
In particular, we exploited the PyEvALL (The Python library to Evaluate ALL) framework [26] that
offers several assessment metrics including F1 score, ICM (Information Contrast Measure) [27], and a
soft version of ICM (ICM-Soft). All of these additional metrics were observed during the model selection
process. The sole criterion for the selection was the validation metric (BCE), except for one of the three
candidate models, which was selected based on the ICM-Soft.
The ICM-Soft criterion represents an extension of ICM, a measure that has been demonstrated to be
analytically superior to cases where categories have a hierarchical structure and items may belong to
more than one category. However, in contrast to its standard counterpart, the ICM-Soft accepts both
soft system outputs and soft ground truth assignments.
7.3. Model Selection
The initial phase of model selection involved an analysis designed to gain a first understanding of the
influence of hyperparameters on model performance. To facilitate this process, we employed the W&B
(Weights & Biases) library (wandb) to train distinct configurations of parameters in parallel across
multiple machines, with all results and plots being recorded directly on the library’s website. The
utilisation of the sweeps and agents features enables the automation of hyperparameter search by
defining a search space and strategy and running the experiments according to this configuration.
The objective was to minimise the validation loss. To achieve this objective, a preliminary random
search [28] was conducted, during which several pretrained models from the XML-RoBERTa family
were tested. The optimal choice was identified as “sdadas/xlm-roberta-large-twitter” [29].
The initial search was followed by a Bayesian search. The fixed parameters for the model are the
Batch Size fixed at 64, the maximum number of Epochs set to 15 (with early-stopping, patience 2),
AdamW as the optimizer, and the pretrained model “sdadas/xlm-roberta-large-twitter”.
Table 1 shows the ranges of the other hyperparameters that were explored during the Bayesian
search.
Table 1
Hyperparameter Ranges for Model Selection.
Parameter Distribution Values/Range
classifier_type Categorical chain, ff
dropout Quantized Uniform min: 0.2, max: 0.7, q: 0.05
hidden_layer_size Categorical 128, 256, 512, 1024
learning_rate Log Uniform min: -13, max: -10
To prevent overfitting, during Bayesian search we employed early stopping. This ensured that the
model did not continue to train beyond the point where its performance on the validation set started to
degrade. After this fine-grade search, we can see the top 10 runs in Figure 7.
Figure 7: Top 10 runs from the final Bayesian search. Each line in the graph represents a specific model run,
characterised by specific hyperparameter values and the corresponding minimum validation loss result.
From this graph, which highlights the top 10 runs found during model selection, we can understand
that both types of classifiers (Binary Relevance and Classifier Chain) can achieve competitive results.
Dropout tend to be slightly more effective between 0.2 and 0.4, indicating that minimal regularisation
is preferable. The results indicate that smaller hidden layer sizes (equal to or less than 512) are more
common among the best runs. Furthermore, smaller learning rates are associated with a lower minimum
validation loss, which highlights the importance of precise fine-tuning of the learning rate to improve
model convergence. In any case, the most crucial hyperparameters were identified as the values of
the hidden layer size and the learning rate. By operating within the range of interest, adjusting these
parameters to more specific values led to changes in the results. The hidden layer size was identified as
the most influential factor in the search for optimal models, exhibiting a strong negative correlation
(lower values perform better) with respect to the minimum validation loss. The learning rate also
showed significant importance, with a moderate negative correlation. In contrast, the value related
to dropout did not have a significant impact, as it showed minimal importance and a low positive
correlation.
The model selection process led to the identification of the most promising hyperparameter configura-
tions. From these, three RoBEXedda models were selected for submission to the challenge (a maximum
of three candidates per team were allowed). These models have been selected by us for specific features
and are identified as “Best BR”, “Best Chain”, and “Best ICM-Soft”. Best BR and Best Chain represent
respectively the Binary Relevance model and the Classifier Chain model that obtained the best BCE
loss on the validation set. Best ICM-Soft is the model chosen for obtaining the best ICM-Soft on the
validation set (also featuring the Binary Relevance architecture).
All three RoBEXedda models share the parameters of maximum number of ‘epochs’ (15), ‘batch size’
(64), and ‘early stopping patience’ (2). The “Best ICM-Soft” model was trained with ‘learning rate’ of
3.6936026e-5, ‘training epochs’ of 4, ‘dropout’ percentage of 0.4, and ‘hidden layer size’ of 512. The “Best
Chain” model was trained with ‘learning rate’ of 1e-5, ‘training epochs’ of 7, ‘dropout’ percentage of
0.2, and ‘hidden layer size’ of 128. The “Best BR” model was trained with ‘learning rate’ of 1.8176664e-5,
‘training epochs’ of 4, ‘dropout’ percentage of 0.25, and ‘hidden layer size’ of 512. Table 2 shows a brief
summary of the selected models.
Table 2
The core features on which the models have been selected are presented in bold.
Model Description
Best ICM-Soft Model (Binary Relevance) with best ICM Soft on Validation Set
Best Chain Classifier Chain model with best BCE Loss on Validation Set
Best BR Binary Relevance model with best BCE Loss on Validation Set
8. Results
Following the selection of the models, an internal assessment was conducted to evaluate the system
performance. This section will discuss both the internal assessment phase and the scores achieved by
our models in the EXIST 2024 challenge.
8.1. Internal Assessment
Following the retraining on both the training and validation sets, the three RoBEXedda models identified
in model selection phase were evaluated on an internal test set. Tables 4 - 5 show the results of the
internal test set, averaged over five different weight initialisations.
Table 3
Results on internal test set (Soft Metrics).
ICM-Soft ICM-Soft Norm BCE
Gold Label 9.82805 1 0.24241
Best ICM-Soft -2.5544 0.370045 0.35034
Best Chain -2.70407 0.362431 0.35194
Best BR -2.56684 0.369412 0.34715
Majority Class -8.99827 0.042214 17.73328
Minority Class -40.6109 0 33.08923
Table 4
Results on internal test set (Hard Metrics).
ICM ICM Norm F1 score
Gold Label 2.41576 1 1
Best ICM-Soft 0.16134 0.533393 0.612121
Best Chain 0.102222 0.521157 0.610467
Best BR 0.0856231 0.517722 0.606697
Majority Class -1.84935 0.117231 0.110364
Minority Class -3.29148 0 0.0328426
8.2. Challenge Results
After the final assessment, the RoBEXedda models were retrained on the entire dataset, and subsequently
employed to generate the predictions on the official blind test set, submitted for our participation in
the challenge (Task 3 Soft-Soft). Table 6 shows the results of our approaches compared to the official
baselines described in Section 5 and the gold labels.
Table 5
F1 score for each class in the internal test set. The classes I-I, S-D, Obj, S-V, M-NS refer respectively to, Ideological
Inequality, Stereotyping Dominance, Objectification, Sexual Violence, Misogyny Non-Sexual Violence.
NO I-I S-D OBJ S-V M-NS
Gold Label 1 1 1 1 1 1
Best ICM-Soft 0.82795 0.62189 0.52088 0.56732 0.58133 0.55333
Best Chain 0.82510 0.60382 0.52410 0.56695 0.57788 0.56491
Best BR 0.82269 0.61063 0.52307 0.56401 0.57009 0.54966
Majority Class 0.66218 0 0 0 0 0
Minority Class 0 0 0 0 0.19705 0
In the Task 3 Soft-Soft competition, our models achieved the 4th, 5th, and 6th position in the global
ranking, ranking our team (Medusa), just behind the “NYCU-NLP” team (Task Winner), whose three
models took the 1st, 2nd and 3rd positions in the ranking. In particular, the Best ICM-Soft model achieved
the 4th position, the Best Chain model achieved the 5th position, and the Best Binary Relevance model
achieved the 6th position.
Table 6
Task 3 Soft-Soft final results.
English and Spanish Only English Only Spanish
ICM- ICM- ICM-
ICM- ICM- ICM-
Rank Model Soft Soft Soft
Soft Soft Soft
Norm Norm Norm
0 Gold Label 9.4686 1 9.1255 1 9.6071 1
1𝑠𝑡 Task Winner -1.1762 0.4379 -1.2583 0.4311 -1.1280 0.4413
4𝑡ℎ Best ICM-Soft -2.2055 0.3835 -2.0694 0.3866 -2.2859 0.3810
5𝑡ℎ Best Chain -2.4010 0.3732 -2.2945 0.3743 -2.4730 0.3713
6𝑡ℎ Best BR -2.4142 0.3725 -2.3419 0.3717 -2.4397 0.3730
28𝑡ℎ Majority Class -8.7089 0.0401 -8.2105 0.0501 -9.0314 0.0300
33𝑡ℎ Minority Class -46.1080 0 -46.9473 0 -45.4260 0
A first observation is that, although the outcomes are essentially comparable, all RoBEXedda models
demonstrate a slight advantage in English with respect to Spanish. This can be attributed to the
composition of the training data for the pretrained model, which comprised 50.9% English tweets and
14.4% Spanish tweets [30]. A second consideration is that the use of a Classifier Chain did not result
in enhanced efficacy compared to the Binary Relevance approach. One potential explanation for this
finding is that the Binary Relevance architecture is more effective in representing dependencies between
the labels in the analysed task.
In the final ranking, the Best ICMSoft model emerged as the most effective between our choices. It is
noteworthy that it was the sole model selected based on the ICM Soft measure, which is not always
synchronised with the BCE loss.
This emphasises the significance of considering alternative evaluation metrics when selecting models.
Exploring model selection based on this measure could prove to be an intriguing avenue for future
research.
9. Conclusion and future directions
Participation in the EXIST 2024 challenge aimed at categorising sexist content in tweets has provided
valuable insights into the detection and classification of online misogyny. Utilising a robust dataset
of over 10,000 tweets in both English and Spanish, we developed and evaluated three distinct neural
network models based on Binary Relevance and Classifier Chain architectures. The results demonstrate
the potential of advanced machine learning techniques in addressing the pervasive issue of online
sexism and underscore the importance of continued research and development in this critical area.
Although the results obtained are far from perfect, we believe that our analyses have nevertheless
led to interesting insights. One of them is the fact that good results in the task of classifying sexist
behaviour in social networks can be achieved with limited resources. Indeed, as mentioned above, the
development of RoBEXeddA was carried out in particularly narrow time slots within the EXIST 2024
time window, distributed on a few shared machines.
The lack of computational resources is not the sole point of improvement in our process. Indeed,
the project is open to numerous possible future developments. It would be of interest to undertake
a model selection process that could screen larger pretrained transformer models. Among potential
future additions to our project, it might be worthwhile to test other state-of-the-art techniques, such as
data augmentation and Ensemble Learning, which were not included in the challenge preparation in
favour of producing an efficient system in the shortest possible time.
An additional intriguing attempt would be to conduct a separate pre-training of the classifier head,
preceding the entire fine-tuning of the model. This is because, following our tests, training the random
initialised head required a much higher learning rate than what was allowed in the model’s finetuning.
Therefore, we could have obtained more stable training curves and encouraged the learning of an initial
representation of the dataset’s features.
References
[1] L. Plaza, J. C. de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview of
EXIST 2023 - learning with disagreement for sexism identification and characterization (extended
overview), in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the
Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th
to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 813–854. URL:
https://ceur-ws.org/Vol-3497/paper-070.pdf.
[2] J. Bartlett, R. Norrie, S. Patel, R. Rumpel, S. Wibberley, Misogyny on twitter (2014).
[3] L. Plaza, J. C. de Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante,
D. Spina, Overview of exist 2024 – learning with disagreement for sexism identification and
characterization in social networks and memes, in: Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
Association (CLEF 2024), 2024.
[4] L. Plaza, J. C. de Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante,
D. Spina, Overview of exist 2024 – learning with disagreement for sexism identification and
characterization in social networks and memes (extended overview), in: G. Faggioli, N. Ferro,
P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024- Conference and Labs of
the Evaluation Forum, 2024.
[5] L. Plaza, J. Carrillo-de-Albornoz, E. Amigó, J. Gonzalo, R. Morante, P. Rosso, D. Spina, B. Chulvi,
A. Maeso, V. Ruiz, EXIST 2024: sexism identification in social networks and memes, in:
N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Ad-
vances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR
2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part V, volume 14612 of Lecture Notes in
Computer Science, Springer, 2024, pp. 498–504. URL: https://doi.org/10.1007/978-3-031-56069-9_68.
doi:10.1007/978-3-031-56069-9\_68.
[6] Clef 2024 conference and labs of the evaluation forum, 2024. https://clef2024.clef-initiative.eu/
index.php.
[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
Attention is all you need, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus,
S. V. N. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems
30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
Long Beach, CA, USA, 2017, pp. 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/hash/
3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[8] M. Zhang, Y. Li, X. Liu, X. Geng, Binary relevance for multi-label learning: an overview, Frontiers
Comput. Sci. 12 (2018) 191–202. URL: https://doi.org/10.1007/s11704-017-7031-7. doi:10.1007/
S11704-017-7031-7.
[9] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification,
Mach. Learn. 85 (2011) 333–359. URL: https://doi.org/10.1007/s10994-011-5256-5. doi:10.1007/
S10994-011-5256-5.
[10] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for
Computational Linguistics, 2020, pp. 8440–8451. URL: https://doi.org/10.18653/v1/2020.acl-main.
747. doi:10.18653/V1/2020.ACL-MAIN.747.
[11] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, J. Gao, Large language
models: A survey, CoRR abs/2402.06196 (2024). URL: https://doi.org/10.48550/arXiv.2402.06196.
doi:10.48550/ARXIV.2402.06196. arXiv:2402.06196.
[12] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, Taking the human out of the loop:
A review of bayesian optimization, Proc. IEEE 104 (2016) 148–175. URL: https://doi.org/10.1109/
JPROC.2015.2494218. doi:10.1109/JPROC.2015.2494218.
[13] J. Gonzalo, M. Montes-y-Gómez, P. Rosso, Iberlef 2021 overview: Natural language processing for
iberian languages, in: M. Montes, P. Rosso, J. Gonzalo, M. E. Aragón, R. Agerri, M. Á. Álvarez-
Carmona, E. Á. Mellado, J. Carrillo-de-Albornoz, L. Chiruzzo, L. A. de Freitas, H. Gómez-Adorno,
Y. Gutiérrez, S. M. J. Zafra, S. Lima, F. M. P. del Arco, M. Taulé (Eds.), Proceedings of the Iberian
Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish
Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the
Spanish Society for Natural Language Processing., Málaga, Spain, September, 2021, volume 2943
of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 1–15. URL: https://ceur-ws.org/Vol-2943/
Overview_iberLEF_2021.pdf.
[14] A. F. M. de Paula, R. F. da Silva, I. B. Schlicht, Sexism prediction in spanish and english tweets using
monolingual and multilingual BERT and ensemble models, in: M. Montes, P. Rosso, J. Gonzalo, M. E.
Aragón, R. Agerri, M. Á. Álvarez-Carmona, E. Á. Mellado, J. Carrillo-de-Albornoz, L. Chiruzzo,
L. A. de Freitas, H. Gómez-Adorno, Y. Gutiérrez, S. M. J. Zafra, S. Lima, F. M. P. del Arco, M. Taulé
(Eds.), Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with
the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII
International Conference of the Spanish Society for Natural Language Processing., Málaga, Spain,
September, 2021, volume 2943 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 356–373.
URL: https://ceur-ws.org/Vol-2943/exist_paper2.pdf.
[15] F. M. P. del Arco, M. D. Molina-González, L. A. U. López, M. T. Martín-Valdivia, Sexism identification
in social networks using a multi-task learning system, in: M. Montes, P. Rosso, J. Gonzalo, M. E.
Aragón, R. Agerri, M. Á. Álvarez-Carmona, E. Á. Mellado, J. Carrillo-de-Albornoz, L. Chiruzzo,
L. A. de Freitas, H. Gómez-Adorno, Y. Gutiérrez, S. M. J. Zafra, S. Lima, F. M. P. del Arco, M. Taulé
(Eds.), Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with
the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII
International Conference of the Spanish Society for Natural Language Processing., Málaga, Spain,
September, 2021, volume 2943 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 491–499.
URL: https://ceur-ws.org/Vol-2943/exist_paper16.pdf.
[16] A. Jiang, A. Zubiaga, QMUL-SDS at EXIST: leveraging pre-trained semantics and lexical features
for multilingual sexism detection in social networks, in: M. Montes, P. Rosso, J. Gonzalo, M. E.
Aragón, R. Agerri, M. Á. Álvarez-Carmona, E. Á. Mellado, J. Carrillo-de-Albornoz, L. Chiruzzo,
L. A. de Freitas, H. Gómez-Adorno, Y. Gutiérrez, S. M. J. Zafra, S. Lima, F. M. P. del Arco, M. Taulé
(Eds.), Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with
the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII
International Conference of the Spanish Society for Natural Language Processing., Málaga, Spain,
September, 2021, volume 2943 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 469–483.
URL: https://ceur-ws.org/Vol-2943/exist_paper14.pdf.
[17] V. Ahuir, J. González, L. Hurtado, Enhancing sexism identification and categorization in low-data
situations, in: M. Montes-y-Gómez, J. Gonzalo, F. Rangel, M. Casavantes, M. Á. Á. Carmona, G. Bel-
Enguix, H. J. Escalante, L. A. de Freitas, A. Miranda-Escalada, F. J. Rodríguez-Sanchez, A. Rosá,
M. A. S. Cabezudo, M. Taulé, R. Valencia-García (Eds.), Proceedings of the Iberian Languages
Evaluation Forum (IberLEF 2022) co-located with the Conference of the Spanish Society for Natural
Language Processing (SEPLN 2022), A Coruña, Spain, September 20, 2022, volume 3202 of CEUR
Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3202/exist-paper5.pdf.
[18] J. A. García-Díaz, S. M. J. Zafra, R. C. Palacios, R. Valencia-García, Umuteam at EXIST 2022:
Knowledge integration and ensemble learning for multilingual sexism identification and catego-
rization using linguistic features and transformers, in: M. Montes-y-Gómez, J. Gonzalo, F. Rangel,
M. Casavantes, M. Á. Á. Carmona, G. Bel-Enguix, H. J. Escalante, L. A. de Freitas, A. Miranda-
Escalada, F. J. Rodríguez-Sanchez, A. Rosá, M. A. S. Cabezudo, M. Taulé, R. Valencia-García
(Eds.), Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022) co-located with
the Conference of the Spanish Society for Natural Language Processing (SEPLN 2022), A Coruña,
Spain, September 20, 2022, volume 3202 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL:
https://ceur-ws.org/Vol-3202/exist-paper14.pdf.
[19] A. F. M. de Paula, G. Rizzi, E. Fersini, D. Spina, AI-UPV at EXIST 2023 - sexism character-
ization using large language models under the learning with disagreements regime, CoRR
abs/2307.03385 (2023). URL: https://doi.org/10.48550/arXiv.2307.03385. doi:10.48550/ARXIV.
2307.03385. arXiv:2307.03385.
[20] D. Effrosynidis, S. Symeonidis, A. Arampatzis, A comparison of pre-processing techniques for
twitter sentiment analysis, in: J. Kamps, G. Tsakonas, Y. Manolopoulos, L. S. Iliadis, I. Karydis
(Eds.), Research and Advanced Technology for Digital Libraries - 21st International Conference on
Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18-21, 2017,
Proceedings, volume 10450 of Lecture Notes in Computer Science, Springer, 2017, pp. 394–406. URL:
https://doi.org/10.1007/978-3-319-67008-9_31. doi:10.1007/978-3-319-67008-9\_31.
[21] L. Biewald, Experiment tracking with weights and biases, 2020. URL: https://www.wandb.com/,
software available from wandb.com.
[22] Y. Pan, Y. Li, Toward understanding why adam converges faster than SGD for transformers, CoRR
abs/2306.00204 (2023). URL: https://doi.org/10.48550/arXiv.2306.00204. doi:10.48550/ARXIV.
2306.00204. arXiv:2306.00204.
[23] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains: A review and perspectives, J. Artif.
Intell. Res. 70 (2021) 683–718. URL: https://doi.org/10.1613/jair.1.12376. doi:10.1613/JAIR.1.
12376.
[24] O. Luaces, J. Díez, J. Barranquero, J. J. del Coz, A. Bahamonde, Binary relevance efficacy
for multilabel classification, Prog. Artif. Intell. 1 (2012) 303–313. URL: https://doi.org/10.1007/
s13748-012-0030-x. doi:10.1007/S13748-012-0030-X.
[25] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: 7th International Conference
on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net,
2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7.
[26] Pyevall, 2024. https://github.com/UNEDLENAR/PyEvALL.
[27] E. Amigó, A. D. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan,
P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,
Association for Computational Linguistics, 2022, pp. 5809–5819. URL: https://doi.org/10.18653/v1/
2022.acl-long.399. doi:10.18653/V1/2022.ACL-LONG.399.
[28] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res.
13 (2012) 281–305. URL: https://dl.acm.org/doi/10.5555/2503308.2188395. doi:10.5555/2503308.
2188395.
[29] sdadas/xlm-roberta-large-twitter, 2023. https://huggingface.co/sdadas/xlm-roberta-large-twitter.
[30] S. Dadas, OPI at semeval-2023 task 9: A simple but effective approach to multilingual tweet intimacy
analysis, in: A. K. Ojha, A. S. Dogruöz, G. D. S. Martino, H. T. Madabushi, R. Kumar, E. Sartori (Eds.),
Proceedings of the The 17th International Workshop on Semantic Evaluation, SemEval@ACL 2023,
Toronto, Canada, 13-14 July 2023, Association for Computational Linguistics, 2023, pp. 150–154.
URL: https://doi.org/10.18653/v1/2023.semeval-1.21. doi:10.18653/V1/2023.SEMEVAL-1.21.