=Paper=
{{Paper
|id=Vol-3740/paper-105
|storemode=property
|title=NICA at EXIST CLEF Tasks 2024
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-105.pdf
|volume=Vol-3740
|authors=Aylin Naebzadeh,Melika Nobakhtian,Sauleh Eetemadi
|dblpUrl=https://dblp.org/rec/conf/clef/NaebzadehNE24
}}
==NICA at EXIST CLEF Tasks 2024==
NICA at EXIST CLEF Tasks 2024
Notebook for the NICA group at EXIST Lab at CLEF 2024
Aylin Naebzadeh1,* , Melika Nobakhtian2 and Sauleh Eetemadi3
1
Student at School of Computer Engineering, Iran University of Science and Technology, Tehran, Islamic Republic Of Iran.
2
Student at Tehran Institute for Advanced Studies (TeIAS), Khatam University, Tehran, Islamic Republic Of Iran.
3
Assistant Professor of Computer Science, School of Computer Engineering, Iran University of Science and Technology, Tehran,
Islamic Republic Of Iran.
Abstract
In this paper, we introduce the models developed by the NICA group for the sEXism Identification in Social
neTworks (EXIST) Shared Task at CLEF 2024. Our participation spanned across five tasks: Sexism Identification
in Tweets (Task 1), Source Intention in Tweets (Task 2), Sexism Categorization in Tweets (Task 3), Sexism
Identification in Memes (Task 4), and Source Intention in Memes (Task 5). For the first three tasks, we utilized
various multi-lingual transformer models to detect sexism in English and Spanish tweets. For tasks 4 and 5, we
employed the CLIP model, which leverages both image and corresponding text data to identify sexist elements.
Our final model, as demonstrated through a comparative analysis with other transformer-based models, effectively
leverages the multi-lingual transformer models to achieve competitive performance with hard labels. Notably,
our model also yields promising results for the fourth and fifth subtasks, showcasing the efficacy of CLIP as
a multi-modal Vision and Language (V&L) model. The EXIST shared task proposed two evaluation methods:
Hard-Hard, and Soft-Soft, comparing the system’s output with the ground truth. Our team secured the 4𝑡ℎ
position in Task 5 for the Soft-Soft evaluation method, and the 9𝑡ℎ position in Task 4 for the Soft-Soft evaluation
among the participants, achieving 0.3370 and 0.4299 of the ICM metric respectively.
Keywords
Sexism Identification, Transformers, Multi-lingual Transformers, CLIP
1. Introduction
Gender-based discrimination, particularly sexism, remains a significant challenge in digital interactions,
affecting the inclusivity of online spaces. The proliferation of social media has intensified the spread of
sexist content, underscoring the need for automated detection and classification methods.
To address this issue, a series of scientific events called EXIST has been established with the objective
of comprehending sexism in its widest scope. Unlike its predecessors, EXIST 2024 expands its scope to
include image-based content, such as memes, recognizing the diverse manifestations of sexism, from
overt misogyny to subtle, implicit behaviors.
At EXIST 2024, participants are tasked with six challenges: identifying sexism in tweets and memes,
determining the author’s intention behind sexist content, and categorizing the aspects of women
targeted by sexist messages. An important proposal of the task is “The learning with disagreement
paradigm”[1] where the organizers propose to build systems that are able to consider the different
perspectives that people have when identifying sexism. For this reason, task organizers propose two
evaluation methods (Hard-Hard, and Soft-Soft). This paper details the NICA team’s approach to the
first five tasks of EXIST 2024, marking our inaugural participation. This is our first time participating in
the EXIST competition. We utilized multi-lingual transformer models for text-based sexism detection
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ a_naeb@comp.iust.ac.ir (A. Naebzadeh); M.Noubakhtian.5740@Khatam.ac.ir (M. Nobakhtian); sauleh@iust.ac.ir
(S. Eetemadi)
https://github.com/AylinNaebzadeh (A. Naebzadeh); https://github.com/MelikaNobakhtian (M. Nobakhtian);
http://www.sauleh.ir/ (S. Eetemadi)
0000-0001-9788-2764 (M. Nobakhtian)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
and the CLIP model for categorizing sexism in images. This is our first time participating in the EXIST
competition. The source code supporting our findings is available at this Github repository.
This work is structured as follows: Section 2 briefly provides a description of several earlier studies.
Section 3 will then present an explanation of tasks. Following that, Section 4 and 5 will outline the
experimental methodology and evaluation results respectively. Finally, in Section 6, we will present the
key findings and conclusions of our studies, as well as some potential directions for future research.
2. Related Works
Fundamentally, sexism identification is categorised as a subtask of abusive language detection. It
shares a close relationship with a number of abusive language detection, including racism, hate speech,
personal attacks, and others. We consider sexism identification a problem of classification, where the
models will classify which predefined labels a given content belongs to. For example, EVALITA [2] and
AMI [3] focus on the identification of misogyny, while HateEval [4] focuses on the detection of hate
speech directed against women and immigrants. In addition, shared tasks such as EDOS [5] aim to
develop more accurate and explainable systems for sexism detection, and EXIST [6, 7, 8, 9] attempts to
classify sexism according to the different facets of women that are affected. The efforts of increasing
the scope of sexism detection, contribute to a more complete understanding of the different types of
sexism and how they are expressed.
Until the recent past, machine learning techniques, like Recurrent Neural Networks (RNNs) and more
classic algorithms, have been widely adopted for the classification of social media posts as sexist or
non-sexist by capturing sequential dependencies in the text [10, 11].
With the advent of transformers since the seminal work of “Attention is All You Need” [12], the
utilization of transformers for classification tasks has garnered significant attention. Transformers have
demonstrated superior accuracy compared to their predecessors. Notably, XLM models have recently
achieved state-of-the-art performance in various benchmark tasks [13, 14] and they have emerged as a
powerful tool for text classification tasks, particularly in the domain of cross-lingual language modeling.
XLM, which stands for Cross-lingual Language Model, is specifically designed to effectively handle
multiple languages [14]. This cross-lingual capability enables XLM transformers to generalize well to
languages with limited training data, facilitating effective knowledge transfer from high-resource to
low-resource languages [15].
In EXIST 2022, ensembles of different language specific transformer models, including BERTweet-
large [16], RoBERTa [17], DeBERTa v3 [18] for English, and BETO, BERTIN [19], and RoBERTuito [20]
for Spanish, achieved the best results.
In the previous edition of EXIST, the Roh_Neil [21] team was able to win the 1𝑠𝑡 rank for Hard-Hard
evaluation method by applying XLM-RoBERTa-Large-Twitter. There were also other applications of
RoBERTuito and BETO [22] for solving this challenge which have again achieved comparative results
especially for Spanish tweets.
Sexism identification in textual data presents a significant challenge, and the complexity increases
further when dealing with multimodal content. Memes, a prominent source of misogyny and hate
speech, exemplify this challenge as they combine textual and image data. Effectively incorporating
both modalities for sexism detection remains an ongoing effort.
The introduction of SemEval-2022 Task 5: Multimedia Automatic Misogyny Identification (MAMI)
marked a significant step forward, offering one of the first meme datasets for misogyny detection [23].
Analyzing submitted approaches revealed two primary strategies:
• Text-based models: Primarily utilized BERT [24] and RoBERTa [17] for textual analysis.
• Image-based models: Mostly adopted VisualBERT [25] for image analysis.
A smaller number of submissions explored multimodal approaches, leveraging models like CLIP [26]
and ViLBERT [27] to jointly learn from both text and image data.
3. Tasks Description in EXIST 2024
While the three previous editions focused solely on detecting and classifying sexist textual messages,
this new edition incorporates new tasks that center around images, particularly memes. Memes are
images, typically humorous in nature, that are spread rapidly by social networks and Internet users. All
the five tasks in the hierarchy are classification tasks. Tasks 1, 2, 3, 4 and 5 are in the area of classification
(binary and multi-class classification) and the tasks 3 and 6 are multi-label classification (categorization).
Figure 1 shows an overview of tasks in EXIST 2024.
Figure 1: An Overview of Tasks in EXIST 2024
3.1. Task 1 - Binary Classification in Tweets
The first subtask is a binary classification. The systems must decide whether a given tweet contains
sexist expressions or behaviours (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist
behaviour), and classify it according to two categories: YES and NO.
3.2. Task 2 - Multi-class Classification in Tweets
The second subtask is a multi-class classification. For the tweets that have been predicted as sexist, the
second task aims to classify each tweet according to the intention of the person who wrote it. One of
the three following categories must be assigned to each sexist tweet:
• DIRECT
• REPORTED
• JUDGEMENTAL
3.3. Task 3 - Multi-label Classification in Tweets
The third subtask is a multi-label classification. For the tweets that have been predicted as sexist, the
third task aims to categorize them according to the type of sexism. We propose a five-class classification
task: This is a multi-label task, so that more than one of the following labels may be assigned to each
tweet:
• IDEOLOGICAL-INEQUALITY
• STEREOTYPING-DOMINANCE
• OBJECTIFICATION
• SEXUAL-VIOLENCE
• MISOGYNY-NON-SEXUAL-VIOLENCE
3.4. Task 4 - Binary Classification in Tweets in Memes
This is a binary classification task consisting of deciding whether or not a given meme is sexist.
3.5. Task 5 - Multi-class Classification in Memes
As in task 2, this task aims to categorize the meme according to the intention of the author, which
provides insights in the role played by social networks on the emission and dissemination of sexist
messages. Due to the characteristics of the memes, the REPORTED label is virtually null, so in this
task systems should only classify memes with DIRECT or JUDGEMENTAL labels.
3.6. Task 6 - Multi-label Classification in Memes
This subtask is a multi-label classification. This task aims to classify sexist memes according to the
categorization provided for Task 3: (i) IDEOLOGICAL AND INEQUALITY, (ii) STEREOTYPING
AND DOMINANCE, (iii) OBJECTIFICATION, (iv) SEXUAL VIOLENCE, and (v) MISOGYNY AND
NON-SEXUAL VIOLENCE.
4. Methodology
Our approach can be divided into two groups based on the tasks and the datasets used. For Tasks 1,
2, and 3, which involve the detection and categorization of sexism in tweets, we utilized transformer
models. Conversely, Tasks 4 and 5 required handling a completely different dataset comprising memes
the use of a multi-modal model. The methodologies for these two groups of tasks are detailed in the
following sections and illustrated in Figure 2.
4.1. Data
To conduct our experiments, we used the dataset provided by the organizers for EXIST 2024. This dataset
includes comments extracted from Twitter that may contain popular sexist expressions and terms in
both English and Spanish. A total of six annotators with diverse socio-demographic backgrounds,
such as gender (male or female) and age groups (18-22, 23-45, or 46+), labeled the dataset. Instead of
providing a single gold label for each text, the organizers supplied the labels assigned by each annotator
along with their personal information, such as gender and age, for all tasks. This approach aims to
capture the diversity of perspectives in a subjective task like sexism detection. Additionally, this version
of EXIST includes a new dataset for the newly added Tasks 4 and 5, which consist of memes—images
that contain a piece of text. The organizers provided gold labels for the training and development sets,
which we concatenated based on unique ID values to maintain consistency.
4.1.1. Tasks 1-2-3
The data is split into a training and development set with the former containing 6,920 and 1,038
respectively. The test set contains 2,076 tweets in English, Spanish, and a mix of both languages.
4.1.2. Tasks 4-5
Our data was divided into training and development sets for both tasks using a 90/10 split. This ensured
a representative distribution of English and Spanish data in both sets.
4.2. Pre-Processing
4.2.1. Tasks 1-2-3
In our study, we adopt a straightforward pre-processing approach, which aligns with the recommended
practices outlined in the usage guide of the XLM-T-10-L and other transformer based models. The
pre-processing method focuses on handling tags and URLs present in the tweets. Any user handle
encountered in the tweets is replaced with “@USER”, while URLs are replaced with “#HTTPURL”. We
refrain from implementing additional pre-processing steps in order to retain the intricacies and unique
(a) An Overview of Methodology for Tasks 1, 2 and 3.
(b) An Overview of Methodology for Tasks 4 and 5.
Figure 2: An Overview of Methodology with respect to Tasks.
characteristics inherent in the tweet data. By avoiding excessive pre-processing, we aim to preserve the
originality and nuances of the tweet content, allowing the models to capture the genuine nature of the
tweets during the subsequent analysis and classification stages.
4.2.2. Tasks 4-5
For text data, we do not apply any specific pre-processing. However, for image data, we resize the
images to dimensions of 224 x 224 pixels to ensure compatibility with the CLIP model.
Table 1
Best Results Summary
Task No. Version Language Rank # Run
All 4 1
Soft-Soft ES 4 1
EN 4 1
Task 5
All 6 1
Hard-Hard ES 4 1
EN 7 1
All 9 1
Soft-Soft ES 10 1
EN 9 1
Task 4
All 11 1
Hard-Hard ES 13 1
EN 12 1
All 12 2
Soft-Soft ES 13 2
EN 12 2
Task 3
All 13 2
Hard-Hard ES 17 1
EN 16 2
All 28 2
Soft-Soft ES 31 2
EN 28 2
Task 2
All 15 2
Hard-Hard ES 16 2
EN 12 2
All 37 1
Soft-Soft ES 38 1
EN 37 2
Task 1
All 19 1
Hard-Hard ES 18 1
EN 21 1
4.3. Model
4.3.1. Tasks 1-2-3
We decided to exclusively utilize pre-trained Transformer models based on their impressive performance
in previous editions of EXIST. To ensure both robustness and ease of implementation, we selected
HuggingFace’s [28] Trainer API for minimizing the need for custom code development. Our exper-
imentation involved a comprehensive evaluation of existing General Language and Tweet-specific
language (NLP) models. We tried several multi-lingual transformers including sdadas/xlm-roberta-large-
twitter, google-bert/bert-base-multilingual-uncased, distilbert/distilbert-base-multilingual-cased[29],
and FacebookAI/xlm-roberta-base, with common hyperparameters that are considered best practices.
However, hardware limitations posed significant challenges. For example, we were unable to load the
ai-forever/mGPT [30] model, as our code had crashed during the training phase due to its extensive
number of parameters.
Table 2
Hyperparameter Settings for Our Systems in Tasks 1, 2 and 3
Task Run Model Epochs Train Batch size Eval Batch size LR Weight Decay Val F1
1 - DBMLC 5 8 8 2𝑒 − 5 0.01 0.7761
1 3 DBMLC 5 8 8 3𝑒 − 5 0.01 0.7812
1 - DBMLC 5 8 8 4𝑒 − 5 0.01 0.7471
1 - XLM-B 5 8 5 2𝑒 − 5 0.01 0.8268
1 1 XLM-L-T 4 8 16 2𝑒 − 5 0.01 0.8527
1 - BBMUC 5 8 16 2𝑒 − 5 0.01 0.8123
1 2 BBMUC 5 8 8 3𝑒 − 5 0.01 0.8980
1 - BBMUC 5 8 8 4𝑒 − 5 0.01 0.8621
2 1 XLM-L-T 4 8 16 2𝑒 − 5 0.01 0.4894
2 2 BBMUC 4 8 8 1𝑒 − 5 0.01 0.7342
3 1 XLM-L-T 4 8 16 2𝑒 − 5 0.01 0.5449
3 2 BBMUC 4 8 8 3𝑒 − 5 0.01 0.5849
*DBMLC → DistilBert Base Multilingual Cased.
*XLM-L-T → XLM Roberta Large Twitter.
*XLM-B → XLM Roberta Base.
*BBMUC → Bert Base Multilingual Uncased.
For these three tasks, we send the following runs:
• Run 1: The model for this run is sdadas/xlm-roberta-large-twitter. The hyper-parameter setting
for this model is 2𝑒 − 5 for learning rate, 0.01 for weight decay, 8 and 16 for train and validation
batch size, respectively.
• Run 2: The model for this run is google-bert/bert-base-multilingual-uncased. The hyper-parameter
setting for this model is 3𝑒 − 5 for learning rate, 0.01 for weight decay, 8 and 8 for both train and
validation batch sizes.
• Run 3: The model for this run is distilbert/distilbert-base-multilingual-cased. The hyper-parameter
setting for this model is 3𝑒 − 5 for learning rate, 0.01 for weight decay, 8 and 8 for both train and
validation batch sizes.
Table 2 shows the hyperparameter values for each experiment with the computed F1-Score on
validation dataset. One of the key observations from our experiments is that model google-bert/bert-
base-multilingual-uncased consistently outperforms other models, especially in tasks 2 and 3. This
performance discrepancy can largely be attributed to the simplicity and reduced number of parameters
in model google-bert/bert-base-multilingual-uncased. Given the limited amount of input data available
for training, simpler models with fewer parameters tend to generalize better, as they are less prone to
overfitting. Overfitting occurs when a model learns the noise and details in the training data to the
extent that it negatively impacts the model’s performance on new, unseen data.
4.3.2. Tasks 4-5
We employed the CLIP model [26] to achieve joint learning from textual and image modalities. CLIP
encodes both the image and its corresponding text into embedding vectors. These embeddings were
then concatenated to form a combined representation. A subsequent linear layer was applied to this
combined representation to predict the most probable class for the given task.
For Task 4, we adopted the openai/clip-vit-base-patch32 model, while Task 5 employed the openai/clip-
vit-large-patch14 model.
5. Evaluation
The shared task results are evaluated in three different scenarios, we focus on HARD-HARD for the first
three tasks, but for the tasks 4 and 5 we concentrate on Hard-Hard and Soft-Soft. For the HARD-HARD
evaluation, the gold label consists of the class annotated by the majority of annotators. Any items with
no majority class are removed from the evaluation. Here, the model produces a single label, which is
compared to the gold label. In the SOFT-SOFT evaluation, the system provides probabilities for each
class, and this distribution is evaluated against the distribution of the annotator decisions. The official
metric for the shared task is the “Information Contrast Measure” (ICM) [31]. A description of evaluation
metrics can be viewed in Table 3. Our models achieved notable success, ranked 4𝑡ℎ place in Task 5 and
9𝑡ℎ place in Task 4 for the Soft-Soft evaluation, with ICM scores of 0.3370 and 0.4299, respectively. A
brief summary of our best results for each task can be found in Table 1, but a complete overview of the
final results of this study’s submissions with all the computed metrics can be found in tables 4 to 8
which have been provided by organizers in leader boards.
Table 3
Description of Evaluation Metrics
Metric Description
ICM Information Contrast Measure (ICM) is a similarity function used to evaluate the outputs of classification
systems in hierarchical classification tasks. It generalizes Pointwise Mutual Information (PMI) and
measures the resemblance between the system’s output and the ground truth labels. Higher values of
ICM indicate better performance.
ICM-Soft ICM-Soft is an evaluation metric that compares the categories assigned by the system with the probabilities
assigned to each category in the ground truth. It considers the distribution of labels and the number
of annotators assigned to each instance to determine the probability of the classes. Higher values of
ICM-Soft indicate better performance.
ICM-Hard ICM-Hard evaluation involves comparing the system’s "hard" output with the hard ground truth labels.
A probabilistic threshold is employed to extract the hard-labels from the ground truth, considering the
approval of multiple annotators for each task. Only the most popularly labeled classes are included in this
evaluation. Higher values of ICM-Hard indicate better performance.
ICM-Soft Norm ICM-Soft Norm is a normalized version of ICM-Soft that takes into account the number of annotators
assigned to each instance and adjusts the probabilities accordingly. It handles instances labeled as
"UNKNOWN" by reducing the number of annotators considered based on the count of "UNKNOWN"
labels associated with them. Higher values of ICM-Soft Norm indicate better performance.
ICM-Hard Norm ICM-Hard Norm is a normalized version of ICM-Hard that adjusts the hard-labels based on the number
of annotators assigned to each instance. It considers instances labeled as "UNKNOWN" and adjusts the
threshold for label extraction accordingly. Higher values of ICM-Hard Norm indicate better performance.
F1 Score F1 Score is a commonly used evaluation metric in classification tasks. It is the harmonic mean of precision
and recall, weighted by the same values. The F1 score treats false positives and false negatives equally,
assuming that both types of errors have the same consequences. Higher values of F1 Score indicate better
performance.
Cross Entropy Cross Entropy is a metric used to measure the difference between the predicted probabilities and the true
probabilities. It quantifies the average amount of information needed to identify the true class given the
predicted probabilities. Lower values of Cross Entropy indicate better model performance.
6. Conclusion and Future Works
Sexism in digital interactions remains a widespread issue, significantly impacting the creation of
inclusive and respectful online environments. The proliferation of social media has amplified the spread
of sexist content, highlighting the urgent need for automated detection and classification methods. Our
work in the EXIST CLEF 2024 shared task showcases the potential of multi-lingual transformer models
and the CLIP model in identifying and understanding sexism and source intention in social media
content. Despite the challenges and limitations of our hardware resources, our models achieved notable
success. Moving forward, we plan to implement various pre-processing techniques, data augmentation
strategies, and explore newer transformer models, as well as innovative classification approaches like
few-shot [32] and zero-shot learning [33]. Additionally, we will consider using more interpretable
machine learning algorithms, such as XGBoost [34], and ensemble methods [35]. By evaluating and
comparing these diverse methods, we hope to enhance the robustness and accuracy of our models.
Acknowledgments
We would like to express our deep appreciation to the organizers of EXIST CLEF 2024 for providing us
with the opportunity to participate in this esteemed event. It was all a great and valuable experience to
work with the provided data, and gain new knowledge. We are truly grateful for the chance to present
our work and contribute to the progress of our field. We would like to once again acknowledge and
appreciate the support and encouragement received from all individuals involved, as their contributions
have been instrumental in our accomplishments.
References
[1] A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, , M. Poesio, Learning from disagreement: A
survey, Journal of Artificial Intelligence Research 38 (2021) 1385–1470.
[2] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi, et al., Evalita 2023: Overview
of the 8th evaluation campaign of natural language processing and speech tools for italian, in:
Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools
for Italian. Final Workshop (EVALITA 2023), CEUR. org, Parma, Italy, 2023.
[3] S. Chen, U. Naseem, I. Razzak, F. Salim, Unveiling misogyny memes: A multimodal analysis of
modality effects on identification, in: Companion Proceedings of the ACM on Web Conference
2024, 2024, pp. 1864–1871.
[4] T. Sen, A. Das, M. Sen, Hatetinyllm: Hate speech detection using tiny large language models,
arXiv preprint arXiv:2405.01577 (2024).
[5] H. R. Kirk, W. Yin, B. Vidgen, P. Röttger, Semeval-2023 task 10: explainable detection of online
sexism, arXiv preprint arXiv:2303.04222 (2023).
[6] L. Plaza, J. Carrillo-de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview
of exist 2023–learning with disagreement for sexism identification and characterization, in:
International Conference of the Cross-Language Evaluation Forum for European Languages,
Springer, 2023, pp. 316–342.
[7] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, A. Mendieta-Aragón, G. Marco-Remón,
M. Makeienko, M. Plaza, J. Gonzalo, D. Spina, P. Rosso, Overview of exist 2022: sexism identification
in social networks, Procesamiento del Lenguaje Natural 69 (2022) 229–240.
[8] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
the CLEF Association (CLEF 2024), 2024.
[9] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
and Labs of the Evaluation Forum, 2024.
[10] A. Kalra, A. Zubiaga, Sexism identification in tweets and gabs using deep neural networks, arXiv
preprint arXiv:2111.03612 (2021).
[11] M. Buzzell, J. Dickinson, N. Singh, S. Kübler, Iu-nlp-jedi: investigating sexism detection in english
and spanish, Working Notes of CLEF (2023).
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attention is all you need, Advances in neural information processing systems 30 (2017).
[13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv
preprint arXiv:1911.02116 (2019).
[14] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint arXiv:1901.07291
(2019).
[15] M. Artetxe, S. Ruder, D. Yogatama, On the cross-lingual transferability of monolingual representa-
tions, arXiv preprint arXiv:1910.11856 (2019).
[16] D. Q. Nguyen, T. Vu, A. T. Nguyen, Bertweet: A pre-trained language model for english tweets,
arXiv preprint arXiv:2005.10200 (2020).
[17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[18] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021).
[19] J. De la Rosa, E. G. Ponferrada, P. Villegas, P. G. d. P. Salas, M. Romero, M. Grandury, Bertin:
Efficient pre-training of a spanish language model using perplexity sampling, arXiv preprint
arXiv:2207.06814 (2022).
[20] J. M. Pérez, D. A. Furman, L. A. Alemany, F. Luque, Robertuito: a pre-trained language model for
social media text in spanish, arXiv preprint arXiv:2111.09453 (2021).
[21] R. Koonireddy, N. Adel, Roh_neil@ exist2023: detecting sexism in tweets using multilingual
language models, Working Notes of CLEF (2023).
[22] H. Asnani, A. Davis, A. Rajanala, S. Kübler, Tlatlamiztli: fine-tuned robertuito for sexism detection,
Working Notes of CLEF (2023).
[23] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. Sorensen, SemEval-2022
task 5: Multimedia automatic misogyny identification, in: G. Emerson, N. Schluter, G. Stanovsky,
R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings of the 16th International
Workshop on Semantic Evaluation (SemEval-2022), Association for Computational Linguistics,
Seattle, United States, 2022, pp. 533–549. URL: https://aclanthology.org/2022.semeval-1.74. doi:10.
18653/v1/2022.semeval-1.74.
[24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. arXiv:1810.04805.
[25] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant baseline
for vision and language, 2019. arXiv:1908.03557.
[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
supervision, 2021. arXiv:2103.00020.
[27] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks, 2019. arXiv:1908.02265.
[28] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, et al., Huggingface’s transformers: State-of-the-art natural language processing, arXiv
preprint arXiv:1910.03771 (2019).
[29] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, ArXiv abs/1910.01108 (2019).
[30] O. Shliazhko, A. Fenogenova, M. Tikhonova, V. Mikhailov, A. Kozlova, T. Shavrina, mgpt: Few-shot
learners go multilingual, arXiv preprint arXiv:2204.07580 (2022).
[31] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: Proceedings
of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2022, pp. 5809–5819.
[32] Y. Wang, Q. Yao, J. T. Kwok, L. M. Ni, Generalizing from a few examples: A survey on few-shot
learning, ACM computing surveys (csur) 53 (2020) 1–34.
[33] W. Wang, V. W. Zheng, H. Yu, C. Miao, A survey of zero-shot learning: Settings, methods, and
applications, ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2019) 1–37.
[34] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm
sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[35] T. G. Dietterich, Ensemble methods in machine learning, in: International workshop on multiple
classifier systems, Springer, 2000, pp. 1–15.
A. Online Resources
The source code and the final submission files can be accessed through the following official GitHub
repository for EXIST2024:
• Github for EXIST2024 NICA
B. Evaluation Results Provided in Leader board
Table 4
Task 1 Evaluation Results
# Run Rank ICM-Soft ICM-Hard ICM-Soft Norm ICM-Hard Norm Cross Entropy F1-Score Lang
EXIST2024-test_gold 0 3.1182 - 1.0000 - 0.5472 - All
3 37 -2.8848 - 0.0374 - 1.5286 - All
2 38 -2.8848 - 0.0374 - 1.3862 - All
1 39 -2.8848 - 0.0374 - 1.2301 - All
EXIST2024-test_gold 0 - 0.9948 - 1.0000 - 1.0000 All
1 19 - 0.5214 - 0.7621 - 0.7642 All
3 37 - 0.4358 - 0.7191 - 0.7429 All
2 43 - 0.3750 - 0.6885 - 0.7263 All
EXIST2024-test_gold 0 3.1177 - 1.0000 - 0.5208 - ES
1 38 -2.8670 - 0.0402 - 1.1916 - ES
3 39 -2.8671 - 0.0402 - 1.5302 - ES
2 40 -2.8671 - 0.0402 - 1.3733 - ES
EXIST2024-test_gold 0 - 0.9999 - 1.0000 - 1.0000 ES
1 18 - 0.5156 - 0.7578 - 0.7852 ES
3 36 - 0.4223 - 0.7112 - 0.7627 ES
2 44 - 0.3517 - 0.6759 - 0.7419 ES
EXIST2024-test_gold 0 3.1141 - 1.0000 - 0.5770 - EN
2 37 -2.9063 - 0.0334 - 1.4007 - EN
3 38 -2.9064 - 0.0333 - 1.5268 - EN
1 39 -2.9064 - 0.0333 - 1.2734 - EN
EXIST2024-test_gold 0 - 0.9798 - 1.0000 - 1.0000 EN
1 21 - 0.5122 - 0.7614 - 0.7362 EN
3 35 - 0.4393 - 0.7242 - 0.7176 EN
2 38 - 0.3871 - 0.6975 - 0.7057 EN
Table 5
Task 2 Evaluation Results
# Run Rank ICM-Soft ICM-Hard ICM-Soft Norm ICM-Hard Norm Cross Entropy F1-Score Lang
EXIST2024-test_gold 0 6.2057 - 1.0000 - 0.9128 - All
2 28 -5.7592 - 0.0360 - 2.7026 - All
EXIST2024-test_gold 0 - 1.5378 - 1.0000 - 1.0000 All
2 15 - 0.1506 - 0.5490 - 0.4738 All
1 40 - -0.9504 - 0.1910 - 0.1603 All
EXIST2024-test_gold 0 6.2431 - 1.0000 - 0.8926 - ES
2 31 -5.7501 - 0.0395 - 2.6715 - ES
EXIST2024-test_gold 0 - 1.6007 - 1.0000 - 1.0000 ES
2 16 - 0.1567 - 0.5490 - 0.4904 ES
1 40 - -1.0391 - 0.1754 - 0.1545 ES
EXIST2024-test_gold 0 6.1178 - 1.0000 - 0.9354 - EN
2 28 -5.7285 - 0.0318 - 2.7374 - EN
EXIST2024-test_gold 0 - 1.4449 - 1.0000 - 1.0000 EN
2 12 - 0.1213 - 0.5420 - 0.4516 EN
1 39 - -0.8529 - 0.2048 - 0.1667 EN
Table 6
Task 3 Evaluation Results
# Run Rank ICM-Soft ICM-Hard ICM-Soft Norm ICM-Hard Norm Cross Entropy F1-Score Lang
EXIST2024-test_gold 0 9.4686 - 1.0000 - - - All
2 12 -4.4324 - 0.2659 - - - All
EXIST2024-test_gold 0 - 2.1533 - 1.0000 - 1.0000 All
2 13 - -0.2383 - 0.4447 - 0.4564 All
1 19 - -0.3258 - 0.4243 - 0.3867 All
EXIST2024-test_gold 0 9.6071 - 1.0000 - - - ES
2 13 -4.5491 - 0.2632 - - - ES
EXIST2024-test_gold 0 - 2.2393 - 1.0000 - 1.0000 ES
1 17 - -0.3007 - 0.4329 - 0.4023 ES
2 18 - -0.3611 - 0.4194 - 0.4262 ES
EXIST2024-test_gold 0 9.1255 - 1.0000 - - - EN
2 12 -4.3081 - 0.2640 - - - EN
EXIST2024-test_gold 0 - 2.0402 - 1.0000 - 1.0000 EN
2 16 - -0.1145 - 0.4719 - 0.4837 EN
1 19 - -0.3659 - 0.4103 - 0.3645 EN
Table 7
Task 4 Evaluation Results
# Run Rank ICM-Soft ICM-Hard ICM-Soft Norm ICM-Hard Norm Cross Entropy F1-Score Lang
EXIST2024-test_gold 0 3.1107 - 1.0000 - 0.5852 - All
1 9 -0.4360 - 0.4299 - 0.9278 - All
EXIST2024-test_gold 0 - 0.9832 - 1.0000 - 1.0000 All
1 11 - 0.0767 - 0.5390 - 0.7248 All
EXIST2024-test_gold 0 3.1360 - 1.0000 - 0.6160 - ES
1 10 -0.5939 - 0.4053 - 0.9610 - ES
EXIST2024-test_gold 0 - 0.9815 - 1.0000 - 1.0000 ES
1 13 - -0.0086 - 0.4956 - 0.7137 ES
EXIST2024-test_gold 0 3.0794 - 1.0000 - 0.5528 - EN
1 9 -0.2959 - 0.4520 - 0.8929 - EN
EXIST2024-test_gold 0 - 0.9848 - 1.0000 - 1.0000 EN
1 12 - 0.1612 - 0.5818 - 0.7379 EN
Table 8
Task 5 Evaluation Results
# Run Rank ICM-Soft ICM-Hard ICM-Soft Norm ICM-Hard Norm Cross Entropy F1-Score Lang
EXIST2024-test_gold 0 4.7018 - 1.0000 - 0.9325 - All
1 4 -1.5329 - 0.3370 - 1.4664 - All
EXIST2024-test_gold 0 - 1.4383 - 1.0000 - 1.0000 All
1 6 - -0.2881 - 0.3999 - 0.3837 All
EXIST2024-test_gold 0 4.8140 - 1.0000 - 0.9365 - ES
1 4 -1.7405 - 0.3192 - 1.4800 - ES
EXIST2024-test_gold 0 - 1.4356 - 1.0000 - 1.0000 ES
1 4 - -0.2668 - 0.4071 - 0.3771 ES
EXIST2024-test_gold 0 4.5834 - 1.0000 - 0.9282 - EN
1 4 -1.3812 - 0.3493 - 1.4521 - EN
EXIST2024-test_gold 0 - 1.4409 - 1.0000 - 1.0000 EN
1 7 - -0.3123 - 0.3916 - 0.3860 EN