=Paper=
{{Paper
|id=Vol-3756/DETESTS-Dis2024_paper1
|storemode=property
|title=I2C-Huelva at IberLEF-2024 DETESTS-Dis: Learning from Divergence to Identify Explicit and Implicit Racial Stereotypes in Spanish Texts
|pdfUrl=https://ceur-ws.org/Vol-3756/DETESTS-Dis2024_paper1.pdf
|volume=Vol-3756
|authors=Manuel Cerrejón-Naranjo,Manuel Guerrero-García,Jacinto Mata-Vázquez,Victoria Pachón-Álvarez
|dblpUrl=https://dblp.org/rec/conf/sepln/Cerrejon-Naranjo24
}}
==I2C-Huelva at IberLEF-2024 DETESTS-Dis: Learning from Divergence to Identify Explicit and Implicit Racial Stereotypes in Spanish Texts==
I2C-Huelva at IberLEF-2024 DETESTS-Dis: Learning
from Divergence to Identify Explicit and Implicit
Racial Stereotypes in Spanish Texts
Manuel Cerrejón-Naranjo1,∗ , Manuel Guerrero-García1 , Jacinto Mata-Vázquez1 and
Victoria Pachón-Álvarez1
1
I2C Research Group ,University of Huelva, Spain
Abstract
This paper presents the approaches developed for detecting and identifying racial stereotypes in Spanish
texts using advanced Natural Language Processing (NLP) and Deep Learning techniques, incorporating
Learning with Disagreement for enhanced robustness. The major contribution of this work is the
demonstration of the effectiveness of transformer-based ensemble classifiers to recognize both explicit
and implicit stereotypes. By leveraging the strengths of multiple models, the proposed method achieves
better performance than using a single model alone. Additionally, the importance of selecting appropriate
hyperparameters during the model training process was highlighted by the results. Through rigorous
experimentation and evaluation, optimal hyperparameter combination where identified.
In our experiments, we utilized a preprocessed and annotated corpus of Spanish texts and applied data
augmentation techniques, such as back-translation, to balance the dataset. Furthermore, we incorporated
the ”Learning With Disagreement” (LeWiDi) approach, which uses the discrepancies between different
models to improve the classification system. The results obtained demonstrate significant improvements
in F1-Score, underscoring the potential application of these methods in moderating content on social
media and other digital platforms. With this strategy, we achieved second place in Task 1 using an
ensemble consisting of 3 models, one for each annotator, based on RoBERTa. In Task 2, we reached the
seventh position, using the same approach.
Keywords
Learning With Disagreement Stereotypes, Natural Language Processing, Deep Learning, Transformers,
Data Augmentation, Hyperparameter Optimization, Ensemble,
1. Introduction
Racial stereotypes are oversimplified and generalized beliefs about individuals based on their
perceived social group membership [1]. In contemporary society, social media platforms have
become a prominent venue for expressing opinions, including those related to immigration,
a topic of significant public interest. However, the anonymity and reach of these platforms
have also facilitated the proliferation of toxic and stereotypical comments. The ability to detect
and classify such stereotypes, whether explicit or implicit, is crucial for mitigating bias and
promoting healthier online discourse.
IberLEF 2024, September 2024, Valladolid, Spain
Envelope-Open manuel.cerrejon886@alu.uhu.es (M. Cerrejón-Naranjo); manuel.guerrero790@alu.uhu.es (M. Guerrero-García);
mata@uhu.es (J. Mata-Vázquez); vpachon@dti.uhu.es (V. Pachón-Álvarez)
© 2024 Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
In this paper, we present our systems developed for the DETESTS-Dis (DETEction and
classification of racial STereotypes in Spanish - Learning with Disagreement) shared task at
IberLEF 2024 [2] [3], held at the SEPLN 2024 conference. The goal of this task is to detect and
classify both explicit and implicit stereotypes in Spanish texts from social media and news
article comments. This edition incorporates the Learning with Disagreement paradigm by
providing both hard and pre-aggregated labels to manage annotator disagreements.
Given the recent advancements in pre-trained language models for text classification, our
approach centers on fine-tuning these models to adapt them to the task of stereotype detec-
tion. Additionally, we address the challenge of class imbalance using various data balancing
techniques. By leveraging ensemble methods, we aim to improve the F1-Score of stereotype
detection in our models.
The paper is structured as follows: Section 2 reviews related works in stereotype detection
and classification. Section 3 describes the dataset provided by the task organizers and outlines
its use in our experiments. Section 4 details our methodological approach and experimental
setup, followed by the results obtained. Section 5 discusses the submitted runs and evaluation
findings. Finally, Section 6 presents our conclusions and potential future directions for research.
2. Related works
Numerous studies have targeted stereotype detection and classification within specific social
groups, including women and immigrants. For instance, Fersini et al. [4] introduced the Auto-
matic Misogyny Identification task, which included a subtask on stereotype and objectification
of women. Similarly, Rodríguez-Sánchez et al. [5] developed the EXIST dataset to address sexism
in social networks, exploring machine learning and deep learning techniques for automatic
detection of sexist expressions and attitudes on Twitter.
Chiril [6] and Cryan et al. [7] further investigated the detection of gender stereotypes,
contributing valuable datasets and methodologies. Fokkens et al. [8] studied microportraits
to identify stereotypes in narratives about Muslim individuals, while Sap et al. [9] examined
social bias frames driven by stereotypes. Sanguinetti et al. [10] expanded this work to include
stereotypes about immigrants, Muslims, and Roma in the HaSpeeDe 2 task.
Specifically, Sánchez-Junquera et al. [11] focused on the classification of stereotypes related
to immigration within political debates, highlighting the complexity of implicitly expressed
stereotypes. The first DETESTS task [12] made significant strides in detecting and classifying
stereotypes in Spanish texts, setting a precedent for the current DETESTS-Dis task. This task
aims to further advance the field by incorporating learning with disagreement techniques to
better handle annotator disagreement and enhance detection accuracy.
Moreover, Leonardelli et al. [13] explored the potential of integrating disagreement learning
to improve stereotype detection systems, providing a framework that is particularly relevant for
the DETESTS-Dis task. Uma et al. [14] also contributed to this paradigm, demonstrating how
leveraging annotator disagreement can lead to more robust and reliable classification models.
These previous works underscore the importance and challenge of detecting and classifying
stereotypes in various contexts. The DETESTS-Dis task builds upon these foundations, aiming
to address both explicit and implicit stereotypes in social media and news comments, thereby
contributing to the growing body of research in this critical area.
3. Dataset Description and Task Objectives
The training dataset provided by the organizers contains 9906 comments published in response
to two different sources, Detests and Stereohoax. There are 18 information attributes available,
which are: source, id, comment-id, text, level1, level2, level3, level4, stereotype-a1, stereotype-a2,
stereotype-a3, stereotype, stereotype-soft, implicit-a1, implicit-a2, implicit-a3, implicit, and
implicit-soft. The dataset was split while maintaining the class stratification into training (70%),
valid (20%) and test (10%).
• The objective of Subtask 1 is to determine if the sentences in a comment contain at least
one stereotype or none, considering the complete distribution of provided labels.
Table 1
Example instances of Subtask 1
id text st_a1 st_a2 st_a3 st
s_77 Seguidamente va a reventar. Apretando mucho las tuercas 0 0 0 0
d_684_01 600000 ilegales y no pasa nada 1 1 1 1
Table 2
Class Distribution of Subtask 1
Class Train Dataset Valid Dataset Test Dataset
0 5110 1461 730
1 1823 521 261
Total 6933 1982 991
For the second task, the dataset contains 2605 comments, with the same labels than the first
task
• For the Subtask 2, the objective is to determine whether the stereotype in a comment
is expressed explicitly or implicitly, considering the complete distribution of provided
labels.
Table 3
Example instances of Subtask 2
id text im_a1 im_a2 im_a3 im
s_3315 Joder... Ahora todo áfrica aquí?? Que asco 1 1 0 1
d_684_01 600000 ilegales y no pasa nada 0 0 0 0
Table 4
Class Distribution of Subtask 2
Class Train Dataset Valid Dataset Test Dataset
0 891 262 126
1 932 259 135
Total 1823 521 261
Figure 1: Methodology Strategy
Tables 1 and 3 show some examples of the tweets that the datasets respective to the Task 1
and 2 contain. Tables 2 and 4 shows the distribution of the classes for each task, after the split
into training, valid, and test.
4. Methodology
Addressing the challenges of the competition, we adopted a systematic methodology, comprising
several pivotal steps. Given that the competition dataset is in Spanish, our approach primarily
leveraged pretrained models tailored for the Spanish language. Additionally, the computations
were performed using an NVIDIA RTX 4070 GPU, ensuring efficient processing and model
training. Figure 1 illustrates the strategy followed for tasks resolutions.
4.1. Data Splitting By Source
First, the provided dataset was split into two distinct sources: Detest and Stereohoax. This
division allowed for handling and analyzing the data in a specific manner according to their
origin.
4.2. Split Data By Annotators
Since we opted for the Learning With Disagreement strategy, we subdivided each dataset into
subsets according to the annotators from stereotype_a1, stereotype_a2, and stereotype_a3 for
the first subtask, and on the annotators implicit_a1, implicit_a2, and implicit_a3 for the second
subtask. This segmentation facilitated a structured approach to address annotator discrepancies
across both tasks.
4.3. Models
As the text are in Spanish, we decided to use only pre-trained models in Spanish language. The
pre-trained models selected, obtained from the Huggingface transformers library, were: 1
• dccuchile/bert-base-spanish-wwm-uncased [15]. This model (BETO) is a BERT Spanish
version
• PlanTL-GOB-ES/roberta-base-bne [16]. This model is based on the RoBERTa base model
and has been pre-trained using the largest Spanish corpus known to date
4.4. Baseline
To compare the results obtained with the different models and strategies developed, a starting
point was established using selected pretrained models. Since the optimal hyperparameter
values cannot be known in advance, some of the most commonly used values were employed:
a batch size of 16, a learning rate of 5e-5, a maximum length of 128, and a weight decay of
0.1. The training datasets were used without any additional processing, i.e., as provided by the
competition. Tables 5 and 6 show the baseline results on different models for each task.
Table 5
Baseline Results in Subtask 1
F1 Score
Source Model Baseline
DETESTS BETO A1 0.7561
BETO A2 0.7417
BETO A3 0.7710
RoBERTa A1 0.7668
RoBERTa A2 0.7955
RoBERTa A3 0.7919
STEREOHOAX BETO A1 0.8496
BETO A2 0.8331
BETO A3 0.8412
RoBERTa A1 0.8643
RoBERTa A2 0.8389
RoBERTa A3 0.8580
1
https://huggingface.co/
Table 6
Baseline Results in Subtask 2
F1 Score
Source Model Baseline
DETESTS BETO A1 0.5016
BETO A2 0.5137
BETO A3 0.4797
RoBERTa A1 0.5394
RoBERTa A2 0.4942
RoBERTa A3 0.5842
STEREOHOAX BETO A1 0.6537
BETO A2 0.7317
BETO A3 0.7021
RoBERTa A1 0.6532
RoBERTa A2 0.7610
RoBERTa A3 0.6343
4.5. Data Pre-processing
An exhaustive preprocessing of the data was performed to clean and normalize it. The prepro-
cessing stages included:
• Conversion of uppercase to lowercase: To avoid differences caused by capitalization.
• Removal of mentioned users: Elimination of user mentions preceded by ’@’.
• Removal of URLs and links: To clean the text of irrelevant external links.
• Removal of hashtags: Only the ’#’ symbol was removed to preserve relevant words.
• Removal of emoticons: To eliminate graphical elements that do not contribute to textual
Tables 7 shows the result of applying preprocessing.
Table 7
Example of Pre-processing Text
Original Permiso de residencia para extranjeras víctimas de violencia de género. #abogado
#extranjeria #residencia... URL
Pre-processing permiso de residencia para extranjeras víctimas de violencia de género. abogado
extranjeria residencia ... url
Tables 8 and 9 show the results achieved after processing the texts from the comments. As it
can be seen, this preproccessing improved the results obtained with the baselines.
4.6. Data Augmentation
Given the umbalanced in class distribution, we opted to address it by employing the back-
translation method to augment the dataset. For the primary task, the scarcity of class 1 instances,
Table 8
Pre-processing Results in Subtask 1
F1 Score
Source Model Baseline Preprocessing
DETESTS BETO A1 0.7561 0.7621
BETO A2 0.7417 0.7547
BETO A3 0.7710 0.7810
RoBERTa A1 0.7668 0.7698
RoBERTa A2 0.7955 0.8037
RoBERTa A3 0.7919 0.7706
STEREOHOAX BETO A1 0.8496 0.8591
BETO A2 0.8331 0.8401
BETO A3 0.8412 0.8419
RoBERTa A1 0.8643 0.8810
RoBERTa A2 0.8389 0.8667
RoBERTa A3 0.8580 0.8647
Table 9
Pre-processing Results in Subtask 2
F1 Score
Source Model Baseline Preprocessing
DETESTS BETO A1 0.5016 0.5115
BETO A2 0.5137 0.5235
BETO A3 0.4797 0.4816
RoBERTa A1 0.5394 0.5404
RoBERTa A2 0.4942 0.5062
RoBERTa A3 0.5842 0.6042
STEREOHOAX BETO A1 0.6536 0.6736
BETO A2 0.7317 0.7517
BETO A3 0.7021 0.7121
RoBERTa A1 0.6532 0.6732
RoBERTa A2 0.7610 0.7880
RoBERTa A3 0.6343 0.6745
the minority class, was a notable challenge across both datasets. To address this, we undertook
data augmentation specifically for this class. In the case of the second task, class imbalance
was evident in the Detests dataset, where class 0 data was notably sparse. Conversely, the
Stereohoax dataset exhibited balance, except for the third annotator’s class, which lacked any
class 1 instances. To mitigate this issue, we integrated gold label elements containing a 1 in
the implicit class, as provided by the competition, into the third annotator’s dataset, thereby
achieving improved data balance.
This translation process involved initial translation from Spanish to English, then from
English to German, and finally back to English and Spanish, respectively. We utilized the
pre-trained model ’Helsinki-NLP/opus-mt-es-en’[17] for the initial translation and ’Helsinki-
NLP/opus-mt-en-es’ [18] for the reverse translation. Table 10 illustrates the application of this
technique to the dataset.
Table 10
Back-Translation Technique
Original Váyase usted a la m13rd4
First Translation Go fuck yourself4
Second Translation Fick dich selbst4
Third Translation Fuck you4
Back-Translation Que te jodan4
Tables 11 and 12 the results achieved after applying back-translation technique to the texts
from the comments.
Table 11
Back-Translation Results in Subtask 1
F1 Score
Source Model Baseline Preprocessing Back-translation
DETESTS BETO A1 0.7561 0.7621 0.7920
BETO A2 0.7417 0.7547 0.7660
BETO A3 0.7710 0.7810 0.7846
RoBERTa A1 0.7668 0.7698 0.7685
RoBERTa A2 0.7955 0.8037 0.8022
RoBERTa A3 0.7919 0.7706 0.7569
STEREOHOAX BETO A1 0.8496 0.8591 0.7685
BETO A2 0.8331 0.8401 0.8022
BETO A3 0.8412 0.8419 0.7569
RoBERTa A1 0.8643 0.8810 0.8462
RoBERTa A2 0.8389 0.8667 0.8437
RoBERTa A3 0.8580 0.8647 0.8297
4.7. Hyperparameter Search
Hyperparameter search is an essential step in fine-tuning models to fit datasets optimally. For
this reason, multiple training and testing iterations were performed, using various combinations
of hyperparameters such as batch size, learning rate, max size, and weight decay. To minimize
training time, datasets were reduced to 80% of their original size before experimentation began.
Optuna[19] was the platform used for this process. An exhaustive search algorithm was
developed, testing all possible combinations of hyperparameters (grid search). Table 13 shows
the search space and the best values found for each of them. It should be noted that the same
hyperparameter values were obtained for all trained models.
Table 12
Back-Translation Results in Subtask 2
F1 Score
Source Model Baseline Preprocessing Back-translation
DETESTS BETO A1 0.5016 0.5116 0.5516
BETO A2 0.5137 0.5237 0.5337
BETO A3 0.4797 0.4807 0.4997
RoBERTa A1 0.5394 0.5404 0.5694
RoBERTa A2 0.4942 0.5062 0.5142
RoBERTa A3 0.5842 0.6042 0.6242
STEREOHOAX BETO A1 0.6536 0.6736 0.6936
BETO A2 0.7317 0.7518 0.7718
BETO A3 0.7021 0.7121 0.7321
RoBERTa A1 0.6532 0.6734 0.6932
RoBERTa A2 0.7610 0.7881 0.8000
RoBERTa A3 0.6343 0.6745 0.6843
Table 13
Hyperparameter Space and Best Hyperparameter Values
Hyperparameter Values
Batch size [16, 32]
Learning rate [3e-05, 5e-05]
Max length [64, 128]
Weight decay [0.01, 0.1]
The hyperparameter search was used on the best-performing models, namely, preprocessed
BETO and RoBERTa, as they yielded the best results. Table 14 displays the outcomes of the
hyperparameter search. The Performance column presents the results of training with the best
hyperparameter found on the preprocessed datasets.
4.8. Ensemble Approach
Finally, a classification model was developed by combining the three models trained for each
annotator using a hard voting approach. This ensemble method improved overall performance
by reducing individual biases from each annotator and providing more robust predictions. Table
15 and 16 present the chosen model for the ensemble; for both datasets, the RoBERTa models
were selected as they produced the best results. Additionally, a comparison is made with the
results obtained from training the model with the gold label provided by the competition. It
is evident that the learning with disagreement approach yielded superior results. As can be
observed, applying LeWiDi has clearly been a successful approach.
Table 14
Hyperparameter Search Results in Subtask 1
F1 Score
Source Model Baseline Preprocessing Back-translation Performance
DETESTS BETO A1 0.7561 0.7621 0.7920 0.7970
BETO A2 0.7418 0.7548 0.7660 0.7670
BETO A3 0.7711 0.7810 0.7847 0.7878
RoBERTa A1 0.7668 0.7698 0.7685 0.7729
RoBERTa A2 0.7955 0.8038 0.8023 0.8080
RoBERTa A3 0.7919 0.7706 0.7569 0.7846
STEREOHOAX BETO A1 0.8496 0.8591 0.8454 0.8687
BETO A2 0.8331 0.8402 0.8269 0.8553
BETO A3 0.8412 0.8419 0.8073 0.8137
RoBERTa A1 0.8643 0.8811 0.8463 0.8924
RoBERTa A2 0.8389 0.8668 0.8437 0.8696
RoBERTa A3 0.8580 0.8647 0.8297 0.8649
Table 15
Comparison of Gold Label vs Ensemble in Subtask 1
F1 Score
Source Model Gold Label Ensemble
DETESTS RoBERTa 0.7899 0.7949
STEREOHOAX RoBERTa 0.8896 0.8973
Table 16
Comparison of Gold Label vs Ensemble in Subtask 2
F1 Score
Source Model Gold Label Ensemble
DETESTS RoBERTa 0.5633 0.5800
STEREOHOAX RoBERTa 0.7096 0.7246
5. Results
To measure our results, the organizers provided an unlabeled test dataset, which we processed
with our models to generate predictions for each instance. These predictions were then used
to calculate our score for the leaderboard. In the first task, we achieved the second position
(I2C-Huelva_1) using an ensemble consisting of 3 models, one for each annotator, based on
RoBERTa. A second run (I2C-Huelva_2), consisting of an ensemble of three models, one for
each annotator, based on BETO, was submitted and achieved third place with an F1 Score of
0.701. Table 17 shows the ranking summary.
For Task 2, the submitted run consisted of an ensemble of three models, one for each annotator,
Table 17
Ranking Results for the Subtask 1
Ranking User F1-Score
1 Brigada Lenguaje_1 0.724
2 I2C-Huelva_1 0.712
3 I2C-Huelva_2 0.701
4 EUA_2 0.691
- - -
20 BASELINE_fast_text_svc 0.297
based on RoBERTa. This run achieved 7th place with an ICM of -0.328. Table 18 shows the
ranking summary:
Table 18
Ranking Results for the Subtask 2
Ranking User ICM ICM Norm
1 BASELINE_beto 0.126 0.546
2 EUA_2 0.065 0.524
3 EUA_3 0.061 0.522
4 EUA_1 0.045 0.516
5 Brigada Lenguaje_1 -0.240 0.413
6 BASELINE_tfidf_svc -0.275 0.400
7 I2C-Huelva_1 -0.328 0.381
- - - -
14 UC3M-SAS_2 -2.103 0.000
6. Error Analysis
For error analysis, our test dataset built during the development phase has been used. The
confusion matrices of the classifiers for both tasks can be found in Figures 2, 3, 4 and 5. Figure
2 shows the performance of the classifier in predicting classes Gold 0 and Gold 1 for the Detests
task. While the classifier accurately predicts most instances of the Gold 0 class (392 true
negatives), it is less reliable in predicting the Gold 1 class, with only 89 true positives and
47 false negatives. This outcome suggests that the classifier tends to predict the Gold 0 class
more accurately, likely due to an imbalance in the training dataset where the Gold 1 class is
underrepresented.
Figure 3 illustrates the classifier’s performance in the Stereohoax task. In this instance, the
classifier demonstrates a better balance between predicting both classes. It correctly predicts
287 instances of the Gold 0 class and 105 instances of the Gold 1 class. However, some errors
persist, with 16 false positives and 20 false negatives. This more balanced performance may be
attributed to a more even distribution of classes in the training dataset.
Figure 2: Confusion Matrix Task 1 Detests
Figure 3: Confusion Matrix Task 1 Stereohoax
To analyze Task 2, only the cases classified as positive in Task 1 have been used. Therefore,
the analysis was conducted as a binary task to distinguish between the explicit and implicit
classes.
Figure 4 displays the confusion matrix for Task 2 of Detests dataset. Similar to the previous
results, the classifier shows a strong performance for the Gold 0 class, with 87 true negatives and
15 false positives. However, the performance for the Gold 1 class remains weak, with 15 true
positives and 8 false negatives. This indicates that the class imbalance issue observed in Task
Figure 4: Confusion Matrix Task 2 Detests
Figure 5: Confusion Matrix Task 2 Stereohoax
1 continues to affect the classifier’s performance in this task. Figure 5 presents the confusion
matrix for Task 2 of Stereohoax. Here, the classifier exhibits high performance in predicting
the Gold 0 class with 87 true negatives but struggles with predicting the Gold 1 class, yielding
only 15 true positives and 8 false negatives. This pattern suggests that the classifier may be
influenced by a class imbalance similar to that observed in Task 1 of Stereohoax.
7. Conclusion
In this competition, we introduced the DETESTS-Dis task as part of IberLEF 2024, aimed
at detecting and classifying explicit and implicit stereotypes in texts from social media and
comments on news articles, incorporating learning with disagreement techniques. By leveraging
the annotations provided by multiple annotators and employing learning with disagreement,
we successfully mitigated individual biases and uncertainties inherent in the data labeling
process. Furthermore, through the ensemble technique, we combined the predictions from each
annotator’s model and obtained a final prediction using a majority voting approach, which
proved to be highly effective in enhancing the overall predictive performance.
Additionally, in Task 1, we achieved an F1-Score of 0.712, reflecting a robust performance
in stereotype identification. In Task 2, we obtained an ICM of -0.328 and a normalized ICM
of 0.381, indicating a detailed analysis of the presence and manifestation of stereotypes, both
explicit and implicit. These results highlight our dedication to improving stereotype detection
techniques and tackling the complex challenges posed by social media and news commentary.
Acknowledgments
This paper is part of the I+D+i Project titled “Conspiracy Theories and hate speech on-
line: Comparison of patterns in narratives and social networks about COVID-19, immi-
grants, refugees and LGBTI people [NON-CONSPIRA-HATE!]”, PID2021-123983OB-I00, funded
by MCIN/AEI/10.13039/501100011033/ and by “ERDF/EU”.
References
[1] F. H. Allport, The structuring of events: outline of a general theory with applications to
psychology., Psychological Review 61 (1954) 281.
[2] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda, A. Ariza-Casabona, M. Farrús, P. Rosso,
M. Taulé, Overview of DETESTS-Dis at IberLEF 2024: DETEction and classification of
racial STereotypes in Spanish - Learn with Disagreement, Procesamiento del Lenguaje
Natural 69 (2024).
[3] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference
of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org,
2024.
[4] E. Fersini, D. Nozza, P. Rosso, et al., Overview of the evalita 2018 task on automatic
misogyny identification (ami), in: CEUR Workshop Proceedings, volume 2263, CEUR-WS,
2018, pp. 1–9.
[5] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet,
T. Donoso, Overview of exist 2021: sexism identification in social networks, Procesamiento
del Lenguaje Natural 67 (2021) 195–207.
[6] P. Chiril, F. Benamara, V. Moriceau, “be nice to your wife! the restaurants are closed”: Can
gender stereotype detection improve sexism classification?, in: Findings of the Association
for Computational Linguistics: EMNLP 2021, 2021, pp. 2833–2844.
[7] J. Cryan, S. Tang, X. Zhang, M. Metzger, H. Zheng, B. Y. Zhao, Detecting gender stereotypes:
Lexicon vs. supervised learning methods, in: Proceedings of the 2020 CHI conference on
human factors in computing systems, 2020, pp. 1–11.
[8] P. Vossen, A. Fokkens, I. Maks, C. van Son, Towards an open dutch framenet lexicon and
corpus, in: Proceedings of the LREC 2018 Workshop International FrameNet Workshop,
2018, pp. 75–80.
[9] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, Y. Choi, Social bias frames: Reasoning
about social and power implications of language, arXiv preprint arXiv:1911.03891 (2019).
[10] M. Sanguinetti, G. Comandini, E. Di Nuovo, S. Frenda, M. Stranisci, C. Bosco, T. Caselli,
V. Patti, I. Russo, Haspeede 2@ evalita2020: Overview of the evalita 2020 hate speech
detection task, Evaluation Campaign of Natural Language Processing and Speech Tools
for Italian (2020).
[11] J. Sánchez-Junquera, P. Rosso, M. Montes, B. Chulvi, et al., Masking and bert-based models
for stereotype identication, Procesamiento del Lenguaje Natural 67 (2021) 83–94.
[12] A. Ariza-Casabona, W. S. Schmeisser-Nieto, M. Nofre, M. Taulé, E. Amigó, B. Chulvi,
P. Rosso, Overview of detests at iberlef 2022: Detection and classification of racial stereo-
types in spanish, Procesamiento del lenguaje natural 69 (2022) 217–228.
[13] E. Leonardelli, C. Casula, Dh-fbk at semeval-2023 task 10: Multi-task learning with classifier
ensemble agreement for sexism detection, in: Proceedings of the 17th International
Workshop on Semantic Evaluation (SemEval-2023), 2023, pp. 1894–1905.
[14] A. Uma, T. Fornaciari, A. Dumitrache, T. Miller, J. Chamberlain, B. Plank, E. Simpson,
M. Poesio, Semeval-2021 task 12: Learning with disagreements, in: Proceedings of the
15th International Workshop on Semantic Evaluation (SemEval-2021), 2021, pp. 338–347.
[15] J. Cañete, G. Chaperon, R. Fuentes, J. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model
and evaluation data, in: Proceedings of the Workshop on Practical Machine Learning for
Developing Countries (PML4DC) at ICLR 2020, 2020.
[16] A. Gutiérrez-Fandiño, J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo,
C. Carrino, C. Armentano-Oller, C. Rodriguez-Penagos, A. Gonzalez-Agirre, M. Villegas,
Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022).
[17] J. Tiedemann, S. Thottingal, Opus-mt: Building open translation services for the world,
https://huggingface.co/Helsinki-NLP/opus-mt-es-en, 2020. Accessed: 2024-07-08.
[18] J. Tiedemann, S. Thottingal, Opus-mt: Building open translation services for the world,
https://huggingface.co/Helsinki-NLP/opus-mt-en-es, 2020. Accessed: 2024-07-08.
[19] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperpa-
rameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining, 2019, pp. 2623–2631.