NYCU-NLP at EXIST 2024: Leveraging Transformers with Diverse Annotations for Sexism Identification in Social Networks Notebook for the EXIST Lab at CLEF 2024 Yi-Zeng Fang1 , Lung-Hao Lee2,* and Juinn-Dar Huang1 1 Institute of Electronics, National Yang Ming Chiao Tung University, Taiwan 2 Institute of Artificial Intelligence Innovation, National Yang Ming Chiao Tung University, Taiwan Abstract This paper presents a robust methodology for identifying sexism in social media texts as part of the EXIST 2024 challenge. First, we incorporate extensive data preprocessing techniques, including removing redundant elements, standardizing text formats, increasing data diversity by the back-translation, and augmenting texts using the AEDA approach. We then integrate annotator demographics such as gender, age, and ethnicity into our selected transformer-based language models. The rounding technique is used to handle non-continuous annotation values to maintain precise probability distributions. We empirically optimize shared layers across tasks based on the hard parameter-sharing techniques to improve generalization and computational efficiency. Rigorous evaluations were conducted using five-fold cross-validation to ensure the reliability of the findings. Finally, our system was respectively ranked first out of 40, 35, and 33 submissions for Tasks 1, 2 and 3 in the Soft-Soft category setting. In addition, in the Hard-Hard category setting, our system was ranked the first out of 70 submissions for Task 1; second out of 46 submissions for Task 2; and third out of 34 submissions for Task 3. This paper reports our findings in classifying sexism within social media textual content, offering substantial insights for the EXIST 2024 challenge. Keywords Sexism Identification, Pre-trained Language Models, Text Classification, Transformers 1. Introduction Social media platforms like Twitter, Instagram, and Facebook have become integral to modern commu- nication and information sharing. However, these platforms also facilitate the spread of discriminatory and prejudiced content, such as sexism. Sexism is a form of discrimination based on gender that under- mines the dignity and rights of women and marginalized groups through insults, stereotypes, jokes, threats, and harassment. Identifying and filtering objectionable web content is crucial for fostering a respectful and inclusive online environment [1]. The EXIST (sEXism Identification in Social neTworks) is a series of shared tasks to capture instances of sexism, ranging from explicit misogyny to other subtle expressions that involve implicit sexist behaviors [2, 3, 4]. The EXIST 2024 [5, 6] challenge contains three traditional tasks for classifying sexist textual messages. Task 1 (Sexism Identification in Tweets): This is a binary task used to decide whether a tweet contains sexist expressions or behaviors. Task 2 (Source Intention in Tweets): This is a multi-class task used to classify tweets identified as sexist in Task 1 into three categories based on the author’s intention, including 1) Direct: the tweet itself is a sexist message; 2) Reported: the tweet reports or describes a sexist event or situation; and 3) Judgmental: the tweet condemns sexist situations or behaviors. Task 3 (Sexism Categorization in Tweets): This is a multi-label task used to further categorize tweets identified as sexist into defined types, including 1) Ideological-Inequality: discrediting feminism or presenting men as victims of gender inequality; 2) Stereotyping-Dominance: promoting traditional gender roles or suggesting male superiority; 3) Objectification: treating women as objects, often focusing on physical CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ joycefang1213.ee11@nycu.edu.tw (Y. Fang); lhlee@nycu.edu.tw (L. Lee); jdhuang@nycu.edu.tw (J. Huang) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings appearance or traditional gender roles; 4) Sexual-Violence: including sexual suggestions, harassment, or assault; and 5) Misogyny-Non-Sexual-Violence: expressing hatred or non-sexual violence towards women. The EXIST-2024 datasets contain tweets in English and Spanish annotated with sexist remarks. Similar to the 2023 edition, this edition also embraces the Learning With Disagreement paradigm for dataset development and system evaluations. Therefore, developed systems can learn from conflicting or diverse annotations, allowing for a fairer learning process by considering the perspectives, biases, or interpretations of multiple annotators. Given the success of transformer models in various NLP tasks, our approach explores the use of transformer-based language models to identify and classify tweets for sexism detection. This paper describes the NYCU-NLP system for the EXIST 2024 challenge. We use extensive data preprocessing techniques, including removing irrelevant elements, standardizing text formats, back- translation via the Google Translator API, and implementing the AEDA [7] method for text augmentation. We also adapt the Round to Closed Value method [8] to handle non-continuous annotation values. The main system architecture is the transformer-based language model. We integrate annotator information, such as gender, age, and ethnicity, to create a unified vector representation for each tweet. This integration enriches the model’s contextual understanding and improves its ability to identify sexist content. We further incorporate Hard Parameter Sharing [9] to optimize shared layers across tasks to enhance generalization and computational efficiency. We rigorously evaluate our model performance through 5-fold cross-validation to ensure reliability and minimize over-fitting. Finally, our model obtained outstanding performance in the EXIST 2024 challenge, ranking respectively first out of 40, 35, and 33 submissions for Tasks 1, 2, and 3 with the Soft-Soft category setting. In addition, in the Hard-Hard category configuration, our system was ranked first out of 70 submissions for Task 1, second out of 46 submissions for Task 2, and third out of 34 submissions for Task 3. Our findings reflect ongoing efforts to detect and categorize sexism in social media, offering valuable insights for the EXIST 2024 challenge and beyond. The rest of this paper is organized as follows. Section 2 investigates related studies for sexism identification. Section 3 describes the NYCU-NLP system for the EXIST-2024 tasks. Section 4 presents results and performance comparisons. Conclusions are finally drawn in Section 5. 2. Related Work The automated detection of sexism on digital platforms has become increasingly important due to its prevalence and the sheer volume of content needing review, necessitating the development of systems to quicking and effectively identify and counteract such content. Researchers have explored various methods, initially focusing on rule-based systems but now predominantly using machine learning techniques, particularly pre-trained transformer models like BERT and its derivatives [10, 11, 12, 13]. These advanced models now outperform traditional methods in capturing sexist language’s contextual and semantic nuances. Despite these advancements, sexism detection remains challenging due to the subjective and culturally variable nature of sexist behavior. Initiatives such as the EXIST 2023 [4] and SemEval 2023 challenges [14] have emphasized the need for detailed classification systems, introducing taxonomies that categorize sexism into distinct types, including ideological sexism, stereotyping, and misogyny. These taxonomies aim to enhance the explainability and comprehensiveness of sexism detection systems. Bias in detection models is another critical issue. Models can perpetuate biases in their training data, leading to skewed results. Recent studies have addressed this by incorporating perspectivism and analyzing annotator agreement, which can improve the fairness and accuracy of these systems [15, 16, 17]. This consideration is particularly important in multilingual contexts, where expressions of sexism can vary widely. While machine learning and deep learning models have significantly advanced sexism detection, challenges remain in addressing bias, subjectivity, and the diverse forms of sexism across cultures and media types. Further research is needed, leveraging multimodal analysis and incorporating nuanced, context-aware approaches to develop more robust detection systems. 3. The NYCU-NLP System We use Hard Parameter Sharing [9] to efficiently train our transformer-based model on tasks that exhibit an inherent hierarchical relationship, specifically Tasks 1 through 3. This technique involves sharing hidden layers across all tasks, enabling the network to learn a common representation that leverages the shared features of these related tasks. Given the sequential nature of our tasks, where each task builds upon the preceding one, training them in isolation would be sub-optimal and could result in redundant or conflicting representations. We show the parameter sharing architecture in Fig. 1 (b). Hard Parameter Sharing ensures that the foundational knowledge acquired from Task 1 is effectively utilized and refined in subsequent tasks, thereby enhancing the model’s overall performance and generalization. This approach mitigates the risk of over-fitting through the regularizing effect of shared parameters and improves computational efficiency by reducing the number of required parameters compared to training separate models for each task. Consequently, Hard Parameter Sharing is a suitable and effective method for our multi-task learning scenario. To prepare the data for analysis during the pre-processing phase, we first removed usernames, URLs, percentages, time, dates, hashtags, and emojis, as these elements were unlikely to influence the annotators’ judgments (see Fig. 2). Subsequently, all characters are converted to lowercase to ensure uniformity and reduce the complexity of the text data. We also translated the text from English to Spanish and then back to English via the Google Translator API, effectively doubling the amount of data and introducing subtle variations that can improve the robustness of our models, as shown in Fig. 3. Tweet Tweet Tweet Tweet Transformer 1 Transformer 2 Transformer 3 Transformer (Task 1) (Task 2) (Task 3) Annotator mean max Information ⊖ mean max mean max mean max ⊖ ⊖ ⊖ Classifier(2) Classifier(4) Classifier(6) Classifier(2) Classifier(4) Classifier(6) Softmax Softmax Sigmoid Softmax Softmax Sigmoid Output Output Output Output Output Output Soft label Soft label Soft label Soft label Soft label Soft label (a) wo/ share (b) w/ share Figure 1: Schematic representation of our model architecture using Hard Parameter Sharing across three tasks. This diagram illustrates the shared transformer layers and specific classifiers for each task, highlighting the integration of annotator information and the utilization of softmax and sigmoid outputs. The comparative setup (a) without sharing versus (b) with sharing underscores the benefits of shared parameters in reducing redundancy and enhancing task interdependencies. 3.1. Data Augmentation We use the AEDA technique [7] to augment the text data by randomly segmenting sentences and inserting punctuation marks from a predefined set “.”, “;”, “?”, “:”, “!”, “,”. AEDA offers advantages ID Lang. Original Sentence Transform Sentence ¡Gran oportunidad de exposición para sus juegos en @Steam! @wingsfundme está aceptando ¡gran oportunidad de exposición para sus juegos propuestas de videojuegos que ya estén presentes en ! está aceptando propuestas de videojuegos que 100199 ES en #Steam para figurar en el evento de ya estén presentes en para figurar en el evento de #WomensDay Pueden ver todos los pueden ver todos los requisitos para participar en el requisitos para participar en el formulario del tweet formulario del tweet https://t.co/3W6PtUTDdR Feel #blessed that I have raised a caring & loving 13 yo who is our Next Gen Feminist & feel that i have raised a caring & loving 13 yo Ally. I was crying of joy inside when I got this text. who is our next gen feminist & ally. i was Not only we must #BreakTheBias for women, we 200176 EN crying of joy inside when i got this text. not only need to do it for our children. we must for women, we need to do it for our @GlobalFundWomen @UN_Women children. @womensday @WomeninID https://t.co/UJvvloR0iP Figure 2: Data cleaning process, including the removal of usernames, URLs, emojis, hashtags, and other non- essential elements, followed by conversion of all texts to lowercase. This figure showcases the transformation of example tweets, highlighting the streamlined and standardized text that forms the basis for further analysis. ID Lang. / Original Sentence Lang. / New Sentence ES EN ¡gran oportunidad de exposición para sus juegos en ! Great exposure opportunity for your games on ! is 100199 está aceptando propuestas de videojuegos que ya accepting proposals for video games that are already estén presentes en para figurar en el evento de pueden present in to appear in the event. You can see all the ver todos los requisitos para participar en el requirements to participate in the tweet form formulario del tweet EN ES Siento que he criado una familia cariñosa y amable. feel that i have raised a caring & loving 13 yo cariñosa de 13 años, que es nuestra feminista y 200176 who is our next gen feminist & ally. i was crying campesina de próxima generación. aliado. Estaba of joy inside when i got this text. not only we must for llorando de alegría por dentro cuando recibí este women, we need to do it for our children. mensaje. No sólo debemos hacerlo por las mujeres, sino que debemos hacerlo por nuestros hijos. Figure 3: Data augmentation process via back-translation using the Google Translator API. This figure presents original English sentences translated to Spanish and then back to English, illustrating the subtle variations introduced to enhance the dataset’s diversity and robustness for model training. including its simplicity and effectiveness in generating diverse textual variations without significantly altering the semantic content. Unlike traditional text augmentation techniques [18] such as synonym replacement, random insertion, random swap, and random deletion, which may cause unintended biases and distortions, the AEDA technique maintains the integrity of the original data, ensuring that the augmented dataset remains representative of the original distribution, thereby enhancing the robustness and generalizability of our model. 3.2. Incorporating Annotator Information Each tweet was annotated by up to six annotators, and their demographic information was stored in the EXIST 2024 datasets [5, 6]. We converted each annotator’s gender, age, and ethnicity information into one-hot encoded vectors, transforming categorical variables into a binary vector representation suitable for input into our neural network model. Each one-hot encoded vector [19] is passed through an embedding layer to obtain a dense 16-dimensional representation. The embedding layer is trained to map similar categories closer in the vector space, capturing the underlying relationships between different annotator attributes. For each tweet, we average the sum of the 16-dimensional embedding vectors [20] of the six annotators, resulting in a single 16-dimensional vector representing the combined annotator information. 3.3. Round to Closed Value We use Round to Closed Value [8] to ensure the output probability is close to the real value that matches the number of annotators. The method was applied uniformly to Task 1 and Task 2, as they both involve a mono-label classification where the sum of probabilities should be 1. We first generate all possible probability combinations for the given labels. For example, [1/6, 5/6] is a valid combination for Task 1 with 2 categories. We then calculate the cosine similarity between these valid combinations and the model’s predicted probabilities. The combination is most similar to the prediction chosen as the adjusted value. We modify the Round to Closed Value [8] approach for Task 3, which is a multi-label classification task where the sum of probabilities exceeds 1. We use the minimum of absolute differences to find the closest value for adjustment. The total adjusted probability might be below 1. We then select the next closest category and adjust its probability accordingly, ensuring the sum of adjusted probabilities is at least 1. This step ensures that the cumulative probability is valid and meaningful. 4. Evaluation 4.1. Datasets The EXIST 2024 Tweets Datasets [5, 6] aim to facilitate the identification and analysis of sexism in social media content. Table 1 shows the datasets comprising over 10,000 annotated tweets in both English and Spanish with a balanced distribution. Each tweet in the dataset is represented as a JSON object containing the following attributes: 1) id_EXIST: unique identifier for the tweet; 2) lang: language of the tweet text (“en” for English or “es” for Spanish); 3) tweet: text content of the tweet; 4) number_annotators: number of annotators who labeled the tweet; 5) annotators: unique identifiers for each annotator; 6) gender_annotators: gender of the annotators (values: “F” for female and “M” for male); 7) age_annotators: age group of the annotators (values: “18-22”, “23-45”, “46+”); 8) ethnicity_annotators: ethnicity of the annotators (e.g., “Black or African American”, “Hispano or Latino”, etc.); 9) study_level_annotators: educational level of the annotators (e.g., “high school degree or equivalent”, “bachelor’s degree”, etc.); 10) country_annotators: country where the annotators reside; 11) labels_task1: one label indicates whether the tweet contains sexist content (values: “yes” or “no”); 12) labels_task2: one label categorizes the intention behind the sexist tweet (values: “direct”, “reported”, “judgemental”, “-”, “unknown”); 13) labels_task3: one label indicates the type(s) of sexism present in the tweet, if any (e.g., “ideological-inequality”, “stereotyping-dominance”, etc.) The dataset is annotated by a diverse group of individuals in terms of gender, age, ethnicity, education level, and country of residence, enhancing the robustness and fairness of the annotations. This diversity helps ensure the dataset captures various perspectives and reduces potential annotation biases. 4.2. Settings We use the five-fold cross-validation technique to evaluate the model performance during development. This method involves partitioning the combined datasets, including training and development data, into five folds of equal size. During each iteration, one fold is designated as the validation set, while Table 1 Distribution of the dataset across different phases of model training, segmented by language. The table details the number of instances in the training, development, and test sets for both Spanish and English, highlighting the balanced allocation to support effective model generalization across both languages. Language Training Development Test Spanish 3660 549 1098 English 3260 489 978 Total 6920 1038 2076 the remaining four folds are used for model training. This process is repeated five times, ensuring that each fold is used exactly once as the validation set. We derive a robust estimate of the model’s generalization capability by averaging the performance metrics obtained from each iteration. The use of five-fold cross-validation not only maximizes the utility of our datasets but also provides a reliable means of assessing the model’s performance, reducing the potential for over-fitting and ensuring that the evaluation is not biased to any single train-test split. DeBERTaV3-large [12] and XLM-RoBERTa-large [13] were used as main transformer models across three tasks. The hyperparameters were empirically configured as follows. We used the AdamW optimizer [21] for training, with a learning rate of 1e-5 and a dropout rate of 0.1. The training process spanned 30 epochs, with a maximum sequence length of 128 tokens. A batch size of 20 was used to optimize performance across the tasks. The evaluation framework was implemented on a single NVIDIA Tesla V100 GPU with 32GB of memory. Our training primarily focused on the Soft-Soft category setting only, so the Hard-Hard category setting was not specifically trained on. Tasks 1 and 2 used a direct conversion with maximum values, whereas Task 3 used a conversion threshold of 0.16666. The ICM-Soft values [22] for each task were measured using the PyEvALL Evaluation Library, with results averaged over five-fold cross-validation. 4.3. Results Tables 2 and 3 show the respective five-fold cross-validation results using DeBERTaV3-large [12] and XLM-RoBERTa-large [13]. Various configurations were tested, including data augmentation (denoted as DA), annotator information (AI), rounding to closed values (RC) [8], and translation from Spanish to English (Tr.). The configurations denoted as V1, V2, and V3 represent the final versions submitted for official evaluation. The DeBERTaV3-large model consistently surpassed the XLM-RoBERTa-large model across all tasks and configurations. The performance disparity was particularly pronounced in the final versions (V2 for DeBERTaV3-large and V3 for XLM-RoBERTa-large). Specifically, DeBERTaV3-large demonstrates superior results in Task 1 (1.0084 vs. 0.9370) and achieves lower negative values in Task 2 (-0.5208 vs. -0.9049) and Task 3 (-1.8042 vs. -2.3777). These findings reveal the effectiveness of integrating data augmentation, annotator information, the rounding to closed values technique, and the Spanish-to-English translation. Overall, the DeBERTaV3- large model in its version 2 (V2) emerged as the most effective model for the tasks assessed in this study. Table 2 Performance results of the DeBERTaV3-large model across three tasks. DeBERTaV3-large Task 1 ↑ Task 2 ↑ Task 3 ↑ baseline 0.7849 -1.2073 -3.2058 + DA 0.8287 -0.9153 -2.9269 + DA + AI 0.9410 -0.7256 -2.5230 + DA + AI + RC (V1) 0.9862 -0.5597 -1.9450 + DA + AI + RC + Tr. (V2) 1.0084 -0.5208 -1.8042 Table 3 Performance evaluation of the XLM-RoBERTa-large model across three tasks. XLM-RoBERTa-large Task 1 ↑ Task 2 ↑ Task 3 ↑ baseline 0.7605 -1.5976 -3.5432 + DA 0.8063 -1.3867 -3.3798 + DA + AI 0.9072 -0.9970 -2.8341 + DA + AI + RC (V3) 0.9370 -0.9049 -2.3777 + DA + AI + RC + Tr. 0.9005 -0.9251 -2.5723 4.4. Rankings Tables 4, 5 and 6 respectively show our final submissions with the Soft-Soft and Hard-Hard category settings on the test set. In the Soft-Soft category setting, our model was respectively ranked first for Tasks 1, 2, and 3 out of 40, 35, and 33 submissions. In the Hard-Hard category setting, our system was ranked first out of 70 submissions for Task 1, second out of 46 submissions for Task 2, and third out of 34 submissions for Task 3. In summary, our findings reflect ongoing efforts to detect and categorize sexism in social media. Table 4 Final results on the test set for Task 1 Soft ICM-Soft Cross Hard ICM-Hard Macro Lang Version ICM-Soft ICM-Hard Rank Norm Entropy Rank Norm F1 1 1 1.0944 0.6755 0.9088 1 0.5973 0.8002 0.7944 All 2 2 1.0866 0.6742 0.8826 9 0.5619 0.7824 0.7785 3 3 1.0810 0.6733 0.9831 8 0.5749 0.7889 0.7813 1 1 1.1434 0.6834 0.8681 1 0.6215 0.8108 0.8238 ES 2 3 1.1251 0.6804 0.8751 8 0.5805 0.7903 0.8077 3 2 1.1358 0.6822 0.9229 5 0.5995 0.7998 0.8075 1 2 1.0024 0.6609 0.9545 5 0.5564 0.7839 0.7557 EN 2 1 1.0158 0.6631 0.8911 13 0.5298 0.7704 0.7410 3 3 0.9841 0.6580 1.0506 11 0.5362 0.7736 0.7477 Table 5 Final results on the test set for Task 2 Soft ICM-Soft Cross Hard ICM-Hard Macro Lang Version ICM-Soft ICM-Hard Rank Norm Entropy Rank Norm F1 1 2 -0.4059 0.4673 1.8549 3 0.3383 0.6100 0.5353 All 2 1 -0.2543 0.4795 1.8344 4 0.3073 0.5999 0.5273 3 3 -0.5226 0.4579 1.9206 2 0.3522 0.6145 0.5410 1 2 -0.2633 0.4789 1.8228 1 0.4457 0.6392 0.5757 ES 2 1 -0.0756 0.4939 1.8197 4 0.4098 0.6280 0.5723 3 3 -0.3308 0.4735 1.8540 3 0.4113 0.6285 0.5697 1 2 -0.6464 0.4472 1.8909 4 0.1881 0.5651 0.4729 EN 2 1 -0.5041 0.4588 1.8509 6 0.1692 0.5585 0.4625 3 3 -0.8235 0.4327 1.9954 2 0.2672 0.5925 0.4991 Table 6 Final results on the test set for Task 3 Soft ICM-Soft Hard ICM-Hard Macro Lang Version ICM-Soft ICM-Hard Rank Norm Rank Norm F1 1 1 -1.1762 0.4379 4 0.2364 0.5549 0.6066 All 2 2 -1.2169 0.4357 5 0.1725 0.5401 0.5933 3 3 -1.4555 0.4231 3 0.3069 0.5713 0.6130 1 1 -1.1280 0.4413 4 0.2986 0.5667 0.6206 ES 2 2 -1.1584 0.4397 5 0.1653 0.5369 0.5968 3 3 -1.2881 0.4330 3 0.3138 0.5701 0.6228 1 1 -1.2583 0.4311 5 0.1448 0.5355 0.5855 EN 2 2 -1.2802 0.4299 4 0.1680 0.5412 0.5874 3 3 -1.7322 0.4051 1 0.2820 0.5691 0.5989 5. Conclusions This study describes the NYCU-NLP submission for the EXIST-2024 Tasks 1, 2 and 3, including system design, implementation and evaluation. We remove superfluous elements, standardize the text formats, increase data diversity by the back-translation, and augment texts using the AEDA approach. We then integrate annotator demographics such as gender, age, and ethnicity into our selected transformer-based language models. Our model architecture based on the Hard Parameter Sharing technique optimizes computational efficiency and improves performance by leveraging shared features across related tasks. The results of the EXIST 2024 challenge demonstrate that our methodology significantly improves the detection and categorization of sexism in social media. Our approach yielded excellent performance, underscoring the effectiveness of the advanced techniques and strategies implemented. Acknowledgments This work was partially supported by the Ministry of Science and Technology, Taiwan, under grant MOST-111-2218-E-A49-022, and the National Science and Technology Council, Taiwan, under grant NSTC 111-2628-E-A49-029-MY3. We also thank the National Center for High-performance Computing and Taiwan Computing for supporting computing resources. References [1] L.-H. Lee, Y.-C. Juan, W.-L. Tseng, H.-H. Chen, Y.-H. Tseng, Mining browsing behaviors for objectionable content filtering, Journal of the Association for Information Science and Technology 66 (2015) 930–942. [2] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet, T. Donoso, Overview of exist 2021: sexism identification in social networks, Procesamiento del Lenguaje Natural 67 (2021) 195–207. [3] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, A. Mendieta-Aragón, G. Marco-Remón, M. Makeienko, M. Plaza, J. Gonzalo, D. Spina, P. Rosso, Overview of exist 2022: sexism identification in social networks, Procesamiento del Lenguaje Natural 69 (2022) 229–240. [4] L. Plaza, J. Carrillo-de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview of exist 2023: sexism identification in social networks, in: European Conference on Information Retrieval, Springer, 2023, pp. 593–599. [5] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [6] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [7] A. Karimi, L. Rossi, A. Prati, Aeda: an easier data augmentation technique for text classification, arXiv preprint arXiv:2108.13230 (2021). [8] A. F. M. de Paula, G. Rizzi, E. Fersini, D. Spina, Ai-upv at exist 2023–sexism characterization using large language models under the learning with disagreements regime, arXiv preprint arXiv:2307.03385 (2023). [9] S. Ruder, An overview of multi-task learning in deep neural networks, arXiv preprint arXiv:1706.05098 (2017). [10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [11] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, arXiv preprint arXiv:1909.11942 (2019). [12] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021). [13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [14] A. K. Ojha, A. S. Doğruöz, G. Da San Martino, H. T. Madabushi, R. Kumar, E. Sartori, Proceedings of the 17th international workshop on semantic evaluation (semeval-2023), in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), 2023. [15] S. Raza, O. Bamgbose, V. Chatrath, S. Ghuge, Y. Sidyakin, A. Y. Muaad, Unlocking bias detection: Leveraging transformer-based models for content analysis, arXiv preprint arXiv:2310.00347 (2023). [16] A. Radwan, L. Zaafarani, J. Abudawood, F. AlZahrani, F. Fourat, Addressing bias through ensemble learning and regularized fine-tuning, arXiv preprint arXiv:2402.00910 (2024). [17] T. P. Pagano, R. B. Loureiro, F. V. Lisboa, R. M. Peixoto, G. A. Guimarães, G. O. Cruz, M. M. Araujo, L. L. Santos, M. A. Cruz, E. L. Oliveira, et al., Bias and unfairness in machine learning models: a systematic review on datasets, tools, fairness metrics, and identification and mitigation methods, Big data and cognitive computing 7 (2023) 15. [18] J. Wei, K. Zou, Eda: Easy data augmentation techniques for boosting performance on text classification tasks, arXiv preprint arXiv:1901.11196 (2019). [19] J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities., Proceedings of the national academy of sciences 79 (1982) 2554–2558. [20] F. Almeida, G. Xexéo, Word embeddings: A survey, arXiv preprint arXiv:1901.09069 (2019). [21] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [22] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5809–5819.