Overview of EXIST 2024 – Learning with Disagreement for Sexism Identification and Characterization in Tweets and Memes (Extended Overview) Notebook for the EXIST Lab at CLEF 2024 Laura Plaza1,* , Jorge Carrillo-de-Albornoz1 , Víctor Ruiz1 , Alba Maeso2 , Berta Chulvi2 , Paolo Rosso2,3 , Enrique Amigó1 , Julio Gonzalo1 , Roser Morante1 and Damiano Spina4 1 Universidad Nacional de Educación a Distancia (UNED), 28040 Madrid, Spain 2 Universidad Politécnica de Valencia (UPV), 46022 Valencia, Spain 3 Valencian Graduate School and Research Network Analysis of Artificial Analysis (ValgrAI), 46022 Valencia, Spain 4 RMIT University, 3000 Melbourne, Australia Abstract In recent years, the rapid increase in the dissemination of offensive and discriminatory material aimed at women through social media platforms has emerged as a significant concern. This trend has had adverse effects on women’s well-being and their ability to freely express themselves. The EXIST campaign has been promoting research in online sexism detection and categorization in English and Spanish since 2021. The fourth edition of EXIST, hosted at the CLEF 2024 conference, consists of three groups of tasks, continuing from EXIST 2023: sexism identification, source intention identification, and sexism categorization. However, while EXIST 2023 focused on processing tweets, the novelty of this edition is that the three tasks are also applied to memes, resulting in a total of six tasks. To address disagreements in the labeling process, the “learning with disagreement” paradigm is adopted. This approach promotes the development of equitable systems capable of learning from different perspectives on the sexism phenomenon. The 2024 edition of EXIST has surpassed the success of previous editions, with the participation of 57 teams submitting 412 runs. This extended lab overview describes the tasks, dataset, evaluation methodology, participant approaches, and results. Additionally, it highlights the advancements made in understanding and tackling online sexism through more diverse data sources and innovative methodologies. Finally, it briefly introduces future intended work for next editions of EXIST. Keywords sexism identification, sexism categorization, learning with disagreement, memes, data bias 1. Introduction EXIST (sEXism Identification in Social neTworks) is a series of scientific events and shared tasks on sexism identification in social networks. The editions of 2021 and 2022 [1, 2], celebrated under the umbrella of the IberLEF forum, were the first in proposing tasks focusing on identifying and classifying online sexism in a broad sense, from explicit and/or hostile to other subtle or even benevolent expressions. The 2023 edition [3] took place as a CLEF Lab and added a third task consisting in determining the intention of the author of sexist messages with the aim of understanding the purpose behind people posting sexist messages on social networks.. Additionally, the main novelty of the 2023 edition was the adoption of the “Learning with Disagreements” (LwD) paradigm [4] for the development of the dataset and for the evaluation of the systems. In the LwD paradigm, models are trained to handle and learn from conflicting or diverse annotations so that different annotators’ perspectives, biases, or interpretations are taken into account. This approach fits the findings of our previous work that showed that the perception of sexism is strongly dependent on the demographic and cultural background of the CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ lplaza@lsi.uned.es (L. Plaza); jcalbornoz@lsi.uned.es (J. Carrillo-de-Albornoz); victor.ruiz@lsi.uned.es (V. Ruiz); amaeolm@inf.upv.es (A. Maeso); berta.chulvi@upv.es (B. Chulvi); prosso@dsic.upv.es (P. Rosso); enrique@lsi.uned.es (E. Amigó); julio@lsi.uned.es (J. Gonzalo); r.morant@lsi.uned.es (R. Morante); damiano.spina@rmit.edu.au (D. Spina) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings individual. Adopting this paradigm was a distinguishing feature in comparison to the SemEval-2023 Shared Task 10: “Explainable Detection of Online Sexism” [5]. EXIST 2024, organised also as a CLEF Lab, aims to continue contributing datasets and tasks that help developing applications to combat sexism on-line, as a form of hate on-line. This edition embraces also the LwD paradigm and, as novelty, incorporates three new tasks that center around memes. Memes are images that are spread rapidly by social networks and Internet users. While by nature memes are humorous, there is a growing tendency to use them for harmful purposes, as an strategy to conceal hate speech by combining stylistic devices of humour [6], since people tolerate humorously communicated prejudices better than explicit irrespectful remarks [7]. Thus, memes contribute to spreading derogatory humour and to strengthen preexisting prejudices and maintaining hierarchies between social groups [8]. As Gasparini et al. indicate [9], misogyny and sexism against women are widespread attitudes within the social media communities, reinforcing age-old patriarchal establishments of baseless name-calling, objectifying their appearances, and stereotyping gender roles. By including sexist memes in the EXIST 2024 dataset, we aim to encompass a broader spectrum of sexist manifestations in social networks and to contribute to the development of automated multimodal tools capable of detecting harmful content targeting women. Meme detection has also been the focus of other competitions. The SemEval-2022 Task 5: “Multimedia Automatic Misogyny Identification” [10] focused on the detection of misogynous memes on the web in English and proposed two tasks: recognising whether a meme is misogynous or not and recognising types of misogyny in memes. The shared task on “Multitask Meme Classification - Unraveling Misogy- nistic and Trolls in Online Memes” [11] consisted in classifying misogynistic content and troll memes, focusing specifically on memes in Tamil and Malayalam languages. The originality of EXIST lies in that the languages addressed are English and Spanish, it introduces also the task on source intention recognition and it adopts the LwD paradigm. In the following sections, we provide comprehensive information about the tasks, the dataset, the evaluation methodology, the results and the different approaches of the teams that participated in the EXIST 2024 Lab. The competition features six distinct tasks: sexism identification, source intention clas- sification, and sexism categorization, both in tweets and in memes. A total of 148 teams from 32 different countries registered to participate. Ultimately, we received 412 results from 57 teams. Interestingly, a significant number of teams leveraged the diverse labels representing various demographic groups and provided soft labels as the outputs of their systems. Their results showcase the effectiveness and advantages of employing the LwD paradigm in our specific domain: sexism detection and categorization in social networks. 2. Tasks The 2024 edition of EXIST feature six tasks, which are described below. The languages addressed are English and Spanish, and the datasets are collections of tweets and memes. For the tasks on memes, all the partitions of the dataset are new, whereas for the tasks on tweets we employ the EXIST 2023 dataset. 2.1. Task 1: Sexism Identification in Tweets This is a binary classification task where systems must decide whether or not a given tweet expresses ideas related to sexism in any of the three forms: it is sexist itself, it describes a sexist situation in which discrimination towards women occurs, or criticizes a sexist behaviour. The following statements show examples of sexist and not sexist messages, respectively. • Sexist. The tweet is sexist or describes or criticizes a sexist situation. (1) Woman driving, be careful!. (2) It’s less of #adaywithoutwomen and more of a day without feminists, which, to be quite honest, sounds lovely. (3) I’m sorry but women cannot drive, call me sexist or whatever but it is true. (4) You look like a whore in those pants" - My brother of 13 when he saw me in a leather pant • Not sexist. The tweet is not sexist, nor it describes or criticizes a sexist situation. (5) Just saw a woman wearing a mask outside spank her very tightly leashed dog and I gotta say I love learning absolutely everything about a stranger in a single instant. (6) Where all the white women at?. (7) The shocking video of a woman at the wheel who miraculously escapes an assassination attempt. (8) Don’t my arguments convince you? Let’s try to debate. Do you use "feminazi"? You stay alone. 2.2. Task 2: Source Intention in Tweets This task aims to categorize the message according to the intention of the author. This distinction will allow us to differentiate sexism that is actually taking place online from sexism which is being suffered by women in other situations but that is being reported in social networks with the aim of complaining and fighting against sexism. We propose the following ternary classification of tweets: • Direct sexist message. The intention is to write a message that is sexist by itself or incites to be sexist, as in: (9) A woman needs love, to fill the fridge, if a man can give this to her in return for her services (housework, cooking, etc), I don’t see what else she needs. (10) Women shouldn’t code...perhaps be influencer/creator instead...it’s their natural strength. • Reported sexist message. The intention is to report and share a sexist situation suffered by a woman or women in first or third person, as in: (11) I doze in the subway, I open my eyes feeling something weird: the hand of the man sat next to me on my leg #SquealOnYourPig. (12) Today, one of my year 1 class pupils could not believe he’d lost a race against a girl. • Judgemental message. The intention is to condemn sexist situations or behaviours, as in: (13) As usual, the woman was the one quitting her job for the family’s welfare... (14) 21st century and we are still earning 25% less than men #Idonotrenounce. 2.3. Task 3: Sexism Categorization in Tweets Many facets of a woman’s life may be the focus of sexist attitudes including domestic and parenting roles, career opportunities, sexual image, and life expectations, to name a few. According to this, each sexist tweet must be assigned one or more of the following categories: • Ideological and inequality. It includes messages that discredit the feminist movement. It also includes messages that reject inequality between men and women, or present men as victims of gender-based oppression. Some examples of this type of sexism discourse can be found in the following tweets: (15) #Feminism is a war on men, but it’s also a war on women. It’s a war on female nature, a war on femininity. (16) Think the whole equality thing is getting out of hand. We are different, that’s how were made! (17) OK. Whew, that’s good. Get a real degree and forget this poison of victimhood known as feminism. • Stereotyping and dominance. It includes messages that express false ideas about women that suggest they are more suitable or inappropriate for certain tasks, and somehow inferior to men. Some examples of this type of sexism discourse can be found in the following tweets: (18) Most women no longer have the desire or the knowledge to develop a high quality character, even if they wanted to. (19) I feel like everytime I flirt with a girl they start to imagine all the ways they can utilize me. (20) Woman driving, be careful! • Objectification. It includes messages where women are presented as objects apart from their dignity and personal aspects. We also include messages that assume or describe certain physical qualities that women must have in order to fulfill traditional gender roles. Some examples of this type of sexism discourse can be found in the following tweets: (21) No offense but I’ve never seen an attractive african american hooker. Not a single one. (22) Most of the times I just want women for sex just because everytime I try to make a relationship out of it they always break my heart. (23) Don’t get married than blame all woman for your poor investment. You should of got a hooker but instead you choose to go get a wedding ring. • Sexual violence. It includes messages where sexual suggestions, requests or harassment of a sexual nature (rape or sexual assault) are made. Some examples of this type of sexism discourse can be found in the following tweets: (24) I wanna touch your tits..you can’t imagine what I can do on your body. (25) Fuck that cunt, I would with my fist. (26) Now I understand why woman that get no action are miserable. Semen retention almost drove me insane. • Misogyny and non sexual violence. It includes expressions of hatred and violence towards women. Some examples of this type of sexism discourse can be found in the following tweets: (27) Domestic abuse is never okay. . . Unless your wife is a bitch. (28) Some woman are so toxic they don’t even know they are draining everyone around them in poison. (29) I hate misogyny more than I hate women. 2.4. Task 4: Sexism Identification in Memes As in Task 1, this is a binary classification consisting on deciding whether or not a meme is sexist. Figure 1 shows examples of sexist and non sexist memes, respectively. 2.5. Task 5: Source Intention in Memes As in Task 2, this task aims to categorize the meme according to the intention of the author. Due to the characteristics of the memes, we barely found examples of memes within the “reported” category, so this category was not considered. As a result, in this task systems should only classify memes in two classes: “direct” or “judgemental”, as shown in Figure 2. 2.6. Task 6: Sexism Categorization in Memes This task aims to classify sexist memes according to the categorization provided for Task 3: (i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual violence and (v) misogyny and non-sexual violence. Figure 3 shows one meme of each category. (a) Sexist meme (b) Non sexist meme Figure 1: Examples of sexist and non-sexist memes. (a) Direct (b) Judgemental Figure 2: Examples of direct and judgemental memes. 3. Dataset The EXIST 2024 dataset comprises two types of data: the tweets from the EXIST 2023 dataset and a completely new dataset of memes. Plaza et al. [3] provide a detailed description of the EXIST 2023 tweet dataset. Here, we briefly describe the process followed to curate the meme dataset. Since we adopt the LwD paradigm, we provide all labels assigned by the different annotators to allow systems to learn from conflicting and subjective information. This paradigm not only proved to improve the systems’ accuracy, robutness and generalizability, but also helped to mitigate bias. 3.1. Data Sampling We first curated a lexicon of terms and expressions leading to sexist memes. The set of seeds encompasses diverse topics and contains 250 terms, with 112 in English and 138 in Spanish. The terms were used as search queries on Google Images to obtain the top 100 images. Rigorous manual cleaning procedures were applied, defining memes and ensuring the removal of noise such as textless images, text-only images, ads, and duplicates. The final set consists of more than 3,000 memes per language. Since the proportion of memes per term was heterogeneous, we discarded the most unbalanced seeds and made sure that all seeds have at least five memes. To avoid introducing selection bias, we randomly selected memes, ensuring the appropriate distribution per seed. As a result, we have 2,000 memes per language for the training set and 500 memes per language for the test set. (a) Ideological & (b) Objectification inequality (c) Stereotyping & (d) Sexual (e) Misogyny & non-sexual vio- dominance violence lence Figure 3: Examples of memes from the different sexist categories. 3.2. Dealing with Label Bias We have considered some sources of “label bias”, that may be introduced by the socio-demographic differences of the persons that participate in the annotation process, but also when more than one possible correct label exists or when the decision on the label is highly subjective. In particular, we consider two sociodemographic parameters: gender (MALE/FEMALE) and age (18–22/23–45/+46 y.o.). Each meme was annotated by six annotators selected through the Prolific crowdsourcing platform. No personally identifiable information about the crowd workers was collected. Crowd workers were informed that the tweets could contain offensive information and were allowed to withdraw voluntarily at any time. Full consent was obtained. Also, as a new feature in the datasets, both of 2023 and 2024, we have reported three additional demographic characteristics of annotators: level of education, ethnicity, and country of residence. 3.3. Learning with Disagreement The assumption that natural language expressions have a single and clearly identifiable interpretation in a given context is a convenient idealization, but far from reality, especially in highly subjective task as sexism identification. The learning with disagreements paradigm aims to deal with this by letting systems learn from datasets where no gold annotations are provided but information about the annotations from all annotators, in an attempt to gather the diversity of views. Following methods proposed for training directly from the data with disagreements, instead of using an aggregated label, we will provide all annotations per instance for the six different strata of annotators. Table 1 Runs submitted and teams participating on each EXIST 2024 task. Tweets Memes Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 # Runs 106 77 63 87 36 43 # Teams 46 38 27 41 18 22 4. Lab Setup and Participation In this section, we provide a concise overview of the approaches presented at EXIST 2024. For a comprehensive description of the systems, please refer to Section 6 and to the participants’ papers. Although 148 teams from 32 different countries registered for participation, the number of participants who finally submitted results were 57, submitting 412 runs. Teams were allowed to participate in any of the six tasks and submit hard and/or soft outputs. Table 1 summarizes the participation in the different tasks and evaluation contexts. The evaluation campaign started on March 4, 2024 with the release of the training set. The test set was made available on April 15. The participants were provided with the official evaluation script. Runs had to be submitted by May 10. Each team could submit up to three runs per task. A wide range of approaches and strategies were used by the participants. Nearly all participant systems utilized large language models, both monolingual and multilingual. Most employed LLMs include BERT, DistilBERT, MarIA, MDEBERTA, RoBERTa, DeBERTa, Llama, and GPT-4. For processing memes, popular vision models were employed: CLIP, BEIT and VIT. Some teams employed ensembles of multiple models to enhance the overall performance. A couple of teams made use of knowledge integration to combine different language models with language features. Data augmentation techniques were used by several teams. Prompt Engineering was also used to adapt pre-trained models to the sexism detection task. Only two teams utilized deep learning architectures such as BiLSTM and CNN, while four teams opted for traditional machine learning methods, including SVM, Random Forest, and XGBoost, among others. As in EXIST 2023 [12], Twitter-specific models where employed, such as Twitter-RoBERTa and Twitter-XML-RoBERTa. While 174 systems took advantage of the multiple annotations available and provided soft outputs, 238 followed the traditional approach of providing only hard labels as outputs. Textual tasks received greater engagement, although participation is also high in the tasks on memes. The binary classification tasks had more participants, followed by mono-label tasks, and finally, multi-label tasks, which is due to the increasing difficulty of these tasks. For each of the six tasks, the organization also provided two different baseline runs: • EXIST2024 majority, a non-informative baseline that classifies all instances as the majority class. • EXIST2024 minority, a non-informative baseline that classifies all instances as the minority class. • The evaluation metrics for the the gold standard (EXIST2024 gold) are also provided, in order to set the upper bound for the ICM metrics. 5. Evaluation Methodology and Metrics As in EXIST 2023, we have carried out a “soft evaluation” and a “hard evaluation”. The soft evaluation relates to the LwD paradigm and is intended to measure the ability of the model to capture disagreements, by considering the probability distribution of labels in the output as a soft label and comparing it with the probability distribution of the annotations. The hard evaluation is the standard paradigm and assumes that a single label is provided by the systems for every instance in the dataset. From the point of view of evaluation metrics, the tasks can be described as follows: • Tasks 1 and 4 (sexism identification): binary classification, monolabel. • Tasks 2 and 5 (source intention): multiclass hierarchical classification, monolabel. The hierarchy of classes has a first level with two categories, sexist/not sexist, and a second level for the sexist category with three mutually-exclusive subcategories: direct/reported/judgemental. A suitable evaluation metric must reflect the fact that a confusion between not sexist and a sexist category is more severe than a confusion between two sexist subcategories. • Tasks 3 and 6 (sexism categorization): multiclass hierarchical classification, multilabel. Again the first level is a binary distinction between sexist/not sexist, and there is a second level for the sexist category that includes five subcategories: ideological and inequality, stereotyping and dominance, objectification, sexual violence, and misogyny and non-sexual violence. These classes are not mutually exclusive: a tweet may belong to several subcategories at the same time. The LwD paradigm can be considered in both sides of the evaluation process: • The ground truth. In a “hard” setting, the variability in the human annotations is reduced by selecting one and only one gold category per instance, the hard label. In a “soft” setting, the gold standard label for one instance is the set of all the human annotations existing for that instance. Therefore, the evaluation metric incorporates the proportion of human annotators that have selected each category (soft labels). Note that in Tasks 1, 2, 4 and 5, which are monolabel problems, the sum of the probabilities of each class must be one. But in Task 3, which is multilabel, each annotator may select more than one category for a single instance. Therefore, the sum of probabilities of each class may be larger than one. • The system output. In a “hard”, traditional setting, the system predicts one or more categories for each instance. In a “soft” setting, the system predicts a probability for each category, for each instance. The evaluation score is maximized when the probabilities predicted match the actual probabilities in a soft ground truth. In EXIST 2024, for each of the tasks, two types of evaluation have been performed: 1. Soft-soft evaluation. For systems that provide probabilities for each category, we perform a soft-soft evaluation that compares the probabilities assigned by the system with the probabilities assigned by the set of human annotators. The probabilities of the classes for each instance are calculated according to the distribution of labels and the number of annotators for that instance. We use a modification of the original ICM metric (Information Contrast Measure [13]), ICM-Soft (see details below), as the official evaluation metric in this variant and we also provide results for the normalized version of ICM-Soft (ICM-Soft Norm). 2. Hard-hard evaluation. For systems that provide a hard, conventional output, we perform a hard-hard evaluation. To derive the hard labels in the ground truth from the different annotators’ labels, we use a probabilistic threshold computed for each task. As a result, for Tasks 1 and 4, the class annotated by more than 3 annotators is selected; for Tasks 2 and 5, the class annotated by more than 2 annotators is selected; and for Tasks 3 ad 6 (multilabel), the classes annotated by more than 1 annotator are selected. The instances for which there is no majority class (i.e., no class receives more probability than the threshold) are removed from this evaluation scheme. The official metric for this task is the original ICM, as defined by [13]. We also report a normalized version of ICM (ICM Norm) and F1 (F1YES ). In Tasks 1 and 4, we use F1 for the positive class. In Tasks 2, 3, 5 and 6, we use the macro-average of F1 for all classes (Macro F1). Note, however, that F1 is not ideal in our experimental setting: although it can handle multilabel situations, it does not take into account the relationships between classes. In particular, a confusion between not sexist and any of the sexist subclasses, and a confusion between two of the sexist subclasses, are penalized equally. ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used to evaluate outputs in classification problems by computing their similarity to the ground truth. The general definition of ICM is: ICM(𝐴, 𝐵) = 𝛼1 𝐼𝐶(𝐴) + 𝛼2 𝐼𝐶(𝐵) − 𝛽𝐼𝐶(𝐴 ∪ 𝐵) Where 𝐼𝐶(𝐴) is the Information Content of the instance represented by the set of features A. ICM maps into PMI when all parameters take a value of 1. The general definition of ICM by [13] is applied to cases where categories have a hierarchical structure and instances may belong to more than one category. The resulting evaluation metric is proved to be analytically superior to the alternatives in the state of the art. The definition of ICM in this context is: ICM(𝑠(𝑑), 𝑔(𝑑)) = 2𝐼𝐶(𝑠(𝑑)) + 2𝐼𝐶(𝑔(𝑑)) − 3𝐼𝐶(𝑠(𝑑) ∪ 𝑔(𝑑)) Where 𝐼𝐶() stands for Information Content, 𝑠(𝑑) is the set of categories assigned to document 𝑑 by system 𝑠, and 𝑔(𝑑) the set of categories assigned to document 𝑑 in the gold standard. The score for a perfect output (𝑠(𝑑) = 𝑔(𝑑)) is the gold standard Information Content (𝐼𝐶(𝑔(𝑑)). The score for a zero-information system (no category assignment) is −𝐼𝐶(𝑔(𝑑)). We use these two boundaries for normalisation purposes, truncating to 0 the scores lower than −𝐼𝐶(𝑔(𝑑)). As there is not, to the best of our knowledge, any current metric that fits hierarchical multilabel classification problems in a LwD scenario, we have defined an extension of ICM (ICM-soft) that accepts both soft system outputs and soft ground truth assignments. ICM-soft works as follows: first, we define the Information Content of a single assignment of a category 𝑐 with an agreement 𝑣 to a given instance as the probability of instances in the gold standard to exceed the agrement level 𝑣 for the category 𝑐: 𝐼𝐶({⟨𝑐, 𝑣⟩}) = − log2 (𝑃 ({𝑑 ∈ 𝐷 : 𝑔𝑐 (𝑑) ≥ 𝑣}) In order to estimate 𝐼𝐶, we compute the mean and deviation of the agreement levels for each class across instances, and applying the cumulative probability over the inferred normal distribution. In the case of zero variance, we must consider that the probability for values equals or below the mean is 1 (zero IC) and the probability for values above the mean must be smoothed. But this is not the case of the EXIST datasets. Due to the multi-label and hierarchical nature of the classification task,for each classification instance, the gold standard, the system output and their unions (𝐼𝐶(𝑠(𝑑)) 𝐼𝐶(𝑔(𝑑)) and 𝐼𝐶(𝑠(𝑑))𝑈 𝑔(𝑑)) are sets of category assignments. The union of the assignments (i.e. 𝑠(𝑑))𝑈 𝑔(𝑑)) is calculated as fuzzy sets, i.e. the maximum values., in order to estimate information content, we apply a recursive function similar to the one described by Amigó and Delgado [13] for assignment sets and avoid the redundant information of parent categories. (︃ 𝑛 )︃ (︃ 𝑛 )︃ ⋃︁ ⋃︁ 𝐼𝐶 {⟨𝑐𝑖 , 𝑣𝑖 ⟩} = 𝐼𝐶(⟨𝑐1 , 𝑣1 ⟩) + 𝐼𝐶 {⟨𝑐𝑖 , 𝑣𝑖 ⟩} 𝑖=1 𝑖=2 (︃ 𝑛 )︃ ⋃︁ − 𝐼𝐶 {⟨lca(𝑐1 , 𝑐𝑖 ), 𝑚𝑖𝑛(𝑣1 , 𝑣𝑖 )⟩} (30) 𝑖=2 where lca(𝑎, 𝑏) is the lowest common ancestor of categories 𝑎 and 𝑏. 6. Overview of approaches EXIST 2024 In this section, we provide a description of the approaches adopted by the participants. More detailed information of each work is provided in the participants’ working notes. Team FraunhoferSIT [14] utilized a stacking ensemble of machine learning models. For predicting hard labels, they used an ensemble of classification models: Multinomial Naive Bayes, Stochastic Gradient Descent, Decision Tree, k-Nearest Neighbors, Logistic Regression and Extra Trees. For predicting soft labels, they used an ensemble of regression models: Random Forest, Gradient Boosting, Stochastic Gradient Descent and AdaBoost. They employed two augmentation methods at word level: synonym replacement using WordNet for English tweets and contextual augmentation with three transformer models: BERTIN, ALBERT Base Spanish, and RoBERTuito. They participated in the first three tasks. Team frms [15] participated in Tasks 1, 2 and 3, for both hard and soft evaluations. For all three tasks, their first run used BERT multilingual, and their second run used XLM-RoBERTa model. In the third task, they created an ensemble of BERT and XLM-RoBERTa combining the predictions from both models. For Task 1, RoBERTa model performed the best in the hard evaluation, but, for Task 2, BERT performed better for both hard and soft evaluations. For Task 3, the ensemble obtained the best results in the hard evaluation and tied with RoBERTa in the soft evaluation. Team RMIT-IR [16] proposed different approaches to Tasks 1-3 and Task 4. For Tasks 1–3 (on tweets), they studied the effectiveness of zero-shot In-Context Learning (ICL) with off-the-shelf pre- trained Large Language Models (LLMs). Their approaches for meme classification (Task 4) utilize CLIP (Contrastive Language-Image Pre-training) to experiment with multi-modal embeddings and zero-shot sexism identification models. They participated on hard and soft evaluations, obtaining soft labels by generating six answers and calculating the proportions. The annotator’s genders and/or study levels were included in some runs for the first three tasks. For Task 4, they used three systems TI-CLIP (feedforward network), TIMV-CLIP (Transformer encoder), and Prompt-CLIP (zero-shot). TIMV-CLIP (Transformer encoder) stands out by its performance, especially in the memes in English dataset, where it achieves the best result in soft evaluation with a ICM-Soft Norm score of 0.4998. Team dap-upv [17] proposed hard labels in Tasks 4 and 6. For Task 4, they utilized a two-stage approach by fine-tuning the Contrastive Language-Image Pre-training (CLIP) model followed by a classifier. A different classifier was tested per run, where the best one was Light Gradient Boosting Machine (LightGBM) reaching a 0.72 in F1. For Task 6, they fine-tuned both a RoBERTa model for text, a Google Vision Transformer (ViT) for images, and they concatenated them training a LightGBM classifier on the concatenated embeddings. The ensemble notably improved results, obtaining 0.49 in F1. Team Aditya [18] only participated in Task 1 submitting hard labels. They preprocessed tweets by removing any emoji, URLs or mentions. They used XLM-RoBERTa fine-tuned on the dataset, and their three runs differ on the number of epochs of training and whether they used development in training too. Their best model, trained with 12 epochs, ranked 14th with F1 of 0.7691. Team BAZI [19] fine-tuned various transformer models, some monolingual in English and Spanish and some multilingual, and they chose XLM-RoBERTa as the best performing of them. With this method, they provided hard labels and soft labels. Hard labels were the direct outputs from the model, and soft labels were obtained by adding a softmax function to the last layer. Their approach using XLM-RoBERTa achieved 4th place in the soft evaluation for Task 1, and 2nd place for Task 2. Also, they employed few-shot learning with GPT-3.5 with three examples in English and three in Spanish from the training set, providing hard labels with this method. Team mc-mistral_2 [20] submitted hard labels for Task 1. Their approach leveraged a Mistral 7B model along with a few-shot learning strategy and prompt engineering to address the task in the hard labelling setup. They translated cases in Spanish to English with an online and real-time use of Google Translator from the deep_translator library. Then they randomly selected 10 samples from the provided labelled training set and formatted the samples of test set as: Tweet1 // NO, Tweet2 // YES. In the global ranking, they achieved a F1 score of 0.51. Team CNLP-NITS-PP [21] presented two systems to cover all tasks in EXIST 2024. The model for textual modalities is a Convolutional Neural Network - Bidirectional Long Short-Term Memory (CNN-BiLSTM) and it is used in Tasks 1, 2 and 3, and to characterize textual data in Tasks 4, 5 and 6. Texts are tokenized into words utilizing GloVe, they are lowercased, and common stopwords are deleted. For memes, a combination of Residual Network 50 (ResNet-50), used to analyze images, and text-based analysis is utilized. Images are resized to 224x224 pixels, and pixel values are normalized to the interval [0, 1]. To enhance model robustness, data augmentation techniques such as random rotation, flipping, and colour jitter are applied. Hyperparameter tuning is conducted via grid search and k-fold cross-validation. A single run was submitted for every task in both hard and soft labels. Results stand out in Task 5, where they reached the 5th position in the ranking. Team Awakened [22] studied the most appropriate way of assembling transformer models by comparing performances of different models and assigning different weights in the ensemble to the best model. They used both English models and multilingual models, and both general language models (like DistilBERT of xlm-RoBERTa), and domain-specific models (like twitter-xlm-roberta-base-sentiment or roberta-hate-speech-dynabench-r4). They tested three ways of assembling models: assigning half of the weight to the most dominant model, assigning 75% of the weight, and assigning all the weight to the most dominant model. Results showed that, in most of the tasks, the best model was the one in which the most dominant was assigned a 75% of the weight. The NYCU-NLP team [23] employed extensive data preprocessing techniques, which included removing irrelevant elements, standardizing text formats, back-translation using the Google Translator API, and implementing the AEDA method for text augmentation. Additionally, they adapted the Round to Closest Value approach to handle non-continuous annotation values. The system relies on two transformer-based language models: DeBERTa-v3 and xlm-RoBERTa. The team integrated annotator information such as gender, age, and ethnicity, creating a unified vector representation for each tweet. They further incorporated Hard Parameter Sharing to optimize shared layers across tasks, improving generalization and computational efficiency. Notably, their model achieved outstanding performance in the EXIST 2024 challenge, securing first place in Tasks 1, 2, and 3 in the soft evaluation setting. In the hard setting, their system ranked first in Task 1, second in Task 2, and third in Task 3. Team shm2024 [24] participated with hard labels to Tasks 1 and 2. They implemented three different models. The first system was a MultiLayer Perceptron (MLP) classifier with Language Agnostic BERT Sentence Embeddings (LaBSE). The second one was a eXtreme Gradient Boosting (XGBoost) Classifier, and the third approach was an ensemble of CNN models. The first model was the best performing on the test set with a ICM-Hard Norm value of 0.6623 and F1_YES value of 0.7044 for Task 1 and, for Task 2, 0.2115 for ICM-Hard Norm and 0.1200 for Macro F1. UMUTeam [25] submitted soft predictions to Tasks 1 and 2, and hard labels to the rest of the tasks. For textual modalities, they created an ensemble of two Spanish LLMs, BETO and MarIA; and two multilingual LLMs, deBERTa v3 and XLMTwitter. They also extracted the linguistic features (LFs) using the UMUTextStats tool and performed hyperparameter tuning on 10 models. They tried two ways of ensembling: Knowledge Integration (KI), and Ensemble Learning (EL). Their three runs were KI, EL and LFs. Their results were better at Spanish than English. Their best results were obtained in Task 2, where KI reached the 8th position. For multi modalities, they used a CLIP model to extract images and text. Their algorithm was formed by an Image Encoder (CLIP image encoder), a Text Encoder (CLIP text encoder), Diagonal multiplication and a Classification head. Their results were near the baselines. The 3 Musketeers [26] team applied traditional classification machine learning algorithms such as Logistic Regression (LR), Random Forest (RF) and Support Vector Machine (SVM), and BERT transformer model to the EXIST Task 1. They followed a pipeline of preprocessing, including lowercasing, removing punctuation, emoticons, links, mentions and stopwords. They lemmatized words and used TF-IDF vectorization. They used GridSearch to optimize hyperparameters. Their best model was an SVM that got a F1 score of 0.6299. Team TextMiner [27] submitted hard labels to Task 1. They preprocessed the text to standardize it and text embeddings were created using TF-IDF. They performed feature selection of word n-grams and character n-grams. They explored a diverse range of classifiers: Random Forest Classifier, Extra Trees Classifier, LightGBM Classifier, AdaBoost Classifier, Bernoulli Naive Bayes, Support Vector Classifier. They experimented with different combinations of hyperparameters, and created three ensembles of models: one with the 10 best models, another one with the 50 best models, and the last one with the 100 best models. The best model was the one with the top 50 models, coming 39th place. The team Victor-UNED [28] participated in every task of EXIST 2024. Their system employed a concatenation of models based on the predictions of the level of agreement of the instances. They trained various transformer based models on the textual training dataset to generate predictions to Tasks 1 and 4. Models with more information shown in their learning phase were used to determine soft labels of instances with lower level of agreement. Models were trained with soft labels, and hard labels were obtained from the resulting soft labels. For Tasks 2 & 5, and Tasks 3 & 6, results from previous tasks were incorporated, taking into account the hierarchical nature of the challenge. First run was mDeBERTa-v3 trained on this year’s dataset, second run was the concatenated models to determine low agreement cases, and third run took into account an annotator ensemble to distinguish low agreement cases. Their approach achieved top rankings in Tasks 4 and 5, and was one of the most consistent models among all tasks. Team PINK [29] participated in Task 4 on sexism detection in memes. They proposed a unified, multi- modal Transformer-based architecture capable of dealing with multiple languages, namely English and Spanish. Their architecture extracts high-level features using large-scale, pre-trained models that are kept frozen during training. These features are then normalized and projected into the same dimensional space. They are then conditioned based on the language of the sample and its modality before being processed by a Transformer encoder backbone. The final classification is predicted through average pooling and a linear projection. With this architecture, they created three types of systems: Single Models, Majority Voting Ensembles (MVE) and Average Probability Ensembles (APE). Each of these systems was used to obtain a set of predictions, both hard and soft labels. Their approach reached the 10th and 20th places in the final ranking for soft- and hard-label evaluations, respectively. Team RoJiNG-CL [30] investigated using large language models (LLMs), specifically GPT-4, to extract textual descriptions from images in Task 4. They obtained these descriptions from GPT-4’s zero-shot prompting, with textual prompts alongside memes. They integrated these descriptions with related texts to fine-tune both monolingual and multilingual models, enhancing their ability to identify sexist content in memes with hard labels. They experimented with several transformer-based models and their hyperparameter’s optimization with Optuna. The first run used BERT fine-tuned in English data, and BETO fine-tuned in Spanish data. The second run used mDeBERTa, and the third run used GPT-4 output results. Their submissions secured the top three positions on the hard-hard evaluation leaderboard, encompassing both English and Spanish instances. The GPT-4 based predictions emerged as the most effective, delivering top results in a zero-shot setting. Team NICA [31] participated in tasks from 1 to 5. For textual models, they used various multilingual transformer models to detect sexism in English and Spanish tweets. Their runs worked with xlm- RoBERTa-Large-Twitter, multilingual BERT, and multilingual DistilBERT. First, they preprocessed texts, focusing on eliminating tags and URLs in the tweets. Their experiments showed that BERT outperformed other models, however xlm-RoBERTa-Large-Twitter showed notable better results in the test set. For tasks 4 and 5, they employed the CLIP model, which leverages both image and corresponding text data to identify sexist elements. CLIP performance yielded promising results, reaching the 9th position in the soft evaluation in Task 4, and the 4th position in both hard and soft evaluations in Task 5. Team Mind [32] presented an approach for detecting sexism in memes in Task 4. In their approach, they used ResNet50 as image encoder and m-BERT to create the text embeddings, fine-tuned on EXIST 2024 dataset. Once they had the encodings, they applied a projection layer for dimensionality reduction and feature transformation on input vectors. In order to combine the image and text features and get concatenated data, they used the feature interaction matrix (FIM). Then, they trained a contrastive learning-based model on these embeddings. They computed the cosine similarity between each test sample and all the training samples. They used the K-Nearest Neighbors (KNN) algorithm to select the 10 training embeddings with the highest cosine similarity to each sample. The performance of the model achieved ICM scores of 0.2778 for English, 0.2152 for Spanish, and 0.2465 for the combined dataset. Team DiTana-PV [33] focused on hard evaluation of Tasks 4 and 6. Their objectives were to evaluate the effect of machine translation on model performance and explore data augmentation techniques. They automatically translated Spanish data to English and leveraged a trained version of the BERTweet model, BERTweet-large-sexism-detector, fine-tuned in the dataset of SemEval-2023 Task 10 EDOS. Then they used data augmentation techniques to increase the amount of data and reduce the imbalance. They used BERT contextual embeddings for paraphrasing the words in the original text. One separate model was trained for each language. For Task 4, runs included models that added a weighted-loss function and a weighted-loss function plus data augmentation. The one that performed the best was the model with weighted-loss function and no data augmentation. For Task 6, their models tried to predict 5 labels or 6 labels at a time. Results showed that the model predicting 5 labels performed better. Team Atresa-I2C-UHU [34] participated in Tasks 4 and 5 submitting both hard and soft labels. They focused on working with perspectives and Learning with Disagreement. They trained the multilingual versions of BERT and RoBERTa with different hyperparameters to analyze their effect in each perspective with enough values (gender, age, level of studies, Bachelor’s High school White). They chose the versions of BERT and RoBERTA that worked best for every perspective, and combined them. Data was preprocessed and translated to generate supplementary training datasets. Final runs were combinations of BERT models or combinations of BERT and Roberta models to deal with hard and soft evaluations in each task. For Task 4, they ranked 4th, with ICM-Hard and ICM-Soft scores of 0.5668 and 0.4476, respectively. For Task 5, they secured 2nd and 10th places with ICM-Hard and ICM-Soft scores of 0.4119 and 0.2023, respectively. Team CAU&ITU_2 [35] investigated a broad range of models, including traditional machine learning methods, such as ensemble models and probability-based model like Random Forest and XGboost, and deep learning architectures models with the use of multiligual BERT. In order to obtain vector representations of texts, they implemented BOW and TF-IDF and the machine learning methods were trained with both alternatives. Hyperparameter fine-tuning was performed with RandomizedSearchCV. These methods were explored for binary classification in Task 1, and multi-class classification in Task 2. Alternatively, multilingual BERT performed better than the rest of the models in Tasks 1 and 2. For Task 3, only multilingual BERT was evaluated. Every experiment was evaluated with hard labels. Participation of team I2C-UHU_2 [36] in Tasks 1 and 2 aimed to employ Learning with Disagreement techniques and explore annotators’ perspectives to obtain more robust models. Firstly, they preprocessed data with cleaning and normalization techniques. Then they applied data augmentation strategies and hyperparameter optimization with Optuna. They explored different transformer-based models. For Task 1, the first run used XLM RoBERTa Base to predict all instances, and the second run separated DeBERTa v3 Base for the English dataset and RoBERTa Base BNE for the Spanish dataset. For Task 2, they submitted three runs: the first one used XLM RoBERTa Base trained with the whole dataset, the second one implemented an ensemble of XLM RoBERTa Base for every annotator group, and the third one was an ensemble of XLM RoBERTa Base for every age group. In Task 1, they secured the 10th position in the hard evaluation, and the 13th position in the soft evaluation. In Task 2, they achieved the 11th position for the hard evaluation, and the 17th position for the soft evaluation. Team maven [37] submitted hard labels to textual Tasks 1, 2 and 3. They focused on creating a stacking classifier composed by an ensemble of four LLMs. They built the ensemble calculating the highest Complementary Error Correction. The stacking classifier was based on LightGBM, whose parameters were fine-tuned with Optuna, and fed with the output scores of the four LLMs. They created two datasets from the original EXIST dataset by translating all data from English to Spanish, and from Spanish to English. They performed preprocessing: lowercasing and eliminating mentions, hashtags, links, numerals, etc. From the four LLMs, two of them were trained with English data (DistilBERT + RoBERTa), and the other two with Spanish data (somosnlp-hackathon-2022/twitter-sexismo-finetuned- robertuito-exist2021 and annahaz/xlm-roberta-base-misogyny-sexism-indomain-mix-bal). For Tasks 2 and 3, their system was based on BERT. In Task 1, the best model obtained a F1-score of 0.7359, 0.4563 for Task 2, and 0.4491 for Task 3. Team CIMAT-CS-NLP [38] participated in Task 1, with hard and soft labels, and in Task 2 with hard labels. The proposed methods for both tasks are based on unifying the knowledge of two dif- ferent systems: zero-shot classification with LLMs through prompting, and supervised fine-tuning of multilingual transformer models. The zero-shot classification was performed with the Gemini API (gemini-1.0-pro). Four kinds of results were taken into account, depending on the type of the prompt engineering processed. On the other hand, three types of transformer models were fine-tuned on the dataset, namely XLM-RoBERTa, mBERT and Twitter-XLM-Roberta. With seven types of results (one per prompt and model), hard labels were obtained by three methods: creation of new input for fine-tuning, proportion of votes and Best LLM response, or best fine-tuned model. Soft labels were obtained by proportion of votes. The best system achieved the 3rd place in the hard evaluation for all tweets with a F1 (positive class) of 0.7899. The highest ranked model for soft labels was in 5th place, and the two best results obtained for Task 2 were ranked 7th and 8th. Team Medusa [39] adressed Task 3, sexism categorization in tweets, with hard and soft labels. They aimed to study the performance of two architectural archetype: Classifier Chain and Binary Relevance. The binary relevance architecture assumes that each label is independent of the others and can therefore be treated separately. In the classifier chain architectures, classifiers are chained together so that predictions from individual labels become features for other classifiers. These architectures constitute the multi-label classifier head situated at the top of the pretrained models from the XML-RoBERTa family. They trained various models and selected the Best BR, Best Chain, and “Best ICM-Soft”, that is, the models with less BCE loss on the validation set. Their models achieved 4th, 5th, and 6th positions in the soft evaluation ranking. Team Penta ML [40] participated with soft labels in Tasks 4 and 5, and with hard labels in Tasks 4, 5 and 6. They presented a multimodal architecture with five different components: (i) A Pretrained Vision- Language (ViLT) Model, which employs BERT as the text processor and ViT (Vision Transformer) as the image processor; (ii) Semantics from Pooled Representations; (iii) Attention Enhanced Context Vector for each Modality; (iv) Modality Fusion and (v) Classification Head, based on Cross Entropy loss. Then, they experimented with CLIP and VILT as their baseline models. They concatenated representations from images and texts and passed them to an MLP for classification. Since ViLT outperformed CLIP, it was used as the model in their approach. Results showed that ViLT alone obtained the best results in ICM metrics, however, their approach achieved better performance in Macro F1 in Task 6. Team CIMAT-GTO [41] only participated in Task 1 with hard labels. They explored the reasoning capabilities of Llama 3 in a two-step process. In the first stage, they generated “reasoning” texts using a LLM that aims to understand the tweets’ nature. These rationales are added to the tweets and, in the second stage, they are processed further with a pre-trained XLM-RoBERTa model trained on multilingual tweets. They differentiated between various types of reasoning: positive and negative reasoning and comparative reasoning. They included answers from the LLM (Llama 3) to questions about sexism identification. They processed these answers with a multi-layer FFN and then concatenated with the text. The models corresponded to a RoBERTa model with negative reasoning, an ensemble of models with negative and comparative reasoning and an ensemble of models with negative, comparative and answering reasoning. The latter was the best performing model, reaching the 4th position in the hard evaluation ranking for Task 1. Team EquityExplorers [42] submitted two runs of hard labels to Task 1. These two runs corresponded to the results of their two approaches: the Dual-Transformer Fusion Network (DTFN) and the Multimodel Fusion Ensemble (MFE). The DTFN is based on the fusion of two Transformer models, RoBERTa- Large and DeBERTa-V3-Large. This ensemble model leverages the distinctive characteristics of each constituent model to enhance text classification. MFE is a more complex approach based on the ensemble of LLMs (Mistral-7b, RoBERTa-Large and DeBERTa-V3-Large), and the DTFN using a majority voting mechanism. Evaluation showed that these methodologies significantly outperform existing models, with MFE and DTFN ranking 1st and 2nd, respectively, in the English segment, and 4th and 13th in the combined English and Spanish segments of the official leaderboard. Umera Wajeed Pasha [43] participated in Task 4 on sexism identification in memes. The study starts by importing and visualizing a meme dataset, then pre-processing the images using techniques including cropping, scaling, and normalization to get them ready for model training. A pre-trained model called CLIP is used to extract features, and the dataset is split into training and validation sets for memes in both Spanish and English. The collected features are used to train and assess a variety of machine learning models, such as Logistic Regression, SVM, XGBoost, Decision Trees, Random Forest, Neural Network, AdaBoost, and SGD. The Random Forest model performed the best out of all of them. Team VerbaNex AI [44] proposed a method to deal with Task 1 in the hard setting. They implemented a profiling approach based on demographic factors: gender, education level, and age. This allowed to categorize the profiles into four groups based on their likelihood of labelling messages as sexist or not sexist. Then they performed feature extraction and trained four distinct systems based on the grouped profiles and their responses. To address class imbalance, they used K-Fold Stratified Shuffle-Split. They incorporated the Twitter-roBERTa-base model specifically fine-tuned for sentiment analysis. This method was evaluated using the testing profiles, achieving a F1 score of 0.745. In the evaluation phase, their approach yielded a F1 score of 0.63. Team MMICI [45] participated in every task of the EXIST challenge. For textual tasks, they used trans- former models (“cardiffnlp/twitter-roberta-base-sentiment” for English and “pysentimiento/robertuito- base-uncased” for Spanish). They created two types of ensembles. The first one used a majority vote from the outputs of six different models, one for each annotator. The second ensemble used a majority vote from the outputs of five different models, focusing on gender and age. For multimodal tasks, they utilized CLIP embeddings using a Vision Transformer (ViT) model and two types of classifiers: FNNs and Factorization Machines. In runs 1 and 2, demographic information was represented using one-hot encoding, whereas, in run 3, a descriptive text was created for the annotator features, from which embeddings were extracted. For runs 2 and 3, the classifier used a FNN, and in run 3 they proposed a Factorization Machine model. Their best performances include a 10th place in Task 1, a 15th place in Task 2, and a 13th place in Task 3 for Spanish tweets. For memes, they achieved a 3rd place in Task 4 for English. Team Penta-nlp [46] participated in Tasks 1 to 3. They explored multiple approaches: Machine Learning (ML), Deep Learning (DL) and Transformer-based Pretrained Models. The ML models included Support Vector Machine, Random Forest, XGBoost (Tasks 1 & 2) and Logistic Regression (Task 3). The DL models leverage both LSTM and LSTM + Attention models. Transformer models explored XLM-RoBERTa, mBERT and BETO. They conducted experiments using various preprocessing methods: removing usernames, URLs, punctuation and emojis. It was shown that models performed best without URLs for Tasks 1 and 3. A single run with hard labels per task was submitted to each task, reaching the 29th position in Task 1, and the 9th position in Task 2. The ABCD [47] team participated in Tasks 1, 2 and 3 with both hard and soft labels. In their approaches, they employed both LLMs like Llama 2 and T5 and smaller models like XLM_RoBERTa. They divided the datasets into six subsets corresponding to each annotator group. Subsamples are preprocessed, and prompt engineering is applied to LLMs. Then, smaller transformer models are fine-tuned on each subset and predictions are collected for each model. To incorporate the hierarchical structure of Tasks 2 and 3, they only made predictions for subsamples classified as sexist in Task 1. Their best performance model achieved 2nd in Task 1, 1st in Task 2, and 1st in Task 3 for the hard evaluation. 7. Results In the next subsections, we report the results of the participants and the baseline systems for each task. Disaggregated results for each language, English and Spanish, may be found at the EXIST 2024 website (http://nlp.uned.es/exist2024/). 7.1. Task 1: Sexism Identification in Tweets We first report and analyze the results for Task 1, which focuses on sexism identification in tweets. This task involves a binary classification. As discussed in Section 5, we report two sets of evaluation results (hard and soft). 7.1.1. Soft Evaluation Table 2 presents the results for the soft-soft evaluation for Task 1. A total of 37 runs were submitted. Out of these, 34 runs outperformed the non-informative majority class baseline (where all instances are labeled as “NO”), and all runs surpassed the non-informative minority class baseline (where all instances are labeled as “YES”). We observed a significant discrepancy in performance, with ICM-Soft Norm scores ranging from 0.6755 to 0.0374. However, if we analyze the top 5 systems, we appreciate a difference of less than 5 percentual points. Notably, the best run achieved an ICM-Soft Norm score of 68% for this binary classification task, surpassing the top performance of 64% recorded by the best EXIST 2023 participant. This suggests that new models and approaches are becoming more effective at detecting sexism in social networks. However, it also indicates that there is still room for improvement. Table 2: Results of Task 1 in the soft-soft evaluation. Run Rank ICM-Soft ICM-Soft Norm Cross Entropy EXIST2024 gold 0 3.1182 1.0000 0.5472 NYCU-NLP_1 [23] 1 1.0944 0.6755 0.9088 NYCU-NLP_2 2 1.0866 0.6742 0.8826 NYCU-NLP_3 3 1.0810 0.6733 0.9831 ABCD Team_3 [47] 4 0.9291 0.6490 1.2637 CIMAT-CS-NLP_3 [38] 5 0.9285 0.6489 1.2252 CIMAT-CS-NLP_1 6 0.8468 0.6358 1.2538 ABCD Team_1 7 0.8316 0.6333 1.6727 CIMAT-CS-NLP_2 8 0.8213 0.6317 1.2684 BAZI_1 [19] 9 0.8179 0.6311 0.9750 Awakened_2 [22] 10 0.7196 0.6154 0.8106 Victor-UNED_1 [28] 11 0.6952 0.6115 1.0691 Awakened_3 12 0.6909 0.6108 0.8542 I2C-UHU_2 [36] 13 0.6871 0.6102 0.9184 Victor-UNED_2 14 0.6797 0.6090 0.9818 UMUTEAM_1 [25] 15 0.6679 0.6071 0.8708 Awakened_1 16 0.6663 0.6068 0.8037 Victor-UNED_3 17 0.6479 0.6039 1.0930 I2C-UHU_1 18 0.5175 0.5830 1.0666 UMUTEAM_2 19 0.5033 0.5807 0.8357 ABCD Team_2 20 0.4594 0.5737 1.2164 MMICI_3 [45] 21 0.4589 0.5736 2.0316 clac_1 22 0.1431 0.5230 2.9543 RMIT-IR_1 [16] 23 −0.0011 0.4998 2.7892 FraunhoferSIT_1 [14] 24 −0.0658 0.4895 0.8801 CNLP-NITS-PP_1 [21] 25 −0.2086 0.4666 1.0390 RMIT-IR_3 26 −0.3016 0.4516 2.8235 Atresa-I2C-UHU_1 [34] 27 −0.3256 0.4478 3.9518 CNLP-NITS-PP_2 28 −0.3286 0.4473 0.9236 MMICI_1 29 −0.3394 0.4456 0.8375 MMICI_2 30 −0.3622 0.4419 0.8382 RMIT-IR_2 31 −0.3941 0.4368 2.9956 UMUITEAM_3 32 −0.4170 0.4331 1.0933 UniLeon-UniBO_1 33 −1.1882 0.3095 1.2449 UniLeon-UniBO_2 34 −1.3306 0.2866 2.0069 UniLeon-UniBO_3 35 −1.3447 0.2844 2.0239 EXIST2024 majority 36 −2.3585 0.1218 4.6115 NICA_3 [31] 37 −2.8848 0.0374 1.5286 NICA_2 38 −2.8848 0.0374 1.3862 NICA_1 39 −2.8848 0.0374 1.2301 EXIST2024 minority 40 −3.0717 0.0075 5.3572 7.1.2. Hard Evaluation Table 3 presents the results for the hard-hard evaluation. In this scenario, the annotations from the six annotators are combined into a single label using the majority vote. Out of the 67 systems submitted for this task, 66 ranked above the majority class baseline (all instances labeled as “NO”). All systems surpassed the minority class baseline (all instances labeled as “YES”). Similar to the soft-soft evaluation, the results vary considerably. If we focus on the ICM-Hard normalized metric, we observe that the best run gets 0.8002 while the worse one gests only 0.2665. If we focus on the top 5 systems, we observe that they achieve comparable results. Table 3: Results of Task 1 in the hard-hard evaluation. Run Rank ICM-Hard ICM-Hard Norm F1YES EXIST2024 gold 0 0.9948 1.0000 1.0000 NYCU-NLP_1 1 0.5973 0.8002 0.7944 ABCD Team_1 2 0.5957 0.7994 0.7826 CIMAT-CS-NLP_2 3 0.5926 0.7978 0.7899 EquityExplorers_2 [42] 4 0.5883 0.7957 0.7775 CIMAT-GTO_3 [41] 5 0.5848 0.7939 0.7903 CIMAT-GTO_2 6 0.5798 0.7914 0.7887 ABCD Team_3 7 0.5766 0.7898 0.7823 NYCU-NLP_3 8 0.5749 0.7889 0.7813 NYCU-NLP_2 9 0.5619 0.7824 0.7785 I2C-UHU_2 10 0.5557 0.7793 0.7733 BAZI_1 11 0.5490 0.7759 0.7755 CIMAT-CS-NLP_1 12 0.5486 0.7757 0.7746 EquityExplorers_1 13 0.5448 0.7738 0.7615 ADITYA_3 [18] 14 0.5418 0.7723 0.7691 CIMAT-GTO_1 15 0.5407 0.7718 0.7694 CIMAT-CS-NLP_3 16 0.5357 0.7692 0.7700 MMICI_3 [45] 17 0.5324 0.7676 0.7637 ADITYA_2 18 0.5246 0.7636 0.7669 NICA_1 19 0.5214 0.7621 0.7642 Awakened_3 20 0.5196 0.7611 0.7652 Awakened_2 21 0.5124 0.7575 0.7620 maven_3 22 0.5015 0.7521 0.7596 Awakened_1 23 0.4984 0.7505 0.7582 Victor-UNED_3 24 0.4934 0.7480 0.7602 Victor-UNED_1 25 0.4914 0.7470 0.7542 Victor-UNED_2 26 0.4863 0.7444 0.7535 RMIT-IR_3 27 0.4802 0.7414 0.7548 MMICI_2 28 0.4780 0.7402 0.7460 penta-nlp_1 29 0.4779 0.7402 0.7508 RMIT-IR_1 30 0.4739 0.7382 0.7526 MMICI_1 31 0.4705 0.7365 0.7455 I2C-UHU_1 32 0.4651 0.7338 0.7513 RMIT-IR_2 33 0.4590 0.7307 0.7448 ADITYA_1 34 0.4580 0.7302 0.7447 fmrs_2 [15] 35 0.4398 0.7211 0.7462 Continued on next page Table 3 – continued from previous page Run Rank ICM-Hard ICM-Hard Norm F1YES clac_1 36 0.4380 0.7201 0.7376 NICA_3 37 0.4358 0.7191 0.7429 fmrs_1 38 0.3961 0.6991 0.7194 TextMiner_2 [27] 39 0.3926 0.6973 0.7223 ABCD Team_2 40 0.3884 0.6952 0.7292 TextMiner_3 41 0.3876 0.6948 0.7180 maven_1 42 0.3860 0.6940 0.7121 NICA_2 43 0.3750 0.6885 0.7263 BAZI_2 44 0.3472 0.6745 0.7121 CAU&ITU_2 [35] 45 0.3460 0.6739 0.7024 DLRG_1 46 0.3446 0.6732 0.7085 TextMiner_1 47 0.3412 0.6715 0.7048 shm2024_3 [24] 48 0.3230 0.6623 0.7044 maven_2 49 0.3044 0.6530 0.6946 shm2024_1 50 0.2905 0.6460 0.6946 CAU&ITU_1 51 0.2832 0.6423 0.6922 Atresa-I2C-UHU_1 52 0.2782 0.6398 0.6899 FraunhoferSIT_1 53 0.2320 0.6166 0.6823 CNLP-NITS-PP_1 54 0.1977 0.5994 0.6762 CNLP-NITS-PP_2 55 0.1440 0.5724 0.6281 mc-mistral_2 [20] 56 0.0614 0.5309 0.5317 mc-mistral_1 57 −0.0094 0.4953 0.4779 UniLeon-UniBO_2 58 −0.1870 0.4060 0.4963 UniLeon-UniBO_3 59 −0.1959 0.4015 0.4906 NIT-Patna-NLP_1 60 −0.2975 0.3505 0.5272 UniLeon-UniBO_1 61 −0.2980 0.3502 0.5972 shm2024_2 62 −0.3410 0.3286 0.4922 DadJokers_1 63 −0.3611 0.3185 0.4365 VerbaNex_1 [44] 64 −0.4048 0.2965 0.4588 The 3 Musketeers_1 [26] 65 −0.4229 0.2875 0.3371 The 3 Musketeers_2 66 −0.4260 0.2859 0.3719 VerbaNex_2 67 −0.4392 0.2792 0.4560 EXIST2024 majority 68 −0.4413 0.2782 0.0000 The 3Musketeers_3 69 −0.4645 0.2665 0.2999 EXIST2024 minority 70 −0.5742 0.2114 0.5698 7.2. Task 2: Source Intention in Tweets In this section, we report and analyze the results for Task 2, which focuses on determing the intention of the author when posting a sexist tweet. This task is a multi-class, mono-label classification. We report two sets of evaluation results (hard and soft). 7.2.1. Soft Evaluation Table 4 presents the results for the soft-soft evaluation of Task 2. The table shows that 32 runs were submitted. Among them, 25 runs achieved better results than the majority class baseline (where all instances are labeled as “NO”). Furthermore, all of the submitted runs outperformed or equaled the minority class baseline (where all instances are labeled as “REPORTED”). The ICM-Soft Norm scores range from the 0.4795 of the best system (“nycu-nlp_2”) to 0.0000 of “fmrs_2”, indicating significant variability in the effectiveness of the submitted models. It is worth mentioning that the best system outperforms the second-best by more than 8 percentage points. Overall, performance is considerably lower compared to Task 1. This can be attributed to the hierarchical and multi-class nature of Task 2. It is also worth noting the correlation between the ICM-Soft and Cross-Entropy measures. The results indicate a strong correlation between the two metrics, but some differences can still be observed due to the fact that cross entropy does not have into account the specificity of the different classes. Table 4: Results of Task 2 in the soft-soft evaluation. Run Rank ICM-Soft ICM-Soft Norm Cross Entropy EXIST2024 gold 0 6.2057 1.0000 0.9128 NYCU-NLP_2 1 −0.2543 0.4795 1.8344 NYCU-NLP_1 2 −0.4059 0.4673 1.8549 NYCU-NLP_3 3 −0.5226 0.4579 1.9206 BAZI_1 4 −1.3468 0.3915 1.7812 Victor-UNED_2 5 −1.6440 0.3675 1.7971 Victor-UNED_1 6 −1.6549 0.3667 1.8132 ABCD Team_3 7 −1.8462 0.3513 2.4123 UMUTEAM_1 8 −1.9566 0.3424 1.4726 Awakened_2 9 −2.0091 0.3381 3.0835 ABCD Team_2 10 −2.0149 0.3377 2.3892 Awakened_1 11 −2.0365 0.3359 3.1429 UMUTEAM_2 12 −2.0533 0.3346 1.7890 Awakened_3 13 −2.1502 0.3268 3.0908 fmrs_1 14 −2.1737 0.3249 2.1210 CNLP-NITS-PP_1 15 −2.4732 0.3007 1.6696 Atresa-I2C-UHU_1 16 −2.6802 0.2841 2.1629 I2C-UHU_2 17 −2.6952 0.2828 2.1440 ABCD Team_1 18 −2.9080 0.2657 2.7595 UMUTEAM_3 19 −3.3189 0.2326 4.4490 MMICI_3 20 −3.6350 0.2071 1.7285 FraunhoferSIT_1 21 −4.0856 0.1708 1.7649 I2C-UHU_3 22 −4.2278 0.1594 2.5245 RMIT-IR_1 23 −4.5481 0.1336 3.5776 MMICI_1 24 −4.5753 0.1314 1.6866 MMICI_2 25 −4.6285 0.1271 1.6974 CUET-SSTM_1 26 −5.1320 0.0865 4.8736 EXIST2024 majority 27 −5.4460 0.0612 4.6233 NICA_2 28 −5.7592 0.0360 2.7026 RMIT-IR_3 29 −5.7632 0.0357 3.9903 UniLeon-UniBO_3 30 −5.7633 0.0356 2.1267 UniLeon-UniBO_1 31 −5.9587 0.0199 2.2261 UniLeon-UniBO_2 32 −5.9798 0.0182 2.1542 RMIT-IR_2 33 −6.1535 0.0042 4.0930 fmrs_2 34 −6.9170 0.0000 4.1975 EXIST2024 minority 35 −32.9552 0.0000 8.8517 7.2.2. Hard Evaluation Table 5 presents the hard-hard evaluation results for Task 2, assessing 43 systems against the hard gold standard. Among these, 37 runs outperform the majority class baseline (where all instances are labeled “NO”), and all systems show equal or better performance compared to the minority class baseline (where all instances are labeled as “REPORTED”). Similar to the soft-soft evaluation, discrepancies between the best and the worst-performing systems are more pronounced in Task 2 than in Task 1. The top-ranking system, “ABCD Team_1,” achieved the highest ICM-Hard normalized score (0.6320). The top 5 best systems range between 0.5937 and 0.6320. The lower end of the table includes five systems which score 0 in the ICM-Hard norm metric. The correlation between ICM-Hard and F1 is generally strong, with slight variations among the top-ranked systems and greater variability towards the lower end of the table. This variability arises because F1 does not account for the hierarchical nature of the task as effectively as ICM-Hard, which more stringently penalizes misclassifications between different hierarchy levels. Table 5: Results of Task 2 in the hard-hard evaluation. Run Rank ICM-Hard ICM-Hard Norm Macro F1 EXIST2024 gold 0 1.5378 1.0000 1.0000 ABCD Team_1 1 0.4059 0.6320 0.5677 NYCU-NLP_3 2 0.3522 0.6145 0.5410 NYCU-NLP_1 3 0.3383 0.6100 0.5353 NYCU-NLP_2 4 0.3073 0.5999 0.5273 CUET-SSTM_1 5 0.2883 0.5937 0.5383 ABCD Team_3 6 0.2847 0.5926 0.5289 CIMAT-CS-NLP_2 7 0.2643 0.5859 0.5171 CIMAT-CS-NLP_1 8 0.2346 0.5763 0.5195 penta-nlp_1 9 0.2089 0.5679 0.4856 BAZI_1 10 0.1883 0.5612 0.4843 I2C-UHU_2 11 0.1815 0.5590 0.4980 Awakened_2 12 0.1812 0.5589 0.4826 CIMAT-CS-NLP_3 13 0.1615 0.5525 0.4885 fmrs_1 14 0.1609 0.5523 0.4978 NICA_2 15 0.1506 0.5490 0.4738 Awakened_1 16 0.1487 0.5483 0.4753 Awakened_3 17 0.1306 0.5425 0.4686 RMIT-IR_1 18 0.0855 0.5278 0.4024 Victor-UNED_1 19 0.0851 0.5277 0.3257 Victor-UNED_2 20 0.0815 0.5265 0.3256 I2C-UHU_1 21 0.0418 0.5136 0.4708 BAZI_2 22 0.0396 0.5129 0.4278 RMIT-IR_3 23 0.0394 0.5128 0.3856 I2C-UHU_3 24 0.0210 0.5068 0.4663 RMIT-IR_2 25 0.0173 0.5056 0.3926 maven_1 [37] 26 −0.0510 0.4834 0.4563 MMICI_1 27 −0.0987 0.4679 0.4548 MMICI_3 28 −0.1076 0.4650 0.4525 DLRG_1 29 −0.1171 0.4619 0.3931 ABCD Team_2 30 −0.1368 0.4555 0.4182 Continued on next page Table 5 – continued from previous page Run Rank ICM-Hard ICM-Hard Norm Macro F1 Atresa-I2C-UHU_1 31 −0.1524 0.4504 0.4278 MMICI_2 32 −0.2406 0.4218 0.4383 CNLP-NITS-PP_1 33 −0.2694 0.4124 0.3743 FraunhoferSIT_1 34 −0.4106 0.3665 0.3823 CAU&ITU_1 35 −0.4711 0.3468 0.2998 CAU&ITU_2 36 −0.5024 0.3366 0.3029 shm2024_1 37 −0.8873 0.2115 0.3148 fmrs_2 38 −0.9078 0.2048 0.1899 EXIST2024 majority 39 −0.9504 0.1910 0.1603 NICA_1 40 −0.9504 0.1910 0.1603 UniLeon-UniBO_3 41 −1.2145 0.1051 0.2605 NIT-Patna-NLP_1 42 −1.9410 0.0000 0.1207 shm2024_2 43 −2.0626 0.0000 0.1200 UniLeon-UniBO_1 44 −2.0862 0.0000 0.1736 UniLeon-UniBO_2 45 −2.2986 0.0000 0.1628 EXIST2024 minority 46 −3.1545 0.0000 0.0280 7.3. Task 3: Sexism Categorization in Tweets The third task is a hierarchical multi-class and multi-label classification problem, where systems must determine if a tweet is sexist or not, and categorize the sexist tweets according to the five categories of sexism defined in Section 2. 7.3.1. Soft Evaluation Table 6 displays the results of the soft-soft evaluation for Task 3. A total of 30 runs were submitted, with 26 runs surpassing the majority class baseline (all instances labeled as “NO”), and all systems outperforming the minority class baseline (all instances labeled as “SEXUAL-VIOLENCE”). The “NYCU- NLP” team has the top three runs, with ”NYCU-NLP_1“ ranked first (ICM-Soft: −1.1762, ICM-Soft Norm: 0.4379). The next two runs from the same team, ”NYCU-NLP_2“ and “NYCU-NLP_3,” follow closely, indicating the consistency and robustness of their approach. The fourth and fifth systems, however, show a significantly poorer performance (0.3835 and 0.3732, respectively) The range of ICM-Soft Norm scores (from 0.4379 to 0.0000) underscores a significant variability in system performance. However, despite the complexity of the task, it seems that systems are still able to correctly capture relevant information concerning the different types of sexism. Table 6: Results of Task 3 in the soft-soft evaluation. Run Rank ICM-Soft ICM-Soft Norm EXIST2024 gold 0 9.4686 1.0000 NYCU-NLP_1 1 −1.1762 0.4379 NYCU-NLP_2 2 −1.2169 0.4357 NYCU-NLP_3 3 −1.4555 0.4231 Medusa_1 [39] 4 −2.2055 0.3835 Medusa_2 5 −2.4010 0.3732 Continued on next page Table 6 – continued from previous page Run Rank ICM-Soft ICM-Soft Norm Medusa_3 6 −2.4142 0.3725 ABCD Team_3 7 −3.5160 0.3143 ABCD Team_2 8 −3.5438 0.3129 Awakened_2 9 −4.0748 0.2848 Awakened_3 10 −4.0786 0.2846 Awakened_1 11 −4.1845 0.2790 NICA_2 12 −4.4324 0.2659 ABCD Team_1 13 −4.5913 0.2576 FraunhoferSIT_1 14 −5.1905 0.2259 Victor-UNED_1 15 −5.5936 0.2046 Victor-UNED_2 16 −5.6190 0.2033 CNLP-NITS-PP_1 17 −5.7385 0.1970 CNLP-NITS-PP_2 18 −6.0810 0.1789 RMIT-IR_1 19 −7.2098 0.1193 MMICI_3 20 −7.6413 0.0965 RMIT-IR_2 21 −7.8944 0.0831 MMICI_1 22 −7.9356 0.0809 MMICI_2 23 −7.9380 0.0808 fmrs_1 24 −8.2508 0.0643 fmrs_2 25 −8.4277 0.0550 fmrs_3 26 −8.4277 0.0550 RMIT-IR_3 27 −8.5680 0.0476 EXIST2024 majority 28 −8.7089 0.0401 UniLeon-UniBO_1 29 −10.3622 0.0000 UniLeon-UniBO_2 30 −10.3622 0.0000 UniLeon-UniBO_3 31 −10.3622 0.0000 Atresa-I2C-UHU_1 32 −10.4052 0.0000 EXIST2024 minority 33 −46.1080 0.0000 7.3.2. Hard Evaluation In the hard-hard evaluation context for the third task, 31 systems were submitted. As shown in Table 7, 28 systems outperformed the majority class baseline (all instances labeled as “NO”), while all systems achieved better results than the minority class baseline (all instances labeled as “SEXUAL-VIOLENCE”). The discrepancy between the best (“ABCD Team_1”, 0.5862 ICM-Hard norm score) and the worst- performing system (“CAU&ITU” 1, 0.000 score) is over 0.5 ICM-hard-norm, which is less than in Task 2. Finally, comparing the performance of the three different textual tasks in the hard-hard evaluation, the efficiency of the systems in this task, in terms of ICM-Hard Norm, is lower than in previous tasks. This further highlights the complexity of categorizing sexism. Table 7: Results of Task 3 in the hard-hard evaluation. Run Rank ICM-Hard ICM-Hard Norm Macro F1 EXIST2024 gold 0 2.1533 1.0000 1.0000 Continued on next page Table 7 – continued from previous page Run Rank ICM-Hard ICM-Hard Norm Macro F1 ABCD Team_1 1 0.3713 0.5862 0.6004 ABCD Team_3 2 0.3540 0.5822 0.6042 NYCU-NLP_3 3 0.3069 0.5713 0.6130 NYCU-NLP_1 4 0.2364 0.5549 0.6066 NYCU-NLP_2 5 0.1725 0.5401 0.5933 Awakened_2 6 −0.0042 0.4990 0.4833 Awakened_3 7 −0.0115 0.4973 0.4803 RMIT-IR_3 8 −0.0344 0.4920 0.5049 RMIT-IR_2 9 −0.0394 0.4909 0.4986 RMIT-IR_1 10 −0.0396 0.4908 0.5024 Awakened_1 11 −0.0427 0.4901 0.4743 ABCD Team_2 12 −0.1090 0.4747 0.5286 NICA_2 13 −0.2383 0.4447 0.4564 penta-nlp_1 [46] 14 −0.2597 0.4397 0.4379 maven_1 15 −0.2654 0.4384 0.4491 UniLeon-UniBO_1 16 −0.3188 0.4260 0.5032 UniLeon-UniBO_2 17 −0.3188 0.4260 0.5032 UniLeon-UniBO_3 18 −0.3188 0.4260 0.5032 NICA_1 19 −0.3258 0.4243 0.3867 UMUTEAM_1 20 −0.7339 0.3296 0.4942 FraunhoferSIT_1 21 −0.7437 0.3273 0.3724 UMUTEAM_3 22 −0.7901 0.3165 0.4821 MMICI_3 23 −0.8105 0.3118 0.4805 UMUTEAM_2 24 −0.8719 0.2975 0.4738 CNLP-NITS-PP_1 25 −0.9571 0.2778 0.2684 CNLP-NITS-PP_2 26 −0.9684 0.2751 0.2318 MMICI_1 27 −1.4509 0.1631 0.4026 MMICI_2 28 −1.5003 0.1516 0.4017 fmrs_3 29 −1.5952 0.1296 0.1087 EXIST2024 majority 30 −1.5984 0.1289 0.1069 fmrs_2 31 −1.6017 0.1281 0.1069 fmrs_1 32 −1.7482 0.0941 0.1700 CAU&ITU_1 33 −2.3423 0.0000 0.1705 EXIST2024 minority 34 −3.1295 0.0000 0.0288 7.4. Task 4: Sexism Identification in Memes We next report and analyze the results for Task 4, which focuses on sexism identification in memes. This task involves a binary classification. Again, we report two sets of evaluation results (hard and soft). 7.4.1. Soft Evaluation Table 8 presents the results for the classification of memes as sexist or not sexist. The performance results are notably low for a binary classification task: “Victor-UNED_1”, the top-ranked participant, achieved an ICM-Soft Norm score of 0.4530 and a relatively low Cross Entropy of 1.1028. However, the variability between the best and worst-performing systems is reduced compared to that of the tasks described above. When comparing these results to those of Task 1 (classifying tweets as sexist or not), we observe a significant drop in performance for image classification (0.4530 versus 0.6755 ICM-Soft Norm). It is important to highlight that most approaches relied solely on the text within the meme for classification, without incorporating image processing. This suggests that sexism in memes might often be conveyed through the imagery, even when the accompanying text seems to be neutral. Table 8: Results of Task 4 in the soft-soft evaluation. Run Rank ICM-Soft ICM-Soft Norm Cross Entropy EXIST2024 gold 0 3.1107 1.0000 0.5852 Victor-UNED_1 1 −0.2925 0.4530 1.1028 Victor-UNED_2 2 −0.3135 0.4496 1.2834 Elias&Sergio_1 3 −0.3225 0.4482 0.9903 I2C-Huelva_3 4 −0.3263 0.4476 1.5189 I2C-Huelva_1 5 −0.3390 0.4455 1.4096 I2C-Huelva_2 6 −0.3446 0.4446 1.4112 Victor-UNED_3 7 −0.3761 0.4395 1.1562 RMIT-IR_2 8 −0.3780 0.4392 0.9852 NICA_1 9 −0.4360 0.4299 0.9278 PINK_2 [29] 10 −0.4396 0.4293 0.9375 PINK_1 11 −0.4537 0.4271 0.9282 ROCurve_3 12 −0.4646 0.4253 0.9609 the gym nerds_2 13 −0.5015 0.4194 0.9201 Elias&Sergio_2 14 −0.5617 0.4097 0.9228 ROCurve_2 15 −0.6097 0.4020 0.9537 MMICI_2 16 −0.6183 0.4006 0.9143 MMICI_1 17 −0.6189 0.4005 0.9151 PINK_3 18 −0.6378 0.3975 0.9318 MMICI_3 19 −0.6410 0.3970 0.9534 ROCurve_1 20 −0.6420 0.3968 0.9431 OppositionalOppotision_1 21 −0.9556 0.3464 3.2025 melialo-vcassan_1 22 −1.0022 0.3389 0.9931 melialo-vcassan_2 23 −1.0239 0.3354 0.9904 RMIT-IR_3 24 −1.0894 0.3249 1.1206 melialo-vcassan_3 25 −1.0957 0.3239 1.0090 the gym nerds_1 26 −1.1035 0.3226 0.9733 CNLP-NITS-PP_2 27 −1.2354 0.3014 1.0918 CHEEXIST_2 28 −1.2710 0.2957 1.1993 RMIT-IR_1 29 −1.2819 0.2940 1.0128 Penta-ML_2 [40] 30 −1.2910 0.2925 2.2277 epistemologos_1 31 −1.3486 0.2832 2.9425 Penta-ML_1 32 −1.5664 0.2482 2.4735 Penta-ML_3 33 −1.7425 0.2199 4.0007 CHEEXIST_3 34 −2.0119 0.1766 0.5017 CHEEXIST_1 35 −2.0388 0.1723 0.5030 EXIST2024 majority 36 −2.3568 0.1212 4.4015 CNLP-NITS-PP_1 37 −2.6987 0.0662 1.3445 EXIST2024 minority 38 −3.5089 0.0000 5.5672 7.4.2. Hard Evaluation Table 9 presents the results for the hard-hard evaluation of Task 4. Out of the 50 systems submitted for this task, only 37 ranked above the majority class baseline (all instances labeled as “NO”), while 47 systems surpassed the minority class baseline (all instances labeled as “YES”). Similar to the soft-soft evaluation, the results vary considerably, from 0.6618 ICM-Hard Norm for the best performing system (“RoJiNG-CL_3”) to 0.0876 (“melialo-vcassan_1”). When comparing ICM-Hard Norm results with F1 scores, we observe little correlation between the two metrics, especially in the lower ranks of the table. Table 9: Results of Task 4 in the hard-hard evaluation. Run Rank ICM-Hard ICM-Hard Norm F1YES EXIST2024 gold 0 0.9832 1.0000 1.0000 RoJiNG-CL_3 [30] 1 0.3182 0.6618 0.7642 RoJiNG-CL_2 2 0.2272 0.6155 0.7437 RoJiNG-CL_1 3 0.1863 0.5947 0.7274 I2C-Huelva_2 4 0.1313 0.5668 0.7241 I2C-Huelva_1 5 0.1166 0.5593 0.7154 DiTana-PV_2 [33] 6 0.1150 0.5585 0.7122 Victor-UNED_2 7 0.1028 0.5523 0.7154 MMICI_2 8 0.1014 0.5515 0.7261 I2C-Huelva_3 9 0.0987 0.5502 0.6933 DiTana-PV_3 10 0.0888 0.5451 0.7082 NICA_1 11 0.0767 0.5390 0.7248 MMICI_1 12 0.0751 0.5382 0.7202 Victor-UNED_1 13 0.0641 0.5326 0.7051 OppositionalOppotision_1 14 0.0494 0.5251 0.7168 Elias&Sergio_1 15 0.0433 0.5220 0.6979 Elias&Sergio_2 16 0.0408 0.5208 0.6962 Victor-UNED_3 17 0.0364 0.5185 0.6991 DiTana-PV_1 18 0.0337 0.5171 0.6908 ROCurve_3 19 0.0088 0.5045 0.6834 PINK_1 20 0.0076 0.5039 0.7044 PINK_3 21 −0.0053 0.4973 0.7006 RMIT-IR_2 22 −0.0123 0.4938 0.6726 PINK_2 23 −0.0346 0.4824 0.7102 MMICI_3 24 −0.0361 0.4816 0.6781 ROCurve_2 25 −0.0956 0.4514 0.6654 Miqarn_1 26 −0.1159 0.4411 0.6632 CNLP-NITS-PP_1 27 −0.1234 0.4372 0.6699 Penta-ML_2 28 −0.1308 0.4335 0.6742 Penta-ML_1 29 −0.1745 0.4113 0.6524 epistemologos_1 30 −0.1823 0.4073 0.5503 TokoAI_1 31 −0.1872 0.4048 0.5639 Penta-ML_3 32 −0.2049 0.3958 0.6101 UMUTEAM_1 33 −0.2422 0.3768 0.6963 RMIT-IR_3 34 −0.2601 0.3677 0.6040 ROCurve_1 35 −0.2640 0.3657 0.6318 Continued on next page Table 9 – continued from previous page Run Rank ICM-Hard ICM-Hard Norm F1YES Umera Wajeed Pasha_1 [43] 36 −0.3083 0.3432 0.5956 TargaMarhuenda_1 37 −0.3535 0.3202 0.6487 TargaMarhuenda_2 38 −0.3844 0.3045 0.5568 EXIST2024 majority 39 −0.4038 0.2947 0.6821 CNLP-NITS-PP_2 40 −0.4045 0.2943 0.4937 DLRG_1 41 −0.4206 0.2861 0.6469 MIND_1 [32] 42 −0.4986 0.2465 0.5674 ALC-UPV-JD-2_1 43 −0.5446 0.2231 0.4878 dap-upv_1 [17] 44 −0.5737 0.2082 0.4188 AI Fusion_1 45 −0.6416 0.1737 0.4651 EXIST2024 minority 46 −0.6468 0.1711 0.0000 RMIT-IR_1 47 −0.6468 0.1711 0.0000 AI Fusion_2 48 −0.6486 0.1702 0.4656 AI Fusion_3 49 −0.6508 0.1691 0.4079 TheATeam_1 50 −0.6644 0.1621 0.4821 melialo-vcassan_2 51 −0.6644 0.1621 0.0281 melialo-vcassan_3 52 −0.6723 0.1581 0.0347 melialo-vcassan_1 53 −0.8109 0.0876 0.5316 7.5. Task 5: Source Intention in Memes In this section, we report and analyze the results for Task 5, which focuses on determing the intention of the author when posting a sexist meme. This task is a multi-class, mono-label classification. We report two sets of evaluation results (hard and soft). 7.5.1. Soft Evaluation Table 10 presents the results for the classification of memes according to the intention of the author, with the outputs provided as the probabilities of the different classes. Only 15 runs were submitted for this task. While all the runs ranked above the minority class baseline (all instances labeled as “JUDGEMENTAL”), only 15 runs surpassed the majority class baseline (all instances labeled as “NO”). The results for this task are notably low, with the best team (“Victor-UNED_2”) achieving only 0.3676 ICM-Soft Norm. This suggests that identifying whether a meme contains direct sexism or is judgmental is more difficult than identifying the intention behind a sexist tweet. Table 10: Results of Task 5 in the soft-soft evaluation. Run Rank ICM-Soft ICM-Soft Norm Cross Entropy EXIST2024 gold 0 4.7018 1.0000 0.9325 Victor-UNED_2 1 −1.2453 0.3676 1.6235 MMICI_1 2 −1.2660 0.3654 1.4645 MMICI_2 3 −1.3738 0.3539 1.4405 NICA_1 4 −1.5329 0.3370 1.4664 CNLP-NITS-PP_1 5 −1.5907 0.3308 1.5273 melialo-vcassan_2 6 −1.9847 0.2889 1.5211 Continued on next page Table 10 – continued from previous page Run Rank ICM-Soft ICM-Soft Norm Cross Entropy Victor-UNED_1 7 −2.0053 0.2867 2.0028 melialo-vcassan_3 8 −2.0653 0.2804 1.5295 melialo-vcassan_1 9 −2.6821 0.2148 1.6291 I2C-Huelva_3 10 −2.7996 0.2023 3.9604 I2C-Huelva_2 11 −2.7997 0.2023 3.9857 I2C-Huelva_1 12 −2.8007 0.2022 3.9735 MMICI_3 13 −3.4751 0.1304 3.4504 EXIST2024 majority 14 −5.0745 0.0000 5.5565 Penta-ML_3 15 −5.2668 0.0000 5.1547 Penta-ML_1 16 −5.3096 0.0000 3.2977 Penta-ML_2 17 −5.9832 0.0000 5.4845 EXIST2024 minority 18 −18.9382 0.0000 8.0245 7.5.2. Hard Evaluation Table 11 presents the results for the hard-hard evaluation of Task 5. Out of the 19 systems submitted for this task, only 15 ranked above the majority class baseline (all instances labeled as “NO”), while 18 systems surpassed the minority class baseline (all instances labeled as “JUDGEMENTAL”). The results range from 0.4167 ICM-Hard Norm for the best performing system (“Victor-UNED_1”) to 0.0000 for the worst performing systems, but are quite homogeneous among the top 5 systems. When comparing ICM-Hard Norm results with F1 scores, we again observe little correlation between the two metrics, especially in the lower ranks of the table. Table 11: Results of Task 5 in the hard-hard evaluation. Run Rank ICM-Hard ICM-Hard Norm Macro F1 EXIST2024 gold 0 1.4383 1.0000 1.0000 Victor-UNED_1 1 −0.2397 0.4167 0.3873 I2C-Huelva_2 2 −0.2535 0.4119 0.4761 Victor-UNED_2 3 −0.2668 0.4073 0.3850 I2C-Huelva_3 4 −0.2772 0.4036 0.4714 I2C-Huelva_1 5 −0.2880 0.3999 0.4714 NICA_1 6 −0.2881 0.3999 0.3837 MMICI_1 7 −0.3066 0.3934 0.4179 MMICI_3 8 −0.3297 0.3854 0.3814 CNLP-NITS-PP_1 9 −0.3370 0.3829 0.4101 MMICI_2 10 −0.3868 0.3655 0.3770 Penta-ML_3 11 −0.6123 0.2872 0.3841 Penta-ML_1 12 −0.6546 0.2725 0.3856 Penta-ML_2 13 −0.7089 0.2536 0.3841 TokoAI_1 14 −0.7263 0.2475 0.3716 melialo-vcassan_3 15 −0.7758 0.2303 0.3709 melialo-vcassan_2 16 −0.8585 0.2016 0.3500 EXIST2024 majority 17 −1.0445 0.1369 0.1839 Continued on next page Table 11 – continued from previous page Run Rank ICM-Hard ICM-Hard Norm Macro F1 UMUTEAM_1 18 −1.1486 0.1007 0.2098 melialo-vcassan_1 19 −1.1971 0.0838 0.2970 DLRG_1 20 −1.4891 0.0000 0.2530 EXIST2024 minority 21 −2.0637 0.0000 0.0697 epistemologos_1 22 −8.7012 0.0000 0.0557 7.6. Task 6: Sexism Categorization in Memes The sixth task is a hierarchical multi-class and multi-label classification problem, where systems must determine if a meme is sexist or not, and if so, categorize it according to the five categories of sexism defined in Section 2. 7.6.1. Soft Evaluation Table 12 presents the results for classifying memes based on the aspects of women being attacked, with outputs provided as class probabilities. Only 19 runs were submitted for this task. While all runs performed better than the minority class baseline (labeling all instances as “MISOGYNY-NON-SEXUAL- VIOLENCE”), only 11 runs exceeded the majority class baseline (labeling all instances as “NO”). The performance for this task was generally low, with the top team (“ROCurve_1”) achieving an ICM-Soft Norm score of only 0.2462, which is significantly lower compared to the results for the same task when applied to tweets (Task 3). Table 12: Results of Task 6 in the soft-soft evaluation. Run Rank ICM-Soft ICM-Soft Norm EXIST2024 gold 0 9.4343 1.0000 ROCurve_1 1 −4.7893 0.2462 the gym nerds_2 2 −4.7942 0.2459 ROCurve_2 3 −5.0030 0.2348 ROCurve_3 4 −5.0675 0.2314 Elias&Sergio_1 5 −5.9160 0.1865 Victor-UNED_1 6 −6.4124 0.1602 Victor-UNED_2 7 −6.4777 0.1567 CNLP-NITS-PP_1 8 −6.6782 0.1461 CNLP-NITS-PP_2 9 −7.2381 0.1164 AI Fusion_1 10 −7.6282 0.0957 AI Fusion_2 11 −7.6363 0.0953 AI Fusion_3 12 −7.7043 0.0917 EXIST2024 majority 13 −9.8173 0.0000 dap-upv_1 14 −10.4213 0.0000 Penta-ML_2 15 −11.2593 0.0000 the gym nerds_1 16 −11.2648 0.0000 Penta-ML_1 17 −11.8047 0.0000 Penta-ML_3 18 −13.2556 0.0000 MMICI_1 19 −16.1248 0.0000 Continued on next page Table 12 – continued from previous page Run Rank ICM-Soft ICM-Soft Norm MMICI_2 20 −19.3246 0.0000 MMICI_3 21 −45.0237 0.0000 EXIST2024 minority 22 −50.0353 0.0000 7.6.2. Hard Evaluation Finally, Table 13 presents the results for classifying memes based on the aspects of women being attacked, with outputs provided as a single class prediction. 22 runs were submitted for this task. Only 17 runs exceeded the majority class baseline (labeling all instances as “NO”), while 21 runs ranked above the minority class (all instances labeled as “MISOGYNY-NON-SEXUAL-VIOLENCE”) The performance for this task was low, with the top team (“DiTana-PV_1”) achieving an ICM-Soft Norm score of 0.3549. Table 13: Results of Task 6 in the hard-hard evaluation. Run Rank ICM-Hard ICM-Hard Norm Macro F1 EXIST2024 gold 0 2.4100 1.0000 1.0000 DiTana-PV_1 1 −0.6996 0.3549 0.4319 DiTana-PV_2 2 −0.8450 0.3247 0.4430 MMICI_1 3 −0.9863 0.2954 0.4342 ROCurve_1 4 −1.0089 0.2907 0.3639 ROCurve_2 5 −1.1075 0.2702 0.3275 ROCurve_3 6 −1.1440 0.2627 0.3085 MMICI_2 7 −1.3446 0.2210 0.4453 Penta-ML_3 8 −1.3631 0.2172 0.3356 DiTana-PV_3 9 −1.3691 0.2160 0.3255 Penta-ML_2 10 −1.4684 0.1954 0.3093 Elias&Sergio_1 11 −1.5276 0.1831 0.4321 Penta-ML_1 12 −1.5499 0.1784 0.3053 Miqarn_1 13 −1.6216 0.1636 0.3211 CNLP-NITS-PP_1 14 −1.7920 0.1282 0.1587 ALC-UPV-JD-2_1 15 −1.8573 0.1147 0.2103 CNLP-NITS-PP_2 16 −1.8813 0.1097 0.1511 dap-upv_1 17 −1.9497 0.0955 0.2227 UMUTEAM_1 18 −1.9511 0.0952 0.3786 EXIST2024 majority 19 −2.0711 0.0703 0.0919 TargaMarhuenda_1 20 −2.0725 0.0700 0.1440 TargaMarhuenda_2 21 −2.2075 0.0420 0.1140 TheATeam_1 22 −2.3159 0.0195 0.1490 EXIST2024 minority 23 −3.3135 0.0000 0.0318 MMICI_3 24 −3.8341 0.0000 0.2347 One-by-zero_1 25 −4.5910 0.0000 0.2304 8. Discussion After the study of the 34 submitted working notes, we have discovered patterns between systems and their performance that are of the greatest interest to the task. Even though in this analysis we only considered systems whose working notes were available, most of the best performing models were considered. We will focus on the conclusions from the study of the top ten systems for the six tasks in hard and soft evaluations, and the classification in English and Spanish for each one. We can divide two types of approaches submitted: textual systems for Tasks 1, 2 and 3, and systems for Tasks 4, 5 and 6. 8.1. Tasks 1, 2 and 3: Sexism Detection and Classification in Tweets In the textual tasks, most of the top-ranking models were encoding-based transformer, fine-tuned on the EXIST dataset with an additional component. This additional component could be: • A meticulous data preprocessing step. For example, that is the main contribution of the NYCU-NLP team: they removed irrelevant elements of the text, and they increased the size of the dataset by applying data augmentation techniques such as AEDA, and by translating from English to Spanish and vice versa. • Ensembles of encoding-based transformer models. A huge variety of models were employed: most of them include BERT-like systems like BERT, RoBERTa, DeBERTa, multilingual versions of them trained in Spanish datasets, and some were fine-tuned in tweets, hate-speech or sentiment analysis datasets. Systems following this architecture excel particularly in the soft evaluation. This success appears to be linked to how teams trained their models, as encoding-based models are easily fine-tuned with soft labels. Depending on the models used to obtain results for the ensemble, different types of voting methods are considered: – If the results differ highly between models, systems seem to perform better when a higher weight is given to the best model and the influence of the rest of the models is reduced. The Awakened team considered different models in their ensembles, but their best performing runs gave higher weight to their best model in the ensemble. – If the differences between the performance of models are small, a proportion of votes can take into account aspects detected by every model. This method obtained good results for CIMAT-CS-NLP in the first task, and Medusa in the third task. • Sharing results with LLMs. This method proved to be successful in the hard evaluation, where teams that submitted labels obtained by both LLMs and encoding-based transformer models reached the best results. Due to the increasing amount of LLMs published and their constant improvement, there is a whole range of models that the community can try. In EXIST tasks, there were attempts testing Llama-2, Llama-3, Gemini, Mistral and GPT-4. Most of them are used to obtained labels using zero-shot or few-shots because of their computational cost of fine-tuning. Thus, approaches rely on the use of prompt engineering to make the model understand the task. Because of these differences in models and ways of formulating prompts, systems cannot be directly compared; however, we can mention, as some of the best performing ones: the system of CIMAT-CS-NLP, that uses an ensemble of 4 zero-shot answers of Gemini; the team ABCD, which using prompt engineering created one Llama-2 answer per annotator; the team EquityExplorers, who used Mistral results in their ensemble, and the team CIMAT-GTO, which studied types of reasoning with Llama-3. In general, we can draw some general conclusions about systems for textual tasks. Models that use encoding-based transformers performed better in the soft evaluation. Even systems which use one model or one model trained in different ways, like BAZI and Victor-UNED, achieved top-10 ranking results in the soft evaluation because they are trained with soft labels. On the other hand, LLMs performance stands out in the hard evaluation, but drops in the soft evaluation. This can be clearly seen with ABCD, where runs obtained by Llama-2 got better results than encoding models in the hard evaluation for the three tasks, whereas encoding models outperformed them in the soft evaluation. In general, demographic information was not included in models, but those teams who have included it have obtained improvements in their systems in Task 3. 8.2. Tasks 4, 5 and 6: Sexism Detection and Classification in Memes Tasks 4, 5 and 6 are multimodal, that is, teams could use both images and text to detect and categorize sexism in the instances. Although most of the teams have faced these tasks as multimodal, the best performances correspond to models which only use text to analyze memes. Top ranking positions for most of the tasks, especially Tasks 4 and 5, were obtained by textual models. However, images remain important. RoJiNG-CL used GPT-4 to create a textual description of the image and analyzed it, outperforming the other approaches. Models that were used to analyze the text were mostly encoding- based transformers: combinations of BERT, DeBERTa and RoBERTa with different weights or different fine-tunings. Systems that only considered text where similar to those presented in the previous section. Next in the ranking are multimodal approaches. The influence of textual analysis implies that even in multimodal approaches the ones that focus on textual treatment with a ViT module using BERT or similar outperform models than do not include it. This phenomenon is shown in RMIT-IR runs, whose best run is the one that used mBERT and obtained several positions ahead of their next run which did not include it. Most of the systems utilized CLIP to analyze images. However, the use of CLIP alone led to poor results. The concatenation of text and images stands as the necessary way of dealing with these tasks, but new ways of obtaining representations of images are needed to outperform text-only systems. 9. Conclusions The objective of the EXIST challenge is to encourage research on the automated detection and modeling of sexism in online environments, with a specific focus on social networks. The EXIST 2024 Lab held as part of CLEF attracted nearly 60 participant teams, and received more than 400 runs. Participants adopted a wide range of approaches, including vision transformer models, data augmentation through automatic translation, data duplication, utilization of data from past EXIST editions, multilingual language models, Twitter-specific language models, and transfer learning techniques from domains like hate speech, toxicity, and sentiment analysis. While many systems opted for the traditional approach of providing only hard labels as outputs, a significant number of systems leveraged the multiple annotations available in the dataset, and provided soft outputs, proving that there is an increasing interest by the research community in developing systems able to deal with disagreements and with different perspectives. Concerning the results, in the textual tasks (Tasks 1, 2 and 3), top-performing models were typically encoding-based Transformers fine-tuned on the EXIST dataset with an additional component, such as meticulous data preprocessing, data augmentation or the use of model ensembles. In multimodal tasks (Tasks 4, 5, and 6), where both images and text could be used to detect and categorize sexism, top performances were achieved by models focusing solely on text. For future editions of EXIST, we plan to expand our study in order to include additional communication channels and media formats, such as TikTok videos. By doing so, we aim to address the nuances and unique challenges presented by different formats, enhancing the robustness and applicability of research on automated sexism detection. Additionally, this expansion will allow us to capture a broader spectrum of online interactions and cultural contexts. Acknowledgments This work has been financed by the European Union (NextGenerationEU funds) through the “Plan de Recuperación, Transformación y Resiliencia”, by the Ministry of Economic Affairs and Digital Transformation and by the UNED University. It has also been financed by the Spanish Ministry of Science and Innovation (project FairTransNLP (PID2021-124361OB-C31 and PID2021-124361OB-C32)) funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making Europe, and by the Australian Research Council (DE200100064 and CE200100005). References [1] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet, T. Donoso, Overview of EXIST 2021: Sexism identification in social networks, Procesamiento del Lenguaje Natural 67 (2021) 195–207. [2] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, A. Mendieta-Aragón, G. Marco-Remón, M. Makeienko, M. Plaza, D. Spina, J. Gonzalo, P. Rosso, Overview of EXIST 2022: Sexism identifi- cation in social networks, Procesamiento del Lenguaje Natural 69 (2022) 229–240. [3] L. Plaza, J. C. de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview of EXIST 2023 – Learning with Disagreement for Sexism Identification and Characterization (Extended Overview), in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), volume 497, CEUR Working Notes, 2023, pp. 813–854. [4] A. Uma, T. Fornaciari, A. Dumitrache, T. Miller, J. Chamberlain, B. Plank, E. Simpson, M. Poesio, SemEval-2021 task 12: Learning with disagreements, in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Association for Computational Linguistics, Online, 2021, pp. 338–347. [5] H. R. Kirk, W. Yin, B. Vidgen, P. Röttger, SemEval-2023 Task 10: Explainable Detection of Online Sexism, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval), 2023. [6] M. Billig, Humour and hatred: the racist jokes of the Ku Klux Klan, Discourse & Society 12 (2014) 267–289. [7] A. Mendiburo-Seguel, T. E. Ford, The Effect of Disparagement Humor on the Acceptability of Prejudice., Current Psychology: A Journal for Diverse Perspectives on Diverse Psychological Issues (2019) No Pagination Specified–No Pagination Specified. doi:10.1007/s12144-019-00354-2. [8] G. Hodson, J. Rush, C. C. MacInnis, A Joke Is Just a Joke (except When It Isn’t): Cavalier Humor Beliefs Facilitate the Expression of Group Dominance Motives., Journal of Personality and Social Psychology 99 (2010) 660–682. doi:10.1037/a0019627. [9] F. Gasparini, G. Rizzi, A. Saibene, E. Fersini, Benchmark Dataset of Memes with Text Transcriptions for Automatic Detection of Multi-modal Misogynistic Content, Data in Brief 44 (2022) 108526. [10] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. Sorensen, SemEval-2022 Task 5: Multimedia Automatic Misogyny Identification, in: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 2022, pp. 533–549. [11] B. R. Chakravarthi, S. Rajiakodi, R. Ponnusamy, K. Pannerselvam, A. K. Madasamy, R. Rajalakshmi, H. LekshmiAmmal, A. Kizhakkeparambil, S. S Kumar, B. Sivagnanam, C. Rajkumar, Overview of Shared Task on Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes, in: Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, 2024, pp. 139–144. [12] L. Plaza, J. Carrillo-de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview of EXIST 2023 – Learning with Disagreement for Sexism Identification and Characterization (Extended Overview), in: Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, 2023. [13] E. Amigó, A. Delgado, Evaluating Extreme Hierarchical Multi-label Classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, volume Volume 1: Long Papers, ACL, Dublin, Ireland, 2022, p. 5809–5819. [14] S. Fan, R. A. Frick, M. Steinebach, FraunhoferSIT@EXIST2024: Leveraging Stacking Ensemble Learning for Sexism Detection, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [15] M. Usmani, R. Siddiqui, S. Rizwan, F. Khan, F. Alvi, A. Samad, Sexism Identification in Tweets using BERT and XLM – Roberta, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [16] T. Smith, R. Nie, J. Trippas, D. Spina, RMIT-IR at EXIST Lab at CLEF 2024, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [17] M. Obrador Reina, A. García Cucó, LightGMB for Sexism Identification in Memes, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [18] A. Shah, A. Gokhale, Team Aditya at EXIST 2024 — Detecting Sexism In Multilingual Tweets Using Contrastive Learning Approach, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [19] A. Azadi, B. Ansari, S. Zamani, Bilingual Sexism Classification: Fine-Tuned XLM-RoBERTa and GPT-3.5 Few-Shot Learning, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [20] M. Siino, I. Tinnirello, Prompt Engineering for Identifying Sexism using GPT Mistral 7B, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [21] A. Vetagiri, P. Mogha, P. Pakray, Cracking Down on Digital Misogyny with MULTILATE a MULTImodaL hATE Detection System, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [22] A. Petrescu, C.-O. Truică, E.-S. Apostol, Language-based Mixture of Transformers for EXIST2024, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [23] Y.-Z. Fang, L.-H. Lee, J.-D. Huang, NYCU-NLP at EXIST 2024 – Leveraging Transformers with Diverse Annotations for Sexism Identification in Social Networks, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [24] G. Shimi, J. Mahibha, D. Thenmozhi, Automatic Classification of Gender Stereotypes in Social Media Post, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [25] R. Pan, J. A. García Díaz, T. Bernal Beltrán, R. Valencia-Garcia, UMUTeam at EXIST 2024: Multi- modal Identification and Categorization of Sexism by Feature Integration, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [26] M. Sreekumar, S. K, T. Durairaj, S. Gopalakrishnan, K. Swaminathan, Sexism Identification in Tweets using Traditional Machine Learning Approaches, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [27] R. Keinan, Sexism Identification in Social Networks using TF-IDF Embeddings, PreProccessing, Feature Selection, Word/Char N-Grams and Various Machine Learning Models In Spanish and English, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [28] V. Ruiz, J. Carrillo-de-Albornoz, L. Plaza, Concatenated Transformer Models based on Levels of AgreementsfFor Sexism Detection, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [29] G. Rizzi, D. Gimeno-Gómez, E. Fersini, C.-D. Martínez-Hinarejos, PINK at EXIST2024: A Cross- Lingual and Multi-Modal Transformer Approach for Sexism Detection in Memes, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [30] J. Ma, R. Li, RoJiNG-CL at EXIST 2024: Sexism Identification in Memes by Integrating Prompting and Fine-Tuning, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [31] A. Naebzadeh, M. Nobakhtian, S. Eetemadi, NICA at EXIST CLEF Tasks 2024, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [32] F. Maqbool, E. Fersini, A Contrastive Learning based Approach to Detect Sexism in Memes, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [33] A. Menárguez Box, D. Torres Bertomeu, DiTana-PV at sEXism Identification in Social neTworks (EXIST) Tasks 4 and 6: The Effect of Translation in Sexism Identification, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [34] Álvaro Carrillo-Casado, J. Román-Pásaro, J. Mata-Vázquez, V. Pachón-Álvarez, I2C-UHU at EXIST 2024: Transformer-Based Detection of Sexism and Source Intention in Memes Using a Learning with Disagreement Approach, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [35] N. Maqbool, Sexism Identification in Social Networks: Advances in Automated Detection – A Report on the Exist Task at CLEF, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [36] M. Guerrero-García, M. Cerrejón-Naranjo, J. Mata-Vázquez, V. Pachón-Álvarez, I2C-UHU at EXIST2024: Learning from Divergence and Perspectivism for Sexism Identification and Source Intent Classification, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [37] A. Shanbhag, S. Jadhav, A. Date, S. Joshi, S. Sonawane, The Wisdom of Weighing: Stacking Ensembles for a More Balanced Sexism Detector, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [38] J. Tavarez-Rodríguez, F. Sánchez-Vega, A. Rosales-Pérez, A. P. López-Monroy, Better Together: LLM and Neural Classification Transformers to Detect Sexism, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [39] G. Aru, N. Emmolo, A. Piras, S. Marzeddu, J. Raffi, L. C. Passaro, RoBEXedda: Enhancing Sexism Detection in Tweets for the EXIST 2024 Challenge, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [40] D. D. Barua, M. S. U. R. Sourove, F. Haider, F. T. Shifat, M. F. Ishmam, M. Fahim, F. A. Bhuiyan, Penta ML at EXIST 2024: Tagging Sexism in Online Multimodal Content With Attention-enhanced Modal Context, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [41] K. Villarreal-Haro, F. Sánchez-Vega, A. Rosales-Pérez, A. P. López-Monroy, Stacked Reflective Reasoning in Large Neural Language Models, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [42] S. Khan, G. Pergola, A. Jhumka, Multilingual Sexism Identification via Fusion of Large Language Models, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [43] U. W. Pasha, Multilingual Sexism Detection in Memes, A CLIP-Enhanced Machine Learning Approach, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [44] E. Martinez, J. Cuadrado, J. C. Martinez-Santos, E. Puertas, VerbaNex AI at CLEF EXIST 2024: Detection of Online Sexism using Transformer Models and Profiling Techniques, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [45] M. P. Jimenez-Martinez, J. M. Raygoza-Romero, C. E. Sánchez-Torres, I. H. Lopez-Nava, M. Montes- y Gómez, Enhancing Sexism Detection in Tweets with Annotator-Integrated Ensemble Methods and Multimodal Embeddings for Memes, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [46] F. T. Shifat, F. Haider, M. S. U. R. Sourove, D. D. Barua, M. F. Ishmam, M. Fahim, F. A. Bhuiyan, Penta-nlp at EXIST 2024 Task 1–3: Sexism Identification, Source Intention, Sexism Categorization In Tweets, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [47] L. M. Quan, D. V. Thin, Sexism Identification in Social Networks with Generation-based Approach, in: Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024.