1. Introduction

Current Psychology: A Journal for Diverse Perspectives on Diverse Psychological Issues (2019) No Pagination Specified-No Pagination Specified. doi: 10.1007/s12144

10.1007/s12144-019-00354-2

Overview of EXIST 2025: Learning with Disagreement for Sexism Identification and Characterization in Tweets, Memes, and TikTok Videos (Extended Overview)

Laura Plaza

lplaza@lsi.uned.es 1

Jorge Carrillo-de-Albornoz

Iván Arcos

Paolo Rosso

prosso@dsic.upv.es 2 3

Damiano Spina

damiano.spina@rmit.edu.au 0

Enrique Amigó

enrique@lsi.uned.es 1

Julio Gonzalo

julio@lsi.uned.es 1

Roser Morante

1 0 RMIT University , 3000 Melbourne , Australia 1 Universidad Nacional de Educación a Distancia (UNED) , 28040 Madrid , Spain 2 Universidad Politécnica de Valencia (UPV) , 46022 Valencia , Spain 3 Valencian Graduate School and Research Network Analysis of Artificial Analysis (ValgrAI) , 46022 Valencia , Spain

2025

1 338 347

This paper presents the EXIST 2025 Lab on sexism detection and categorization in social media, which took place at the CLEF 2025 conference and marks the fifth edition of the EXIST Shared Task. Building on the success of previous editions, EXIST 2025 addresses the growing concern over the spread of ofensive and discriminatory content targeting women across online platforms, which significantly impacts women's well-being and freedom of expression. The lab comprises nine tasks in two languages (English and Spanish), organized around three core objectives: sexism identification, source intention detection, and sexism categorization. These tasks are applied across three media types-text (tweets), image (memes), and video (TikToks)-ofering a multimodal perspective that allows for a deeper understanding of how sexism manifests across diferent formats and user interactions. As in previous editions, EXIST 2025 adopts the “Learning With Disagreement” paradigm, using annotations from multiple annotators that reflect diverse and at times conflicting viewpoints. This overview describes the task design, datasets, evaluation methodology, participating systems, and results of EXIST 2025, which has surpassed participation expectations with 244 registered teams from 38 countries, 114 teams from 23 countries submitting runs, a total of 873 runs processed, and 33 working notes published. Warning: Some of the examples included in this paper may contain ofensive language and explicit descriptions of sexist behavior, which may be disturbing to the reader.

eol>sexism identification sexism categorization learning with disagreement tweets memes TikTok videos humancentric AI

1. Introduction

Sexism refers to prejudice or discrimination based on a person’s sex or gender, often manifesting in the belief that one gender is superior to another. It can take many forms, from overt aggression and harassment to subtler behaviors and norms that reinforce inequality. While sexism afects individuals of all genders, it disproportionately impacts women, particularly in digital spaces.

In recent years, online platforms like Twitter and TikTok have become breeding grounds for the proliferation of sexist discourse. On Twitter, sexism often manifests through harassment, trolling, and misogynistic hashtags that normalize discriminatory narratives [1, 2]. TikTok, by contrast, poses unique challenges due to its algorithm-driven content promotion and its popularity among younger audiences. Its recommendation system can generate filter bubbles that reinforce sexist ideologies [ 3], while visual trends and content moderation disparities contribute to the hypersexualization and objectification of women [4, 5]. These dynamics not only perpetuate traditional gender stereotypes but can also shape the perceptions and behaviors of young users.

To tackle these challenges, the sEXism Identification in Social neTworks (EXIST) campaign was launched in 2021. EXIST is a series of shared tasks and scientific events aimed at identifying, analyzing, and mitigating sexist content on social networks. The first two editions were hosted under the IberLEF forum [6, 7], and focused on textual data. In 2023, EXIST became a CLEF Lab [8], introducing a third task centered on detecting the communicative intention behind sexist messages and adopting for the ifrst time the Learning with Disagreement (LeWiDi) paradigm [ 9]. This paradigm acknowledges that disagreements among annotators are not noise, but valuable signals that reflect the subjectivity inherent to tasks like sexism detection. The fourth edition of EXIST (2024) expanded the challenge to multimodal data by introducing tasks involving memes. Memes, while often humorous, are increasingly used to spread prejudices under the guise of irony [10, 11, 12, 13]. Their blend of text and image makes them particularly insidious vectors for normalizing sexist stereotypes, especially when humor is used to reduce the perceived harm [14, 15].

EXIST 2025 marks the fifth edition of the challenge and represents its most ambitious iteration yet. Held again as a CLEF Lab,1 it comprises nine tasks in total—covering three core objectives (sexism identification, source intention detection, and sexism categorization) across three modalities: tweets (text), memes (image), and TikToks (video). This multimodal and bilingual (English and Spanish) design aims to capture the varied ways in which sexism is expressed and interpreted online, enabling researchers to develop AI models that are sensitive to both linguistic and visual cues, as well as the platform-specific dynamics that influence sexist content dissemination.

Throughout its four previous editions, more than 100 teams from universities and companies around the world have participated in EXIST, developing and testing state-of-the-art models to address this pressing social issue. The 2025 edition continues to foster international participation, with 244 registered teams from 38 countries. Of these, 114 teams from 23 countries submitted valid runs, resulting in a total of 873 system submissions.

In the following sections, we present a detailed overview of the tasks, datasets, annotation process, evaluation methodology, and system results for EXIST 2025.

2. Tasks

The 2025 edition of EXIST features nine tasks, which are described below. The languages addressed are English and Spanish and the datasets are collections of tweets, memes and TikTok videos (see Section 3). For the tasks on TikTok videos, all the partitions of the dataset are new, whereas for the tasks on tweets and memes we employ the EXIST 2023 and 2024 datasets, respectively.

2.1. Task 1.1: Sexism Identification in Tweets

This is a binary classification task where systems must decide whether or not a given tweet expresses sexist ideas because it is sexist itself, it describes a sexist situation, or it criticizes a sexist behavior. The following examples from the dataset show sexist and not sexist messages, respectively. ( 1 ) ( 2 )

Sexist. It’s less of #adaywithoutwomen and more of a day without feminists, which, to be quite honest, sounds lovely.

Not sexist. Just saw a woman wearing a mask outside spank her very tightly leashed dog and I gotta say I love learning absolutely everything about a stranger in a single instant.

2.2. Task 1.2: Source Intention in Tweets

This task aims to categorize the message according to the intention of the author. We propose the following ternary classification of tweets: • Direct sexist message. The intention is to write a message that is sexist by itself or incites sexism, as in: ( 3 )

A woman needs love, to fill the fridge, if a man can give this to her in return for her services, I don’t see what else she needs. • Reported sexist message. The intention is to report and share a sexist situation sufered by a woman or women in first or third person, as in: ( 4 ) I doze in the subway, I open my eyes feeling something weird: the hand of the man sat next to me on my leg #SquealOnYourPig. • Judgemental message. The intention is to condemn sexist situations or behaviours, as in: ( 5 )

As usual, the woman was the one quitting her job for the family’s welfare...

2.3. Task 1.3: Sexism Categorization in Tweets

Many facets of a woman’s life may be the focus of sexist attitudes including domestic role, career opportunities, and sexual image, to name a few. According to this, each sexist tweet must be assigned one or more of the following categories: • Ideological and inequality. It includes messages that discredit the feminist movement. It also includes messages that reject inequality between men and women, or present men as victims of gender-based oppression. ( 6 ) #Feminism is a war on men, but it’s also a war on women. It’s a war on female nature, a war on femininity. • Stereotyping and dominance. It includes messages that suggest women are more suitable or inappropriate for certain tasks, and somehow inferior to men. ( 7 ) ( 8 )

Most women no longer have the desire or the knowledge to develop a high quality character, even if they wanted to. • Objectification . It includes messages where women are presented as objects apart from their dignity and personal aspects. We also include messages that assume or describe certain physical qualities that women must have in order to fulfill traditional gender roles.

No ofense but I’ve never seen an attractive african american hooker. Not a single one. • Sexual violence. It includes messages where sexual suggestions, requests or harassment of a sexual nature (rape or sexual assault) are made.

( 9 ) I wanna touch your tits..you can’t imagine what I can do on your body. • Misogyny and non sexual violence. It includes expressions of hatred and violence towards women. (10) Domestic abuse is never okay. . . Unless your wife is a bitch.

2.4. Task 2.1: Sexism Identification in Memes

As in Task 1.1, this involves a binary classification consisting on deciding whether or not a meme is sexist, as in Figure 1.

2.5. Task 2.2: Source Intention in Memes

As in Task 1.2, this task aims to categorize the meme according to the intention of the author. However, in this task systems should only classify memes in two classes: direct or judgemental, as shown in Figure 2.

(a) Sexist meme (b) Non sexist meme

(a) Ideological & inequality

(b) Objectification (c) Stereotyping & dominance (d) Sexual violence (e) Misogyny & non-sexual violence

2.6. Task 2.3: Sexism Categorization in Memes

This task aims to classify sexist memes according to the categorization provided for Task 1.3. Figure 3 shows one meme of each sexist category.

2.7. Task 3.1: Sexism Identification in TikToks

As in Tasks 1.1 and 2.1, systems must determine whether short videos shared on TikTok are sexist.

2.8. Task 3.2: Source Intention in TikToks

As in Tasks 1.2 and 2.2, this task aims to categorize TikTok short videos according to the intention of the author, as direct or judgemental.

2.9. Task 3.3: Sexism Categorization in TikToks

As in Tasks 1.3 and 3.3, this task aims to categorize short videos according to the sexism categories provided for Task 1.3.

3. Dataset

The EXIST 2025 dataset comprises three types of data: the tweets from the EXIST 2023 dataset, the memes from the EXIST 2024 dataset and a new dataset of TikTok videos. Plaza et al. [8] and [16] provide a detailed description of the tweets and memes datasets, respectively. Here we provide a summarized description of the three datasets.

3.1. Data Sampling 3.1.1. EXIST 2023 Tweets Dataset

We first collected diferent popular expressions and terms, both in English and Spanish, commonly used to underestimate the role of women in our society. These expressions were later used as seeds to retrieve Twitter data. To mitigate the seed bias, we have also gathered other common hashtags and expressions less frequently used in sexist contexts to ensure a balanced distribution between sexist/not sexist expressions. This first set of seeds contains more than 400 expressions.

The set of seeds was then used to extract tweets in English and Spanish (more than 8,000,000 tweets were downloaded). The crawling was performed during the period from the September 1, 2021 till September 30, 2022. 100 tweets were downloaded for each seed per day (no retweets and promotional tweets were included). To ensure an appropriate balance between seeds, we removed those with less than 60 tweets. The final set of seeds contains 183 seeds for Spanish and 163 seeds for English.

To mitigate the terminology and temporal bias, the final sets of tweets were selected as follows: for each seed, approximately 20 tweets were randomly selected within the period from 1st September 1, February 28, 2022 for the training set, taking into account a representative temporal distribution among tweets of the same seed. Similarly, 3 tweets per seed were selected for the development set within the period from 1st to 31st May of 2022, and 6 tweets per seed within the period from August 1, 2022 to September, 30 2022 were selected for the test set. Only one tweet per author was included in the final selection to avoid author bias. Finally, tweets containing less than 5 words were removed. As a result, we have more than 3,200 tweets per language for the training set, around 500 per language for the development set, and nearly 1,000 tweets per language for the test set.

3.1.2. EXIST 2024 Memes Dataset

We first curated a lexicon of terms and expressions leading to sexist memes. The set of seeds encompasses diverse topics and contains 250 terms, with 112 in English and 138 in Spanish. The terms were used as search queries on Google Images to obtain the top 100 images. Rigorous manual cleaning procedures were applied, defining memes and ensuring the removal of noise such as textless images, text-only images, ads, and duplicates. The final set consists of more than 3,000 memes per language.

Since the proportion of memes per term was heterogeneous, we discarded the most unbalanced seeds and made sure that all seeds have at least five memes. To avoid introducing selection bias, we randomly selected memes, ensuring the appropriate distribution per seed. As a result, we have 2,000 memes per language for the training set and 500 memes per language for the test set.

3.1.3. TikTok Dataset

The data was collected with the Apify’s TikTok Hashtag Scraper tool.2 using a previously curated list of 185 Spanish hashtags and 61 English hashtags associated with potentially sexist content. More than 3,500 videos in English and Spanish were downloaded from diferent TikTok accounts. Rigorous manual cleaning procedures were applied, ensuring the removal of noise such as ads and duplicates.

The collected TikTok videos were divided into training and test sets following a chronological and author-based partitioning strategy. This approach ensured temporal coherence while preventing data leakage. To achieve this, authors present in the training set were excluded from the test set, preventing the model from learning author-specific patterns and enhancing its generalization capabilities. Additionally, each hashtag (seed) was required to contribute a minimum number of videos, ensuring a more uniform distribution across the dataset. The final selection of videos was conducted randomly but maintained a temporal distribution to ensure diversity and avoid overrepresentation of any specific time period.

3.2. Datasets Size 3.2.1. EXIST 2023 Tweets Dataset

The dataset consists of three partitions per language. The distribution of tweets per partition and language is shown in Table 1.

3.2.2. EXIST 2024 Memes Dataset

The memes dataset is provided in two partitions per language, training and test. The distribution per partition and language is shown in Table 2.

3.2.3. TikTok Dataset

The TikTok dataset consists of three partitions per language. The distribution of tweets per partitions is shown in Table 3. 2https://apify.com/clockworks/tiktok-hashtag-scraper

3.3. Labeling with Disagreements

The LeWiDi paradigm was adopted to label the TikTok videos, in the same way that it was adopted to label the tweets and memes datasets for EXIST 2023 and 2024, respectively. Diferently from previous EXIST editions, the annotation was performed by trained annotators, instead of crowd workers. The annotation was conducted using Servipoli’s service,3 with eight students organized in pairs consisting of one male and one female student, in order to avoid biases. Each pair was tasked with annotating 1,000 TikTok videos.

4. Evaluation Methodology and Metrics

As in EXIST 2023 and 2024, we have carried out a soft evaluation and a hard evaluation. The soft evaluation relates to the LeWiDi paradigm and is intended to measure the ability of the model to capture disagreements, by considering the probability distribution of labels in the output as a soft label and comparing it with the probability distribution of the annotations. The hard evaluation is the standard paradigm and assumes that a single label is provided by the systems for every instance in the dataset.

From the point of view of evaluation metrics, the tasks can be described as follows: • Tasks 1 and 4 (sexism identification): binary classification, monolabel. • Tasks 2 and 5 (source intention): multiclass hierarchical classification, monolabel. The hierarchy of classes has a first level with two categories, sexist/not sexist, and a second level for the sexist category with three mutually-exclusive subcategories: direct/reported/judgemental. A suitable evaluation metric must reflect the fact that a confusion between not sexist and a sexist category is more severe than a confusion between two sexist subcategories. • Tasks 3 and 6 (sexism categorization): multiclass hierarchical classification, multilabel. Again the first level is a binary distinction between sexist/not sexist, and there is a second level for the sexist category that includes five subcategories: ideological and inequality, stereotyping and dominance, objectification, sexual violence, and misogyny and non-sexual violence. These classes are not mutually exclusive: a tweet may belong to several subcategories at the same time.

The LeWiDi paradigm can be considered in both sides of the evaluation process: • The ground truth. In a hard evaluation setting, the variability in the human annotations is reduced by selecting one and only one gold category per instance, the hard label. In a soft evaluation setting, the gold standard label for one instance is the set of all the human annotations existing for that instance. Therefore, the evaluation metric incorporates the proportion of human annotators that have selected each category (soft labels). Note that in Tasks 1, 2, 4 and 5, which are monolabel problems, the sum of the probabilities of each class must be one. But in Task 3, which is multilabel, each annotator may select more than one category for a single instance.

Therefore, the sum of probabilities of each class may be larger than one. • The system output. In a hard, traditional setting, the system predicts one or more categories for each instance. In a soft setting, the system predicts a probability for each category, for each instance. The evaluation score is maximized when the probabilities predicted match the actual probabilities in a soft ground truth.

In EXIST 2025, for each of the tasks, two types of evaluation have been performed: 1. Soft-soft evaluation. For systems that provide probabilities for each category, we perform a soft-soft evaluation that compares the probabilities assigned by the system with the probabilities assigned by the set of human annotators. The probabilities of the classes for each instance are calculated according to the distribution of labels and the number of annotators for that instance. We use a modification of the original ICM metric (Information Contrast Measure [ 17]), ICM-Soft (see details below), as the oficial evaluation metric in this variant and we also provide results for the normalized version of ICM-Soft (ICM-Soft Norm). 2. Hard-hard evaluation. For systems that provide a hard, conventional output, we perform a hard-hard evaluation. To derive the hard labels in the ground truth from the diferent annotators’ labels, we use a probabilistic threshold computed for each task. As a result, for Tasks 1 and 4, the class annotated by more than 3 annotators is selected; for Tasks 2 and 5, the class annotated by more than 2 annotators is selected; and for Tasks 3 ad 6 (multilabel), the classes annotated by more than 1 annotator are selected. The instances for which there is no majority class (i.e., no class receives more probability than the threshold) are removed from this evaluation scheme. The oficial metric for this task is the original ICM, as defined by [ 17]. We also report a normalized version of ICM (ICM Norm) and F1 (F1YES). In Tasks 1 and 4, we use F1 for the positive class. In Tasks 2, 3, 5 and 6, we use the macro-average of F1 for all classes (Macro F1). Note, however, that F1 is not ideal in our experimental setting: although it can handle multilabel situations, it does not take into account the relationships between classes. In particular, a confusion between not sexist and any of the sexist subclasses, and a confusion between two of the sexist subclasses, are penalized equally.

ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used to evaluate outputs in classification problems by computing their similarity to the ground truth. The general definition of ICM is:

ICM(, ) = 1() + 2() − ( ∪ ) Where () is the Information Content of the instance represented by the set of features A. ICM maps into PMI when all parameters take a value of 1. The general definition of ICM by [ 17] is applied to cases where categories have a hierarchical structure and instances may belong to more than one category. The resulting evaluation metric is proved to be analytically superior to the alternatives in the state of the art. The definition of ICM in this context is:

ICM((), ()) = 2(()) + 2(()) − 3(() ∪ ()) Where () stands for Information Content, () is the set of categories assigned to document by system , and () the set of categories assigned to document in the gold standard. The score for a perfect output (() = ()) is the gold standard Information Content ((()). The score for a zero-information system (no category assignment) is − (()). We use these two boundaries for normalisation purposes, truncating to 0 the scores lower than − (()).

As there is not, to the best of our knowledge, any current metric that fits hierarchical multilabel classification problems in a LeWiDi scenario, we have defined an extension of ICM (ICM-soft) that accepts both soft system outputs and soft ground truth assignments. ICM-soft works as follows: first, we define the Information Content of a single assignment of a category with an agreement to a given instance as the probability of instances in the gold standard to exceed the agrement level for the category :

({⟨, ⟩}) = − log2( ({ ∈ : () ≥ }) In order to estimate , we compute the mean and deviation of the agreement levels for each class across instances, and applying the cumulative probability over the inferred normal distribution. In the case of zero variance, we must consider that the probability for values equals or below the mean is 1 (zero IC) and the probability for values above the mean must be smoothed. But this is not the case of the EXIST datasets.

Due to the multi-label and hierarchical nature of the classification task,for each classification instance, the gold standard, the system output and their unions ((()) (()) and (()) ()) are sets of category assignments. The union of the assignments (i.e. ()) ()) is calculated as fuzzy sets, i.e. the maximum values., in order to estimate information content, we apply a recursive function similar to the one described by Amigó and Delgado [17] for assignment sets and avoid the redundant information of parent categories.

︃( ⋃︁{⟨, ⟩} )︃ = (⟨1, 1⟩) + ︃( ⋃︁{⟨, ⟩}

)︃ =2 =1 − ︃( ⋃︁{⟨lca(1, ), (1, )⟩}

)︃ =2 (11) where lca(, ) is the lowest common ancestor of categories and .

5. Overview of Approaches

This section ofers an overview of the methodological approaches submitted to EXIST 2025.

Although 244 teams from 38 diferent countries registered for participation, the number of participants who finally submitted results were 114, submitting 873 runs. Teams were allowed to participate in any of the nine tasks and submit hard and/or soft outputs. Table 4 summarizes the participation in the diferent tasks and evaluation contexts. for a clearer comparison of modeling strategies across diferent modalities and highlights trends and innovations specific to each content type.

5.1. Sexism Detection in Tweets

Sexism detection in tweets was predominantly approached through Natural Language Processing (NLP) techniques and neural network-based models. The majority of teams relied on pre-trained large language models (LLMs), such as BERT, RoBERTa, and domain-specific variants like BERTweet or HateBERT, often fine-tuned on the EXIST datasets. While transformer-based models dominated, a minority of teams used traditional machine learning techniques such as Support Vector Machines (SVM) or Random Forests with TF-IDF, as well as rule-based or lexicon-based methods.

Many teams applied data preprocessing techniques tailored to social media content, including emoji normalization, hashtag segmentation, and URL removal. Data augmentation methods, such as backtranslation, synonym replacement, or oversampling of minority classes, were also employed to mitigate class imbalance and improve generalization.

5.2. Sexism Detection in Memes

For memes, the inherently multimodal nature of the data led teams to combine computer vision and text analysis methods. Convolutional Neural Networks (CNNs) and visual feature extractors such as CLIP and ResNet were used to process image data. Meanwhile, embedded text within memes was handled using transformer-based NLP models.

Teams used both early fusion (merging textual and visual embeddings before classification) and late fusion (aggregating predictions from separate pipelines). Although multimodal fusion was key, some teams focused primarily on one modality, revealing diverse strategic preferences.

5.3. Sexism Detection in TikTok Videos

Sexism detection in TikToks required integrating audio, visual, and textual information, making multimodal analysis indispensable. Despite the complexity of the modality, the dominant methods remained rooted in NLP (particularly for transcript analysis), followed by computer vision models. Multimodal fusion strategies—especially late fusion—were key in top-performing systems, and some teams adopted zero-shot or prompt-based learning using general-purpose LLMs such as GPT-3.

Given TikTok’s social dynamics, models were also designed to be sensitive to context, sometimes incorporating meta-information, such as hashtags or background music features.

5.4. Summary of Approaches per Team

Next we provide a summary of the methodological approaches followed by the EXIST 2025 teams that submitted a description paper for the Working Notes. We start by the teams that participated only in some or all subtasks of Task 1 on processing tweets.

ANLP-Uniso [18] uses the mT5 model for contextual embeddings and a system that integrates several machine learning and deep learning classifiers, including both traditional models (Logistic Regression, SVM) and neural networks (RNN, GRU, hybrid FNN+GRU). To enhance classification accuracy, they apply extensive preprocessing, feature normalization, dimensionality reduction via PCA, and data balancing techniques such as SMOTE and class weighting.

NLPDame [19] addresses Sub-task 1.3 with a methodology that includes fine-tuning twelve transformer LLMs within a tailored multi-head and multi-task model architecture that employs CLS, mean, and max pooling for multi-label text classification. The multi-head architecture is chosen to deal with multilinguality, while the multi-task architecture incorporates sentiment analysis to enhance the multilabel classification process. The methodology also involves utilizing the open-source multilingual LLM Llama-3.2-3B-Instruct and prompt engineering to classify tweets. Additionally, a method incorporating RAG (Retrieval Augmented Generation), chain-of-thought reasoning and annotators’ profiles was used to provide contextual information within the LLM prompt engineering framework. A majority voting system was submitted that includes the predictions from (i) the twelve Transformer models with LLM prompt engineering, and (ii) the twelve transformer models with LLM prompt engineering, including chain-of-thought and annotators’ profiles, along with RAG. Various loss functions and thresholds were applied, as well as the use of class positive weights to tackle class imbalance.

ECORBI-UPV [20] leverages semantic embeddings generated using pre-trained models from Google’s Generative AI suite, evaluated on both frozen and fine-tuned forms. For classification they use traditional machine learning models, such as Random Forest, SVM, and MLP.

Mumul03 [21] employs ModernBERT-large and incorporates demographic information from the annotator such as gender, ethnicity, age, and other attributes into the model input. By modeling individual annotator perspectives and aggregating predictions across submodels, they aim at capturing the subjectivity in annotations.

Fosu-students [22] reformulates the binary classification problem of Task 1 into a seven-class task. They implement ModernBERT-large with layered learning rate decay for hierarchical feature optimization. The model is enhanced with Supervised Contrastive Learning (SCL) to improve discrimination of nuanced sexism expressions through metric learning. Their architecture incorporates: ( 1 ) Task reformulation from binary to fine-grained seven-class prediction, ( 2 ) ModernBERT’s memory-eficient attention mechanisms for long-context understanding, and ( 3 ) Hybrid CE+SCL loss ( =0.9) for robust representation learning.

Warwick [23] develops a hybrid detection framework that integrates the outputs of multiple neural language models, each encoding diferent perspectives on the task. Their system combines fine-tuned monolingual transformers (BERTweet for English, RoBERTuito for Spanish) with instruction-tuned LLMs such as Claude 3 Sonnet and LLaMA3-70B-Instruct. These models are combined within a confidence-based multi-stage pipeline: high-confidence predictions from task-specialized models are preserved, while uncertain instances are routed to general-purpose LLMs for zero-shot classification. This dynamic strategy combines high-confidence predictions from specialized models with broader judgments from instruction-tuned LLMs.

CLiC [24] employs BERT fine-tuning for Task 1.1 and DSPy-based prompt optimization for Tasks 1.2 and 1.3. They explore BERT-based methods for Task 1.1 and contrasting prompt-based methods, including variants with annotator information and RAG, for the subsequent tasks.

NetGuardAI [25] experiments with several transformer-based models, including DeBERTa, mDeBERTa, XLM-RoBERTa, Detoxify, and HateBERT, alongside three levels of text preprocessing: Light, Classic, and Aggressive Cleaning. Although they tested various data augmentation strategies, such as translation-based augmentation using Meta AI’s NLLB model and pseudo-labeling with the EDOS dataset, the final submitted system does not include these enhancements.

EquityExplorer-2.0 [26] proposes a pipeline that combines label-aware translation, domain-adaptive pre-training, and ensemble learning. A central component of their system is a prompt-based Spanishto-English translation step, designed to preserve the tone and task-relevant semantics of the original message, selectively incorporating label cues during training. They aimed at enabling the use of highperformance monolingual models, while maintaining semantic fidelity across languages. They further adapt DeBERTa-v3-Large and RoBERTa-Large using 2 million unlabeled posts from the EDOS dataset and fine-tune them individually and in a fused configuration (DTFN). Final predictions are generated via majority voting, with a tie-handling rule that improves robustness.

Exist@CeDRI [27] uses a combination of multiple text augmentation strategies, including AEDA (punctuation-based), synonym replacement, back-translation, and light code-switching via round-trip translation, in order to enhance model reliability and deal with data sparsity. Their architecture builds on XLM-RoBERTa-large, fine-tuned for three subtasks: binary sexism detection, source classification, and sexism categorization. Both soft and hard label strategies are incorporated to account for annotation disagreement and label smoothing and class-weighted loss functions are applied to manage class imbalance.

Awakened [28] employs an adaptive Mixture of Transformers architecture. The system combines nine transformer-based models—spanning both English-specific and multilingual variants—each specialized by language, platform, or task. A dynamic weighting mechanism automatically adjusts the contribution of each model in the ensemble, based on the detected language and performance metrics, in order to enable robust and context-aware classification across diverse linguistic settings.

Dandys-de-BERTganim [29] adopts a multi-task learning architecture with language-specific transformers for English and Spanish, integrating demographic information from annotators as contextual signals. They enhance model generalization through data augmentation techniques such as backtranslation and a punctuation-based augmentation method. Furthermore, they introduce a soft-labeling data reader to better reflect annotation disagreement, aligning with the LeWiDi paradigm.

DuthThrace [30] develops a transformer-based multilingual architecture, fine-tuned with techniques such as oversampling, class weighting, and soft-label learning to account for class imbalance and annotator disagreement.

CIMAT-CS-NLP [31] proposes a method based on a single multitask query to LLMs, designing a query that first generates chain-of-thought justifications and then requests answers for all tasks simultaneously. To automate query refinement, they apply evolutionary computation, optimizing the F1-macro on a development subset. Experiments are performed with DeepSeek-R1-Distill-Llama-8B and Gemini-1.5-Flash. They fine-tune a BERT-like model with the LLM-generated justifications, with DeepSeek achieving similar performance to the Gemini-based justifications despite the reduced model size.

UC3M-LI [32] develops a variety of systems for Task 1.1 and Task 1.2, combining traditional machine learning models, Transformer-based architectures, ensemble methods, and hybrid CNN-BERT approaches. Their approach incorporates data augmentation, and multilingual modeling strategies to address challenges such as label disagreement and language variation.

Cyberpufs [ 33] uses several LLMs, prominently multilingual BERT and XLM_Roberta, combined with an ensemble learning approach to process tweets. They employ data augmentation techniques such as cross-translation, EASE, and AEDA, and develop separate models for English and Spanish to optimize language-specific predictions. Model evaluation is conducted using hard labels, derived through majority annotator voting, and soft labels, derived from class probability distributions.

COMFOR [34] approaches the tasks with an SVM based on a comprehensive feature representation, including embeddings and lexical features. For the third subtask, this classifier was used as the basis for a classifier chain.

CIMAT-GTO [35] uses a hybrid setting aimed at taking advantage of the reasoning produced by generative LLMs using justification-guided knowledge expansion when fine-tuning a smaller transformer-based model for classification.

Mario [36] applies hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Their method introduces conditional adapter routing that explicitly models label dependencies across the three hierarchically structured subtasks. Unlike conventional LoRA applications that target only attention layers, they apply adaptation to all linear transformations, enhancing the model’s capacity to capture task-specific patterns. They train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified multilingual training that leverages Llama 3.1’s native bilingual capabilities. The method requires minimal preprocessing and uses standard supervised learning.

FHSTP [37] proposes three machine learning models to address these tasks, including Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT) and a finetuned XLM-RoBERTa transformer model that serves as baseline. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages LLMs to map input texts to an abstract adjective-based representation, which is then utilized to train a light-weight classifier for downstream tasks. SCBMT extends this approach by fusing transformer-based contextual embeddings with the adjective-based representation, aiming to balance interpretability and classification performance.

NYCU-NLP [38] integrates annotator demographics and leverages bilingual fusion by combining original and cross-translated tweets. They implement a hierarchical pipeline and compare three distinct modeling strategies: a fine-tuned transformer-based dual-encoder architecture with early and late fusion, a zero-shot auto-regressive LLM, and a zero-shot difusion-based LLM. The transformer-based approach consistently achieves the highest performance across most metrics.

Next we present the approaches of teams that participated only in Task 2 on processing memes.

TrankilTwice [39] participates in Task 2.1 with an end-to-end system integrating LLM-based prompting strategies, cross-modal language encoding, and graph-based modeling at meme level, obtaining performance gaps across languages.

NaturalThinkers [40] integrates visual and textual feature extraction using BLIP (Bootstrapping Language-Image Pretraining), BERT and ViT (Vision transformer) followed by a fusion mechanism employing attention-based. Then they use multi-layer perceptron (MLP) for the final classification, with a Gradio-based user interface.

ArcosGPT [41] adds BLIP-generated image captions to OCR text. By further including a GPT-4o description of the memes they obtain an increae of 8.2 points. They obtain the best overall performance with a ViT+RoBERTa fusion model.

CLTL [42] follows a hard majority voting ensemble strategy to process memes, where the component models included a multimodal model that combines the representations of Swin Transformer V2 and a pre-trained language model (RoBERTa or BERT), and the text-only model that uses meme text and image captions as input. The text-only approaches included pre-trained transformer models (RoBERTa, BERT, and a BERTweet model fine-tuned for sexism detection) and a conventional machine learning approach, namely an SVM with stylometric and emotion-based features.

I2C-UHU-Altair [43] uses LLMs and vision-language models (VLMs) to process both textual and visual information in memes. To enhance model robustness, they adopt the LeWiDi framework, as an attempt to allow the system to benefit from divergent annotations that reflect the inherent ambiguity and subjectivity in sociolinguistic tasks.

GrootWatch [44] participates in both the tweets and memes tasks. For tweet classification, they used a multi-task headed BERT model enriched with relevant information surrounding the tweet, helping the model achieve a full understanding of the tweet and its context. For memes, they used a VLM-based application to detect and categorise sexism in diferent scenarios.

The following are the approaches of the teams that participated only in the TikTok tasks.

ECA-SIMM-UVa [45] follows a segmentation oriented approach, splitting TikTok videos into textual, audio, and video channels, driven by the hypothesis that sexism can manifest in spoken words, embedded text, speaker tone, or visual content (text, pictures or other images). They train individual deep learning classifiers for each channel and explore various prediction fusion mechanisms like One Is Enough (OIE), Majority Voting, and Probabilistic OIQ for hard evaluation, as well as Logistic Regression and Weighted Sum for soft evaluation, to combine predictions. Models using the textual channel show superior performance, specially when using the original text provided with each sample in the dataset. These models consistently outperform audio and video channels, indicating that textual information is the most informative source for sexism detection in this context.

DS@GT EXIST [46] implements a multimodal framework for automated sexism detection in shortform videos, incorporating audio, visual, and textual signals. They explore the use of transformer-based models including RoBERTa for text, VideoMAE for video, and CNN-LSTM pipelines for audio and they introduce a generative AI-enhanced pipeline using Gemini to produce video summaries and analyses, which are combined with traditional modalities.

Finally, a few teams participated on the three tasks, processing tweets, memes and TikTok videos.

UMUTeam [47] addresses all three subtasks with multilingual Transformer-based models, including XLM-RoBERTa (base and large versions) for text, ViT for image features, and VideoMAE for video input. They apply specialized preprocessing and label handling for each modality. Soft-label learning is implemented using mean squared error (MSE) loss for Subtasks 1 and 2, which involve binary and multiclass classification, respectively, and binary cross-entropy (BCE) loss for Subtask 3, which is a multilabel classification problem. In all cases, annotator votes are transformed into probability distributions to capture label uncertainty. For hard-label variants, discrete predictions are obtained by selecting the class or classes with the highest probability from the model’s output during the evaluation stage.

CogniCIC [48] explores tailored methodologies to process tweets, memes, and TikTok videos. For subtask 1 they compare two approaches: the transformer-based HateBERT model and the generative Claude 3.7 model. HateBERT is optimized through tweet preprocessing, regularized training, and multitask learning, and Claude 3.7, which leverages advanced multimodal capabilities, integrating visual and textual cues for flexible and efective content interpretation. For Subtasks 2 and 3 they use Claude 3.7, which incorporates multimodal inputs, including visual frames from memes and videos, enabling nuanced distinctions, such as direct sexist expressions versus judgmental critiques.

Bergro [49] follows a generalizable BERT-based approach to identify and classify the source intent of sexism across diferent social network channels. This approach focuses on individual models trained on tweets that are then applied to both meme (image) and TikTok data using OCR and annotations, respectively. This is an example of single model fine-tuned on one media type and applied to multiple media types with minimal data preprocessing required.

BeatrizRuiz [50] uses three transformer-based models—DistilBERT, XLM-RoBERTa, and DistilGPT-2 to address all tasks. The results show that, while all models tend to overpredict sexist content and underutilize the non-sexist class in complex subtasks, DistilBERT demonstrates the most balanced performance in binary classification, XLM-RoBERTa shows robustness, but a propensity for overgeneralization, and DistilGPT-2 exhibits greater flexibility in multilabel assignments, despite its generative architecture.

6. Results

In the following subsections, we present the results of both, the participants and the baseline systems for each task, organized by evaluation mode (soft or hard).

6.1. Task 1.1: Sexism Identification in Tweets 6.1.1. Soft Evaluation

(continued on next page) Rank

6.1.2. Hard Evaluation

ICM-H Rank System BERT-Simpson_3 GrootWatch_3 hfstp_2 BERTin-Osborne_1 ArPa Project_1 PabloyFede_1 pau-rus_1 samuel-sergei_2 BERTinators_1 samuel-sergei_3 ArPa Project_3 Cachapas_1 A-squared_2 A-squared_3 CS-GO_1 CIMAT-GTO_1 CIMAT-CS-NLP_1 moniclaudia_1 JosepYSergio_1 yow_1 bergro_1 PorTod@s_1 A-squared_1 NeuralNomads_1 carlamiguel_1 Mouctar Diakhaby_1 Team PCIC_1 UC3M-LI_1 TheMagicToken_1 TransformerHotspur_3 UMUTeam_1 CLiC_1 Dandys-de-BERTganim_2 BRAINSTORMERS_1 Lirili-Larila_1 Dandys-de-BERTganim_1 NetGuardAI_1 Cyberpufs_3 LolaClaudia_1 samuel-sergei_1 UC3M-LI_3 NYCU-NLP_1 Lirili-Larila_3 Juanji&Jowi_1 Lim-go-home_1 DoubleA_1 Alberto and Ángel_1 sadiqlovers_1 Güeypingüino_1 GPTesla Smashers_1 E.T._1 MakeTwitterGreatAgain_1 SalaPlanes_1 Joses_1 BocadilloDelDia_1 ICM-H (continued on next page)

6.2. Task 1.2: Source Intention in Tweets 6.2.1. Soft Evaluation

A total of 36 systems outperformed the strongest baseline (EXIST2025-test_majority-class, where all instances are labeled as ‘NO’), indicating moderate variation in system efectiveness. All systems also outperformed the EXIST2025-test_minority-class baseline.

The relative diference between the best and fifth-best teams ( GrootWatch and NetGuardAI ) was 15.7%, suggesting relatively close performance among the top submissions. This narrow spread points to a convergence in probabilistic modeling strategies among leading participants, despite overall scores being lower than in other tasks—likely due to the increased ambiguity inherent in intent classification. Leaderboard for EXIST 2025 Task 1.2 (author intention analysis in tweets), for the soft evaluation. Metrics: ICM-S = ICM Soft, ICM-S Nr = ICM Soft Norm, CE = Cross Entropy.

ICM-S

Rank (continued on next page)

6.3.2. Hard Evaluation

For the Hard–Hard evaluation of Task 1.3, a total of 130 systems were submitted (see Table 11). The normalized ICM-Hard values ranged from 0.0000 to 0.6514, with an average of 0.353 and a standard deviation of 0.193. Remarkably, 106 systems surpassed the best baseline (EXIST2025-test_mayority-class), demonstrating high efectiveness in predicting the aggregated ground truth labels. The range between the top and fifth systems was only 9.1%, highlighting a tight cluster of top performances. This compact variation among the leaders suggests strong generalization in handling categorical distinctions of sexism in tweets when annotations are aggregated. All except four systems achieved better results than the minority class baseline (all instances labeled as ’SEXUAL-VIOLENCE’) (continued on next page)

ICM-H (continued on next page) (continued on next page) Team BeatrizRuiz BeatrizRuiz TheMagicToken BeatrizRuiz

ICM-H − 3.1920 − 3.8352 − 4.3696 − 4.9143

M F1

6.4. Task 2.1: Sexism Identification in Memes 6.4.1. Soft Evaluation

Table 12 presents the results for the classification of memes as sexist or not sexist. A total of 8 systems participated in the Soft–Soft evaluation. The normalized scores ranged from 0.0650 to 0.5110, with a mean of 0.373 and a standard deviation of 0.149. All but one system outperformed the strongest baseline (EXIST2025-test_majority-class), indicating that most submissions were efective under this probabilistic evaluation. The relative diference between the highest and lowest among the top five submissions from diferent teams was substantial (87.3%), with a notable drop from the fourth to fifth system. This wide spread suggests room for improvement and divergence in approaches to modeling soft labels in multimodal data.

6.4.2. Hard Evaluation

Table 13 presents the results for the hard-hard evaluation of Task 2.1. This task received 18 valid system submissions. The normalized ICM-Hard values ranged from 0.1711 to 0.6877, with an average of 0.471 and a standard deviation of 0.145. Out of these, 16 systems outperformed the EXIST2025-test_majorityclass baseline. The top five systems from distinct teams showed a moderate performance spread, with a 28.3% relative diference between the highest and lowest performers in this top group. All submissions surpassed the EXIST2025-test_minority-class baseline. Compared to Task 1.1, the distribution in Task 2.1 reflects greater dificulty in aligning with aggregated hard labels in multimodal settings, likely due to the inherent ambiguity and subjective interpretation of memes. Table 14 presents the results for the classification of memes according to the intention of the author, with the outputs provided as the probabilities of the diferent classes. Only 5 systems participated in the Soft–Soft evaluation. The average normalized score across systems was 0.228, with a standard deviation of 0.101. All five systems surpassed the majority baseline. Taking into account the top ranked submissions from distinct teams, the relative diference between the best and the worst among this top-4 was 81.7%, indicating a wide spread in system quality.

6.5. Task 2.2: Source Intention in Memes 6.5.1. Soft Evaluation 6.5.2. Hard Evaluation

Table 15 presents the results for the hard-hard evaluation of Task 2.2. We received 15 system submissions. The normalized ICM-Hard metric ranged from 0.0000 to 0.5784, with an average of 0.308 and a standard EXIST2025-test_gold UMUTeam_1 I2C-UHU-Altair_1 surrey-mm-group_1 UMUTeam_2 Nogroupnocry_1 EXIST2025-test_majority-class EXIST2025-test_minority-class

Team – 4.7018 UMUTeam -1.6327 I2C-UHU-Altair − 2.0736 surrey-mm-group − 2.4423 UMUTeam − 2.4994 Nogroupnocry − 4.1395 – − 5.0745 – − 18.9382

ICM-S

ICM-S Nr

Rank deviation of 0.169. Thirteen systems outperformed the EXIST2025-test_majority-class baseline, reflecting strong participation despite the challenging nature of the task. Concerning the five best submissions from diferent teams, the top system outperformed the fifth by 52.7%, a considerable diference suggesting uneven performance across modeling strategies. Nonetheless, the narrow gap among the three leading systems (near 10%) points to the emergence of competitive approaches for intent recognition, even in the presence of aggregated hard annotations derived from subjectively interpreted multimodal inputs.

6.6. Task 2.3: Sexism Categorization in Memes 6.6.1. Soft Evaluation

M F1 Nogroupnocry_1 Nogroupnocry EXIST2025-test_minority-class – Finally, Table 17 presents the results for classifying memes based on the aspects of women being attacked, with outputs provided as a single class prediction. A total of 14 systems participated (excluding the gold and baselines). Thirteen of them scored above the best baseline, with an average normalized ICM-Hard of 0.262 and a standard deviation of 0.158. The relative diference between the top and fifth-best system from diferent teams was 59.5%, indicating competitive but not saturated performance across top ranks. All systems clearly outperformed the EXIST2025-test_minority-class baseline.

Rank EXIST2025-test_gold – 2.4100 CogniCIC_1 CogniCIC 0.0244 GrootWatch_3 GrootWatch − 0.0798 GrootWatch_2 GrootWatch − 0.3550 ArcosGPT_1 ArcosGPT − 0.4187 GrootWatch_1 GrootWatch − 0.5812 I2C-UHU-Altair_1 I2C-UHU-Altair − 0.9958 I2C-UHU-Altair_2 I2C-UHU-Altair − 1.1838 CLTL_2 CLTL − 1.4243 CLTL_3 CLTL − 1.5325 UMUTeam_1 UMUTeam − 1.5624 CLTL_1 CLTL − 1.6077 UMUTeam_2 UMUTeam − 1.8869 NaturalThinker_1 NaturalThinker − 2.0376 EXIST2025-test_majority-class – − 2.0711 surrey-mm-group_1 surrey-mm-group − 2.9992 EXIST2025-test_minority-class – − 3.3135

M F1

6.7. Task 3.1: Sexism Identification in Videos 6.7.1. Soft Evaluation

Table 18 presents the results for classifying videos as sexist or not sexist. The Soft–Soft evaluation of Task 3.1 attracted 34 participating systems. The normalized ICM-Soft values, which reflect alignment with the probabilistic distribution of annotator labels, ranged from 0.1481 to 0.5590. The average normalized score was 0.3584, with a standard deviation of 0.174, indicating considerable variance in system quality. A total of 25 systems outperformed the strongest baseline (EXIST2025-test_majorityclass). The diference between the best and worst among the top five teams was approximately 18.2%, reflecting a modest but meaningful spread. Interestingly, most high-scoring systems came from teams with distinct modeling pipelines, suggesting diverse yet efective approaches to handling annotator disagreement in the multimodal context of video classification.

ICM-S ICM-S Nr

6.7.2. Hard Evaluation

Finally, Table 19 presents the results for classifying videos on sexism identification in a hard-hard context. For this task, 41 systems submitted valid runs. Normalized ICM-Hard scores spanned from 0.1954 to 0.6001, with a mean of 0.4913 and a standard deviation of 0.1033. Nearly all participants (39 out of 41) exceeded the majority-class baseline (EXIST2025-test_majority-class), showing strong global performance. The top five teams, as can be observed from Table 19, were closely matched, with only a 4.0% diference between the best and lowest performer among the top five.

ICM-H Nr

6.8. Task 3.2: Source Intention in Videos 6.8.1. Soft Evaluation

Table 20 presents the results for the classification of videos according to the intention of the author, with the outputs provided as the probabilities of the diferent classes. In this task, the 29 participating systems showed normalized ICM-Soft scores that ranged from 0.0000 to 0.3728, with a mean of 0.252 F1Y and a standard deviation of 0.084. A total of 26 systems surpassed the strongest baseline (EXIST2025test_majority-class), indicating a generally competitive field. The diference between the best and the iffth ranked systems from distinct teams was modest, at 12.0%, revealing a cluster of high-performing submissions.

ICM-S Nr

6.8.2. Hard Evaluation

Table 21 presents the results for the hard-hard evaluation of Task 3.2. The normalized ICM-Hard scores for the 36 systems submitted ranged from 0.0000 to 0.5018, with a mean of 0.375 and a standard deviation of 0.116. Most systems (33 out of 36) outperformed the majority-class baseline. The best systems from ifve diferent teams showed a relative diference between the highest and lowest normalized scores of only 4.3%, reflecting a tight performance range. Interestingly, while the average performance remains moderate, the consistency among top runs suggests that author intent in video—despite its multimodal complexity—can be reliably modeled when annotations are aggregated, albeit with room for improving discriminatory power across subtle categories.

ICM-H Nr

6.9. Task 3.3: Sexism Categorization in Videos 6.9.1. Soft Evaluation

Table 22 presents the results for classifying videos based on the aspects of women being attacked, with outputs provided as class probabilities. A total of 34 participant systems were submitted for this task. The normalized ICM-Soft scores ranged from 0.0000 to 0.1593, with a mean of 0.051 and standard deviation of 0.052. The majority baseline achieved a normalized ICM score of 0.0931, and was outperformed by 4 systems, while the minority baseline was not surpassed by any system. The top 5 systems from diferent teams achieved normalized ICM-Soft scores between 0.1593 and 0.0931. The relative diference between the best and the fifth-ranked system within this top group was 41.6%. Despite the low overall values, a meaningful gap between systems can be observed, which underlines the dificulty of probabilistic categorization in multi-class scenarios over multimodal video content.

ICM-S Nr

Rank

6.9.2. Hard Evaluation

Finally, Table 23 presents the results for classifying memes based on the aspects of women being attacked, with outputs provided as a single class prediction. This task attracted 41 participant systems. Normalized ICM-Hard scores spanned from 0.0000 to 0.3765, with a mean of 0.243 and standard deviation of 0.116. A total of 30 systems outperformed the majority baseline, while 13 did better than the minority baseline. The top 5 systems from distinct teams achieved normalized ICM-Hard scores ranging from 0.3765 to 0.3585, showing a very tight performance band with only a 4.78% relative diference between the highest and the lowest scoring among them.

6.10. Cross-task Performance Analysis

Figure 4 shows the results of Cross Entropy (horizontal axes) and normalized ICM-Soft (vertical axes). All the plots include the gold standard with maximum score. The first row (Tasks 1.1, 2.1, and 3.1), corresponds to sexism detection tasks, i.e., binary single-label classification on texts, images and video, respectively. The baseline approaches consisting of labeling everything as the majority class or as the minority class are marked in blue and red, respectively.

In terms of both Cross Entropy and ICM-Soft, the results of these two baselines fall below those of the other participant runs, indicating that the proposed systems contribute some informative value.

Only in the case of the video task (Task 3.1) are there some runs that fall below the baseline in terms of ICM. This may be due to the fact that ICM penalizes false information based on class frequency.

Another observation is that, while high ICM values imply high Cross Entropy values, the reverse is not true, with several runs accumulating good performance (low scores) according to Cross Entropy but low ICM scores. While on the horizontal axis (cross-entropy), clusters of outputs with high ICM similarity are located far from the baseline in terms of cross-entropy, all the graphs show ICM ranges with high cross-entropy values spanning from the maximum down to the baseline. This may be due, among other factors, to the fact that ICM considers not only the similarity of the assigned values for each class, but also the distribution of classes throughout the corpus. In any case, in terms of ICM, there remains a significant gap between the best-performing systems and the perfect solution. The gap is notably larger for the image and video tasks (Tasks 1.2 and 1.3).

The second row corresponds to intention detection tasks. These are hierarchical classification tasks with an initial YES/NO decision and two or three sub-classes for the YES category. In this case, there is also an accumulation of runs with high performance in Cross Entropy but low ICM, suggesting that the second metric captures additional aspects. Most runs outperform the baselines, but the gap between the best run and the perfect output in terms of ICM is larger than in sexism detection, indicating a higher complexity of the task .

Finally, the third row corresponds to hierarchical multi-label classification tasks involving multiple categories of sexism. In this case, since the tasks are multi-label, the Cross Entropy metric is not applicable. The plots show system rankings ordered from lowest to highest ICM. An interesting finding is that, in this case, many of the runs—including the minority-class baseline—do not surpass the zero threshold in normalized ICM. This suggests that some outputs, in terms of information content, do not outperform the empty output. In other words, the amount of noisy information exceeds the amount of useful information. As the number of categories increases and the task requires capturing annotation ambiguity (multi-label classification), the gap between the best run and the perfect output increases significantly compared to the previous tasks.

On the other hand, Figure 5 displays evaluation results for the hard evaluation versions, in which the assignment of items to classes depends on whether diferent thresholds of annotator agreement are met. The plot shows F1 scores for the positive class in the first row (sexism identification), and the average F1 score across all classes for the remaining tasks. The vertical axes show the results for ICM-Hard.

In general, a strong correlation between both metrics can be observed above a certain score threshold. This is because both F1 and ICM take class specificity or frequency within the corpus into account.

Again, most runs outperform the baselines. Moreover, by observing the gap between the best run and the ideal output, we can see that task dificulty increases as we move to setups with more classes, multi-labeling, or hierarchical structures (rows). An increase in task dificulty is also observed as we move from text-based tasks (first column), to image (second column), and video (third column).

7. Discussion

The following discussion analyzes system performance across the full range of tasks proposed in EXIST 2025, which include the detection, intent classification, and fine-grained categorization of sexist content. For the first time in the series, these tasks have been applied not only to textual data (tweets), but also to memes and short-form videos (TikToks), enabling a broad multimodal evaluation. The section is structured into three parts, each focusing on one of the core challenges: sexism detection, source intention, and categorization, allowing us to examine how the nature of the input content (text, image, or video) afects model efectiveness.

7.1. System Performance Across Text, Memes, and Video in Sexism Detection

As it can be observed in Table 24, which summarizes the best results for the subtasks 1.1, 2.1 and 3.1 (sexism detection in tweets, memes and TikTok videos, respectively), the tweets (text) dataset yielded the highest detection performance, while memes and especially videos proved more challenging. In the Soft-Soft evaluation (probabilistic outputs), the top system on tweets achieved an ICM-Soft Norm of ∼ 0.67, notably higher than the top systems on memes (0.511) and videos (0.559), as shown in Table 24. In the Hard-Hard evaluation (binary outputs), tweet data again saw the best results with the top F1 (positive class) ∼ 0.817 and a normalized ICM-Hard of ∼ 0.84. Memes were intermediate (top F1 ∼ 0.781, Norm ∼ 0.688), and videos the lowest (top F1 ∼ 0.694, Norm ∼ 0.600). These gaps suggest that the data source significantly influences system performance. Models detect sexism in raw text more efectively than in images or videos, likely due to the noise and information loss introduced when dealing with multimedia content.

Even state-of-the-art multimodal systems face dificulties with blurry or stylized text and background clutter in memes, which can explain the reduced accuracy on the meme and video datasets. The lower results on Subtask 3.1 (videos) align with the expectation that multimodal sexism detection is a novel and challenging problem, less studied than text-based sexism and complicated by needing to interpret visual or audio context. Overall, tweet-based models outperformed those on OCR-derived text, underlining how a clean text signal (tweets) is easier for current NLP systems to handle compared to extracted text from images or videos.

The lower performance observed in memes and videos is not solely attributable to the multimodal nature of these formats. Beyond the technical challenges of processing visual and audio data, these media often rely on implicit cultural references, sarcasm, irony, and contextual humor that are dificult to interpret automatically. Memes, in particular, tend to condense layered meanings into very short texts superimposed on images, often requiring familiarity with platform-specific discourse, internet slang, or ongoing social debates. Similarly, TikTok videos frequently reference adolescent trends, in-group codes, and popular audio tracks, which may be opaque to both annotators and systems unless they share that sociocultural context. These aspects introduce a level of pragmatic and cultural ambiguity that goes beyond the limitations of current vision or language models, and point to the need for systems that can integrate both multimodal understanding and world knowledge to interpret such content efectively.

7.2. System Performance Across Text, Memes, and Video in Sexism Source Intention

This task required systems to predict the intention behind online sexist content, with a hierarchical multiclass setup. The classification pipeline first determines whether the content is sexist, and then predicts the fine-grained intention: DIRECT, REPORTED (tweets only), or JUDGEMENTAL. Table 25 presents the top systems and their evaluation metrics for each modality and context.

As observed in Table 25, tweet-based systems once again outperform meme and video systems, especially in the Soft-Soft (probabilistic) evaluation. However, absolute values of all metrics are lower than in binary sexism detection, reflecting the increased dificulty of intention identification, particularly in noisy or OCR-extracted content. Notably, the performance gap between modalities is less pronounced in Macro F1 than in ICM-Soft, suggesting that top systems are better at predicting the main class, but struggle with fine calibration to the true distribution of annotator votes.

The gap between tweet, meme, and video results is partially explained by the challenges posed by multimodal and OCR-derived content, as in Tasks 1.1, 2.1 and 3.1. Additionally, the removal of the REPORTED class from memes and videos (a design choice based on data inspection) means that systems face a simpler but less nuanced label space in those domains. This may contribute to the relatively high Macro F1 in memes and videos, as models need only diferentiate between fewer classes.

Moreover, the higher prevalence of the DIRECT class in memes aligns with the nature of meme content, which often features explicit or humorous sexist material. Systems tuned to this distribution may perform well in memes but generalize poorly to tweets, where REPORTED and JUDGEMENTAL are more common and context-dependent.

7.3. System Performance Across Text, Memes, and Video in Sexism Categorization

Tasks 1.3, 2.3 and 3.3 addressed the multilabel, multiclass, and hierarchical classification of online sexist content, where systems must not only detect sexist content, but also assign one or more fine-grained categories indicating the facet of womanhood under attack. The categories include IDEOLOGICAL AND INEQUALITY, STEREOTYPING AND DOMINANCE, OBJECTIFICATION, SEXUAL VIOLENCE, and MISOGYNY AND NON-SEXUAL VIOLENCE.

Table 26 presents the top system performances for each data source and context. The overall pattern mirrors previous tasks: tweet-based systems consistently outperform those on memes and videos, especially in the probabilistic (Soft-Soft) context. However, absolute metrics are lower than for binary or intention-based sexism detection, reflecting the increased complexity of the multilabel, hierarchical setup and the annotation ambiguity intrinsic to these subtle categories.

In all modalities, ICM-Soft Norm scores are considerably lower than in previous tasks, indicating that systems struggle to accurately capture the distribution of annotator opinions and to model multilabel uncertainty. Notably, even the best systems on tweets barely exceed 0.41 in ICM-Soft Norm, with further drops for memes and videos.

7.4. Performance Trends on Tweet-based Tasks (2023–2025)

To better understand the progress in sexism detection over time, we compared the best-performing systems across the three tweet-based tasks (Tasks 1.1, 1.2, and 1.3) in the last three editions of EXIST. The results, shown in Table 27, include both ICM-Soft scores and their normalized counterparts (when available).

Task 1.1

0.90 1.09 (0.68) 1.06 (0.67)

Task 1.2

-1.34 -0.25 (0.48) -0.43 (0.46)

Task 1.3

-2.32 -1.18 (0.44) -1.10 (0.44)

The data suggests a clear performance improvement from 2023 to 2024, likely reflecting the broader adoption of large language models and increasingly refined prompt engineering and fine-tuning strategies. This gain is particularly visible in the source intention and category classification tasks (1.2 and 1.3), which traditionally require more nuanced modeling.

Interestingly, 2025 shows no clear progress over 2024, despite a significant increase in the number of participants and submitted runs. In fact, the best normalized scores for Tasks 1.2 and 1.3 in 2025 are slightly lower than the previous year. This raises the important question: are we reaching a performance ceiling on these tasks when using the same dataset? One possible explanation is saturation — as systems converge toward similar architectures and training data, gains become increasingly marginal. Moreover, when using the same test data over multiple editions, top systems may begin to approach the upper bounds of what can be achieved without new annotation rounds or more diverse evaluation settings.

These findings highlight the importance of refreshing datasets, increasing task complexity, or shifting focus to novel and underexplored modalities to maintain scientific progress and distinguish truly innovative approaches.

8. Conclusions

The objective of the EXIST challenge is to foster research on the automatic detection and modeling of sexism in online environments, with a particular emphasis on social networks. The 2025 edition of the lab, organized as part of CLEF, attracted 114 participant teams and received a total of 873 system runs. Participants explored a wide range of approaches, including vision transformer models, data augmentation via automatic translation and duplication, the use of data from previous EXIST editions, multilingual and Twitter-specific language models, as well as transfer learning from related domains such as hate speech, toxicity, and sentiment analysis.

The tasks in EXIST 2025 addressed the problem of sexism detection and classification across three types of content—text (tweets), images (memes), and video (TikToks)—demonstrating the comprehensive and multimodal scope of the challenge. This multimodal design reflects the complexity of real-world social media platforms, where sexist messages may be conveyed through language, visuals, or a combination of both.

While many participating systems followed the conventional strategy of producing hard-label outputs, a substantial number took advantage of the multi-annotator nature of the dataset to submit soft-label predictions. This shift indicates a growing interest within the research community in building models that can handle subjectivity, disagreement, and nuanced interpretations of harmful content.

Acknowledgments

This work has been financed by the European Union (NextGenerationEU funds) through the “Plan de Recuperación, Transformación y Resiliencia”, by the Ministry of Digital Transformation and by the UNED University. However, the points of view and opinions expressed in this document are solely those of the author(s) and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be considered responsible for them. It has also been financed by the Spanish Ministry of Science and Innovation (project FairTransNLP (PID2021-124361OB-C31 and PID2021-124361OB-C32)) funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making Europe, and by the Australian Research Council (DE200100064 and CE200100005).

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT in order to check grammar and spelling.

[1] NewStatesman, Social media and the silencing efect: Why misogyny online is a human rights issue , https://bit.ly/3n3ox68, n.d. Last accessed 18 Oct 2023 .

[2]

J. L. Gil

Bermejo ,

Martos Sánchez ,

O. Vázquez

Aguado ,

E. B.

García-Navarro , Adolescents, ambivalent sexism and social networks, a conditioning factor in the healthcare of women , Healthcare (Basel) 9 ( 2021 ) 721 . doi: 10 .3390/healthcare9060721.

[3]

Morales Rodríguez ,

Lopez-Figueroa , The portrayal of women in media , Journal of Student Research 13 ( 2024 ).

[4]

S. E.

Davis , Objectification, sexualization, and misrepresentation: Social media and the college experience , Social Media + Society 4 ( 2018 ).

[5]

Harriger ,

Thompson , M.

Tiggemann, TikTok, TikTok, the time is now: Future directions in social media and body image, ody Image B (

2023 ) 222 - 226 .

[6]

Rodríguez-Sánchez ,

Carrillo-de Albornoz , L. Plaza,

Gonzalo ,

Rosso ,

Comet , T. Donoso, Overview of EXIST 2021: Sexism identification in social networks , Procesamiento del Lenguaje Natural 67 ( 2021 ) 195 - 207 .

[7]

Rodríguez-Sánchez ,

Carrillo-de Albornoz ,

Plaza ,

Mendieta-Aragón ,

Marco-Remón ,

Makeienko ,

Plaza ,

Spina ,

Gonzalo ,

Rosso , Overview of EXIST 2022: Sexism identification in social networks , Procesamiento del Lenguaje Natural 69 ( 2022 ) 229 - 240 .

[8]

Plaza , J. C. de Albornoz,

Morante ,

Amigó ,

Gonzalo ,

Spina ,

Rosso , Overview of EXIST 2023 - Learning with Disagreement for Sexism Identification and Characterization (Extended Overview) , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023 ), volume 497 , CEUR Working Notes, 2023 , pp. 813 - 854 .

[9]

Uma ,

Fornaciari ,

Dumitrache ,

Miller ,

Chamberlain ,

Plank , E. Simpson, M. Poesio, SemEval -2021 task 12: Learning with disagreements , in: Proceedings of the 15th International