<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Current Psychology: A Journal for Diverse Perspectives on Diverse Psychological Issues
(2019) No Pagination Specified-No Pagination Specified. doi: 10.1007/s12144</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s12144-019-00354-2</article-id>
      <title-group>
        <article-title>Overview of EXIST 2025: Learning with Disagreement for Sexism Identification and Characterization in Tweets, Memes, and TikTok Videos (Extended Overview)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Plaza</string-name>
          <email>lplaza@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Carrillo-de-Albornoz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iván Arcos</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damiano Spina</string-name>
          <email>damiano.spina@rmit.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrique Amigó</string-name>
          <email>enrique@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Gonzalo</string-name>
          <email>julio@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roser Morante</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RMIT University</institution>
          ,
          <addr-line>3000 Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Nacional de Educación a Distancia (UNED)</institution>
          ,
          <addr-line>28040 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Politécnica de Valencia (UPV)</institution>
          ,
          <addr-line>46022 Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Valencian Graduate School and Research Network Analysis of Artificial Analysis (ValgrAI)</institution>
          ,
          <addr-line>46022 Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>338</fpage>
      <lpage>347</lpage>
      <abstract>
        <p>This paper presents the EXIST 2025 Lab on sexism detection and categorization in social media, which took place at the CLEF 2025 conference and marks the fifth edition of the EXIST Shared Task. Building on the success of previous editions, EXIST 2025 addresses the growing concern over the spread of ofensive and discriminatory content targeting women across online platforms, which significantly impacts women's well-being and freedom of expression. The lab comprises nine tasks in two languages (English and Spanish), organized around three core objectives: sexism identification, source intention detection, and sexism categorization. These tasks are applied across three media types-text (tweets), image (memes), and video (TikToks)-ofering a multimodal perspective that allows for a deeper understanding of how sexism manifests across diferent formats and user interactions. As in previous editions, EXIST 2025 adopts the “Learning With Disagreement” paradigm, using annotations from multiple annotators that reflect diverse and at times conflicting viewpoints. This overview describes the task design, datasets, evaluation methodology, participating systems, and results of EXIST 2025, which has surpassed participation expectations with 244 registered teams from 38 countries, 114 teams from 23 countries submitting runs, a total of 873 runs processed, and 33 working notes published. Warning: Some of the examples included in this paper may contain ofensive language and explicit descriptions of sexist behavior, which may be disturbing to the reader.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sexism identification</kwd>
        <kwd>sexism categorization</kwd>
        <kwd>learning with disagreement</kwd>
        <kwd>tweets</kwd>
        <kwd>memes</kwd>
        <kwd>TikTok videos</kwd>
        <kwd>humancentric AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sexism refers to prejudice or discrimination based on a person’s sex or gender, often manifesting in
the belief that one gender is superior to another. It can take many forms, from overt aggression and
harassment to subtler behaviors and norms that reinforce inequality. While sexism afects individuals
of all genders, it disproportionately impacts women, particularly in digital spaces.</p>
      <p>In recent years, online platforms like Twitter and TikTok have become breeding grounds for the
proliferation of sexist discourse. On Twitter, sexism often manifests through harassment, trolling, and
misogynistic hashtags that normalize discriminatory narratives [1, 2]. TikTok, by contrast, poses unique
challenges due to its algorithm-driven content promotion and its popularity among younger audiences.
Its recommendation system can generate filter bubbles that reinforce sexist ideologies [ 3], while visual
trends and content moderation disparities contribute to the hypersexualization and objectification of
women [4, 5]. These dynamics not only perpetuate traditional gender stereotypes but can also shape
the perceptions and behaviors of young users.</p>
      <p>To tackle these challenges, the sEXism Identification in Social neTworks (EXIST) campaign was
launched in 2021. EXIST is a series of shared tasks and scientific events aimed at identifying, analyzing,
and mitigating sexist content on social networks. The first two editions were hosted under the IberLEF
forum [6, 7], and focused on textual data. In 2023, EXIST became a CLEF Lab [8], introducing a third
task centered on detecting the communicative intention behind sexist messages and adopting for the
ifrst time the Learning with Disagreement (LeWiDi) paradigm [ 9]. This paradigm acknowledges that
disagreements among annotators are not noise, but valuable signals that reflect the subjectivity inherent
to tasks like sexism detection. The fourth edition of EXIST (2024) expanded the challenge to multimodal
data by introducing tasks involving memes. Memes, while often humorous, are increasingly used to
spread prejudices under the guise of irony [10, 11, 12, 13]. Their blend of text and image makes them
particularly insidious vectors for normalizing sexist stereotypes, especially when humor is used to
reduce the perceived harm [14, 15].</p>
      <p>EXIST 2025 marks the fifth edition of the challenge and represents its most ambitious iteration yet.
Held again as a CLEF Lab,1 it comprises nine tasks in total—covering three core objectives (sexism
identification, source intention detection, and sexism categorization) across three modalities: tweets
(text), memes (image), and TikToks (video). This multimodal and bilingual (English and Spanish)
design aims to capture the varied ways in which sexism is expressed and interpreted online, enabling
researchers to develop AI models that are sensitive to both linguistic and visual cues, as well as the
platform-specific dynamics that influence sexist content dissemination.</p>
      <p>Throughout its four previous editions, more than 100 teams from universities and companies around
the world have participated in EXIST, developing and testing state-of-the-art models to address this
pressing social issue. The 2025 edition continues to foster international participation, with 244 registered
teams from 38 countries. Of these, 114 teams from 23 countries submitted valid runs, resulting in a total
of 873 system submissions.</p>
      <p>In the following sections, we present a detailed overview of the tasks, datasets, annotation process,
evaluation methodology, and system results for EXIST 2025.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Tasks</title>
      <p>The 2025 edition of EXIST features nine tasks, which are described below. The languages addressed are
English and Spanish and the datasets are collections of tweets, memes and TikTok videos (see Section
3). For the tasks on TikTok videos, all the partitions of the dataset are new, whereas for the tasks on
tweets and memes we employ the EXIST 2023 and 2024 datasets, respectively.</p>
      <sec id="sec-2-1">
        <title>2.1. Task 1.1: Sexism Identification in Tweets</title>
        <p>
          This is a binary classification task where systems must decide whether or not a given tweet expresses
sexist ideas because it is sexist itself, it describes a sexist situation, or it criticizes a sexist behavior. The
following examples from the dataset show sexist and not sexist messages, respectively.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
        <p>Sexist. It’s less of #adaywithoutwomen and more of a day without feminists, which, to be quite
honest, sounds lovely.</p>
        <p>Not sexist. Just saw a woman wearing a mask outside spank her very tightly leashed dog and I
gotta say I love learning absolutely everything about a stranger in a single instant.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 1.2: Source Intention in Tweets</title>
        <p>
          This task aims to categorize the message according to the intention of the author. We propose the
following ternary classification of tweets:
• Direct sexist message. The intention is to write a message that is sexist by itself or incites sexism,
as in:
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
        </p>
        <p>
          A woman needs love, to fill the fridge, if a man can give this to her in return for her services, I
don’t see what else she needs.
• Reported sexist message. The intention is to report and share a sexist situation sufered by a
woman or women in first or third person, as in:
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) I doze in the subway, I open my eyes feeling something weird: the hand of the man sat next to
me on my leg #SquealOnYourPig.
• Judgemental message. The intention is to condemn sexist situations or behaviours, as in:
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>As usual, the woman was the one quitting her job for the family’s welfare...</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Task 1.3: Sexism Categorization in Tweets</title>
        <p>
          Many facets of a woman’s life may be the focus of sexist attitudes including domestic role, career
opportunities, and sexual image, to name a few. According to this, each sexist tweet must be assigned
one or more of the following categories:
• Ideological and inequality. It includes messages that discredit the feminist movement. It also
includes messages that reject inequality between men and women, or present men as victims of
gender-based oppression.
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) #Feminism is a war on men, but it’s also a war on women. It’s a war on female nature, a war
on femininity.
• Stereotyping and dominance. It includes messages that suggest women are more suitable or
inappropriate for certain tasks, and somehow inferior to men.
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
        </p>
        <p>Most women no longer have the desire or the knowledge to develop a high quality character,
even if they wanted to.
• Objectification . It includes messages where women are presented as objects apart from their
dignity and personal aspects. We also include messages that assume or describe certain physical
qualities that women must have in order to fulfill traditional gender roles.</p>
        <p>No ofense but I’ve never seen an attractive african american hooker. Not a single one.
• Sexual violence. It includes messages where sexual suggestions, requests or harassment of a
sexual nature (rape or sexual assault) are made.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) I wanna touch your tits..you can’t imagine what I can do on your body.
• Misogyny and non sexual violence. It includes expressions of hatred and violence towards
women.
(10) Domestic abuse is never okay. . . Unless your wife is a bitch.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Task 2.1: Sexism Identification in Memes</title>
        <p>As in Task 1.1, this involves a binary classification consisting on deciding whether or not a meme
is sexist, as in Figure 1.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Task 2.2: Source Intention in Memes</title>
        <p>As in Task 1.2, this task aims to categorize the meme according to the intention of the author.
However, in this task systems should only classify memes in two classes: direct or judgemental,
as shown in Figure 2.</p>
        <p>(a) Sexist meme
(b) Non sexist meme</p>
        <p>(a) Ideological &amp;
inequality</p>
        <p>(b) Objectification
(c) Stereotyping &amp;
dominance
(d) Sexual
violence
(e) Misogyny &amp; non-sexual
violence</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Task 2.3: Sexism Categorization in Memes</title>
        <p>This task aims to classify sexist memes according to the categorization provided for Task 1.3.
Figure 3 shows one meme of each sexist category.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.7. Task 3.1: Sexism Identification in TikToks</title>
        <p>As in Tasks 1.1 and 2.1, systems must determine whether short videos shared on TikTok are
sexist.</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.8. Task 3.2: Source Intention in TikToks</title>
        <p>As in Tasks 1.2 and 2.2, this task aims to categorize TikTok short videos according to the intention
of the author, as direct or judgemental.</p>
      </sec>
      <sec id="sec-2-9">
        <title>2.9. Task 3.3: Sexism Categorization in TikToks</title>
        <p>As in Tasks 1.3 and 3.3, this task aims to categorize short videos according to the sexism categories
provided for Task 1.3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The EXIST 2025 dataset comprises three types of data: the tweets from the EXIST 2023 dataset, the
memes from the EXIST 2024 dataset and a new dataset of TikTok videos. Plaza et al. [8] and [16] provide
a detailed description of the tweets and memes datasets, respectively. Here we provide a summarized
description of the three datasets.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Sampling</title>
        <sec id="sec-3-1-1">
          <title>3.1.1. EXIST 2023 Tweets Dataset</title>
          <p>We first collected diferent popular expressions and terms, both in English and Spanish, commonly
used to underestimate the role of women in our society. These expressions were later used as seeds to
retrieve Twitter data. To mitigate the seed bias, we have also gathered other common hashtags and
expressions less frequently used in sexist contexts to ensure a balanced distribution between sexist/not
sexist expressions. This first set of seeds contains more than 400 expressions.</p>
          <p>The set of seeds was then used to extract tweets in English and Spanish (more than 8,000,000 tweets
were downloaded). The crawling was performed during the period from the September 1, 2021 till
September 30, 2022. 100 tweets were downloaded for each seed per day (no retweets and promotional
tweets were included). To ensure an appropriate balance between seeds, we removed those with less
than 60 tweets. The final set of seeds contains 183 seeds for Spanish and 163 seeds for English.</p>
          <p>To mitigate the terminology and temporal bias, the final sets of tweets were selected as follows:
for each seed, approximately 20 tweets were randomly selected within the period from 1st September 1,
February 28, 2022 for the training set, taking into account a representative temporal distribution among
tweets of the same seed. Similarly, 3 tweets per seed were selected for the development set within the
period from 1st to 31st May of 2022, and 6 tweets per seed within the period from August 1, 2022 to
September, 30 2022 were selected for the test set. Only one tweet per author was included in the final
selection to avoid author bias. Finally, tweets containing less than 5 words were removed. As a result,
we have more than 3,200 tweets per language for the training set, around 500 per language for the
development set, and nearly 1,000 tweets per language for the test set.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. EXIST 2024 Memes Dataset</title>
          <p>We first curated a lexicon of terms and expressions leading to sexist memes. The set of seeds encompasses
diverse topics and contains 250 terms, with 112 in English and 138 in Spanish. The terms were used as
search queries on Google Images to obtain the top 100 images. Rigorous manual cleaning procedures
were applied, defining memes and ensuring the removal of noise such as textless images, text-only
images, ads, and duplicates. The final set consists of more than 3,000 memes per language.</p>
          <p>Since the proportion of memes per term was heterogeneous, we discarded the most unbalanced
seeds and made sure that all seeds have at least five memes. To avoid introducing selection bias, we
randomly selected memes, ensuring the appropriate distribution per seed. As a result, we have 2,000
memes per language for the training set and 500 memes per language for the test set.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. TikTok Dataset</title>
          <p>The data was collected with the Apify’s TikTok Hashtag Scraper tool.2 using a previously curated list
of 185 Spanish hashtags and 61 English hashtags associated with potentially sexist content. More than
3,500 videos in English and Spanish were downloaded from diferent TikTok accounts. Rigorous manual
cleaning procedures were applied, ensuring the removal of noise such as ads and duplicates.</p>
          <p>The collected TikTok videos were divided into training and test sets following a chronological
and author-based partitioning strategy. This approach ensured temporal coherence while preventing
data leakage. To achieve this, authors present in the training set were excluded from the test set,
preventing the model from learning author-specific patterns and enhancing its generalization capabilities.
Additionally, each hashtag (seed) was required to contribute a minimum number of videos, ensuring a
more uniform distribution across the dataset. The final selection of videos was conducted randomly but
maintained a temporal distribution to ensure diversity and avoid overrepresentation of any specific
time period.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Datasets Size</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. EXIST 2023 Tweets Dataset</title>
          <p>The dataset consists of three partitions per language. The distribution of tweets per partition and
language is shown in Table 1.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. EXIST 2024 Memes Dataset</title>
          <p>The memes dataset is provided in two partitions per language, training and test. The distribution per
partition and language is shown in Table 2.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. TikTok Dataset</title>
          <p>The TikTok dataset consists of three partitions per language. The distribution of tweets per partitions is
shown in Table 3.
2https://apify.com/clockworks/tiktok-hashtag-scraper</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Labeling with Disagreements</title>
        <p>The LeWiDi paradigm was adopted to label the TikTok videos, in the same way that it was adopted to
label the tweets and memes datasets for EXIST 2023 and 2024, respectively. Diferently from previous
EXIST editions, the annotation was performed by trained annotators, instead of crowd workers. The
annotation was conducted using Servipoli’s service,3 with eight students organized in pairs consisting
of one male and one female student, in order to avoid biases. Each pair was tasked with annotating
1,000 TikTok videos.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Methodology and Metrics</title>
      <p>As in EXIST 2023 and 2024, we have carried out a soft evaluation and a hard evaluation. The soft
evaluation relates to the LeWiDi paradigm and is intended to measure the ability of the model to capture
disagreements, by considering the probability distribution of labels in the output as a soft label and
comparing it with the probability distribution of the annotations. The hard evaluation is the standard
paradigm and assumes that a single label is provided by the systems for every instance in the dataset.</p>
      <p>From the point of view of evaluation metrics, the tasks can be described as follows:
• Tasks 1 and 4 (sexism identification): binary classification, monolabel.
• Tasks 2 and 5 (source intention): multiclass hierarchical classification, monolabel. The hierarchy
of classes has a first level with two categories, sexist/not sexist, and a second level for the sexist
category with three mutually-exclusive subcategories: direct/reported/judgemental. A suitable
evaluation metric must reflect the fact that a confusion between not sexist and a sexist category
is more severe than a confusion between two sexist subcategories.
• Tasks 3 and 6 (sexism categorization): multiclass hierarchical classification, multilabel. Again
the first level is a binary distinction between sexist/not sexist, and there is a second level for
the sexist category that includes five subcategories: ideological and inequality, stereotyping and
dominance, objectification, sexual violence, and misogyny and non-sexual violence. These classes
are not mutually exclusive: a tweet may belong to several subcategories at the same time.</p>
      <p>The LeWiDi paradigm can be considered in both sides of the evaluation process:
• The ground truth. In a hard evaluation setting, the variability in the human annotations is
reduced by selecting one and only one gold category per instance, the hard label. In a soft
evaluation setting, the gold standard label for one instance is the set of all the human annotations
existing for that instance. Therefore, the evaluation metric incorporates the proportion of human
annotators that have selected each category (soft labels). Note that in Tasks 1, 2, 4 and 5, which
are monolabel problems, the sum of the probabilities of each class must be one. But in Task 3,
which is multilabel, each annotator may select more than one category for a single instance.</p>
      <p>Therefore, the sum of probabilities of each class may be larger than one.
• The system output. In a hard, traditional setting, the system predicts one or more categories
for each instance. In a soft setting, the system predicts a probability for each category, for each
instance. The evaluation score is maximized when the probabilities predicted match the actual
probabilities in a soft ground truth.</p>
      <p>In EXIST 2025, for each of the tasks, two types of evaluation have been performed:
1. Soft-soft evaluation. For systems that provide probabilities for each category, we perform a
soft-soft evaluation that compares the probabilities assigned by the system with the probabilities
assigned by the set of human annotators. The probabilities of the classes for each instance are
calculated according to the distribution of labels and the number of annotators for that instance.
We use a modification of the original ICM metric (Information Contrast Measure [ 17]), ICM-Soft
(see details below), as the oficial evaluation metric in this variant and we also provide results for
the normalized version of ICM-Soft (ICM-Soft Norm).
2. Hard-hard evaluation. For systems that provide a hard, conventional output, we perform a
hard-hard evaluation. To derive the hard labels in the ground truth from the diferent annotators’
labels, we use a probabilistic threshold computed for each task. As a result, for Tasks 1 and 4, the
class annotated by more than 3 annotators is selected; for Tasks 2 and 5, the class annotated by
more than 2 annotators is selected; and for Tasks 3 ad 6 (multilabel), the classes annotated by
more than 1 annotator are selected. The instances for which there is no majority class (i.e., no
class receives more probability than the threshold) are removed from this evaluation scheme. The
oficial metric for this task is the original ICM, as defined by [ 17]. We also report a normalized
version of ICM (ICM Norm) and F1 (F1YES). In Tasks 1 and 4, we use F1 for the positive class. In
Tasks 2, 3, 5 and 6, we use the macro-average of F1 for all classes (Macro F1). Note, however, that
F1 is not ideal in our experimental setting: although it can handle multilabel situations, it does
not take into account the relationships between classes. In particular, a confusion between not
sexist and any of the sexist subclasses, and a confusion between two of the sexist subclasses, are
penalized equally.</p>
      <p>ICM is a similarity function that generalizes Pointwise Mutual Information (PMI), and can be used
to evaluate outputs in classification problems by computing their similarity to the ground truth. The
general definition of ICM is:</p>
      <p>ICM(, ) =  1() +  2() −  ( ∪ )
Where () is the Information Content of the instance represented by the set of features A. ICM
maps into PMI when all parameters take a value of 1. The general definition of ICM by [ 17] is applied
to cases where categories have a hierarchical structure and instances may belong to more than one
category. The resulting evaluation metric is proved to be analytically superior to the alternatives in the
state of the art. The definition of ICM in this context is:</p>
      <p>ICM((), ()) = 2(()) + 2(()) − 3(() ∪ ())
Where () stands for Information Content, () is the set of categories assigned to document  by
system , and () the set of categories assigned to document  in the gold standard. The score for
a perfect output (() = ()) is the gold standard Information Content ((()). The score for a
zero-information system (no category assignment) is − (()). We use these two boundaries for
normalisation purposes, truncating to 0 the scores lower than − (()).</p>
      <p>As there is not, to the best of our knowledge, any current metric that fits hierarchical multilabel
classification problems in a LeWiDi scenario, we have defined an extension of ICM (ICM-soft) that
accepts both soft system outputs and soft ground truth assignments. ICM-soft works as follows: first,
we define the Information Content of a single assignment of a category  with an agreement  to a
given instance as the probability of instances in the gold standard to exceed the agrement level  for
the category :</p>
      <p>({⟨, ⟩}) = − log2( ({ ∈  : () ≥ })
In order to estimate , we compute the mean and deviation of the agreement levels for each class
across instances, and applying the cumulative probability over the inferred normal distribution. In the
case of zero variance, we must consider that the probability for values equals or below the mean is 1
(zero IC) and the probability for values above the mean must be smoothed. But this is not the case of
the EXIST datasets.</p>
      <p>Due to the multi-label and hierarchical nature of the classification task,for each classification instance,
the gold standard, the system output and their unions ((()) (()) and (()) ()) are
sets of category assignments. The union of the assignments (i.e. ()) ()) is calculated as fuzzy
sets, i.e. the maximum values., in order to estimate information content, we apply a recursive function
similar to the one described by Amigó and Delgado [17] for assignment sets and avoid the redundant
information of parent categories.</p>
      <p>︃( 
⋃︁{⟨, ⟩}
)︃
= (⟨1, 1⟩) + 
︃( 
⋃︁{⟨, ⟩}</p>
      <p>)︃
=2
=1
− 
︃( 
⋃︁{⟨lca(1, ), (1, )⟩}</p>
      <p>)︃
=2
(11)
where lca(, ) is the lowest common ancestor of categories  and .</p>
    </sec>
    <sec id="sec-5">
      <title>5. Overview of Approaches</title>
      <p>This section ofers an overview of the methodological approaches submitted to EXIST 2025.</p>
      <p>Although 244 teams from 38 diferent countries registered for participation, the number of participants
who finally submitted results were 114, submitting 873 runs. Teams were allowed to participate in any
of the nine tasks and submit hard and/or soft outputs. Table 4 summarizes the participation in the
diferent tasks and evaluation contexts.
for a clearer comparison of modeling strategies across diferent modalities and highlights trends and
innovations specific to each content type.</p>
      <sec id="sec-5-1">
        <title>5.1. Sexism Detection in Tweets</title>
        <p>Sexism detection in tweets was predominantly approached through Natural Language Processing
(NLP) techniques and neural network-based models. The majority of teams relied on pre-trained large
language models (LLMs), such as BERT, RoBERTa, and domain-specific variants like BERTweet or
HateBERT, often fine-tuned on the EXIST datasets. While transformer-based models dominated, a
minority of teams used traditional machine learning techniques such as Support Vector Machines (SVM)
or Random Forests with TF-IDF, as well as rule-based or lexicon-based methods.</p>
        <p>Many teams applied data preprocessing techniques tailored to social media content, including emoji
normalization, hashtag segmentation, and URL removal. Data augmentation methods, such as
backtranslation, synonym replacement, or oversampling of minority classes, were also employed to mitigate
class imbalance and improve generalization.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Sexism Detection in Memes</title>
        <p>For memes, the inherently multimodal nature of the data led teams to combine computer vision and text
analysis methods. Convolutional Neural Networks (CNNs) and visual feature extractors such as CLIP
and ResNet were used to process image data. Meanwhile, embedded text within memes was handled
using transformer-based NLP models.</p>
        <p>Teams used both early fusion (merging textual and visual embeddings before classification) and late
fusion (aggregating predictions from separate pipelines). Although multimodal fusion was key, some
teams focused primarily on one modality, revealing diverse strategic preferences.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Sexism Detection in TikTok Videos</title>
        <p>Sexism detection in TikToks required integrating audio, visual, and textual information, making
multimodal analysis indispensable. Despite the complexity of the modality, the dominant methods remained
rooted in NLP (particularly for transcript analysis), followed by computer vision models. Multimodal
fusion strategies—especially late fusion—were key in top-performing systems, and some teams adopted
zero-shot or prompt-based learning using general-purpose LLMs such as GPT-3.</p>
        <p>Given TikTok’s social dynamics, models were also designed to be sensitive to context, sometimes
incorporating meta-information, such as hashtags or background music features.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Summary of Approaches per Team</title>
        <p>Next we provide a summary of the methodological approaches followed by the EXIST 2025 teams that
submitted a description paper for the Working Notes. We start by the teams that participated only in
some or all subtasks of Task 1 on processing tweets.</p>
        <p>ANLP-Uniso [18] uses the mT5 model for contextual embeddings and a system that integrates several
machine learning and deep learning classifiers, including both traditional models (Logistic Regression,
SVM) and neural networks (RNN, GRU, hybrid FNN+GRU). To enhance classification accuracy, they
apply extensive preprocessing, feature normalization, dimensionality reduction via PCA, and data
balancing techniques such as SMOTE and class weighting.</p>
        <p>NLPDame [19] addresses Sub-task 1.3 with a methodology that includes fine-tuning twelve
transformer LLMs within a tailored multi-head and multi-task model architecture that employs CLS, mean,
and max pooling for multi-label text classification. The multi-head architecture is chosen to deal with
multilinguality, while the multi-task architecture incorporates sentiment analysis to enhance the
multilabel classification process. The methodology also involves utilizing the open-source multilingual LLM
Llama-3.2-3B-Instruct and prompt engineering to classify tweets. Additionally, a method incorporating
RAG (Retrieval Augmented Generation), chain-of-thought reasoning and annotators’ profiles was used
to provide contextual information within the LLM prompt engineering framework. A majority voting
system was submitted that includes the predictions from (i) the twelve Transformer models with LLM
prompt engineering, and (ii) the twelve transformer models with LLM prompt engineering, including
chain-of-thought and annotators’ profiles, along with RAG. Various loss functions and thresholds were
applied, as well as the use of class positive weights to tackle class imbalance.</p>
        <p>ECORBI-UPV [20] leverages semantic embeddings generated using pre-trained models from Google’s
Generative AI suite, evaluated on both frozen and fine-tuned forms. For classification they use traditional
machine learning models, such as Random Forest, SVM, and MLP.</p>
        <p>Mumul03 [21] employs ModernBERT-large and incorporates demographic information from the
annotator such as gender, ethnicity, age, and other attributes into the model input. By modeling
individual annotator perspectives and aggregating predictions across submodels, they aim at capturing
the subjectivity in annotations.</p>
        <p>
          Fosu-students [22] reformulates the binary classification problem of Task 1 into a seven-class task.
They implement ModernBERT-large with layered learning rate decay for hierarchical feature
optimization. The model is enhanced with Supervised Contrastive Learning (SCL) to improve discrimination
of nuanced sexism expressions through metric learning. Their architecture incorporates: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Task
reformulation from binary to fine-grained seven-class prediction, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) ModernBERT’s memory-eficient
attention mechanisms for long-context understanding, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Hybrid CE+SCL loss ( =0.9) for robust
representation learning.
        </p>
        <p>Warwick [23] develops a hybrid detection framework that integrates the outputs of multiple neural
language models, each encoding diferent perspectives on the task. Their system combines fine-tuned
monolingual transformers (BERTweet for English, RoBERTuito for Spanish) with instruction-tuned
LLMs such as Claude 3 Sonnet and LLaMA3-70B-Instruct. These models are combined within a
confidence-based multi-stage pipeline: high-confidence predictions from task-specialized models are
preserved, while uncertain instances are routed to general-purpose LLMs for zero-shot classification.
This dynamic strategy combines high-confidence predictions from specialized models with broader
judgments from instruction-tuned LLMs.</p>
        <p>CLiC [24] employs BERT fine-tuning for Task 1.1 and DSPy-based prompt optimization for Tasks
1.2 and 1.3. They explore BERT-based methods for Task 1.1 and contrasting prompt-based methods,
including variants with annotator information and RAG, for the subsequent tasks.</p>
        <p>NetGuardAI [25] experiments with several transformer-based models, including DeBERTa,
mDeBERTa, XLM-RoBERTa, Detoxify, and HateBERT, alongside three levels of text preprocessing: Light,
Classic, and Aggressive Cleaning. Although they tested various data augmentation strategies, such
as translation-based augmentation using Meta AI’s NLLB model and pseudo-labeling with the EDOS
dataset, the final submitted system does not include these enhancements.</p>
        <p>EquityExplorer-2.0 [26] proposes a pipeline that combines label-aware translation, domain-adaptive
pre-training, and ensemble learning. A central component of their system is a prompt-based
Spanishto-English translation step, designed to preserve the tone and task-relevant semantics of the original
message, selectively incorporating label cues during training. They aimed at enabling the use of
highperformance monolingual models, while maintaining semantic fidelity across languages. They further
adapt DeBERTa-v3-Large and RoBERTa-Large using 2 million unlabeled posts from the EDOS dataset
and fine-tune them individually and in a fused configuration (DTFN). Final predictions are generated
via majority voting, with a tie-handling rule that improves robustness.</p>
        <p>Exist@CeDRI [27] uses a combination of multiple text augmentation strategies, including AEDA
(punctuation-based), synonym replacement, back-translation, and light code-switching via round-trip
translation, in order to enhance model reliability and deal with data sparsity. Their architecture builds
on XLM-RoBERTa-large, fine-tuned for three subtasks: binary sexism detection, source classification,
and sexism categorization. Both soft and hard label strategies are incorporated to account for annotation
disagreement and label smoothing and class-weighted loss functions are applied to manage class
imbalance.</p>
        <p>Awakened [28] employs an adaptive Mixture of Transformers architecture. The system combines nine
transformer-based models—spanning both English-specific and multilingual variants—each specialized
by language, platform, or task. A dynamic weighting mechanism automatically adjusts the contribution
of each model in the ensemble, based on the detected language and performance metrics, in order to
enable robust and context-aware classification across diverse linguistic settings.</p>
        <p>Dandys-de-BERTganim [29] adopts a multi-task learning architecture with language-specific
transformers for English and Spanish, integrating demographic information from annotators as contextual
signals. They enhance model generalization through data augmentation techniques such as
backtranslation and a punctuation-based augmentation method. Furthermore, they introduce a soft-labeling
data reader to better reflect annotation disagreement, aligning with the LeWiDi paradigm.</p>
        <p>DuthThrace [30] develops a transformer-based multilingual architecture, fine-tuned with techniques
such as oversampling, class weighting, and soft-label learning to account for class imbalance and
annotator disagreement.</p>
        <p>CIMAT-CS-NLP [31] proposes a method based on a single multitask query to LLMs, designing
a query that first generates chain-of-thought justifications and then requests answers for all tasks
simultaneously. To automate query refinement, they apply evolutionary computation, optimizing the
F1-macro on a development subset. Experiments are performed with DeepSeek-R1-Distill-Llama-8B
and Gemini-1.5-Flash. They fine-tune a BERT-like model with the LLM-generated justifications, with
DeepSeek achieving similar performance to the Gemini-based justifications despite the reduced model
size.</p>
        <p>UC3M-LI [32] develops a variety of systems for Task 1.1 and Task 1.2, combining traditional
machine learning models, Transformer-based architectures, ensemble methods, and hybrid CNN-BERT
approaches. Their approach incorporates data augmentation, and multilingual modeling strategies to
address challenges such as label disagreement and language variation.</p>
        <p>Cyberpufs [ 33] uses several LLMs, prominently multilingual BERT and XLM_Roberta, combined
with an ensemble learning approach to process tweets. They employ data augmentation techniques
such as cross-translation, EASE, and AEDA, and develop separate models for English and Spanish
to optimize language-specific predictions. Model evaluation is conducted using hard labels, derived
through majority annotator voting, and soft labels, derived from class probability distributions.</p>
        <p>COMFOR [34] approaches the tasks with an SVM based on a comprehensive feature representation,
including embeddings and lexical features. For the third subtask, this classifier was used as the basis for
a classifier chain.</p>
        <p>CIMAT-GTO [35] uses a hybrid setting aimed at taking advantage of the reasoning produced by
generative LLMs using justification-guided knowledge expansion when fine-tuning a smaller transformer-based
model for classification.</p>
        <p>Mario [36] applies hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Their method introduces
conditional adapter routing that explicitly models label dependencies across the three hierarchically
structured subtasks. Unlike conventional LoRA applications that target only attention layers, they
apply adaptation to all linear transformations, enhancing the model’s capacity to capture task-specific
patterns. They train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified
multilingual training that leverages Llama 3.1’s native bilingual capabilities. The method requires
minimal preprocessing and uses standard supervised learning.</p>
        <p>FHSTP [37] proposes three machine learning models to address these tasks, including Speech Concept
Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT) and a
finetuned XLM-RoBERTa transformer model that serves as baseline. SCBM uses descriptive adjectives
as human-interpretable bottleneck concepts. SCBM leverages LLMs to map input texts to an abstract
adjective-based representation, which is then utilized to train a light-weight classifier for downstream
tasks. SCBMT extends this approach by fusing transformer-based contextual embeddings with the
adjective-based representation, aiming to balance interpretability and classification performance.</p>
        <p>NYCU-NLP [38] integrates annotator demographics and leverages bilingual fusion by combining
original and cross-translated tweets. They implement a hierarchical pipeline and compare three distinct
modeling strategies: a fine-tuned transformer-based dual-encoder architecture with early and late
fusion, a zero-shot auto-regressive LLM, and a zero-shot difusion-based LLM. The transformer-based
approach consistently achieves the highest performance across most metrics.</p>
        <p>Next we present the approaches of teams that participated only in Task 2 on processing memes.</p>
        <p>TrankilTwice [39] participates in Task 2.1 with an end-to-end system integrating LLM-based
prompting strategies, cross-modal language encoding, and graph-based modeling at meme level, obtaining
performance gaps across languages.</p>
        <p>NaturalThinkers [40] integrates visual and textual feature extraction using BLIP (Bootstrapping
Language-Image Pretraining), BERT and ViT (Vision transformer) followed by a fusion mechanism
employing attention-based. Then they use multi-layer perceptron (MLP) for the final classification,
with a Gradio-based user interface.</p>
        <p>ArcosGPT [41] adds BLIP-generated image captions to OCR text. By further including a GPT-4o
description of the memes they obtain an increae of 8.2 points. They obtain the best overall performance
with a ViT+RoBERTa fusion model.</p>
        <p>CLTL [42] follows a hard majority voting ensemble strategy to process memes, where the component
models included a multimodal model that combines the representations of Swin Transformer V2 and
a pre-trained language model (RoBERTa or BERT), and the text-only model that uses meme text and
image captions as input. The text-only approaches included pre-trained transformer models (RoBERTa,
BERT, and a BERTweet model fine-tuned for sexism detection) and a conventional machine learning
approach, namely an SVM with stylometric and emotion-based features.</p>
        <p>I2C-UHU-Altair [43] uses LLMs and vision-language models (VLMs) to process both textual and
visual information in memes. To enhance model robustness, they adopt the LeWiDi framework, as an
attempt to allow the system to benefit from divergent annotations that reflect the inherent ambiguity
and subjectivity in sociolinguistic tasks.</p>
        <p>GrootWatch [44] participates in both the tweets and memes tasks. For tweet classification, they used
a multi-task headed BERT model enriched with relevant information surrounding the tweet, helping
the model achieve a full understanding of the tweet and its context. For memes, they used a VLM-based
application to detect and categorise sexism in diferent scenarios.</p>
        <p>The following are the approaches of the teams that participated only in the TikTok tasks.</p>
        <p>ECA-SIMM-UVa [45] follows a segmentation oriented approach, splitting TikTok videos into textual,
audio, and video channels, driven by the hypothesis that sexism can manifest in spoken words, embedded
text, speaker tone, or visual content (text, pictures or other images). They train individual deep learning
classifiers for each channel and explore various prediction fusion mechanisms like One Is Enough
(OIE), Majority Voting, and Probabilistic OIQ for hard evaluation, as well as Logistic Regression and
Weighted Sum for soft evaluation, to combine predictions. Models using the textual channel show
superior performance, specially when using the original text provided with each sample in the dataset.
These models consistently outperform audio and video channels, indicating that textual information is
the most informative source for sexism detection in this context.</p>
        <p>DS@GT EXIST [46] implements a multimodal framework for automated sexism detection in
shortform videos, incorporating audio, visual, and textual signals. They explore the use of transformer-based
models including RoBERTa for text, VideoMAE for video, and CNN-LSTM pipelines for audio and they
introduce a generative AI-enhanced pipeline using Gemini to produce video summaries and analyses,
which are combined with traditional modalities.</p>
        <p>Finally, a few teams participated on the three tasks, processing tweets, memes and TikTok videos.</p>
        <p>UMUTeam [47] addresses all three subtasks with multilingual Transformer-based models, including
XLM-RoBERTa (base and large versions) for text, ViT for image features, and VideoMAE for video
input. They apply specialized preprocessing and label handling for each modality. Soft-label learning
is implemented using mean squared error (MSE) loss for Subtasks 1 and 2, which involve binary
and multiclass classification, respectively, and binary cross-entropy (BCE) loss for Subtask 3, which
is a multilabel classification problem. In all cases, annotator votes are transformed into probability
distributions to capture label uncertainty. For hard-label variants, discrete predictions are obtained by
selecting the class or classes with the highest probability from the model’s output during the evaluation
stage.</p>
        <p>CogniCIC [48] explores tailored methodologies to process tweets, memes, and TikTok videos. For
subtask 1 they compare two approaches: the transformer-based HateBERT model and the generative
Claude 3.7 model. HateBERT is optimized through tweet preprocessing, regularized training, and
multitask learning, and Claude 3.7, which leverages advanced multimodal capabilities, integrating visual
and textual cues for flexible and efective content interpretation. For Subtasks 2 and 3 they use Claude
3.7, which incorporates multimodal inputs, including visual frames from memes and videos, enabling
nuanced distinctions, such as direct sexist expressions versus judgmental critiques.</p>
        <p>Bergro [49] follows a generalizable BERT-based approach to identify and classify the source intent of
sexism across diferent social network channels. This approach focuses on individual models trained
on tweets that are then applied to both meme (image) and TikTok data using OCR and annotations,
respectively. This is an example of single model fine-tuned on one media type and applied to multiple
media types with minimal data preprocessing required.</p>
        <p>BeatrizRuiz [50] uses three transformer-based models—DistilBERT, XLM-RoBERTa, and DistilGPT-2
to address all tasks. The results show that, while all models tend to overpredict sexist content and
underutilize the non-sexist class in complex subtasks, DistilBERT demonstrates the most balanced
performance in binary classification, XLM-RoBERTa shows robustness, but a propensity for
overgeneralization, and DistilGPT-2 exhibits greater flexibility in multilabel assignments, despite its generative
architecture.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>In the following subsections, we present the results of both, the participants and the baseline systems
for each task, organized by evaluation mode (soft or hard).</p>
      <sec id="sec-6-1">
        <title>6.1. Task 1.1: Sexism Identification in Tweets</title>
        <sec id="sec-6-1-1">
          <title>6.1.1. Soft Evaluation</title>
          <p>(continued on next page)
Rank</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>6.1.2. Hard Evaluation</title>
          <p>ICM-H
Rank
System
BERT-Simpson_3
GrootWatch_3
hfstp_2
BERTin-Osborne_1
ArPa Project_1
PabloyFede_1
pau-rus_1
samuel-sergei_2
BERTinators_1
samuel-sergei_3
ArPa Project_3
Cachapas_1
A-squared_2
A-squared_3
CS-GO_1
CIMAT-GTO_1
CIMAT-CS-NLP_1
moniclaudia_1
JosepYSergio_1
yow_1
bergro_1
PorTod@s_1
A-squared_1
NeuralNomads_1
carlamiguel_1
Mouctar Diakhaby_1
Team PCIC_1
UC3M-LI_1
TheMagicToken_1
TransformerHotspur_3
UMUTeam_1
CLiC_1
Dandys-de-BERTganim_2
BRAINSTORMERS_1
Lirili-Larila_1
Dandys-de-BERTganim_1
NetGuardAI_1
Cyberpufs_3
LolaClaudia_1
samuel-sergei_1
UC3M-LI_3
NYCU-NLP_1
Lirili-Larila_3
Juanji&amp;Jowi_1
Lim-go-home_1
DoubleA_1
Alberto and Ángel_1
sadiqlovers_1
Güeypingüino_1
GPTesla Smashers_1
E.T._1
MakeTwitterGreatAgain_1
SalaPlanes_1
Joses_1
BocadilloDelDia_1
ICM-H
(continued on next page)</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Task 1.2: Source Intention in Tweets</title>
        <sec id="sec-6-2-1">
          <title>6.2.1. Soft Evaluation</title>
          <p>A total of 36 systems outperformed the strongest baseline (EXIST2025-test_majority-class, where all
instances are labeled as ‘NO’), indicating moderate variation in system efectiveness. All systems also
outperformed the EXIST2025-test_minority-class baseline.</p>
          <p>The relative diference between the best and fifth-best teams ( GrootWatch and NetGuardAI ) was
15.7%, suggesting relatively close performance among the top submissions. This narrow spread points
to a convergence in probabilistic modeling strategies among leading participants, despite overall scores
being lower than in other tasks—likely due to the increased ambiguity inherent in intent classification.
Leaderboard for EXIST 2025 Task 1.2 (author intention analysis in tweets), for the soft evaluation. Metrics: ICM-S
= ICM Soft, ICM-S Nr = ICM Soft Norm, CE = Cross Entropy.</p>
          <p>ICM-S</p>
          <p>Rank
(continued on next page)</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>6.3.2. Hard Evaluation</title>
          <p>For the Hard–Hard evaluation of Task 1.3, a total of 130 systems were submitted (see Table 11). The
normalized ICM-Hard values ranged from 0.0000 to 0.6514, with an average of 0.353 and a standard
deviation of 0.193. Remarkably, 106 systems surpassed the best baseline (EXIST2025-test_mayority-class),
demonstrating high efectiveness in predicting the aggregated ground truth labels. The range between
the top and fifth systems was only 9.1%, highlighting a tight cluster of top performances. This compact
variation among the leaders suggests strong generalization in handling categorical distinctions of sexism
in tweets when annotations are aggregated. All except four systems achieved better results than the
minority class baseline (all instances labeled as ’SEXUAL-VIOLENCE’)
(continued on next page)</p>
          <p>ICM-H
(continued on next page)
(continued on next page)
Team
BeatrizRuiz
BeatrizRuiz
TheMagicToken
BeatrizRuiz</p>
          <p>ICM-H
− 3.1920
− 3.8352
− 4.3696
− 4.9143</p>
          <p>M F1</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>6.4. Task 2.1: Sexism Identification in Memes</title>
        <sec id="sec-6-3-1">
          <title>6.4.1. Soft Evaluation</title>
          <p>Table 12 presents the results for the classification of memes as sexist or not sexist. A total of 8 systems
participated in the Soft–Soft evaluation. The normalized scores ranged from 0.0650 to 0.5110, with a
mean of 0.373 and a standard deviation of 0.149. All but one system outperformed the strongest baseline
(EXIST2025-test_majority-class), indicating that most submissions were efective under this probabilistic
evaluation. The relative diference between the highest and lowest among the top five submissions
from diferent teams was substantial (87.3%), with a notable drop from the fourth to fifth system. This
wide spread suggests room for improvement and divergence in approaches to modeling soft labels in
multimodal data.</p>
        </sec>
        <sec id="sec-6-3-2">
          <title>6.4.2. Hard Evaluation</title>
          <p>Table 13 presents the results for the hard-hard evaluation of Task 2.1. This task received 18 valid system
submissions. The normalized ICM-Hard values ranged from 0.1711 to 0.6877, with an average of 0.471
and a standard deviation of 0.145. Out of these, 16 systems outperformed the
EXIST2025-test_majorityclass baseline. The top five systems from distinct teams showed a moderate performance spread, with a
28.3% relative diference between the highest and lowest performers in this top group. All submissions
surpassed the EXIST2025-test_minority-class baseline. Compared to Task 1.1, the distribution in Task 2.1
reflects greater dificulty in aligning with aggregated hard labels in multimodal settings, likely due to
the inherent ambiguity and subjective interpretation of memes.
Table 14 presents the results for the classification of memes according to the intention of the author,
with the outputs provided as the probabilities of the diferent classes. Only 5 systems participated in
the Soft–Soft evaluation. The average normalized score across systems was 0.228, with a standard
deviation of 0.101. All five systems surpassed the majority baseline. Taking into account the top ranked
submissions from distinct teams, the relative diference between the best and the worst among this
top-4 was 81.7%, indicating a wide spread in system quality.</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>6.5. Task 2.2: Source Intention in Memes</title>
        <sec id="sec-6-4-1">
          <title>6.5.1. Soft Evaluation</title>
        </sec>
        <sec id="sec-6-4-2">
          <title>6.5.2. Hard Evaluation</title>
          <p>Table 15 presents the results for the hard-hard evaluation of Task 2.2. We received 15 system submissions.
The normalized ICM-Hard metric ranged from 0.0000 to 0.5784, with an average of 0.308 and a standard
EXIST2025-test_gold
UMUTeam_1
I2C-UHU-Altair_1
surrey-mm-group_1
UMUTeam_2
Nogroupnocry_1
EXIST2025-test_majority-class
EXIST2025-test_minority-class</p>
          <p>Team
– 4.7018
UMUTeam -1.6327
I2C-UHU-Altair − 2.0736
surrey-mm-group − 2.4423
UMUTeam − 2.4994
Nogroupnocry − 4.1395
– − 5.0745
– − 18.9382</p>
          <p>ICM-S</p>
          <p>ICM-S Nr</p>
          <p>Rank
deviation of 0.169. Thirteen systems outperformed the EXIST2025-test_majority-class baseline, reflecting
strong participation despite the challenging nature of the task. Concerning the five best submissions
from diferent teams, the top system outperformed the fifth by 52.7%, a considerable diference suggesting
uneven performance across modeling strategies. Nonetheless, the narrow gap among the three leading
systems (near 10%) points to the emergence of competitive approaches for intent recognition, even in
the presence of aggregated hard annotations derived from subjectively interpreted multimodal inputs.</p>
        </sec>
      </sec>
      <sec id="sec-6-5">
        <title>6.6. Task 2.3: Sexism Categorization in Memes</title>
        <sec id="sec-6-5-1">
          <title>6.6.1. Soft Evaluation</title>
          <p>M F1
Nogroupnocry_1 Nogroupnocry
EXIST2025-test_minority-class –
Finally, Table 17 presents the results for classifying memes based on the aspects of women being attacked,
with outputs provided as a single class prediction. A total of 14 systems participated (excluding the gold
and baselines). Thirteen of them scored above the best baseline, with an average normalized ICM-Hard
of 0.262 and a standard deviation of 0.158. The relative diference between the top and fifth-best system
from diferent teams was 59.5%, indicating competitive but not saturated performance across top ranks.
All systems clearly outperformed the EXIST2025-test_minority-class baseline.</p>
          <p>Rank
EXIST2025-test_gold – 2.4100
CogniCIC_1 CogniCIC 0.0244
GrootWatch_3 GrootWatch − 0.0798
GrootWatch_2 GrootWatch − 0.3550
ArcosGPT_1 ArcosGPT − 0.4187
GrootWatch_1 GrootWatch − 0.5812
I2C-UHU-Altair_1 I2C-UHU-Altair − 0.9958
I2C-UHU-Altair_2 I2C-UHU-Altair − 1.1838
CLTL_2 CLTL − 1.4243
CLTL_3 CLTL − 1.5325
UMUTeam_1 UMUTeam − 1.5624
CLTL_1 CLTL − 1.6077
UMUTeam_2 UMUTeam − 1.8869
NaturalThinker_1 NaturalThinker − 2.0376
EXIST2025-test_majority-class – − 2.0711
surrey-mm-group_1 surrey-mm-group − 2.9992
EXIST2025-test_minority-class – − 3.3135</p>
          <p>M F1</p>
        </sec>
      </sec>
      <sec id="sec-6-6">
        <title>6.7. Task 3.1: Sexism Identification in Videos</title>
        <sec id="sec-6-6-1">
          <title>6.7.1. Soft Evaluation</title>
          <p>Table 18 presents the results for classifying videos as sexist or not sexist. The Soft–Soft evaluation of
Task 3.1 attracted 34 participating systems. The normalized ICM-Soft values, which reflect alignment
with the probabilistic distribution of annotator labels, ranged from 0.1481 to 0.5590. The average
normalized score was 0.3584, with a standard deviation of 0.174, indicating considerable variance in
system quality. A total of 25 systems outperformed the strongest baseline
(EXIST2025-test_majorityclass). The diference between the best and worst among the top five teams was approximately 18.2%,
reflecting a modest but meaningful spread. Interestingly, most high-scoring systems came from teams
with distinct modeling pipelines, suggesting diverse yet efective approaches to handling annotator
disagreement in the multimodal context of video classification.</p>
          <p>ICM-S ICM-S Nr</p>
        </sec>
        <sec id="sec-6-6-2">
          <title>6.7.2. Hard Evaluation</title>
          <p>Finally, Table 19 presents the results for classifying videos on sexism identification in a hard-hard
context. For this task, 41 systems submitted valid runs. Normalized ICM-Hard scores spanned from
0.1954 to 0.6001, with a mean of 0.4913 and a standard deviation of 0.1033. Nearly all participants (39
out of 41) exceeded the majority-class baseline (EXIST2025-test_majority-class), showing strong global
performance. The top five teams, as can be observed from Table 19, were closely matched, with only a
4.0% diference between the best and lowest performer among the top five.</p>
          <p>ICM-H Nr</p>
        </sec>
      </sec>
      <sec id="sec-6-7">
        <title>6.8. Task 3.2: Source Intention in Videos</title>
        <sec id="sec-6-7-1">
          <title>6.8.1. Soft Evaluation</title>
          <p>Table 20 presents the results for the classification of videos according to the intention of the author,
with the outputs provided as the probabilities of the diferent classes. In this task, the 29 participating
systems showed normalized ICM-Soft scores that ranged from 0.0000 to 0.3728, with a mean of 0.252
F1Y
and a standard deviation of 0.084. A total of 26 systems surpassed the strongest baseline
(EXIST2025test_majority-class), indicating a generally competitive field. The diference between the best and the
iffth ranked systems from distinct teams was modest, at 12.0%, revealing a cluster of high-performing
submissions.</p>
          <p>ICM-S Nr</p>
        </sec>
        <sec id="sec-6-7-2">
          <title>6.8.2. Hard Evaluation</title>
          <p>Table 21 presents the results for the hard-hard evaluation of Task 3.2. The normalized ICM-Hard scores
for the 36 systems submitted ranged from 0.0000 to 0.5018, with a mean of 0.375 and a standard deviation
of 0.116. Most systems (33 out of 36) outperformed the majority-class baseline. The best systems from
ifve diferent teams showed a relative diference between the highest and lowest normalized scores of
only 4.3%, reflecting a tight performance range. Interestingly, while the average performance remains
moderate, the consistency among top runs suggests that author intent in video—despite its multimodal
complexity—can be reliably modeled when annotations are aggregated, albeit with room for improving
discriminatory power across subtle categories.</p>
          <p>ICM-H Nr</p>
        </sec>
      </sec>
      <sec id="sec-6-8">
        <title>6.9. Task 3.3: Sexism Categorization in Videos</title>
        <sec id="sec-6-8-1">
          <title>6.9.1. Soft Evaluation</title>
          <p>Table 22 presents the results for classifying videos based on the aspects of women being attacked,
with outputs provided as class probabilities. A total of 34 participant systems were submitted for
this task. The normalized ICM-Soft scores ranged from 0.0000 to 0.1593, with a mean of 0.051 and
standard deviation of 0.052. The majority baseline achieved a normalized ICM score of 0.0931, and
was outperformed by 4 systems, while the minority baseline was not surpassed by any system. The
top 5 systems from diferent teams achieved normalized ICM-Soft scores between 0.1593 and 0.0931.
The relative diference between the best and the fifth-ranked system within this top group was 41.6%.
Despite the low overall values, a meaningful gap between systems can be observed, which underlines
the dificulty of probabilistic categorization in multi-class scenarios over multimodal video content.</p>
          <p>ICM-S Nr</p>
          <p>Rank</p>
        </sec>
        <sec id="sec-6-8-2">
          <title>6.9.2. Hard Evaluation</title>
          <p>Finally, Table 23 presents the results for classifying memes based on the aspects of women being
attacked, with outputs provided as a single class prediction. This task attracted 41 participant systems.
Normalized ICM-Hard scores spanned from 0.0000 to 0.3765, with a mean of 0.243 and standard deviation
of 0.116. A total of 30 systems outperformed the majority baseline, while 13 did better than the minority
baseline. The top 5 systems from distinct teams achieved normalized ICM-Hard scores ranging from
0.3765 to 0.3585, showing a very tight performance band with only a 4.78% relative diference between
the highest and the lowest scoring among them.</p>
        </sec>
      </sec>
      <sec id="sec-6-9">
        <title>6.10. Cross-task Performance Analysis</title>
        <p>Figure 4 shows the results of Cross Entropy (horizontal axes) and normalized ICM-Soft (vertical axes).
All the plots include the gold standard with maximum score. The first row (Tasks 1.1, 2.1, and 3.1),
corresponds to sexism detection tasks, i.e., binary single-label classification on texts, images and video,
respectively. The baseline approaches consisting of labeling everything as the majority class or as the
minority class are marked in blue and red, respectively.</p>
        <p>In terms of both Cross Entropy and ICM-Soft, the results of these two baselines fall below those of
the other participant runs, indicating that the proposed systems contribute some informative value.</p>
        <p>Only in the case of the video task (Task 3.1) are there some runs that fall below the baseline in terms of
ICM. This may be due to the fact that ICM penalizes false information based on class frequency.</p>
        <p>Another observation is that, while high ICM values imply high Cross Entropy values, the reverse is
not true, with several runs accumulating good performance (low scores) according to Cross Entropy
but low ICM scores. While on the horizontal axis (cross-entropy), clusters of outputs with high ICM
similarity are located far from the baseline in terms of cross-entropy, all the graphs show ICM ranges
with high cross-entropy values spanning from the maximum down to the baseline. This may be due,
among other factors, to the fact that ICM considers not only the similarity of the assigned values for
each class, but also the distribution of classes throughout the corpus. In any case, in terms of ICM, there
remains a significant gap between the best-performing systems and the perfect solution. The gap is
notably larger for the image and video tasks (Tasks 1.2 and 1.3).</p>
        <p>The second row corresponds to intention detection tasks. These are hierarchical classification tasks
with an initial YES/NO decision and two or three sub-classes for the YES category. In this case, there is
also an accumulation of runs with high performance in Cross Entropy but low ICM, suggesting that the
second metric captures additional aspects. Most runs outperform the baselines, but the gap between the
best run and the perfect output in terms of ICM is larger than in sexism detection, indicating a higher
complexity of the task .</p>
        <p>Finally, the third row corresponds to hierarchical multi-label classification tasks involving multiple
categories of sexism. In this case, since the tasks are multi-label, the Cross Entropy metric is not
applicable. The plots show system rankings ordered from lowest to highest ICM. An interesting finding
is that, in this case, many of the runs—including the minority-class baseline—do not surpass the zero
threshold in normalized ICM. This suggests that some outputs, in terms of information content, do not
outperform the empty output. In other words, the amount of noisy information exceeds the amount of
useful information. As the number of categories increases and the task requires capturing annotation
ambiguity (multi-label classification), the gap between the best run and the perfect output increases
significantly compared to the previous tasks.</p>
        <p>On the other hand, Figure 5 displays evaluation results for the hard evaluation versions, in which
the assignment of items to classes depends on whether diferent thresholds of annotator agreement
are met. The plot shows F1 scores for the positive class in the first row (sexism identification), and
the average F1 score across all classes for the remaining tasks. The vertical axes show the results for
ICM-Hard.</p>
        <p>In general, a strong correlation between both metrics can be observed above a certain score threshold.
This is because both F1 and ICM take class specificity or frequency within the corpus into account.</p>
        <p>Again, most runs outperform the baselines. Moreover, by observing the gap between the best run
and the ideal output, we can see that task dificulty increases as we move to setups with more classes,
multi-labeling, or hierarchical structures (rows). An increase in task dificulty is also observed as we
move from text-based tasks (first column), to image (second column), and video (third column).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>The following discussion analyzes system performance across the full range of tasks proposed in EXIST
2025, which include the detection, intent classification, and fine-grained categorization of sexist content.
For the first time in the series, these tasks have been applied not only to textual data (tweets), but
also to memes and short-form videos (TikToks), enabling a broad multimodal evaluation. The section
is structured into three parts, each focusing on one of the core challenges: sexism detection, source
intention, and categorization, allowing us to examine how the nature of the input content (text, image,
or video) afects model efectiveness.</p>
      <sec id="sec-7-1">
        <title>7.1. System Performance Across Text, Memes, and Video in Sexism Detection</title>
        <p>As it can be observed in Table 24, which summarizes the best results for the subtasks 1.1, 2.1 and 3.1
(sexism detection in tweets, memes and TikTok videos, respectively), the tweets (text) dataset yielded
the highest detection performance, while memes and especially videos proved more challenging. In
the Soft-Soft evaluation (probabilistic outputs), the top system on tweets achieved an ICM-Soft Norm of
∼ 0.67, notably higher than the top systems on memes (0.511) and videos (0.559), as shown in Table 24.
In the Hard-Hard evaluation (binary outputs), tweet data again saw the best results with the top F1
(positive class) ∼ 0.817 and a normalized ICM-Hard of ∼ 0.84. Memes were intermediate (top F1 ∼ 0.781,
Norm ∼ 0.688), and videos the lowest (top F1 ∼ 0.694, Norm ∼ 0.600). These gaps suggest that the data
source significantly influences system performance. Models detect sexism in raw text more efectively
than in images or videos, likely due to the noise and information loss introduced when dealing with
multimedia content.</p>
        <p>Even state-of-the-art multimodal systems face dificulties with blurry or stylized text and background
clutter in memes, which can explain the reduced accuracy on the meme and video datasets. The lower
results on Subtask 3.1 (videos) align with the expectation that multimodal sexism detection is a novel and
challenging problem, less studied than text-based sexism and complicated by needing to interpret visual
or audio context. Overall, tweet-based models outperformed those on OCR-derived text, underlining
how a clean text signal (tweets) is easier for current NLP systems to handle compared to extracted text
from images or videos.</p>
        <p>The lower performance observed in memes and videos is not solely attributable to the multimodal
nature of these formats. Beyond the technical challenges of processing visual and audio data, these
media often rely on implicit cultural references, sarcasm, irony, and contextual humor that are dificult
to interpret automatically. Memes, in particular, tend to condense layered meanings into very short texts
superimposed on images, often requiring familiarity with platform-specific discourse, internet slang, or
ongoing social debates. Similarly, TikTok videos frequently reference adolescent trends, in-group codes,
and popular audio tracks, which may be opaque to both annotators and systems unless they share that
sociocultural context. These aspects introduce a level of pragmatic and cultural ambiguity that goes
beyond the limitations of current vision or language models, and point to the need for systems that can
integrate both multimodal understanding and world knowledge to interpret such content efectively.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. System Performance Across Text, Memes, and Video in Sexism Source Intention</title>
        <p>This task required systems to predict the intention behind online sexist content, with a hierarchical
multiclass setup. The classification pipeline first determines whether the content is sexist, and then
predicts the fine-grained intention: DIRECT, REPORTED (tweets only), or JUDGEMENTAL. Table 25
presents the top systems and their evaluation metrics for each modality and context.</p>
        <p>As observed in Table 25, tweet-based systems once again outperform meme and video systems,
especially in the Soft-Soft (probabilistic) evaluation. However, absolute values of all metrics are lower
than in binary sexism detection, reflecting the increased dificulty of intention identification, particularly
in noisy or OCR-extracted content. Notably, the performance gap between modalities is less pronounced
in Macro F1 than in ICM-Soft, suggesting that top systems are better at predicting the main class, but
struggle with fine calibration to the true distribution of annotator votes.</p>
        <p>The gap between tweet, meme, and video results is partially explained by the challenges posed by
multimodal and OCR-derived content, as in Tasks 1.1, 2.1 and 3.1. Additionally, the removal of the
REPORTED class from memes and videos (a design choice based on data inspection) means that systems
face a simpler but less nuanced label space in those domains. This may contribute to the relatively high
Macro F1 in memes and videos, as models need only diferentiate between fewer classes.</p>
        <p>Moreover, the higher prevalence of the DIRECT class in memes aligns with the nature of meme
content, which often features explicit or humorous sexist material. Systems tuned to this distribution
may perform well in memes but generalize poorly to tweets, where REPORTED and JUDGEMENTAL
are more common and context-dependent.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. System Performance Across Text, Memes, and Video in Sexism Categorization</title>
        <p>Tasks 1.3, 2.3 and 3.3 addressed the multilabel, multiclass, and hierarchical classification of online sexist
content, where systems must not only detect sexist content, but also assign one or more fine-grained
categories indicating the facet of womanhood under attack. The categories include IDEOLOGICAL
AND INEQUALITY, STEREOTYPING AND DOMINANCE, OBJECTIFICATION, SEXUAL VIOLENCE, and
MISOGYNY AND NON-SEXUAL VIOLENCE.</p>
        <p>Table 26 presents the top system performances for each data source and context. The overall pattern
mirrors previous tasks: tweet-based systems consistently outperform those on memes and videos,
especially in the probabilistic (Soft-Soft) context. However, absolute metrics are lower than for binary
or intention-based sexism detection, reflecting the increased complexity of the multilabel, hierarchical
setup and the annotation ambiguity intrinsic to these subtle categories.</p>
        <p>In all modalities, ICM-Soft Norm scores are considerably lower than in previous tasks, indicating that
systems struggle to accurately capture the distribution of annotator opinions and to model multilabel
uncertainty. Notably, even the best systems on tweets barely exceed 0.41 in ICM-Soft Norm, with further
drops for memes and videos.</p>
      </sec>
      <sec id="sec-7-4">
        <title>7.4. Performance Trends on Tweet-based Tasks (2023–2025)</title>
        <p>To better understand the progress in sexism detection over time, we compared the best-performing
systems across the three tweet-based tasks (Tasks 1.1, 1.2, and 1.3) in the last three editions of EXIST.
The results, shown in Table 27, include both ICM-Soft scores and their normalized counterparts (when
available).</p>
        <p>Task 1.1</p>
        <p>0.90
1.09 (0.68)
1.06 (0.67)</p>
        <p>Task 1.2</p>
        <p>-1.34
-0.25 (0.48)
-0.43 (0.46)</p>
        <p>Task 1.3</p>
        <p>-2.32
-1.18 (0.44)
-1.10 (0.44)</p>
        <p>The data suggests a clear performance improvement from 2023 to 2024, likely reflecting the broader
adoption of large language models and increasingly refined prompt engineering and fine-tuning
strategies. This gain is particularly visible in the source intention and category classification tasks (1.2 and
1.3), which traditionally require more nuanced modeling.</p>
        <p>Interestingly, 2025 shows no clear progress over 2024, despite a significant increase in the number of
participants and submitted runs. In fact, the best normalized scores for Tasks 1.2 and 1.3 in 2025 are
slightly lower than the previous year. This raises the important question: are we reaching a performance
ceiling on these tasks when using the same dataset? One possible explanation is saturation — as systems
converge toward similar architectures and training data, gains become increasingly marginal. Moreover,
when using the same test data over multiple editions, top systems may begin to approach the upper
bounds of what can be achieved without new annotation rounds or more diverse evaluation settings.</p>
        <p>These findings highlight the importance of refreshing datasets, increasing task complexity, or shifting
focus to novel and underexplored modalities to maintain scientific progress and distinguish truly
innovative approaches.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions</title>
      <p>The objective of the EXIST challenge is to foster research on the automatic detection and modeling
of sexism in online environments, with a particular emphasis on social networks. The 2025 edition of
the lab, organized as part of CLEF, attracted 114 participant teams and received a total of 873 system
runs. Participants explored a wide range of approaches, including vision transformer models, data
augmentation via automatic translation and duplication, the use of data from previous EXIST editions,
multilingual and Twitter-specific language models, as well as transfer learning from related domains
such as hate speech, toxicity, and sentiment analysis.</p>
      <p>The tasks in EXIST 2025 addressed the problem of sexism detection and classification across three types
of content—text (tweets), images (memes), and video (TikToks)—demonstrating the comprehensive and
multimodal scope of the challenge. This multimodal design reflects the complexity of real-world social
media platforms, where sexist messages may be conveyed through language, visuals, or a combination
of both.</p>
      <p>While many participating systems followed the conventional strategy of producing hard-label outputs,
a substantial number took advantage of the multi-annotator nature of the dataset to submit soft-label
predictions. This shift indicates a growing interest within the research community in building models
that can handle subjectivity, disagreement, and nuanced interpretations of harmful content.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work has been financed by the European Union (NextGenerationEU funds) through the “Plan de
Recuperación, Transformación y Resiliencia”, by the Ministry of Digital Transformation and by the
UNED University. However, the points of view and opinions expressed in this document are solely those
of the author(s) and do not necessarily reflect those of the European Union or European Commission.
Neither the European Union nor the European Commission can be considered responsible for them.
It has also been financed by the Spanish Ministry of Science and Innovation (project FairTransNLP
(PID2021-124361OB-C31 and PID2021-124361OB-C32)) funded by MCIN/AEI/10.13039/501100011033
and by ERDF, EU A way of making Europe, and by the Australian Research Council (DE200100064 and
CE200100005).</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to check grammar and
spelling.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] NewStatesman, Social media and the silencing efect: Why misogyny online is a human rights issue</article-title>
          , https://bit.ly/3n3ox68, n.d.
          <source>Last accessed 18 Oct</source>
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. L. Gil</given-names>
            <surname>Bermejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martos Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Vázquez</given-names>
            <surname>Aguado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>García-Navarro</surname>
          </string-name>
          ,
          <article-title>Adolescents, ambivalent sexism and social networks, a conditioning factor in the healthcare of women</article-title>
          ,
          <source>Healthcare (Basel) 9</source>
          (
          <year>2021</year>
          )
          <article-title>721</article-title>
          . doi:
          <volume>10</volume>
          .3390/healthcare9060721.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Morales Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lopez-Figueroa</surname>
          </string-name>
          ,
          <article-title>The portrayal of women in media</article-title>
          ,
          <source>Journal of Student Research</source>
          <volume>13</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Davis</surname>
          </string-name>
          , Objectification, sexualization, and
          <article-title>misrepresentation: Social media and the college experience</article-title>
          ,
          <source>Social Media + Society</source>
          <volume>4</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Harriger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Tiggemann, TikTok, TikTok, the time is now: Future directions in social media and body image, ody Image B (</article-title>
          <year>2023</year>
          )
          <fpage>222</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of EXIST 2021:
          <article-title>Sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendieta-Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marco-Remón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Makeienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of EXIST 2022:
          <article-title>Sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          , J. C. de Albornoz,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of EXIST 2023 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization (Extended Overview)</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2023</year>
          ), volume
          <volume>497</volume>
          , CEUR Working Notes,
          <year>2023</year>
          , pp.
          <fpage>813</fpage>
          -
          <lpage>854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dumitrache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          , E. Simpson, M. Poesio, SemEval
          <article-title>-2021 task 12: Learning with disagreements</article-title>
          ,
          <source>in: Proceedings of the 15th International</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>