<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling Annotator Subjectivity for Sexism Detection on Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qizhang Chen</string-name>
          <email>qizhangc13@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong</string-name>
          <email>kongleilei1979@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanfu Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Changxin Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents Mumul03's submission to the EXIST 2025 shared task on sexism detection. We employ ModernBERT-large, used in this task for the first time, and incorporate demographic information from the annotator such as gender, ethnicity, age, and other attributes into the model input. By modeling individual annotator perspectives and aggregating predictions across submodels, our system efectively captures subjectivity in the annotation process. Our system achieved a ranking of 7th out of 64 submissions for Task 1 in the Soft-Soft category setting. This paper reports our findings on the classification of sexism within textual content on social media, ofering substantial insights for the EXIST 2025 challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism detection</kwd>
        <kwd>Text classification</kwd>
        <kwd>Transformers</kwd>
        <kwd>EXIST 2025</kwd>
        <kwd>Natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sexism, defined as prejudice or discrimination based on sex or gender, often targets women and girls
through subtle and explicit expressions in everyday life [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Although public awareness has grown,
this bias remains deeply rooted and is increasingly manifest in online platforms [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Online sexism not
only undermines women’s self-perception and opportunities but also poses risks to young audiences
by normalizing misogynistic content. Recent studies show that platforms such as TikTok can quickly
expose users to such harmful material [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In this context, NLP technologies have become vital tools in detecting and mitigating online sexism
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We participate in the EXIST 2025 shared task to explore robust modeling approaches for this
challenge.
      </p>
      <p>
        As part of the EXIST 2025 evaluation, Task 1 is defined as a binary classification task that determines
whether a tweet—originally posted on Twitter (now rebranded as X)—contains sexist content, including
both explicit and implicit expressions of gender bias. Task 2 is a three-class classification task focused on
identifying the author’s intent, categorizing tweets into direct sexism, reported sexism, or judgmental
commentary [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These tasks are designed to improve automatic systems’ understanding of both the
presence and contextual framing of sexism on social media.
      </p>
      <p>
        To efectively address these tasks, it is essential to select a model capable of understanding subtle
linguistic cues. ModernBERT-large ofers an enhanced bidirectional attention mechanism and
contextaware encoding, enabling it to capture nuanced semantic signals such as sarcasm and implicit bias
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These features are crucial for detecting sexism, where meaning is often shaped by context and
subjectivity.
      </p>
      <p>
        However, relying on a single model to assess whether a statement is sexist introduces bias, as
such judgments are inherently subjective. Prior research shows that annotator backgrounds such
as gender and cultural context significantly influence their perceptions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Therefore, employing a
single predictive model risks masking these individual-level variations, ultimately undermining the
representational richness aforded by diverse annotator perspectives [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        To address this, we adopt the Learning with Disagreement (LwD) paradigm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and propose a
multimodel ensemble framework built on ModernBERT-large. Each submodel represents a distinct annotator
perspective, and we aggregate their predictions to simulate a diversity of viewpoints. This approach
better reflects the pluralistic nature of human judgment in the detection of sexism.
      </p>
      <p>The rest of this paper is organized as follows. Section 2 reviews related work; Section 3 details our
system; Section 4 presents experiments and results; Section 5 concludes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Relate Work</title>
      <p>
        Recent advances in sexism detection have generally followed three strategic directions: multiview
ensembles, which aggregate predictions from multiple models or perspectives to improve robustness;
transformer fine-tuning, where large pretrained models are adapted to task-specific data using
techniques like data augmentation; and prompt-based large language models (LLMs), which use zero- or
few-shot in-context learning to perform sexism detection without task-specific fine-tuning. Each of
these approaches has shown strong empirical performance in benchmark evaluations such as EXIST[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and SemEval [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Multiview Ensembles</title>
        <p>
          Model ensembling enhances robustness, particularly in multilingual and cross-domain contexts. Methods
such as prediction averaging have been shown to improve F1 scores and reduce classification uncertainty
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. At EXIST 2024, the NYCU-NLP team implemented a multilingual, multitask architecture that
combined DeBERTa-v3 and XLM-RoBERTa with demographic features of annotators, achieving top
performance across all subtasks [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Similarly, the CIMAT-CS-NLP team integrated GPT-4-style models
with fine-tuned multilingual transformers, ranking among the top systems in Task 1 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The BAZI
team optimized XLM-RoBERTa for soft-label uncertainty and secured second place in Task 2 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Transformer Fine-tuning</title>
        <p>
          Transformer-based encoders are widely used for supervised sexism detection due to their contextual
understanding and multilingual adaptability. Commonly used models include XLM-RoBERTa,
DeBERTaV3, mBERT, and Spanish-specific variants like BETO and BERTIN. To improve generalization, teams
employed techniques such as soft-label learning, multi-label classification, and data augmentation. The
AlexPUPB team, for example, fine-tuned compact models like XLM-RoBERTa and MiniLMv2 using soft
labels derived from annotator distributions [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Back-translation was also used to augment data by
introducing lexical variation without altering semantics [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Prompt-based Large Language Models</title>
        <p>
          Prompt-based large language models (LLMs) have emerged as strong baselines in recent EXIST tasks
[
          <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
          ]. The mc-mistral team showed that a single Mistral-7B model, prompted with a compact few-shot
format mixing English and Spanish examples, outperformed many supervised baselines without
finetuning [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The CIMAT-CS-NLP team achieved competitive zero-shot performance with Google Gemini
by combining formal definitions of sexism and expert-role prompts [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. These results underscore the
potential of careful prompt design for low-resource, multilingual classification.
        </p>
        <p>
          Recent studies have extended prompting by incorporating socio-demographic context. Magnossão
de Paula et al. prompted LLMs with demographic personas (e.g., “a male over 45”) and found that
model outputs tended to align more with female annotators by default, though alignment efects were
inconsistent [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Jiang et al. further demonstrated that including annotator profiles—such as gender
and ideological stance—improved the model’s ability to replicate individual judgments [21]. These
approaches highlight both the promise and limitations of demographic-aware prompting for subjective
NLP tasks.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets and Evaluation Metrics</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>In this study, we only participated in the English subset of Task 1 (sexism identification) and Task 2
(source intention classification) under the soft evaluation setting. Therefore, this section focuses solely
on the portion of the dataset relevant to our experiments.</p>
        <p>The English subset of the EXIST 2025 dataset consists of 4,727 tweets, which are annotated for various
types of sexist expressions, including both explicit and reported sexism. The dataset is divided into a
training set (3,260 tweets), a development set (489 tweets), and a test set (978 tweets).</p>
        <p>For each sample, the following attributes are provided in a JSON format:
• id_EXIST: a unique identifier for the tweet.
• tweet: the text of the tweet.
• number_annotators: the number of persons that have annotated the tweet.
• annotators: a unique identifier for each of the annotators.
• gender_annotators: the gender of the diferent annotators (“F” or “M”, for female and male
respectively).
• age_annotators: the age group of the diferent annotators (grouped in “18–22”, “23–45”, or
“46+”).
• labels_task1: a set of labels (one for each of the annotators) that indicate if the tweet contains
sexist expressions or refers to sexist behaviors or not (“YES” or “NO”).
• labels_task2: a set of labels (one for each of the annotators) recording the intention of the
person who wrote the tweet (“DIRECT”, “REPORTED”, “JUDGEMENTAL”, “–”, and “UNKNOWN”).</p>
        <p>The dataset is annotated by a diverse group of individuals in terms of gender, age, ethnicity, education
level, and country of residence. This diversity contributes to the robustness and fairness of the
annotations, helping the dataset to capture a wide range of perspectives and reduce potential annotation
bias.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Metrics</title>
        <p>This evaluation targets systems that output probability distributions over categories instead of
singleclass predictions. To efectively assess system performance in scenarios involving multiple and
potentially conflicting annotations, ICM-Soft (a soft-label extension of the Information Contrast Measure) is
adopted as the oficial evaluation metric. Additionally, results are also reported using its normalized
form (ICM-Soft Norm) and Cross Entropy, providing complementary views on prediction quality.
• ICM-Soft: a soft-label extension of the ICM metric, designed to compare the predicted probability
distribution with the distribution of human annotations. It is particularly suitable for tasks where
label disagreement or subjectivity is common, as it evaluates how closely the model’s predictions
align with the collective opinions of annotators.
• ICM-Soft Norm: a normalized version of ICM-Soft, which rescales the raw scores to facilitate
comparison across diferent tasks or systems.
• Cross Entropy: a widely used metric in classification tasks that measures the diference between
the predicted and true probability distributions. Lower values indicate better alignment with the
true distribution.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. System Description</title>
      <p>Our architecture, illustrated in Figure 1, consists of multiple models based on transformers independently
trained. Each model receives a version of the same tweet augmented with the demographic information
of a specific annotator. These models are trained using a unified configuration and are designed to
capture diverse annotator perspectives. Their predictions are then aggregated to produce the final
output. The following subsections provide a detailed breakdown of our system’s construction, including
data preprocessing, metadata integration, model fine-tuning, and output aggregation.</p>
      <sec id="sec-4-1">
        <title>4.1. Data Preprocessing</title>
        <p>Given that the dataset comprises tweets from X, the textual content is inherently informal and often
includes elements such as hashtags, user mentions, emojis, and URLs. These noisy components can
hinder model performance if not properly normalized. To improve the quality and consistency of
the input data, we applied a series of preprocessing steps aimed at reducing noise and linguistic
variability—transforming raw tweets into normalized, standardized versions more suitable for model
input. Below, we outline the key preprocessing operations applied during this stage:
1. Hashtags were stripped of the “#” symbol to retain only the core word.
2. User mentions were replaced with the placeholder token user.
3. URLs were substituted with the token url.
4. Emojis were converted into their textual representations using the emoji Python package, enabling
standardized tokenization and reducing encoding inconsistencies.</p>
        <p>To further increase data diversity and robustness, we employed the AEDA (An Easier Data
Augmentation) technique [22]. AEDA introduces controlled perturbations by randomly inserting punctuation
marks (e.g., ., ;, ?, :, !, ,) into sentences. In this way, the normalized tweets are transformed into
augmented versions that enrich the training data without altering semantic content, thereby promoting
model generalization.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Integrating Metadata</title>
        <p>Each data sample is annotated by up to six annotators, with each annotator providing demographic
metadata, including gender, age, ethnicity, education level, and country of residence. We split the
dataset by annotator, resulting in six parallel subsets. Each subset contains the same tweet texts but is
paired with metadata corresponding to a specific annotator, forming one-to-one tweet–annotator pairs.</p>
        <p>To construct the input, we concatenate the original tweet with the annotator’s metadata using the
[SEP] token as a separator between fields. This format allows both the tweet content and the associated
demographic information to be included in a single input sequence. Figure 2 illustrates an example of
this formatting.</p>
        <p>This method integrates annotator information directly into the model input without requiring
architectural modifications. Each tweet is thus associated with multiple instances, each reflecting a
distinct annotator perspective.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Fine-tuning Models</title>
        <p>To model the subjective perspectives of individual annotators, we fine-tune six model instances
independently. While all sub-models share the same architecture and training configuration, each is trained
on a distinct subset created by pairing tweets with the metadata of a specific annotator. Input sequences
are formed by concatenating the tweet text with the annotator’s demographic information, ensuring
structural consistency across samples while enabling the model to capture perspective-specicfi patterns.</p>
        <p>During training, we use the standard cross-entropy loss function as the optimization objective, defined
as:
ℒCE = −

∑︁  log()
=1
where  denotes the one-hot encoded ground-truth label, and  is the predicted probability for
class . This loss function measures the discrepancy between prediction and truth, helping improve
classification accuracy.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Post Processing</title>
        <p>The final output is obtained by aggregating predictions from six independently fine-tuned models,
each generating a discrete class label for a given input. We then aggregate these class predictions to
construct a soft label—a probability distribution reflecting the relative frequency of each predicted class.
To formalize this process, the probability ˆ of class  is computed as:</p>
        <p>ˆ = 1 ∑︁ I(() = )</p>
        <p>=1
where  denotes the number of models, () is the predicted label from the -th model, and I(· ) is
the indicator function. This method captures both model consensus and uncertainty, aligning with the
soft-soft evaluation protocol and efectively reflecting the subjective variation among annotators.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setting</title>
        <p>All experiments were conducted on a single NVIDIA A800 GPU (80GB memory). We utilized the
HuggingFace transformers library for model training, with AdamW as the optimizer. The learning rate
was set to 2e-5, and the weight decay was 0.1. Training was carried out for 8 epochs with a batch size
of 16, and a linear warmup strategy was applied with 10% warmup ratio. The maximum input sequence
length was set to 128. Mixed-precision training (fp16) was enabled to improve memory eficiency.
Model evaluation was performed at the end of each epoch, and the checkpoint with the best validation
performance was saved.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Main Results on Development Set</title>
        <p>This section presents the results of our models during the development phase, under the soft-label
evaluation setting. We assess three transformer-based architectures: ModernBERT-large,
DeBERTaV3large, and XLM-RoBERTa-large. Each model was tested across four configurations: (1) the baseline
model, (2) baseline + data augmentation (DA), (3) baseline + annotator metadata integration (AMI),
and (4) baseline + data augmentation (DA) + annotator metadata integration (AMI). To ensure a fair
comparison, Task 2 models used ground-truth labels from Task 1 for the hierarchical structure, instead
of predictions, allowing us to isolate the efect of architecture and training strategy.</p>
        <p>The results, as shown in Table 1, demonstrate that combining data preprocessing, data augmentation
(DA), and annotator metadata integration (AMI) leads to consistent performance improvements across
all models. In Task 1, XLM-RoBERTa-large improved from an ICM-Soft score of 0.5042 (baseline) to
0.7123 (+DA+AMI), and DeBERTaV3-large from 0.5187 to 0.7486. Similar gains were observed in Task 2,
confirming the efectiveness of this strategy.</p>
        <p>When examining the individual contributions of DA and AMI, we find that AMI yields stronger
improvements when applied independently. For instance, with ModernBERT-large, the ICM-Soft score
increased from 0.5619 (baseline) to 0.6937 using AMI, whereas DA achieved 0.6781—an approximate
13% relative improvement in favor of AMI. This suggests that annotator metadata contributes more
directly to modeling subjectivity and improving the accuracy of learned label distributions.</p>
        <p>Among all architectures, ModernBERT-large consistently achieved the best performance, with
ICMSoft scores of 0.7839 in Task 1 and -3.2791 in Task 2. Its strong adaptability to diferent annotator
perspectives further demonstrates its superior ability to model subjective label distributions and to
address the complex classification challenges posed by sexism detection. These results validate our
choice of using it as the backbone for the EXIST 2025 task.</p>
        <p>To further understand the model’s behavior, we conducted an analysis of misclassified samples.
We found that tweets with strong emotional tone or vulgar language were often misjudged as
sexist, even when annotated as non-sexist. This suggests a bias in handling emotionally charged but
gender-irrelevant content. Moreover, the model exhibited instability on samples with high annotator
disagreement, reflecting the need for further improvement in handling highly subjective or semantically
ambiguous texts.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Oficial Leaderboard Performance</title>
        <p>Our final submission was based on the ModernBERT-large model, incorporating data augmentation
and annotator metadata integration. In the final evaluation of EXIST 2025, our system ranked 7th out
of 64 submissions for Task 1 in the Soft-Soft setting, and 20th out of 53 submissions for Task 2.</p>
        <p>As illustrated in Tables 2 and 3, our proposed approach demonstrates competitive performance,
particularly in Task 1, where it ranked 7th with an ICM-Soft score of 0.7135—substantially exceeding the
baseline systems, which yielded scores of -2.1991 and -3.8158. While the system achieved a relatively
lower rank of 20th in Task 2, it still outperformed the majority of baseline models in both ICM-Soft and
cross-entropy metrics, reflecting its robustness and adaptability across tasks. These results underscore
the efectiveness of the ModernBERT-large architecture in modeling annotator subjectivity. Overall,
our method successfully captures the distributional nature of annotator disagreement under the
SoftSoft evaluation paradigm, delivering high-fidelity probabilistic outputs without introducing additional
architectural complexity, thereby highlighting its practicality and scalability for real-world deployment.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents our approach for Tasks 1 and 2 of the EXIST 2025 shared task. We employed
the ModernBERT-large model, enhanced through data augmentation and the integration of annotator
demographic metadata to better model subjectivity in sexism detection. Our system demonstrated
strong performance, ranking 7th in Task 1 under the Soft-Soft evaluation setting. These results highlight
the efectiveness and practicality of our method in capturing annotator disagreement and addressing
the nuanced nature of social bias in language.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the Quality Engineering Projects for Teaching Quality and Teaching Reform
in Undergraduate Colleges and Universities of Guangdong Province (No.20251067).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-o3 in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[21] A. Jiang, N. Vitsakis, T. Dinkar, G. Abercrombie, I. Konstas, Re-examining sexism and misogyny
classification with annotator attitudes, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Findings
of the Association for Computational Linguistics: EMNLP 2024, Association for Computational
Linguistics, Miami, Florida, USA, 2024, pp. 15103–15125. URL: https://aclanthology.org/2024.
ifndings-emnlp.887/. doi: 10.18653/v1/2024.findings-emnlp.887.
[22] A. Karimi, L. Rossi, A. Prati, Aeda: An easier data augmentation technique for text classification,
2021. URL: https://arxiv.org/abs/2108.13230. arXiv:2108.13230.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , I. Arcos,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2025 - learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos (extended overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Perpetuating online sexism ofline: Anonymity, interactivity, and the efects of sexist hashtags on social media</article-title>
          ,
          <source>Computers in Human Behavior</source>
          <volume>52</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>442</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0747563215004641. doi:https://doi.org/ 10.1016/j.chb.
          <year>2015</year>
          .
          <volume>06</volume>
          .024.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] University College London,
          <article-title>Social media algorithms amplify misogynistic content to teens</article-title>
          , https:// www.ucl.ac.uk/news/2024/feb/social
          <article-title>-media-algorithms-amplify-misogynistic-content-</article-title>
          <string-name>
            <surname>teens</surname>
          </string-name>
          ,
          <year>2024</year>
          . UCL News,
          <source>published February 5</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. P. D.</given-names>
            <surname>Díaz Redondo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Fernández Vilas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R. Ramos</given-names>
            <surname>Merino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M. V.</given-names>
            <surname>Valladares Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T. Torres</given-names>
            <surname>Guijarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hafez</surname>
          </string-name>
          ,
          <article-title>Anti-sexism alert system: Identification of sexist comments on social media using ai techniques</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <article-title>4341</article-title>
          . URL: http://dx.doi.org/10.3390/ app13074341. doi:
          <volume>10</volume>
          .3390/app13074341.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clavié</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hallström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taghadouini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aarsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cooper</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Poli</surname>
          </string-name>
          , Smarter, better, faster, longer
          <article-title>: A modern bidirectional encoder for fast, memory eficient, and long context finetuning and inference, 2024</article-title>
          . URL: https://arxiv.org/abs/2412.13663. arXiv:
          <volume>2412</volume>
          .
          <fpage>13663</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tahaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergler</surname>
          </string-name>
          ,
          <article-title>Analysis of annotator demographics in sexism detection</article-title>
          ,
          <source>in: Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)</source>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>376</fpage>
          -
          <lpage>383</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .gebnlp-
          <volume>1</volume>
          .24/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .gebnlp-
          <volume>1</volume>
          .
          <fpage>24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Davani</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          <article-title>D'iaz, On releasing annotator-level labels and information in datasets</article-title>
          ,
          <source>ArXiv abs/2110</source>
          .05699 (
          <year>2021</year>
          ). URL: https://api.semanticscholar.org/CorpusID:238634705.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , P. Röttger, SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval2023)</source>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2193</fpage>
          -
          <lpage>2210</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .305/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wenzel</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Jenatton</surname>
          </string-name>
          ,
          <article-title>Hyperparameter ensembles for robustness and uncertainty quantification</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>6514</fpage>
          -
          <lpage>6527</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 481fbfa59da2581098e841b7afc122f1-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Multitask multilingual learning with annotator demographics for sexism detection</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          . URL: https: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /, eXIST 2024 Lab at CLEF
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Reyes-Meza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gómez-Adame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <article-title>Hybrid systems for sexism detection: Combining gpt-4 and multilingual transformers</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-120.pdf,
          <source>eXIST 2024 Lab at CLEF</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bazikyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          , et al.,
          <article-title>Soft label optimization with xlm-roberta for multilingual sexism detection</article-title>
          ,
          <source>in: Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-XYZ.pdf,
          <source>eXIST 2024 Lab at CLEF</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Petrescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Rivas-Gervilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gutiérrez-Fandiño</surname>
          </string-name>
          , L. Plaza, AlexPUPB at EXIST 2023:
          <article-title>Identifying sexism in social networks</article-title>
          ,
          <source>in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          , volume
          <volume>3497</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          / paper-088.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Beddiar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Jahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oussalah</surname>
          </string-name>
          ,
          <article-title>Data expansion using back translation and paraphrasing for hate speech detection</article-title>
          ,
          <source>ArXiv abs/2106</source>
          .04681 (
          <year>2021</year>
          ). URL: https://api.semanticscholar.org/CorpusID: 235376976.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of EXIST 2023 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland, Thessaloniki, Greece,
          <year>2023</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>342</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -42448-9_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2024 - learning with disagreement for sexism identiifcation and characterization in tweets and memes, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction, Springer Nature Switzerland, Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>117</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Siino</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tinnirello</surname>
          </string-name>
          ,
          <article-title>Prompt engineering for identifying sexism using gpt mistral 7b</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1228</fpage>
          -
          <lpage>1236</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-115.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tavárez-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sánchez-Vega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosales-Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>López-Monroy</surname>
          </string-name>
          ,
          <article-title>Better together: Llm and neural classification transformers to detect sexism</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1260</fpage>
          -
          <lpage>1273</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          / paper-118.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>A. F. M. de Paula</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Culpepper</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mofat</surname>
            ,
            <given-names>S. P.</given-names>
          </string-name>
          <string-name>
            <surname>Cherumanal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Scholer</surname>
            ,
            <given-names>J. Trippas,</given-names>
          </string-name>
          <article-title>The efects of demographic instructions on llm personas, 2025</article-title>
          . URL: https://arxiv.org/abs/2505.11795. doi:
          <volume>10</volume>
          . 48550/arXiv.2505.11795. arXiv:
          <volume>2505</volume>
          .11795, accepted at SIGIR 2025, Padua, Italy.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>