<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLiC at EXIST 2025: Combining Fine-tuning and Prompting with Learning with Disagreement for Sexism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pol Pastells</string-name>
          <email>pol.pastells@ub.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Vázquez</string-name>
          <email>mauro.vazquez@ub.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mireia Farrús</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariona Taulé</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Complex Systems (UBICS), Universitat de Barcelona</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present the CLiC group's participation in the EXIST 2025 shared task, focusing on sexism detection in social media content. Our work addresses three subtasks: sexism identification (Task 1.1), source intention detection (Task 1.2), and sexism categorization (Task 1.3). We employed BERT [1] fine-tuning for Task 1.1 (binary sexism classification) and DSPy-based prompt optimization for Tasks 1.2 and 1.3, leveraging the initial classification outcomes. A key aspect of our approach is a Learning with Disagreement framework that utilizes annotator demographic information to model diverse perceptions of sexism. Our experimental design included three runs, exploring BERT-based methods for Task 1.1 and contrasting prompt-based methods, including variants with annotator information and Retrieval-Augmented Generation (RAG), for the subsequent tasks. Results demonstrate that BERT fine-tuning significantly surpassed prompt-based methods for Task 1.1, where our approach secured 9th place out of 67 participants in the soft label category. The integration of annotator information proved vital, leading to substantial performance gains across all tasks. The impact of RAG, however, remained inconclusive. These findings highlight the enduring efectiveness of fine-tuned models for core classification, while emphasizing the necessity of annotator-aware approaches for handling subjective concepts like sexism. Our code is available at https://github.com/clic-ub/EXIST_2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism identification</kwd>
        <kwd>sexism categorization</kwd>
        <kwd>learning with disagreement</kwd>
        <kwd>prompting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sexism detection in social media has become increasingly important as online platforms struggle with
harmful content moderation. The EXIST 2025 challenge [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] addresses this need through multimodal
evaluation, though our participation focused specifically on the textual components: Task 1.1 (sexism
identification), Task 1.2 (source intention detection), and Task 1.3 (sexism categorization). While
transformer-based fine-tuning has dominated recent EXIST editions, large language models (LLMs)
have achieved state-of-the-art performance across numerous NLP tasks through prompt engineering.
This creates an important methodological gap: shared tasks continue relying on fine-tuning approaches
despite LLMs’ broader success with prompt-based methods.
      </p>
      <p>
        Motivated by this, our primary objective was to investigate the performance of prompt-based methods,
specifically using DSPy [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for systematic prompt optimization, in text classification problems within the
EXIST framework. DSPy automatically generates and refines prompts through latent space exploration,
ofering a more principled comparison with traditional fine-tuning than manual prompt engineering.
      </p>
      <p>
        We also employed a BERT fine-tuning approach for Task 1.1. This served as a well-tested baseline for
classification tasks (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for example) and provided a strong foundation of binary sexism classification
upon which to build for Tasks 1.2 and 1.3. Comparing this fine-tuning approach with the prompting
techniques allowed us to evaluate the viability of relying solely on methods like few-shot tuning,
example selection, and instruction optimization.
      </p>
      <p>Beyond evaluating diferent modeling paradigms, we specifically aimed to address the importance
of incorporating annotator information and retrieval-augmented generation (RAG) on model
performance. Recognizing that sexism perception varies across demographic groups, our approach integrates
annotator perspectives through a Learning with Disagreement (LeWiDi) framework. We systematically
evaluated whether incorporating these annotator perspectives and RAG improves performance across
the diferent modeling approaches tested.</p>
      <p>To investigate these research questions, we designed three distinct runs for each task, summarized in
Table 1. These runs allowed us to compare the BERT baseline, prompting with RAG, prompting with
annotator information (AnI), and a combination of prompting, RAG, and AnI across the three EXIST
subtasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The EXIST challenge has driven significant advances in automated sexism detection since its inception
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Notable approaches from recent editions include multilingual and monolingual BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] models
with ensemble strategies, with winning systems typically employing combinations of transformer
models such as mBERT, XLM-RoBERTa [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and RoBERTa [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] variants [
        <xref ref-type="bibr" rid="ref5 ref9">5, 9</xref>
        ]. These approaches have
consistently demonstrated that transformer-based models outperform traditional machine learning
methods for sexism detection tasks.
      </p>
      <p>
        Traditional annotation approaches favor majority opinion when multiple annotators disagree,
potentially overlooking valuable insights that could enhance model efectiveness. The Learning with
Disagreement (LeWiDi) framework [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] addresses this limitation by incorporating annotator
perspectives directly into the learning process, moving beyond simple majority voting to leverage the full
spectrum of annotator disagreement as a source of information rather than noise.
      </p>
      <p>
        Despite large language models (LLMs) achieving state-of-the-art performance across numerous
NLP tasks, shared tasks like EXIST continue to be dominated by BERT-based fine-tuning approaches.
There has been limited exploration of prompt engineering techniques for sexism detection, with only
one attempt at using prompt engineering on EXIST 2024 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This gap between the broader NLP
landscape and shared task methodologies leaves systematic prompt optimization and comprehensive
comparisons with fine-tuning approaches underexplored. Our work addresses this gap by comparing
BERT fine-tuning with DSPy-based automated prompt optimization while incorporating the Learning
with Disagreement framework across multiple sexism detection subtasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>The EXIST 2025 Task 1 dataset contains 6,920 training tweets (3,660 Spanish, 3,260 English) with
annotations from 6 demographically diverse annotators per instance. Each annotator is characterized
by age, gender, ethnicity, education level, and country, enabling perspective-aware modeling. The
development and test sets have 1,038 and 2,076 instances, respectively. The instances provided include
the language of the tweet (lang), the content (text), and annotator demographics (gender, age, ethnicity,
study level, country), for the 6 annotators involved in each example. In terms of age and gender, the
dataset is completely balanced, and for the other annotator details, there is no apparent bias.</p>
      <sec id="sec-3-1">
        <title>3.1. Preprocessing</title>
        <p>For both training and inference, we preprocessed tweets by removing URLs and user mentions,
converting emojis to their textual descriptions, and retaining all hashtags. Following the LeWiDi framework,
we leverage annotator disagreement as signal rather than noise. Each original instance was expanded
into 6 annotator-specific examples (see Section 4.2).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our approach leverages two distinct methodologies to tackle the three subtasks of sexism detection. For
Task 1.1 (binary sexism identification), we employ traditional BERT fine-tuning with annotator-aware
prompts to establish a strong baseline classification. For Tasks 1.2 and 1.3 (multiclass and multilabel
classification), we use DSPy’s prompt optimization framework, building upon the binary predictions
from Task 1.1. This hybrid approach allows us to compare the efectiveness of fine-tuned models versus
prompt-engineered large language models while systematically evaluating the impact of annotator
information and retrieval-augmented generation across all tasks.</p>
      <p>All experiments were conducted on a single RTX 4090 GPU with 24GB VRAM.</p>
      <sec id="sec-4-1">
        <title>4.1. DSPy and MIPROV2</title>
        <p>
          DSPy is a Python framework [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] that aims to improve prompt quality. Instead of dealing with hard-coded
prompts, it focuses on developing a systematic parameterized approach to optimize each component
using actual code. The parameters for each module in the pipeline include the LLM, the input and
output fields, and the few-shot examples.
        </p>
        <p>
          We were motivated to pursue prompt optimization over weight optimization by the strong results
in the Better Together paper [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Their core finding is that jointly optimizing prompts and weights
improves performance more than either alone. However, they also show that prompt optimization
alone often outperforms weight optimization across three models and three tasks, and in some cases, it
even rivals the combined approach.
        </p>
        <p>
          As an optimizer, we selected MIPROv2, the faster and more accurate version of MIPRO [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], according
to DSPy’s benchmarks. At its core, it uses an iterative loop where it generates some prompt instructions
as well as a set of few-shot examples, tests this prompt on a batch of training data, and evaluates the
performance using a provided metric.
        </p>
        <p>To generate satisfactory instructions (see Figure 1a), MIPROv2 may use another LLM called “proposer”,
the same LLM in our case, that leverages the available context and information for the task. This includes
summaries of the data properties, input/output descriptors, a description of the pipeline of prediction,
and some successful task executions. It also receives a history of previously tested prompts along with
their performance. To obtain demonstrations, the optimizer performs bootstrapping on the available
training data to get candidates and then generates sets of them via random sampling. Finally, it uses
Bayesian Optimization to search among the net of possibilities, assigning performance scores to prompt
components.</p>
        <p>The implementation of MIPROv2 present in DSPy allows for flexible configuration options based on
the task and available data. The max_labeled_demos parameter represents the maximum number
of few-shot examples taken from the training set. Furthermore, max_bootstrapped_demos controls
how many of them can be generated via bootstrapping (augmented). Equally important, MIPRO has
three levels of exploration: light, medium, and heavy.</p>
        <p>DSPy also ofers the possibility of using predefined modules to produce outputs. In our case, we
used ChainOfThought, which forces the model to output a reasoning field before making a prediction,
increasing explainability and taking advantage of more test-time computation.</p>
        <p>RAG
MOST SIMILAR</p>
        <p>EXAMPLE</p>
        <p>FINAL
PROMPT
(b)</p>
        <p>
          To perform optimization on the prompts and inference over the tasks, we used the open-source model
Qwen2.5-7B-Instruct [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task 1.1: Sexism Identification in Tweets</title>
        <p>Task 1.1 was a binary classification problem, where each tweet must be classified as either sexist or
non-sexist.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. BERT models</title>
          <p>
            We fine-tuned ModernBERT-large [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] with the English tweets and RoBERTa-large-BNE [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] with the
Spanish ones. We decided to add the given annotator information for context, as providing context
to BERT models may improve the results [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ], as well as to take into account the possible biases each
annotator may have. Thus, we did both training and predictions using each annotator information. We
cleaned the annotator information to construct a prompt that was fed to the BERT models (technically
we modified the text, as BERT is not an instructed model and does not take prompts as inputs). For
English, the prompt had the structure shown in Listing 1, which (for text id 600,253) leads to Example (a).
For Spanish, we translated the annotator information and used a Spanish prompt. This way, we obtained
6 predictions for each text, that we can compare with the 6 human annotations.
          </p>
          <p>Listing 1: Prompt generation function for English text
1 english_prompt = ("Given the following text: \n{ " + row.text + " }\n"
2 f"A {row.age} year-old {row.ethnicities} "
3 f"{row.gender} from {row.countries} {row.study_levels} "
4 "perceives it as sexist?")
(a) Given the following text:
{ Its nice that young women have a rapist to look up to! She really is an icon of empowerment.
Women aren’t guilty of rape if they identify as innocent. }
A 46+ year-old White or Caucasian woman from Spain with a Bachelor’s degree perceives it as
sexist?
Furthermore, the models were fine-tuned for a regression task using soft labels. The global soft label
for each text was computed as the average of the 6 annotators (see Equation 1), and the soft label for
each annotator was set to the average of the global soft label and the vote of the specific annotator (the
hard label, which can only be 0 or 1), as shown in Equation 2.</p>
          <p>SoftLabel = 1
6</p>
          <p>∑︁
∈</p>
          <p />
          <p>HardLabel ,
SoftLabel =</p>
          <p>SoftLabel + HardLabel ,
2
where  refers to the text index.</p>
          <p>Both models were trained using a context length of 256 tokens for a maximum of 5 epochs, with
a batch size of 32. We validated every 100 steps and kept the best model. The learning rate for
RoBERTa-large-BNE was set to 5 × 10− 6 and for ModernBERT-large to 1 × 10− 5.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Using RAG and Annotator Information</title>
          <p>In this particular run, to optimize the initial prompt, we used MIPROv2 with the heavy configuration,
accuracy as the training metric, max_bootstrapped_demos = 4 and max_labeled_demos = 6. We
also diferentiated between languages, creating two separate prompts.</p>
          <p>
            For each inference example, a Retrieval-Augmented Generation (RAG) step was applied to the initial
prompt. This process, illustrated in Figure 1b, involved retrieving the most similar example from the
training set. The retrieved example, along with its soft labels (representing the combined predictions of
the 6 annotators), was then added to the prompt. This provided the model with insight into how similar
queries were handled during training. Tweet text similarity was calculated using the ‘all-MiniLM-L6-v2‘
model (the specific model used is a fine-tuned version of [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] created by SBERT).
          </p>
          <p>Then, with the specific prompt for each test example, we predict whether the text is sexist or not for
each of the 6 annotators. To obtain the soft and hard labels using the predictions of each annotator
(transformed into a binary representation 0 − 1) we used the intuitive approach:</p>
          <p>SoftLabelPred = 1</p>
          <p>6
HardLabelPred =</p>
          <p>∑︁
∈Annotators</p>
          <p>Prediction ,
{︃0 if SoftLabelPred ≤ 0.5
1 if SoftLabelPred &gt; 0.5.
(1)
(2)
(3)
(4)</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Other Considerations</title>
          <p>
            Besides the usage of plain classes as output, we also considered other structures. This includes: forcing
the model to output a confidence value for its prediction (in [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]), using integers to display the level of
sexism (in {0, 1, ..., 10}) instead of binary classification, similarly using floats, and explicitly asking for
a reasoning field to justify the prediction.
          </p>
          <p>
            The usage of confidence and reasoning was kept at the inference level, as it forced the model to reason
further, and as it also increases explainability. On the other hand, we discarded the usage of integers
and floats as we perceived a certain bias towards values like 0.5, 7 or 10. These tendencies are probably
due to mode collapse or training biases, as it can be seen in [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Task 1.2: Source Intention in Tweets</title>
        <p>Task 1.2 corresponds to a multiclass problem where each sexist tweet must be classified as either
judgmental, direct, or reported sexism. As a starting point for this task, we used the binary classification
from Task 1.1 that used a BERT fine-tuning, as we had already yielded good results with such techniques
in the past.</p>
        <p>To propagate the results, we considered two scenarios. If the soft label from Task 1.1 does not surpass
the 0.5 threshold, this means it would have been classified as a non-sexist tweet in Task 1.2 as well
(see Equation 5). Therefore, we did not try to predict its class. If the value was over the threshold, we
predicted the class that suited the criteria the best, normalized accordingly, and assigned the same value
to the non-sexist class from Task 1.2 (Equation 6).</p>
        <p>Pred1.2[No Class] = Pred1.1[Not Sexist]
Pred1.2[Class] ←</p>
        <p>Pred1.2[Class] × Pred1.1[Sexist]
(5)
(6)
The prompt optimization process for the English and Spanish versions, was performed using MIPROv2
with the medium configuration, accuracy as the training metric, max_bootstrapped_demos = 1 and
max_labeled_demos = 6.</p>
        <p>To be able to analyze the impact of each of the elements present in the few-shot prompt construction
(RAG and Annotator Specific Prediction), we performed the following runs: only RAG, only annotator
information disclosed on the prompt, and both RAG and annotator information. This approach follows
the same scheme as Figure 1b, changing the output field to be one of the sexist classes instead of just
binary classification.</p>
        <p>To generate the soft labels for each class we used the same concept as in Equations 3 and 4. However,
for each class and annotator the associated prediction would be 1 if the class was the chosen one and 0
otherwise. Intuitively, the hard label was selected to be the class with the highest soft label.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Task 1.3: Sexism Categorization in Tweets</title>
        <p>Task 1.3 corresponds with a multilabel problem where each sexist tweet can be marked with multiple
labels representing diferent sexist behavior, those being: objectification , ideological inequality,
stereotyping dominance, sexual violence and misogyny non-sexual violence. Again, for this task, we used the
predictions from Task 1.1 obtained via the BERT models’ fine-tuning.</p>
        <p>
          To obtain both the English and Spanish prompts, we followed the same technique as in the
previous tasks, with the configuration being: medium configuration, max_labeled_demos = 6 and
max_bootstrapped_demos = 1. The main diference compared to the other tasks lays in how we
scored the predictions for the training metric. Correctly guessing whether a label was present added 1
to the score, which was then normalized to the [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] range.
        </p>
        <p>These prompts were optimized with modified output fields, as we configured 5 optional Pydantic
outputs, one for each possible label. We used the same approach as in Task 1.2 to propagate the results,
meaning that the model would not process a tweet predicted as non-sexist in Task 1.3 and that the final
predictions were updated as it is shown in Equations 6 and 5.</p>
        <p>To generate the labels from the predictions outputted by the model, followed the approaches presented
in Sections 4.2 and 4.3, with the diference that each label has its associated hard and soft label.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        This performance gap contradicts findings from [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], suggesting that sexism detection may require
domain-specific knowledge better captured through fine-tuning than prompting. The subjective nature
of sexism judgment may necessitate parameter updates rather than instruction optimization.
The inclusion of Annotator Information (AnI) consistently improves DSPy performance, as evidenced by
run 1 in both Tables 3 and 4, which performs worse than runs 2 and 3, demonstrating that
perspectiveaware modeling benefits prompt-based approaches. Finally, it is inconclusive whether the use of
Retrieval-Augmented Generation (RAG) leads to performance gains, since runs 2 and 3 have comparable
results across tasks, with no consistent advantage.
      </p>
      <p>Note that the bad performance of run 1 in Task 1.2 and Task 1.3 with soft labels is due to the LLM
generating the predictions without annotator information, meaning that the soft labels are truly hard
labels. These runs were delivered for the soft label category for completeness.</p>
      <p>ICM</p>
      <p>ICM Norm ICM Soft ICM Soft Norm</p>
      <p>ICM</p>
      <p>ICM Norm ICM Soft ICM Soft Norm</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we present CLiC’s participation in the EXIST 2025 shared task for Tasks 1.1, 1.2, and 1.3.
The main objective was to evaluate the viability of prompt engineering on its own. Our findings show
that prompt-based methods alone fail to match the performance of standard techniques such as BERT
ifne-tuning for binary sexism text classification, Also, the performance on multilabel and multiclass
tasks is not near the top of the rankings. We also observed that incorporating annotator information into
prompt optimization leads to improved results. However, the efect of Retrieval-Augmented Generation
(RAG) on performance remains inconclusive. Future work could explore the combined impact of model
ifne-tuning and prompt optimization for similar tasks, given that we were unable to pursue this due to
resource constraints, as well as the application of these techniques to larger, more powerful LLMs.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been possible as part of the FairTransNLP-Language project (PID2021-124361OB-C33),
funded by MICIU/AEI/10.13039/501100011033/FEDER, UE. It has also been funded by the Generalitat de
Catalunya (2024 PROD 00016 and 2021 SGR 00313 grants).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Generated prompts</title>
      <p>As an example, we present the optimized prompt for Task 1.2 in Spanish Listing 2. The rest of the
prompt are available together with our code at https://github.com/clic-ub/EXIST_2025. We can see that
it includes a reasoning field to generate a more complex thought process, and even though it is not the
case, it could also include augmented examples. Given the great amount of training data we decided to
avoid this scenario. This prompt would get modified depending on the run; if no annotator information
is needed, the associated fields would be removed, and if RAG is used, extra examples would be added
to the demos section for each query. The optimized instructions, as well as the fields and examples,
would be diferent for each specific task and language. Some fields that appear in the examples are not
sent to the LLM, such as hard_label or soft_label. The fields that are stored along the prompt are simply
determined by how the training set is formed. It is also possible that the examples for the few-shot are
incorporated into the instructions. This was the case for the generated prompt for Task 1.3.
Listing 2: DSPy Spanish prompt for Task 1.2
"predict": {
"demos": [
{
"text": "No es que Awada sea una estúpida descerebrada (o sí). Pero le inventan
notas donde la describen como tal porque es el modelo de mujer dócil y
sumisa que los machirulos de derecha esperan para el resto.",
"language": "Spanish",
"category": "sexist",
"labels_task1_2": "DIRECT",
"hard_label": 1,
"soft_label": 1.0,
"annotator_gender": "female",
"annotator_age": "18-22",
"annotator_ethnicity": "Hispano or Latino",
"annotator_studies": "Bachelor's degree",
"annotator_country": "Chile"
},
{
},
{
},
{
"text": "Tu mujer rebelde y locata contigo puede que se vaya pero te avisa y no
te traiciona ",
"language": "Spanish",
"category": "sexist",
"labels_task1_2": "DIRECT",
"hard_label": 0,
"soft_label": 0.33333333330000003,
"annotator_gender": "female",
"annotator_age": "46+",
"annotator_ethnicity": "Hispano or Latino",
"annotator_studies": "Bachelor's degree",
"annotator_country": "Mexico"
"text": " ANDA EN PRIMERA PUES COMO SABEMOS MUCHOS HOMBRES LAS MUJERES NO SABEN
],
"signature": {
"instructions": "Dado el texto en español, proporciona una categoría que indique
el tipo de sexismo presente (DIRECT, REPORTED, JUDGEMENTAL), una explicación
detallada de por qué se clasifica así y un nivel de confianza en tu
clasificación. Considera el contexto del texto y cualquier información demográ
fica relevante proporcionada por el anotador, como su género, edad, etnia,
estudios y país.",
"fields": [
{
"prefix": "Text:",
"description": "${text}"
"prefix": "Language:",
"description": "${language}"
"prefix": "Annotator Gender:",
"description": "${annotator_gender}"
"prefix": "Annotator Age:",
"description": "${annotator_age}"
"prefix": "Annotator Ethnicity:",
"description": "${annotator_ethnicity}"
"prefix": "Annotator Studies:",
"prefix": "Reasoning: Let's think step by step in order to",
"description": "${reasoning}"</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ). Jorge
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          , Julio Gonzalo, Laura Plaza, Alba García Seco de Herrera, Josiane Mothe, Florina Piroi, Paolo Rosso, Damiano Spina, Guglielmo Faggioli, Nicola Ferro (Eds.),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos (extended overview)</article-title>
          ,
          <source>in: CLEF 2025 Working Notes. Guglielmo Faggioli</source>
          , Nicola Ferro, Paolo Rosso, Damiano Spina (Eds.),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singhvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maheshwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vardhamanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , T. T. Joshi,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moazam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          , Dspy:
          <article-title>Compiling declarative language model calls into self-improving pipelines</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>T.-M. Lin</surname>
            ,
            <given-names>Z.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , NYCU-NLP at EXALT 2024:
          <article-title>Assembling large language models for cross-lingual emotion and trigger detection</article-title>
          ,
          <source>in: Proceedings of the 14th Workshop on Computational Approaches</source>
          to Subjectivity, Sentiment, &amp;
          <article-title>Social Media Analysis, Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>505</fpage>
          -
          <lpage>510</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .wassa-
          <volume>1</volume>
          .50/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .wassa-
          <volume>1</volume>
          .
          <fpage>50</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of exist 2021:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2024-learning with disagreement for sexism identification and characterization in tweets and memes</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Leonardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Abercrombie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almanea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          , Semeval-2023 task 11:
          <article-title>Learning with disagreements (lewidi</article-title>
          ),
          <source>arXiv preprint arXiv:2304.14803</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Siino</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tinnirello</surname>
          </string-name>
          ,
          <article-title>Prompt engineering for identifying sexism using gpt mistral 7b</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <article-title>Fine-tuning and prompt optimization: Two great steps that work better together</article-title>
          ,
          <source>arXiv preprint arXiv:2407.10930</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Opsahl-Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Ryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Purtell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Broman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <article-title>Optimizing instructions and demonstrations for multi-stage language model programs</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2406.11695. arXiv:
          <volume>2406</volume>
          .
          <fpage>11695</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .
          <article-title>5: A party of foundation models</article-title>
          ,
          <year>2024</year>
          . URL: https://qwenlm.github.io/blog/qwen2. 5/.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clavié</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hallström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taghadouini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aarsen</surname>
          </string-name>
          , et al.,
          <article-title>Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory eficient, and long context finetuning and inference</article-title>
          ,
          <source>arXiv preprint arXiv:2412.13663</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Fandiño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Estapé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pàmies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Palao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Ocampo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Carrino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Oller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Penagos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <article-title>Maria: Spanish language models</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>68</volume>
          (
          <year>2022</year>
          ). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0. mendeley. doi:
          <volume>10</volume>
          .26342/2022-68-3.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pastells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Schmeisser-Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <article-title>Context-aware stereotype detection: Conversational thread analysis on bert-based models</article-title>
          ,
          <source>in: SEPLN Posters</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Minilm:
          <article-title>Deep self-attention distillation for task-agnostic compression of pre-trained transformers</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2002</year>
          .10957. arXiv:
          <year>2002</year>
          .10957.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Janus</surname>
          </string-name>
          , Mysteries of mode collapse,
          <year>2022</year>
          . URL: https://www.alignmentforum.org/posts/ t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse, accessed:
          <fpage>2025</fpage>
          -06-10.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <article-title>Evaluating extreme hierarchical multi-label classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>5809</fpage>
          -
          <lpage>5819</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>