<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Reasoning of Auto-Revealed Insights via Knowledge Injection and Evolutionary Prompting for Sexism Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kapioma Villarreal-Haro</string-name>
          <email>kapioma.villarreal@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Judith Tavarez-Rodríguez</string-name>
          <email>judith.tavarez@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo Segura-Gómez</string-name>
          <email>guillermo.segura@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Rosales-Pérez</string-name>
          <email>alejandro.rosales@cimat.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Sánchez-Vega</string-name>
          <email>fernando.sanchez@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrián Pastor López-Monroy</string-name>
          <email>pastor.lopez@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Mathematics Research Center (CIMAT)</institution>
          ,
          <addr-line>Jalisco S/N Valenciana, 36023, Guanajuato, Guanajuato</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mathematics Research Center (CIMAT)</institution>
          ,
          <addr-line>Monterrey, Av. Alianza Centro 502, Apodaca, 66628, Nuevo León</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Secretaría de Ciencia, Humanidades, Tecnología e Innovación (Secihti)</institution>
          ,
          <addr-line>Av. Insurgentes Sur 1582, Col. Crédito Constructor, 03940, CDMX</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper addresses the CIMAT-CS-NLP team's participation in the three EXIST 2025 text-based tasks (hard settings), consisting of sexism detection, intent analysis, and sexism categorization in tweets. The proposed method is based on a single multitask query to a large language model (LLM), with a prompt that first generates auto-revealed insights for the analysis, and then requests answers for all tasks simultaneously. To automate query refinement, we apply evolutionary computation, optimizing the F1-macro on a development subset. We then utilize the LLM to generate justification texts supporting the categorization obtained with the refined prompt, and use them to fine-tune a multilingual RoBERTa-based model for each task. Our experiments show that the evolved prompt boosts some classification metrics, in comparison with the initial prompt. Experiments with DeepSeek-R1-Distill-Llama-8B and Gemini-1.5-Flash show that (i) explicit reasoning can be induced even in models not originally optimized for it, and (ii) Gemini-1.5-Flash achieves the highest scores in a few-shot scenario, while DeepSeek-R1-Distill-Llama-8B ofers a competitive, smaller and open-source alternative. Our findings highlight the advantage of inducing reasoning in an LLM to contextualize tweets and subsequently using them in ifne-tuning to analyze, to various degrees, sexism in social media text.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Online Sexism</kwd>
        <kwd>Multilingual Sexism Detection</kwd>
        <kwd>Sexism Source Intention</kwd>
        <kwd>Sexism Categorization</kwd>
        <kwd>LLMs</kwd>
        <kwd>Reasoning Models</kwd>
        <kwd>Few-Shot Learning</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Evolutionary Computation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In 1975, the UN proclaimed International Women’s Year, placed gender equality on the global agenda,
and declared March 8 International Women’s Day [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Despite all the eforts made over more than
half a century, gender inequality remains a problem in society. The persistence of sexist attitudes
legitimizes exclusion and violence directed against women, which is reflected in current figures on
gender gaps and disparities, where women are clearly the most afected [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The expansion of digital media has facilitated the widespread dissemination of sexist messages.
Recent United Nations reports state that, in a wide range of countries, between 16 and 58 percent of
women and girls have been subjected to online gender-based violence [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unfortunately, practicing
digital violence is easier with the use of social media, which transcends all geographical barriers and
has the potential for messages to come from anonymous sources. In particular, the social network X
(formerly Twitter) allows for almost instant virality, while the short format makes it dificult to perceive
the full context, which complicates content moderation. This problem raises the need for automated
systems capable of detecting, interpreting, and categorizing sexism in tweets with high precision and in
real time [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Furthermore, traditional binary approaches (sexist/non-sexist) are insuficient to address all the
nuances of today’s society, so a more in-depth analysis becomes necessary given the complexity of the
problem. Detecting the intent of messages, as well as identifying them at a more granular level, helps us
understand the problem of sexism on social media, and, based on this understanding, they can lead to
real solutions. In this context, several initiatives have emerged to combine eforts within the scientific
community and advance toward better detection of sexism, both in binary and finer-grained terms.
EXIST (sEXism Identification in Social neTworks) 2025 is the fifth edition of this shared task [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] and
its fundamental objective is to make sexism detection a more far-reaching endeavor, attacking sexism
from its explicit forms, such as misogyny, to its more subtle expressions, such as micro-violence. This
year’s edition features several multimodal challenges: text, images, and videos. Each one with three
tasks: identification, intent analysis, and fine-grained categorization.
      </p>
      <p>Moreover, recent technologies such as large language models (LLMs) have demonstrated a surprising
ability to generalize in classification tasks without requiring fine-tuning for the specific task. However,
the performance of an LLM can vary dramatically depending on the wording of the prompt, the
selection of examples for few-shot schemes, and the output format of the response text. Identifying the
optimal configuration often involves manual trial and error, a process that is unsystematic and dificult
to reproduce.</p>
      <p>In this work, we address the three text tasks, focusing in the hard-settings, as last year’s edition
showed that there is still room for improvement, especially in the intention analysis of sexist content
and fine-grained categorization tasks. The main tool we use to perform the classifications are specific
requests to LLMs, where the prompt was designed to let the model extract relevant insights from
the tweet and the context of the tasks before issuing a categorization. With this prompt, we seek to
extract the answers to the three text tasks in a single query. Furthermore, the prompt was refined using
evolutionary computation, resulting in some improvements in the development set. We subsequently
prompt the LLM to extract reasoning texts explaining its own predictions, which are injected in the
ifne-tuning process of a RoBERTa-based model to perform the final classifications for each task. Our
results suggest that using a single multi-task prompt that let the model identify relevant elements of
the tweet by itself helps language models better respond to the tasks. Besides, the model’s reasoning
encoded in justifications to the answer helps in the fine-tuning of RoBERTa-based models. Additionally,
evolutionary computation-assisted prompt engineering is a promising avenue for leveraging the power
of LLMs and avoiding manual prompt refinement, helping to improve classification metrics.
Our main contributions are:
1. We designed a single prompt that encourages the model to reason deeply about relevant
information and insights for the analysis of the tweets and then answer the three tasks in a single
query.
2. We used evolutionary computation to evolve the prompt and thus optimize the F1-macro on a
development subset, automating what previously relied on manual intuition.
3. We compared the eficiency of the evolved prompt on two models: DeepSeek-R1-Distill-Llama-8B,
and Gemini-1.5-Flash, only with few-shot examples within the prompt itself, demonstrating that
reasoning can be induced in non-reasoning models, the latter being better in performance.
4. We found that fine-tuning a RoBERTa-based model with rationales that Gemini-1.5-Flash produced
from DeepSeek-R1-Distill-Llama-8B answers yields performance on par with using rationales
generated entirely by Gemini-1.5-Flash. The key benefit is that the DeepSeek pipeline relies on a
smaller, fully open-source model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Sexism Detection Since the first edition of the EXIST forum, transformer fine-tuning has been
the preferred tactic for classification tasks [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
        ]. Some of the best strategies from past years
and strategies found in literature include ensembles of ensembles of monolingual BERT and RoBERTa
transformers [12], fine-tuning RoBERTa-based models using information from annotators [ 13, 14, 15]
and fine-tuning of GPT-NeoX and BERTIN-GPT-J-6B models [ 16] with the latter achieving first place in
the hard setting for tasks 1 and 2 in 2023. Thus, it is promising to explore diferent and more recent
LLMs and techniques for sexism identification, analysis of intention and categorization tasks.
Strategies with LLMs were tested in EXIST 2024, were several teams relied on prompt engineering
processes to perform zero-shot or few-shot classifications [ 17, 18, 19]. In fact, in [20], authors generated
rationales for each tweet with Meta-Llama-3-8B and used these rationales to fine-tune a
Twitterspecialized XLM-RoBERTa, yielding better results than fine-tuning using only the tweets.
In this work, we employed the use of rationales for solving the three tasks in one single query to
the model, and subsequently we refined performance through an additional fine-tuning phase on a
RoBERTa-based model.
      </p>
      <p>Evolutionary Optimization for Prompt Engineering in Sensitive Tasks LLMs are powerful tools
for performing complex classification tasks, as they can generate accurate results without requiring
modifications to the model architecture. LLMs leverage extensive pretraining on massive corpora,
consisting of billions of words, to model language in unprecedented ways. However, this also induces
them to have certain biases or prejudices towards topics that could be considered sensitive. Recent
work has demonstrated that LLMs exhibit gender bias [21], and reinforce negative stereotypes [22]. In
fact, there is evidence of an amplification of gender bias by LLMs, reflecting and potentially aggravating
the societal perceptions from which these models learn [23]. Essentially, they mirror collective societal
thinking.</p>
      <p>It has been found that LLMs exhibit a significant dependence on the prompt used [ 24]. By altering the
prompt, it is even possible to mitigate some of the biases that LLMs may display. Studies have shown
that certain prompting strategies, such as chain-of-thought, can reduce the impact of gender bias in
these models [25, 26]. For this reason, it becomes crucial to focus on identifying the features that help us
better detect gender bias from the prompt itself. Training a language model is highly resource-intensive,
and building a model free of bias is even more complex due to the substantial amount of biased data,
especially regarding gender, found across the Internet [23].</p>
      <p>Eforts in prompt engineering for sexism detection have combined various architectures to improve
detection accuracy [27]. Recent studies have opted to use multilingual models like XLM-R or mBERT,
paired with language-adapted prompts, while others have explored zero-shot and few-shot techniques
alongside diferent LLM variants. It has been noted that using well-chosen examples, explicit prompts,
and structured formats can significantly enhance performance in tasks such as binary classification or
intention detection [17, 18]. Nevertheless, these methods often depend heavily on the expertise of the
individual designing them, and they generally lack a systematic approach for exploring useful prompt
variants. This lack of automation limits their scalability, particularly in multilingual or finely annotated
scenarios, such as those found in the EXIST framework.</p>
      <p>To address the challenge of prompt optimization, automatic methods based on evolutionary
algorithms, such as EvoPrompt [28] and PromptBreeder [29], have recently been proposed. These approaches
enable the exploration of efective prompt variants without requiring human intervention. Although
they have shown promising results in general tasks, their application in sensitive areas like sexism
detection remains limited.</p>
      <p>Most research on prompt optimization has been carried out on general tasks, but recent studies have
started to focus specifically on sensitive contexts such as sexism. For instance, it has been suggested to
enrich prompt statements with semantic definitions and explicit descriptions of the sexist phenomenon,
which helps in identifying subtle cases that models often overlook [30]. Another line of work, more
geared toward analyzing biases within LLMs themselves, involves the use of soft prompts, as in [31],
where specific embeddings are adjusted to assess how models react to sensitive inputs with varying
gender cues.</p>
      <p>However, these studies often concentrate on qualitative analysis, partial evaluations, or require
internal modifications to the models. In contrast, our approach in the first phase preserves the original
model and focuses on systematically enhancing prompt formulation, without additional training, by
aligning advances in evolutionary optimization with a real and complex issue such as sexism. Moreover,
we propose a multitask and scalable strategy that not only performs classification but also induces
reasoning, enabling broader generalization to other sensitive contexts.</p>
      <p>Reasoning in Generative LLMs While focusing on improving prompt formulation and structure
is one way to enhance LLM performance, alternative approaches involve modifying how the models
process and respond to those prompts by leveraging their reasoning capabilities. Simple techniques to
encourage the models to “reason” before answering improve the already impressive results achieved
by Generative LLMs in classification tasks across several domains. For instance, chain-of-thought
prompting, which involves adding a few demonstrations that guide reasoning towards a solution, helps
models reproduce such reasoning before generating the final answer, ultimately improving performance
[32].</p>
      <p>While chain-of-thought prompting is useful, it has been shown that explicit demonstrations of how
to conduct the reasoning are not necessary, as models are capable of reproducing this behavior by
themselves. This is achieved by simply indicating the model to “reason” by adding “let’s think step by
step” before each answer. Inspired by these ideas, new models have been trained to generate internal
chains of thought before producing final answers, with popular models like DeepSeek demonstrating
competitive performance against other state-of-the-art approaches [33].</p>
      <p>Whether or not models are “really thinking” and despite the limitations that they face [34], the
richness and knowledge encoded in their “reasoning” can not be neglected and can be leveraged by
other models. There is work where smaller models can benefit from chain-of-thought reasoning by
learning in a distillation process to reproduce both the reasoning and final outputs generated by larger
models [35]. Other work leverages this knowledge by incorporating reasoning produced by larger
models as knowledge expansion of the original input to improve classification metrics [ 20]. In this
approach, although the reasoning is not reproduced, the knowledge embedded in the reasoning is still
utilized to improve model performance [20]. Other more direct approaches focus only on using final
categorical answers as a means of knowledge expansion, where, despite not incorporating reasoning, it
serves as an indirect knowledge transfer coming from larger models [17]. Other approaches do not rely
solely on either reasoning or final answer categories but instead incorporate reasoning generated after
producing answers to justify the output, leveraging both aspects [36]. This work exemplifies that the
reasoning capabilities of LLMs are valuable not only because they lead to more accurate final answers,
but also because the knowledge and insights expressed in the process can be leveraged efectively by
smaller models in diferent settings.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The systems developed to address the three text classification tasks in the hard setting combine generative
LLMs with fine-tuning techniques in a multi-stage process. The methodology consists of three main
stages: (1) prompt-based classification with evolutionary optimization, (2) justification generation to
extract reasoning behind answers, and (3) fine-tuning with knowledge injection.</p>
      <sec id="sec-3-1">
        <title>3.1. Prompt-Based Classification with Auto-Discovery of Relevant Elements</title>
        <p>The first stage of the model consists of an auto-discovery and classification approach that lets the model
identify relevant elements autonomously. It has been shown that guiding the model by decomposing
the task into parts can improve performance in domains like hate speech detection, but it faces the</p>
        <p>New
population
Mutated
population
Evaluation</p>
        <p>Evolutionary Algorithm</p>
        <p>with
Systematic Mutations</p>
        <p>Initial population</p>
        <p>Selection</p>
        <p>Crossover</p>
        <p>Evaluation
Predefined
mutation block
Select</p>
        <p>Mutation</p>
        <p>Systematic mutation
dificulty of carefully selecting its constituents [ 37]. To avoid the manual definition of components, we
aim for the models to discover them on their own.</p>
        <p>We define a prompt, where we instruct generative LLMs to self-identify relevant elements that help
understand the tweet and are relevant for the tasks, and then to perform classification on the three tasks,
all in a single query. The prompt incorporates the task definitions and few-shot examples, which are
optimized through an evolutionary process described in the following subsection. The output includes
highlighted elements related to the input tweet, and classification predictions for all three tasks.</p>
        <p>Performance of the optimized prompt at this stage is evaluated using as core models two diferent
generative LLMs: DeepSeek-R1-Distill-Llama-8B (AD-Dsk), a smaller model optimized for reasoning
and Gemini-1.5-Flash (AD-Gem), a larger non-reasoner general-purpose model.
3.1.1. Evolutionary Optimization
The optimization process adopts a classic genetic algorithm framework, incorporating selection,
crossover, and mutation phases. The initial population includes both handcrafted prompts and
variants generated using GPT-4. To retain the highest-performing prompts across generations, we use
tournament and top-k selection strategies. Unlike conventional methods that rely on purely stochastic
mutations, we implemented a curated library of structured mutations, specifically designed to improve
the eficiency of the prompt in this context. This structured mutation strategy draws inspiration from
the concept of systematic mutation, recently introduced by Segura-Gómez et al. [38], in which each
mutation type is defined as a task-aware transformation intended to enhance prompt performance. In our
configuration, each mutation was selected based on empirical trends observed in prompt engineering
and validated through ablation studies. The complete evolutionary process, which encapsulates the
behavior of systematic mutations, is illustrated in Figure ??.</p>
        <p>
          According to the proposed approach, we define a single prompt designed to address the reasoning of
the language model and solve all three tasks in a single call. This initial prompt or template includes
formatting instructions, text generation, and most importantly, specific instructions for each subtask. While
this complete prompt is used as-is in the final system, during the optimization phase, it is segmented by
subtask, with each section optimized independently. This strategy enables the evolutionary process to
focus on a specific task at a time, thereby enhancing both its eficiency and stability. Additionally, it
allows for fine-grained control over the modifications applied to each functional block of the prompt.
Optimization Setup The evolutionary process was performed on a subset of 500 randomly selected
examples from the dev set. This subset is based on the collection of tweets originally introduced in
EXIST 2023 and follows the “Learning With Disagreement” paradigm, in which each tweet is annotated
by six distinct individuals [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. These annotators represent a diverse range of perspectives, difering in
gender, age range, ethnicity, education level, and country of origin.
        </p>
        <p>The selected subset includes tweets in both English and Spanish. Each tweet is annotated with both
consensus labels, computed via majority vote, and the full distribution of individual annotator labels.
For the purposes of the evolutionary process, only the consensus labels were used. This choice aligns
with the requirements of the fitness function, which operates on predefined classes: a single target
label in binary classification, or a reduced set of categories in multiclass tasks. This subset was used
consistently throughout the evolutionary cycle to evaluate the efectiveness of the generated prompts.</p>
        <p>The procedure was executed over 10 generations with a population of 10 prompts. In each generation,
the best-performing candidates were selected via tournament selection, combined through crossover,
and modified using a structured mutation. Prompts were evaluated on the dev set using the
Meta-Llama3.1-8B model to generate outputs, which were then scored using macro F1 as the fitness function. The
process prioritized interpretability and stability over aggressive exploration, aiming to generate prompt
variants that remained robust across diferent tasks and could generalize well when integrated into the
full pipeline. A detailed description of the prompt evolution system and intermediate prompt versions
is provided in Appendix A.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Process justifications before fine tuning</title>
        <p>To leverage the model’s existing knowledge in the fine-tuning step, we further prompt a generative
LLM to justify the answers to the tasks generated during the Prompt-Based Classification. Thereby, we
obtain justification texts that explain the answers provided, showcasing the internal reasoning of the
models in a posterior process. We use Gemini-1.5-Flash to generate justification for both model outputs
in the previous steps.</p>
        <p>Gemini justifications over AD-Gem answers serve as a way to extract reasoning from Gemini’s own
classification. While we can extract reasoning directly from DeepSeek, this approach presented several
challenges: consistency in the answers was not guaranteed, labels were not always present, and early
development results were unfavorable. Instead, we extract Gemini justifications over AD-DSk serve to
standardize answers in DeepSeek while also incorporating reasoning of Gemini explaining DeepSeek’s
classification.</p>
        <p>These justifications will serve as knowledge injection to expand tweets in the next classification
stage.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Justification Enhanced (JE) Fine Tuning</title>
        <p>The final stage consists of a fine-tuning process with knowledge injection obtained from the justifications
obtained from the previous step, following the approach presented in [36]. We focus on an
XLMRoBERTa model trained on multilingual tweets (T-XLM-RoBERTa) [39].</p>
        <p>The knowledge injection process is implemented by concatenating justifications with the original
tweets, creating a more informative representation for classification. The base T-XLM-RoBERTa model is
then fine-tuned separately to predict each of the three tasks using the enriched input. A final correction
process is applied to tasks 1.2 and 1.3 based on the answers of task 1.1: if the answer to 1.1 was negative,
we correct tasks 1.2 and 1.3 answers to reflect this dependency.</p>
        <p>These fine-tuned models are identified as (JE-DSk-FT) and (JE-Gem-FT), depending on the model used
to generate the answers that serve as guide to the justifications ( AD-DSk and AD-Gem respectively).</p>
        <p>Figure 2 outlines the overall methodology for individual systems, including the alternative tested in
the early stages.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Summary of the submitted models</title>
        <p>Models submitted correspond to the individual fine-tuned versions JE-DSk-FT and JE-Gem-FT. A third
model is obtained by combining multiple models, expecting to obtain a more robust panorama of the
answers. The models considered are those obtained in the Prompt-Based Classification with
AutoDiscovery of Relevant Elements (AD-DSk and AD-Gem), as well as the ones obtained in Justification
Enhanced (JE) Fine Tuning (JE-DSk-FT and JE-Gem-FT ).</p>
        <p>The ensemble model, identified as MixDSkGem, is produced by assigning a confidence score according
to the performance of each model in the dev set. Final answers are then weighted using these confidence
scores to finally output the answer with the highest confidence. For task 1.1 (binary classification) and
task 1.2 (multiclass classification), the highest-scored label is submitted, while for task 1.3 (multi-label
classification), we treated each label as a binary problem to decide if it is included in the labels submitted.</p>
        <p>Table 1 contains a summary of the systems and the corresponding runs in the oficial leaderboard.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Results on dev</title>
        <p>During the prompt evolution phase, we evaluated candidate prompts using Meta-Llama-3.1-8B as the
inference model and macro F1 score as the fitness function. Table 2 summarizes the results obtained
using the best evolved prompt versus the original base prompt across the three tasks on a 500-example
subset of the development data.</p>
        <p>The evolved prompt led to consistent improvements across all tasks, with the most notable gain
observed in Task 1.2, which benefited from clearer intention phrasing. These results validate the
efectiveness of our systematic mutation strategy under a controlled evaluation setup. While these
scores are not directly comparable to the ICM-Hard-Norm metrics reported for Gemini-based evaluations,
they were essential for guiding the prompt optimization process and served as a foundation for the
ifnal prompt adopted in downstream components of the system. A detailed description of the prompt
evolution process and intermediate versions is provided in Appendix A.</p>
        <p>Final prompt evolution performance was evaluated using the complete dev dataset with
Gemini1.5-Flash. Seed prompts performance show improvement from ranges of ICM-Norm from [0.71 −
0.73], [0.54 − 0.57], and [0.48 − 0.49] in tasks 1.1, 1.2, and 1.3 respectively to 0.74, 0.55, and 0.50
compared to the final prompt version. We observe an overall improvement of the final prompt in
tasks 1.1 and 1.3 compared to all seed prompts, while task 1.2 shows minimal improvement but still
outperforms the worst-performing seeds.</p>
        <p>Table 3 shows the performance of auto-discovery prompting-based models and fine-tuned ones on
the dev dataset. From these results, we observed that fine-tuning the T-XLM-RoBERTa model with
justifications generated by LLMs significantly improves classification performance across all tasks,
compared to using only the classifications produced directly through prompting. Ensemble model based
on confidence voting Mix-DSkGem improves performances compared to individual models in tasks
1.1 and 1.2, and experiments a loss in task 1.3 compared with the fine-tuned models, suggesting the
merging process for this task is not optimal.</p>
        <p>Furthermore, in the literature [17], we found that prompting with Gemini-1.0-pro without Auto
Discovery in the prompt (class_definitions_refined_prompt), achieved an ICM-Hard-Norm metric of
0.7318 in the dev set for task 1, and fine-tuning a T-XLM-RoBERTa model with only the tweets achieved
a 0.7940 ICM-Hard-Norm metric for task 1. This tells us that our system has the potential to obtain
competitive metrics in the test, specially the JE-Gem-FT model, with the advantage that the three tasks
can be obtained directly from a single request to the model, which helps reduce latency.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results on test</title>
        <p>The majority of our submissions ranked in the top ten of all the evaluations, specially those for tasks
1 and 2. Our best submitted system ranked 4th for task 2 and it consists of the T-XLM-RoBERTa model
which was fine-tuned with the tweets plus the justifications generated by the Gemini-1.5-Flash model.
It is worth to notice that the ranking order achieved in the dev set is not preserved in the test set
for task 1, since the JE-Gem-FT system obtained a lower ICM-Hard metric. Still, the diference
is only 0.0005 points for the F1 metric, and for task 2, the diference is only 0.0021. This tells
us that the fine-tuning process helps an 8B parameter model like DeepSeek contribute
similarly to a larger model like Gemini-1.5-Flash, but with the advantage of being open source.</p>
        <p>Regarding the results of N-LLM-R-Stack-Ra and Resp_aware_in + FT XLM-RoBERTa, these systems
correspond to fine-tuned XLM-RoBERTa models with tweets + generated-text-from-LLMs.
N-LLM-R-Stack-Ra consisted in asking to a Meta-Llama-3-8B-Instruct model to generate analysis that
supported the idea that the tweet was not related to sexism (negative reasoning). This analysis was
concatenated along with the tweet itself and then used as input for fine-tuning a T-XLM-RoBERTa
model. The Resp_aware_in + FT XLM-RoBERTa system consisted in prompting a Gemini-1.0-pro model
to classify the tweet as sexist or not. Then, the YES or NO answer was concatenated along with the
tweet and a XLM-RoBERTa model was fine-tuned with this new input.</p>
        <p>As we can notice, incorporating justifications into the fine-tuning inputs significantly improves
classification performance, more than adding negative reasoning or adding the LLM
classification directly to the tweet. The previous demonstrates that the RoBERTa-based model benefits from
the contextual information extracted during LLM Auto Discovery prompting and it contributes to more
accurate detection of sexist tweets and a better identification of intent.
JE-Gem-FT
Mix-DSkGem
JE-DSk-FT
JE-Gem-FT
JE-DSk-FT
Mix-DSkGem
JE-Gem-FT
Mix-DSkGem
JE-DSk-FT</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Limitations</title>
      <p>Our results confirm that incorporating LLM-generated justifications during fine-tuning boosts the
performance of our systems to competitive positions on the EXIST 2025 leader-board. Moreover, the
prompting method yields to a framework for solving the three task in a single query to the LLM, leading
to better metrics when coupled with the fine-tuning process.</p>
      <p>The use of evolutionary computation proved promising for optimizing the prompt. The final
evaluation metrics did not change significantly, but we hypothesized that using this technique with smaller
prompts may be more beneficial in the prompt engineering process.</p>
      <p>The experiments show that incorporating smaller open-source models, such as
DeepSeek-R1-DistillLlama-8B, can achieve results similar to much larger models like Gemini-1.5-Flash for the process of
enrichment with reasoning and explanations.</p>
      <p>Validation of the stability of the observed improvements remains to be explored, as the slight variation
between the dev and test rankings suggests that our models show room for improvement in terms of
generalization.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Villarreal-Haro acknowledges Secretaría de Ciencia, Humanidades, Tecnología e Innovaciones (SECIHTI)
for its support provided by the program Becas Nacionales Para Estudios de Posgrados (CVU 1309535).
Segura-Gómez acknowledges financial support from SECIHTI through the graduate scholarship provided
by the program Becas Nacionales Para Estudios de Posgrados number 4006758 and CVU 1308651.
TavarezRodríguez acknowledges SECIHTI and Centro de Investigación en Matemáticas (CIMAT) for the support
through the PhD scholarship (CVU 859147). Sanchez-Vega acknowledges SECIHTI for its support
through the program “Investigadoras e Investigadores por México” (Project ID.11989, No.1311).
RosalesPérez acknowledges SECIHTI for its support through the grant project "Búsqueda de arquitecturas
neuronales eficientes y efectivas" (CBF2023-2024-2797). The authors gratefully acknowledge SECIHTI
and CIMAT for the computing resources provided by the CIMAT Bajio Supercomputing Laboratory
(#300832), and the support of Google Cloud Platform for the credits granted under its trial program.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Drafting
content, Text Translation, Grammar and spelling checking, Citation management. After using these
tool(s)/service(s), the authors reviewed and edited the content as needed and take full responsibility for
the publication’s content.
D. Spina, Overview of exist 2024–learning with disagreement for sexism identification and
characterization in tweets and memes (extended overview), in: Conference and Labs of the
Evaluation Forum, 2024.
[12] E. Villa-Cueva, F. Sanchez-Vega, A. P. López-Monroy, Bi-ensembles of transformer for online
bilingual sexism detection., in: IberLEF@ SEPLN, 2022.
[13] Y.-Z. Fang, L.-H. Lee, J.-D. Huang, Nycu-nlp at exist 2024–leveraging transformers with diverse
annotations for sexism identification in social networks, Working Notes of CLEF (2024).
[14] M. P. Jimenez-Martinez, I. H. Lopez-Nava, M. Montes-y Gómez, An analysis of the impact of gender
and age on perceiving and identifying sexist posts, in: E. Mezura-Montes, H. G. Acosta-Mesa, J. A.
Carrasco-Ochoa, J. F. Martínez-Trinidad, J. A. Olvera-López (Eds.), Pattern Recognition, Springer
Nature Switzerland, Cham, 2024, pp. 308–318.
[15] M. P. Jimenez-Martinez, I. H. Lopez-Nava, M. Montes-y Gómez, Enhancing the detection of sexist
messages through a multi-profile-based ensemble approach, Computación y Sistemas 29 (2025)
283–294.
[16] L. Tian, N. Huang, X. Zhang, Eficient multilingual sexism detection via large language model
cascades., in: CLEF (Working Notes), 2023, pp. 1083–1090.
[17] J. Tavarez-Rodríguez, F. Sánchez-Vega, A. Rosales-Pérez, A. P. López-Monroy, Better together: Llm
and neural classification transformers to detect sexism, Working Notes of CLEF (2024).
[18] A. Azadi, B. Ansari, S. Zamani, S. Eetemadi, Bilingual sexism classification: fine-tuned xlm-roberta
and gpt-3.5 few-shot learning, arXiv preprint arXiv:2406.07287 (2024).
[19] S. Khan, G. Pergola, A. Jhumka, Multilingual sexism identification via fusion of large language
models, Working Notes of CLEF (2024).
[20] K. Villarreal-Haro, F. Sánchez-Vega, A. Rosales-Pérez, A. P. López-Monroy, Stacked reflective
reasoning in large neural language models, Working Notes of CLEF (2024).
[21] S. Soundararajan, S. J. Delany, Investigating gender bias in large language models through text
generation, in: Proceedings of the 7th International Conference on Natural Language and Speech
Processing (ICNLSP 2024), 2024, pp. 410–424.
[22] R. Ostrow, A. Lopez, Llms reproduce stereotypes of sexual and gender minorities, arXiv preprint
arXiv:2501.05926 (2025).
[23] H. Kotek, R. Dockum, D. Sun, Gender bias and stereotypes in large language models, in: Proceedings
of the ACM collective intelligence conference, 2023, pp. 12–24.
[24] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt
engineering in large language models: Techniques and applications, arXiv preprint arXiv:2402.07927
(2024).
[25] A. Sant, C. Escolano, A. Mash, F. D. L. Fornaciari, M. Melero, The power of prompts: Evaluating
and mitigating gender bias in mt with llms, arXiv preprint arXiv:2407.18786 (2024).
[26] S. Dwivedi, S. Ghosh, S. Dwivedi, Breaking the bias: Gender fairness in llms using prompt
engineering and in-context learning, Rupkatha Journal on Interdisciplinary Studies in Humanities
15 (2023).
[27] M. Siino, I. Tinnirello, Prompt engineering for identifying sexism using gpt mistral 7b, Working</p>
      <p>Notes of CLEF (2024).
[28] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, Y. Yang, Connecting large
language models with evolutionary algorithms yields powerful prompt optimizers, arXiv preprint
arXiv:2309.08532 (2023).
[29] C. Fernando, D. Banarse, H. Michalewski, S. Osindero, T. Rocktäschel, Promptbreeder:
Selfreferential self-improvement via prompt evolution, arXiv preprint arXiv:2309.16797 (2023).
[30] S. Khan, A. Jhumka, G. Pergola, Explaining matters: Leveraging definitions and semantic expansion
for sexism detection, arXiv preprint arXiv:2506.06238 (2025).
[31] J.-J. Tian, D. Emerson, S. Z. Miyandoab, D. Pandya, L. Seyyed-Kalantari, F. K. Khattak, Soft-prompt
tuning for large language models to evaluate bias, arXiv preprint arXiv:2306.04735 (2023).
[32] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou,
Chainof-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th
International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates
Inc., Red Hook, NY, USA, 2022.
[33] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al.,
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv preprint
arXiv:2501.12948 (2025).
[34] P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, M. Farajtabar, The illusion of thinking:
Understanding the strengths and limitations of reasoning models via the lens of problem complexity,
arXiv preprint arXiv:2506.06941 (2025). URL: https://arxiv.org/abs/2506.06941, submitted 7 June
2025.
[35] L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, Y. Choi, Symbolic chain-of-thought distillation:</p>
      <p>Small models can also" think" step-by-step, arXiv preprint arXiv:2306.14050 (2023).
[36] K. Villarreal-Haro, F. Sánchez-Vega, A. P. López-Monroy, Knowledge expansion guided by
justification for improved sexism categorization, Working Notes of CLEF (2025).
[37] B. AlKhamissi, F. Ladhak, S. Iyer, V. Stoyanov, Z. Kozareva, X. Li, P. Fung, L. Mathias, A. Celikyilmaz,
M. Diab, ToKen: Task decomposition and knowledge infusion for few-shot hate speech detection,
in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates, 2022, pp. 2109–2120. URL: https://aclanthology.org/2022.emnlp-main.136/.
doi:10.18653/v1/2022.emnlp-main.136.
[38] G. Segura-Gómez, A. P. Lopez-Monroy, F. Sanchez-Vega, A. Rosales-Pérez, Nlp-cimat at semeval
2025 task 11: Prompt optimization for llms via genetic algorithms and systematic mutation applied
on emotion detection, in: Proceedings of the 19th International Workshop on Semantic Evaluation
(SemEval-2025), Association for Computational Linguistics, Vienna, Austria, 2025. To appear.
[39] F. Barbieri, L. Espinosa Anke, J. Camacho-Collados, XLM-T: Multilingual language models in
Twitter for sentiment analysis and beyond, in: Proceedings of the Thirteenth Language Resources
and Evaluation Conference, European Language Resources Association, Marseille, France, 2022,
pp. 258–266. URL: https://aclanthology.org/2022.lrec-1.27.
[40] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funtowicz, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the
2020 conference on empirical methods in natural language processing: system demonstrations,
2020, pp. 38–45.
[41] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, Gpt3. int8 (): 8-bit matrix multiplication for
transformers at scale, Advances in neural information processing systems 35 (2022) 30318–30332.
[42] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq: Accurate post-training quantization for
generative pre-trained transformers, arXiv preprint arXiv:2210.17323 (2022).
[43] C. Centro de Investigación en Matemáticas A.C, Laboratorio de supercómputo del bajío, n.d. URL:
https://supercomputobajio.cimat.mx/, accessed on June 15, 2025.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Appendix: Evaluated Prompts</title>
      <p>This appendix presents the top-performing prompt variations at diferent stages of the evolutionary
process. Each version includes the full phrasing used for the three EXIST 2025 subtasks.</p>
      <p>All evaluations, prompt generations, and structured mutations were performed using LLaMA 3.1
8B in an 8-bit quantized configuration. The implementation leveraged PyTorch, the transformers
library [40], and bitsandbytes [41, 42] for eficient low-precision inference. All experiments were
executed on two NVIDIA Titan RTX GPUs (24 GB DDR6), provided by the Supercomputing Laboratory
at CIMAT Bajío [43].</p>
      <p>Due to computational constraints, the evolutionary search was limited to 10 generations with a
population size of 10. The entire optimization process was inference-based, with no gradient updates
or additional training. Below, we present the best prompt per generation, showing how task-specific
instructions evolved through systematic mutation.</p>
      <sec id="sec-8-1">
        <title>Prompt - Gen 0 (Seed)</title>
        <p>Macro F1: T1.1 = 0.8098, T1.2 = 0.3109, T1.3 = 0.0381</p>
        <p>TASK 1: Is there any gender-based assumption or judgment in this message?
TASK 2: Identify the intention behind the tweet: is it a direct sexist statement, a report of sexism,
or a judgement against sexism?
TASK 3: How would you classify the kind of sexism this tweet reflects — is it about stereotypes,
objectification, or maybe something more violent?</p>
      </sec>
      <sec id="sec-8-2">
        <title>Prompt - Gen 4</title>
        <p>Macro F1: T1.1 = 0.8250, T1.2 = 0.4350, T1.3 = 0.1650</p>
        <p>TASK 1: Evaluate the given tweet for the presence of sexism, categorizing it as ’Direct Sexism’ if
sexism is directly expressed or stated, ’Described Sexism’ if sexism is described or referenced,
’Criticized Sexism’ if sexism is being criticized or challenged, or ’NO’ if no sexism is detected,
and also assess the tone and context to identify language that targets, belittles, or promotes sexist
situations based on a person’s gender.</p>
        <p>TASK 2: As a social media analyst, consider the context and underlying message of the tweet:
does it actively promote or perpetuate sexism, serve as a means of documenting sexist behavior,
narrate or bear witness to a sexist moment, or express disapproval towards sexist attitudes by
explicitly promoting sexism, serving as a means of documenting sexist behavior, or pushing back
against and objecting to it; determine whether the tweet embodies sexism directly (DIRECT),
reports a sexist event that has occurred (REPORTED), or delivers a critical judgment of sexism
(JUDGEMENTAL).</p>
        <p>TASK 3: Analyze the tweet for sexism and categorize it based on the specific type it exhibits.
Classify the sexism as one of three types: (1) Mocking feminism, which involves ridiculing or
disparaging feminist ideologies or individuals. (2) Objectifying women by reducing them to body
parts, where women are represented or addressed solely in terms of their physical appearance. (3)
Expressing direct aggression, which involves using threatening, derogatory, or violent language
towards women. Tag the corresponding sexism class accordingly, indicating the type of sexism
the tweet embodies.</p>
      </sec>
      <sec id="sec-8-3">
        <title>Prompt - Gen 6</title>
        <p>TASK 1: Evaluate the given tweet as a Social Media Content Moderator for the presence of
sexism, categorizing it as ’Direct Sexism’ if sexism is directly expressed or stated, ’Described
Sexism’ if sexism is described or referenced, ’Criticized Sexism’ if sexism is being criticized or
challenged, or ’NO’ if no sexism is detected, and also assess the tone and context to identify
language that targets, belittles, or promotes sexist situations based on a person’s gender, including
microaggressions or sexist undertones.</p>
        <p>TASK 2: As a social media analyst, consider the context and underlying message of the tweet:
does it actively promote or perpetuate sexism, serve as a means of documenting sexist behavior,
narrate or bear witness to a sexist moment, or express disapproval towards sexist attitudes by
explicitly promoting sexism, serving as a means of documenting sexist behavior, or pushing back
against and objecting to it; determine whether the tweet embodies sexism directly (DIRECT),
reports a sexist event that has occurred (REPORTED), or delivers a critical judgment of sexism
(JUDGEMENTAL).</p>
        <p>TASK 3: Analyze the tweet for sexism and classify it into one of three categories: (1) Mocking
feminism by ridiculing or disparaging feminist ideologies or individuals, (2) Objectifying women
by reducing them to body parts where women are represented or addressed solely in terms of
their physical appearance, (3) Expressing direct aggression by using threatening, derogatory, or
violent language towards women. Tag the corresponding sexism class accordingly, indicating the
type of sexism the tweet embodies.</p>
      </sec>
      <sec id="sec-8-4">
        <title>Prompt - Gen 8</title>
        <p>Macro F1: T1.1 = 0.8470, T1.2 = 0.5350, T1.3 = 0.2250</p>
        <p>TASK 1: Evaluate the given tweet for the presence of sexism... provide specific examples of sexist
language and ofer suggestions for how the author could rephrase the tweet to avoid perpetuating
sexism, be clear about the criteria used for categorization, and provide a detailed breakdown of
the tone and context.</p>
        <p>TASK 2: As a social media analyst, determine the tweet’s intent: DIRECT, REPORTED, or
JUDGEMENTAL.</p>
        <p>TASK 3: As a Social Media Analyst, analyze the tweet for sexism, classifying it into one of three
categories: (1) Mocking feminism by ridiculing or disparaging feminist ideologies or individuals,
(2) Objectifying women by reducing them to body parts where women are represented or
addressed solely in terms of their physical appearance, or (3) Expressing direct aggression by
using threatening, derogatory, or violent language towards women, and indicate the type of
sexism the tweet embodies with concise reasoning.</p>
      </sec>
      <sec id="sec-8-5">
        <title>Prompt - Gen 10 (Final)</title>
        <p>Macro F1: T1.1 = 0.8478, T1.2 = 0.5724, T1.3 = 0.2477</p>
        <p>TASK 1: Evaluate the tweet, assigning YES if sexism is detected (direct, described, or criticized),
and categorize it as Direct Sexism, Described Sexism, or Criticized Sexism if sexism is present, or
NO if not.</p>
        <p>TASK 2: Given a tweet that contains a sexist message, identify the intention behind the message.
Choose one of the following categories: DIRECT, REPORTED, JUDGEMENTAL.
TASK 3: If the tweet was not classified as sexist in TASK 1, respond with: NO.
Otherwise, identify and list all the types of sexism that apply: IDEOLOGICAL-INEQUALITY,
STEREOTYPING-DOMINANCE, OBJECTIFICATION, SEXUAL-VIOLENCE,
MISOGYNY-NONSEXUAL-VIOLENCE.</p>
      </sec>
      <sec id="sec-8-6">
        <title>Final Prompt Template</title>
        <p>The final prompt is designed to elicit detailed and structured responses from the language model in a
JSON format. It consolidates general tweet-level metadata with task-specific classifications for sexism
detection, intention, and category. The evolved instructions are highlighted below.</p>
        <p>TWEET: { }
INSTRUCTIONS: Return a JSON object with the following attributes. Answer based on the
content of the tweet.
"RELEVANT_INFO": Include 4 diferent sections where you add relevant information and
insights for the analysis.
"DISCLAIMERS_AND_WARNINGS": Warnings and disclaimers on potential bias or sensitive
content in the analysis.
"TASK1": Evaluate the tweet, assigning YES if sexism is detected (direct, described, or criticized),
and categorize it as Direct Sexism, Described Sexism, or Criticized Sexism if sexism is present, or
NO if not.</p>
        <p>Examples for TASK1:
YES: "It’s less of #adaywithoutwomen and more of a day without feminists, which, to be quite
honest, sounds lovely"
YES: "I’m sorry but women cannot drive, call me sexist or whatever but it is true"
YES: "You look like a whore in those pants - My brother of 13 when he saw me in a leather pant"
NO: "Where all the white women at?"
NO: "The shocking video of a woman at the wheel who miraculously escapes an assassination
attempt"
NO: "Congratulations to the local high school basketball team on their big championship win!
The whole town is so proud"
"TASK2": Given a tweet that contains a sexist message, identify the intention behind the message.
Choose one of the following categories: DIRECT: The author expresses sexist ideas or incites
sexism. REPORTED: The author is sharing or describing a sexist event they witnessed or
experienced JUDGEMENTAL: The author is condemning or criticizing sexist behavior.
Examples for TASK2:
DIRECT: "A woman needs love, to fill the fridge, if a man can give this to her in return for her
services, I don’t see what else she needs."
REPORTED: "I doze in the subway; I open my eyes feeling something weird: the hand of the
man sat next to me on my leg #SquealOnYourPig."
JUDGEMENTAL: "As usual, the woman was the one quitting her job for the family’s welfare."
"TASK3": If the tweet was not classified as sexist in TASK 1, respond with: NO. Otherwise, identify
and list all the types of sexism that apply. Choose from the following categories:
IDEOLOGICALINEQUALITY: The tweet discredits the feminist movement, rejects gender inequality, or portrays
men as victims of gender-based oppression. STEREOTYPING-DOMINANCE: The tweet promotes
traditional gender roles, portrays women as less capable or suitable for certain tasks, or claims
male superiority. OBJECTIFICATION: The tweet reduces women to their physical attributes
or treats them as objects without personal agency. SEXUAL-VIOLENCE: The tweet includes
sexual harassment, explicit sexual suggestions, or references to sexual assault.
MISOGYNY-NONSEXUAL-VIOLENCE: The tweet expresses hatred or non-sexual violence toward women.
Examples for TASK3:
IDEOLOGICAL-INEQUALITY: "#Feminism is a war on men, but it’s also a war on women. It’s a
war on female nature, a war on femininity"
IDEOLOGICAL-INEQUALITY: "I think the whole equality thing is getting out of hand. We are
diferent, that’s how we’re made!"
STEREOTYPING-DOMINANCE: "Most women no longer have the desire or the knowledge to
develop a high quality character, even if they wanted to."
STEREOTYPING-DOMINANCE: "I feel like every time I flirt with a girl they start to imagine all
the ways they can utilize me"
OBJECTIFICATION: "No ofense but I’ve never seen an attractive African American hooker. Not
a single one"
OBJECTIFICATION: "Most of the times I just want women for sex just because every time I try
to make a relationship out of it they always break my heart"
SEXUAL-VIOLENCE: "Wanna touch your tits... you can’t imagine what I can do on your body"
SEXUAL-VIOLENCE: "Fuck that cunt, I would with my fist"
MISOGYNY-NON-SEXUAL-VIOLENCE: "Domestic abuse is never okay... Unless your wife is a
bitch"
MISOGYNY-NON-SEXUAL-VIOLENCE: "Some women are so toxic they don’t even know they
are draining everyone around them in poison. If you lack self-awareness you won’t even notice
how toxic you really are"</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>United</given-names>
            <surname>Nations General Assembly</surname>
          </string-name>
          ,
          <source>Resolution</source>
          <volume>3010</volume>
          (xxvii):
          <article-title>International women's year</article-title>
          , https: //undocs.org/en/A/RES/3010(XXVII),
          <year>1972</year>
          .
          <article-title>Proclaimed 1975 as International Women's Year and called for the first</article-title>
          <source>World Conference on Women.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>United</given-names>
            <surname>Nations</surname>
          </string-name>
          , Background - international
          <source>women's day</source>
          ,
          <year>2025</year>
          . URL: https://www.un.org/ en/observances/womens
          <article-title>-day/background, states that the UN began celebrating International Women's Day on 8 March during International Women's Year (</article-title>
          <year>1975</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>World</given-names>
            <surname>Economic</surname>
          </string-name>
          <string-name>
            <surname>Forum</surname>
          </string-name>
          ,
          <source>Global gender gap report</source>
          <year>2024</year>
          ,
          <year>2024</year>
          . URL: https://www.weforum.org/ reports/global-gender
          <source>-gap-report-2024</source>
          ,
          <article-title>the report shows that women continue to be the most afected by disparities in economic participation, political representation, health, and education, highlighting how sexist attitudes legitimize exclusion and gender-based violence</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>UN</given-names>
            <surname>Women</surname>
          </string-name>
          ,
          <article-title>Creating safe digital spaces free of trolls, doxing</article-title>
          and hate speech,
          <year>2023</year>
          . URL: https://www.unwomen.org/en/news-stories/explainer/2023/11/ creating-safe
          <article-title>-digital-spaces-free-of-trolls-doxing-and-hate-speech, explainer article</article-title>
          ,
          <source>UN Women Headquarters.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <article-title>Automatic classification of sexism in social networks: An empirical study on twitter data</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>219563</fpage>
          -
          <lpage>219576</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ACCESS.
          <year>2020</year>
          .
          <volume>3042604</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of EXIST 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and TikTok videos</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction.,
          <source>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of EXIST 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and TikTok videos (extended overview)</article-title>
          ,
          <source>in: CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of exist 2021:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendieta-Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marco-Remón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Makeienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of exist 2022:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          , J. C. de Albornoz,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2023-learning with disagreement for sexism identification and characterization (extended overview)</article-title>
          .
          <source>, CLEF (Working Notes)</source>
          (
          <year>2023</year>
          )
          <fpage>813</fpage>
          -
          <lpage>854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Maeso</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chulvi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
          </string-name>
          , R. Morante,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>