<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, X. He, S. N. Wang, X. Li, S. Narang,
et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sexism Identification in Social Networks using LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leire Dominguez-Sol</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Segura Bedmar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human Language and Accessibility Technologies Group (HULAT), Computer Science and Engineering Department, Universidad Carlos III de Madrid</institution>
          ,
          <addr-line>Leganés, 28911, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2210</year>
      </pub-date>
      <volume>11416</volume>
      <fpage>1</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes our participation in the EXIST 2025 shared task on sexism detection in social media. We developed a variety of systems for both Task 1.1 (binary classification of sexism) and Task 1.2 (fine-grained categorization), combining traditional machine learning models, Transformer-based architectures, ensemble methods, and hybrid CNN-BERT approaches. Our approach incorporates data augmentation, and multilingual modeling strategies to address challenges such as label disagreement and language variation. Results indicate that ensembles of fine-tuned models and hybrid architectures are especially efective in handling noisy annotations and capturing nuanced sexist content. This work highlights the importance of combining architectural diversity with robust preprocessing and evaluation strategies in sensitive NLP tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism detection</kwd>
        <kwd>NLP</kwd>
        <kwd>Transformer models</kwd>
        <kwd>Ensemble</kwd>
        <kwd>Multilingual classification</kwd>
        <kwd>EXIST 2025</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sexism generally refers to prejudice, stereotyping, or discrimination, generally targeting women, based
on sex or gender.. According to the European Institute for Gender Equality, it includes “any act, gesture,
visual representation, spoken or written words, practice or behavior, based upon the idea that a person
or a group is inferior because of their sex” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is based in beliefs about the fundamental nature and
roles of men and women, usually ending in the view that one sex is superior or more valuable than the
other. It can appear at multiple levels of society, from laws and institutions, to personal interactions,
as well as internalized attitudes. It can also be expressed in many ways and forms, including speech,
writing, and images, among others.
      </p>
      <p>
        The consequences of sexism are deep and significant. It often leads to discrimination, marginalization,
and unfair treatment towards women. The roots of sexism can be traced back to ancient civilizations,
where social structures were built on rigid gender roles. Traditionally, men have been seen as strong
leaders who participate in public life, while women have often been expected to focus on home and
family, occupying more subordinate roles. Throughout the centuries, these divisions have become
deeply embedded in cultural norms, legal systems, and even everyday language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In addition to its
direct consequences, sexism reinforces existing power hierarchies and cultural narratives that normalize
inequality. Such dynamics produce cycles of disadvantage that are challenging to disrupt without
deliberate and collective societal change. Addressing sexism, therefore, is not only a matter of individual
behavior but a broader project of transforming institutions, cultural values, and social expectations.
      </p>
      <p>
        While sexism has been transmitted through traditional media, education, and institutions over the
years, the rise of digital platforms has given it new dimensions. In recent years, social media has
become a powerful space where sexist attitudes can be amplified, normalized and even monetized. This
digital environment facilitates the fast spread of sexist content, including stereotypes, hate speech, and
harassment which often targets women and girls. Studies show that a significant proportion of young
women (72,2%) are exposed to sexist remarks, body shaming, and unrealistic beauty standards on social
platforms, leading to negative emotions such as frustration, anxiety, and social isolation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Research also demonstrates that social media algorithms can increase the visibility of harmful content
by promoting posts that generates strong emotional reactions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This digital sexism not only mirrors
existing inequalities but also creates new forms of psychological, social, and political harm.
      </p>
      <p>Moreover, the persistence of sexist discourse in online spaces can have broader societal impacts. It
afects the younger generations views on gender roles and relationships, shapes public discourse, and
influences political decision-making. In this way, preventing sexism on social media is crucial for both
individual protection and the creation of a more inclusive and equitable online community.</p>
      <p>
        One of the primary issues in combating sexism on social media resides in its scale and quick expansion.
Unlike conventional forms of discrimination, digital sexism can be pervasive, instantaneous, and
impossible to monitor. It is often embedded in memes, comments and videos, which, along with the fact
that content is often user-generated and rapidly shared, makes it harder to detect using conventional
systems and even attempts to intervene [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>In response to this growing concern, multiple initiatives have emerged from both the public or private
sectors with the goal of both raising awareness and actively counteract gender bias. Among the most
innovative ones are coding competitions and hackathons, which tackle the creativity and technical
skills of diverse participants to propose tangible solutions for gender equality.</p>
      <p>
        One notable example is the womENcourage Hackathon [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which challenges participants to design
and prototype technological solutions addressing issues like bias in language technology. The 2024
edition, themed “AI Fair Play Hackathon: Ensuring Equity in Language Technology,” specifically targets
the detection and mitigation of sexist stereotypes and unfair treatment in digital communication [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Another important competition is EXIST (sEXism Identification in Social neTworks) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], organized
within the IberLEF (Iberian Languages Evaluation Forum), whose challenge is to bring together experts
in natural language processing (NLP), machine learning, and data science to develop automated systems
capable of identifying sexist content in text from social platforms such as Twitter [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Beyond the technical aspect, these competitions also increase awareness of the relevance of ethical
and social dimensions in artificial intelligence. They serve as platforms for advocacy, education, and
empowerment. By engaging a broad audience, which include students, teachers, and technologists
among others, they help build a shared understanding of the challenges posed by online sexism and the
collective responsibility to develop fair, inclusive, and socially aware technological solutions.</p>
      <p>
        In recent years, the EXIST shared task has established itself as a key benchmark for the automatic
detection of sexism in social media texts. Since its first edition in 2021, the top-performing systems
have predominantly relied on Transformer-based models often combined through ensemble strategies
or enhanced with data augmentation techniques. In the 2024 edition, the best-performing system
for the text-based task integrated multilingual Transformers with weighted ensembles, achieving
state-of-the-art results through the fusion of diferent architectural variants and voting mechanisms [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Building upon these findings, the present work contributes to the EXIST 2025 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] challenge by
developing a comprehensive pipeline that combines fine-tuned Transformer models with linguistic and
architectural enhancements. In particular, the project explores monolingual and multilingual models,
ensemble techniques and LLM prompting. It also incorporates variants such as CNN-BERT hybrids and
Retrieval-Augmented Classification (RAC), along with a detailed evaluation based on both overall and
per-class performance.
      </p>
      <p>The main contributions of this work include:
1. The implementation and evaluation of multiple Transformer-based architectures under the
constraints of the EXIST 2025 setup
2. The integration of ensemble models and data augmentation techniques with the goal of improving
robustness in ambiguous cases.</p>
      <p>This paper also aims to provide a critical perspective on the social implications of automated sexism
detection, especially in multilingual and cross-cultural contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>This section presents the methodological framework designed to address the two classification tasks
defined in the EXIST 2025 challenge. As outlined in previous sections, the detection and interpretation
of sexism in tweets present a number of inherent challenges for both tasks: the annotations are provided
by multiple annotators with potential disagreements; the linguistic expression of sexism can be subtle,
sarcastic, or culturally specific; and the dataset includes tweets in two languages, which requires models
capable of understanding both English and Spanish.</p>
      <p>To address these challenges and maximize model performance, a wide range of modeling approaches,
from traditional machine learning baselines to state-of-the-art transformers architectures was explored.
First, standard classifiers trained on bag-of-words and TF-IDF representations to establish a classical
baseline, were implemented. Focus was then shifted to fine-tuning a series of Transformer-based models,
including both multilingual and monolingual models trained specifically for English or Spanish. These
models form the core of the system. All models model were trained on both classification tasks of the
EXIST 2025 challenge. For Task 1.1, each model was trained to predict whether a tweet is sexist. For
Task 1.2, each model was trained to classify the intention behind the sexist content into three categories:
DIRECT, REPORTED, or JUDGEMENTAL</p>
      <p>
        In parallel, prompt-based learning with large language models (LLMs) was explored as well, testing
zero-shot and few-shot strategies [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This allowed to assess the performance of foundation models
without fine-tuning, and to compare them against supervised approaches under the same task definitions.
Beyond purely textual approaches, it was also experimented with hybrid models, integrating
BERTstyle encoders with convolutional neural network (CNN) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] components to assess whether injecting
additional learned features could improve classification.
      </p>
      <p>
        To further enhance performance and robustness, back translation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] was applied as a data
augmentation strategy, introducing lexical and syntactic variation through intermediate translation. This
technique was integrated into the training pipeline of Transformer-based models to improve
generalization across both languages.
      </p>
      <p>
        Finally, ensemble strategies were explored, combining predictions from multiple Transformer models
via majority voting and F1-weighted averaging. In parallel, we implemented a Retrieval-Augmented
Classification (RAC) framework [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], where a Sentence-BERT model was used to retrieve the top-3
most similar training tweets, which were concatenated as contextual input to a base Transformer. This
setup aimed to enhance the model’s ability to understand nuanced or ambiguous tweets by providing
relevant examples. RAC experiments were conducted on both monolingual and multilingual variants of
the underlying LLMs.
      </p>
      <p>The two classification tasks addressed in this work required slightly diferent modeling strategies.
For Task 1.1, the objective was to predict whether a tweet contains or refers to sexist content. This
was framed as a standard binary classification problem (YES / NO), and each multilingual model was
ifne-tuned accordingly. The label aggregation strategy used to define the ground truth 2.1 was kept
consistent within each experiment and was applied to both the training and test data. All multilingual
models were evaluated on both labeling schemes (majority vote or female-vote) to assess the impact of
the annotation strategy on classification performance.</p>
      <p>In contrast, Task 1.2 required a more sophisticated design. Two modeling approaches were explored:
1. Two-step pipeline: In this setup, a model was first trained on the binary labels of Task 1.1 to
identify sexist tweets. Then, a separate model was trained to classify the intention (DIRECT,
REPORTED, or JUDGEMENTAL) using only the subset of tweets labeled as YES. At inference time,
Task 1.1 was applied first, and the stance classification model was applied only to those tweets
predicted as sexist. This approach mirrors the task’s conceptual structure but could compound
errors from the first stage into the second.
2. Unified multiclass classification : Alternatively, a single model was trained to predict among
four classes: the three stance categories plus a fourth NO class representing non-sexist content.
This allowed the model to learn both the binary and fine-grained labels simultaneously, eliminating
the need for a separate model or an explicit dependency on the output of Task 1.1. Label
construction in this approach was adapted accordingly, applying the same aggregation strategies
but assigning the NO label to tweets originally labeled as non-sexist in Task 1.1.</p>
      <p>Both strategies were implemented for each model and for both label aggregation methods, enabling a
comprehensive comparison across architectures and training paradigms.</p>
      <p>The remainder of this section details each one of these approaches, explaining the motivations behind
their selection, their implementation details, and how they fit into the overall system design, as well as
two diferent approaches for obtaining the target label from the dataset.</p>
      <sec id="sec-2-1">
        <title>2.1. Label Construction</title>
        <p>One of the key challenges of this task lies in the definition of the target label, as we mentioned. To
obtain the labels required for Task 1.1 and Task 1.2, we had to construct the label column ourselves.
This introduces a critical design choice that directly afects the dataset composition and, consequently,
model performance. We explored and compared two strategies for Task 1.1, and applied analogous logic
to Task 1.2.</p>
        <p>1. Strict Majority Voting: A tweet is labeled as sexist (YES) if it has at least four or more of the six
annotators’ labels set to YES. Otherwise, it is labeled as non-sexist (NO). In cases where there is
no majority (i.e., a tie of 3 YES and 3 NO), the tweet is discarded and excluded from the dataset.
This aggregation strategy reflects the approach oficially used by the organizers for generating
the hard ground truth labels in the test set, ensuring consistency between training and evaluation.
2. Filtered Aggregation: In this alternative approach, we implemented a more conservative
aggregation strategy that leverages annotator metadata. Specifically, instead of discarding tweets with
tied or ambiguous votes, we resolved these cases by assigning the label that was most frequently
selected by the female annotators. This decision was motivated by the fact that the dataset
includes the gender of each annotator, and the annotation setup always includes three women and
three men per tweet. As a result, we could break ties systematically by prioritizing the majority
vote among female annotators. As a result, this version retains all tweets, including those that
would be discarded under strict majority voting, and therefore yields a larger training set.</p>
        <p>For Task 1.2, which focuses on identifying the author’s intention behind a sexist tweet, we applied a
labeling strategy that mirrors the logic used for Task 1.1, while accounting for the specific structure of
this subtask. Since Task 1.2 is only defined for tweets labeled as sexist (i.e., with label YES in Task 1.1),
we first filtered the dataset accordingly.</p>
        <p>To assign a unique intention label (DIRECT, REPORTED or JUDGEMENTAL), it was considered only
the annotations provided by those annotators who had also labeled the tweet as sexist in Task 1.1. If
a clear majority emerged among these annotators (i.e., one label appeared more frequently than any
other), that label was assigned as the final Task 1.2 label. In cases where there was no majority, meaning
a tie among the most frequent labels, the intention labels provided by the three female annotators were
examined, restricted again to those who had voted YES in Task 1.1. If this subset exhibited a majority
for one label, that label was assigned.</p>
        <p>When neither of the previous steps yielded a conclusive result, a final fallback mechanism was
applied: one of the most frequent valid labels among the Task 1.2, was randomly selected provided the
annotators had voted YES in Task 1.1. This random choice was constrained to valid classes (excluding
and UNKNOWN). If no valid label was available at all, an edge case occurring when all labels were either
missing or invalid, we assigned the label UNKNOWN as a last resort, which was later filtered out of our
training set.</p>
        <p>This multi-step resolution strategy ensures that we maximize the number of usable annotations
for Task 1.2 without introducing arbitrary biases, while also making full use of the gender metadata
provided by the dataset. It also ensures that annotation disagreements are resolved using a reproducible
and interpretable process that leverages annotator metadata when necessary.</p>
        <p>Both approaches were used during experimentation to understand their efect on model performance,
dataset balance, and generalization. The choice of label construction has a direct impact on the class
distribution and the number of training examples, which must be considered carefully when designing
classifiers and evaluation pipelines.</p>
        <p>Finally, it is worth noting that the dataset comprises tweets collected from Twitter, a platform
characterized by informal language, abbreviations, emojis, and a strong reliance on cultural and social
context. This makes the detection of sexism particularly challenging, as sexist content is often expressed
in implicit or ironic ways, rather than through overtly ofensive language. This motivates the use of
context-aware models such as Transformers and Large Language Models (LLMs), which are capable of
capturing nuanced patterns in language, even with limited surface-level cues.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Traditional Machine Learning Approaches</title>
        <p>This section presents the traditional machine learning approach used as a baseline for the binary
classification task. The method relies on transforming raw textual data into structured feature representations,
followed by training well-established classifiers. The goal is to assess how these classical pipelines
perform when applied to social media texts, before comparing them with modern deep learning methods.</p>
        <p>
          To transform raw tweets into numerical feature representations, TF-IDF vectorization was applied.
This model is an extension of the Bag of Words (BoW) model, where BoW represents each document as
a vector in which each feature corresponds to a word from a vocabulary, and its associated value is the
frequency of that word in the document. The vocabulary is built from a corpus of documents, where
each unique word across the entire corpus defines a feature in the bag-of-words representation. TF-IDF
scales term frequencies inversely by their document frequency, down-weighting frequent terms that
are less informative and emphasizing rare but potentially discriminative ones. A detailed description
of these text representation models can be found in Salton and Buckley (1988) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Bigrams were
included in addition to unigrams (individual words), allowing the models to capture short contextual
expressions that are often relevant in social media texts. Three diferent text normalization strategies
were compared in order to assess their impact on classification performance:
1. The basic variant involved lowercasing, accent normalization (e.g., á → a), stopword removal,
using NLTK stopword lists for English and Spanish [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and tokenization. User mentions and
non-alphabetic characters were removed via regular expressions. No morphological reduction
was applied in this version.
2. The lemmatized variant followed a similar cleaning procedure but applied lemmatization using
spaCy tool to reduce each token to its base form. This strategy was intended to group inflected
forms of the same word (e.g., "drives", "driving" → "drive"), potentially improving generalization.
3. The stemmed variant replaced lemmatization with stemming, using the Snowball stemmer
implemented in the NLTK library for each language [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Stemming reduces words to their root
forms using rule-based truncation (e.g., “driving”, “driven” → “driv”), which is computationally
cheaper but less linguistically accurate [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>As a baseline for the binary classification task, traditional machine learning algorithms were
implemented using the previously described text representation pipelines. Two classifiers were selected
for this purpose: Multinomial Naive Bayes (NB) [15] and Support Vector Machine (SVM) [16] with a
linear kernel. Both models have been widely used in text classification tasks due to their eficiency
and competitive performance, especially when combined with appropriate preprocessing and feature
extraction methods [15, 16].</p>
        <p>Naive Bayes classifiers rely on probabilistic reasoning under the assumption of feature independence.
In the case of the Multinomial Naive Bayes model, word frequencies are treated as features drawn
from a multinomial distribution, which is particularly suited for bag-of-words representations [17]. On
the other hand, the Support Vector Machine classifier constructs a hyperplane in a high-dimensional
space to separate classes, and has been shown to perform well in sparse feature spaces, such as those
generated by text data [18].</p>
        <p>Each preprocessing method was applied independently to the training and test sets, resulting in three
parallel datasets. These were then paired with each classifier (NB or SVM), producing a total of six
distinct classification pipelines.</p>
        <p>The final architecture for each pipeline can be summarized as shown in Figure 1.</p>
        <p>This baseline setup was designed not only to provide initial reference scores, but also to evaluate how
diferent levels of linguistic normalization afect model performance in a controlled environment. These
results later serve as a benchmark for more advanced architectures explored in subsequent sections.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Transformer-Based Models for Monolingual Datasets</title>
        <p>Transformer-based architectures have become the foundation of modern NLP due to their ability to
model contextual relationships between words using self-attention mechanisms [19]. Unlike recurrent
or convolutional neural networks, Transformers allow for direct access to all positions in the input
sequence simultaneously through self-attention mechanisms. This architecture enables eficient parallel
training and has led to significant improvements in a wide range of natural language understanding
tasks [19]. Although a more detailed overview of their theoretical foundations is provided in the State
of the Art section, a brief introduction is included here to contextualize the models used in this work.</p>
        <p>The original Transformer architecture, introduced by Vaswani et al. (2017) [19], consisted of both
an encoder and a decoder, and was originally designed for machine translation tasks. Subsequent
models have adapted this architecture for other NLP applications by using only one of the two modules.
Encoder-only architectures, such as BERT [20], are commonly used for classification and understanding
tasks, while decoder-only architectures, like GPT [21], are typically employed for text generation.</p>
        <p>In the standard Transformer encoder, each token in the input sequence is represented by a
contextualized embedding computed via multi-head self-attention and feed-forward layers. These representations
are then used for downstream tasks through additional components, such as classification heads. In
the case of text classification, a special token (typically [CLS]) is added to the input and its final
representation is used as the aggregate embedding for classification.</p>
        <p>Two monolingual Transformer models were selected for fine-tuning: BERT and RoBERTa. These
models were pretrained exclusively on English or Spanish corpora and subsequently fine-tuned on
task-specific data depending on the language of the tweet.</p>
        <p>BERT (Bidirectional Encoder Representations from Transformers) was introduced by Devlin et al.
in 2019 as one of the first language models to leverage deep bidirectional representations by jointly
conditioning on both left and right context in all layers [20]. The model is built upon the encoder
architecture of the Transformer and is pretrained using two self-supervised objectives:
• Masked Language Modeling (MLM): A subset of tokens (typically 15%) in the input is replaced
with a special [MASK] token, and the model is trained to predict the original tokens based on the
surrounding context [20].
• Next Sentence Prediction (NSP): Given a pair of sentences, the model is trained to predict whether
the second sentence logically follows the first. This objective helps the model capture
intersentence relationships [20].</p>
        <p>In this work, the base version of BERT (BERT-Base) was employed, which consists of 12 Transformer
encoder layers, 12 self-attention heads per layer, and a hidden size of 768, amounting to approximately
110 million parameters. The English model used was bert-base-uncased [22], pretrained on BookCorpus
and English Wikipedia, comprising over 3.3 billion words.</p>
        <p>RoBERTa (Robustly Optimized BERT Approach) is a variant of BERT introduced by Facebook AI in
2019 with the aim of improving the pretraining process of Transformer-based language models [23].
While maintaining the original BERT architecture, RoBERTa introduces a number of key modifications
that enhance model performance across a wide range of NLP tasks.</p>
        <p>One of the main diferences lies in the removal of the Next Sentence Prediction (NSP) objective. In
BERT, NSP was used during pretraining to help the model learn sentence-level coherence, but later
studies showed that removing it can lead to improved performance in downstream tasks. RoBERTa
discards NSP and focuses entirely on Masked Language Modeling (MLM), which enables the model to
better capture token-level dependencies.</p>
        <p>Another important modification is the use of dynamic masking. While BERT applies masking once
during data preprocessing, RoBERTa regenerates mask patterns at each epoch. This introduces more
variability and forces the model to learn contextual representations more robustly. In addition, RoBERTa
was trained on significantly more data (over 160 GB of text) from multiple sources such as BookCorpus,
English Wikipedia, CC-News, OpenWebText, and Stories [23]. This extended corpus, combined with
longer training schedules, larger batch sizes, and higher learning rates, results in a more powerful and
generalizable language model. In this work, the pretrained roberta-base-bne [24] was used for Spanish
tweets, which comprises around 125 million parameters.</p>
        <p>Both models were retrieved from the Hugging Face Model Hub and fine-tuned independently on
their respective language subsets and later merged to create the final output for evaluation.</p>
        <p>In both cases, predictions were generated independently for English and Spanish, and then merged
during inference to construct unified outputs for evaluation. This strategy enabled the models to
specialize in the linguistic nuances of each language while still contributing to the overall multilingual
classification pipeline.</p>
        <p>This monolingual modeling approach allowed the classifiers to benefit from language-specific
pretraining, which can be especially advantageous for handling subtle or idiomatic expressions of sexism
that vary across linguistic and cultural contexts.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Transformer-Based Models for Multilingual Datasets</title>
        <p>Multilingual versions of transformer models are pretrained on corpora covering many languages,
making them capable of handling cross-lingual input without requiring explicit translation or separate
models. This is particularly advantageous in the context of EXIST 2025, where the dataset consists of
tweets in both English and Spanish.</p>
        <p>For this reason, several multilingual Transformer models were fine-tuned to address both subtasks.
All models were fine-tuned on the whole dataset (containing tweets from both languages) and trained
using a consistent setup, including identical training procedures and evaluation metrics. Tokenization
was performed using the model-specific tokenizer, and classification was implemented via a linear layer
applied to the [CLS] representation, followed by a task-appropriate activation function.</p>
        <p>The multilingual models evaluated in this work include bert-base-multilingual-cased (Multilingual
BERT) [25], mdeberta-v3-base (Multilingual DeBERTa) [26], distilbert-base-multilingual-cased
(Multilingual DistilBERT) [27], and xlm-roberta-base (XLM-RoBERTa) [28]. In addition, the monolingual English
model roberta-base [29] was included in the experiments to assess its performance on multilingual data.</p>
        <p>Each of these models will be described, focusing on their specific architecture, pretraining objectives,
and adaptation to the EXIST 2025 tasks.</p>
        <p>The general architecture and training objectives of BERT were already introduced in the monolingual
Section 2.3 for English tweets. In this section, the focus shifts to its multilingual counterpart,
bertbase-multilingual-cased, which was employed to jointly model both English and Spanish texts.</p>
        <p>This model, available on the Hugging Face Model Hub [25], was pretrained on the concatenation of
Wikipedia dumps from the 104 largest languages using a cased WordPiece vocabulary. Its multilingual
pretraining makes it capable of handling inputs in both English and Spanish without the need for
language-specific customization. Unicode support and case preservation further enhance its applicability
to social media content, where capitalized words and named entities often carry important semantic
information.</p>
        <p>The architectural details and pretraining objectives of RoBERTa were already introduced in the
monolingual section 2.3, where a Spanish-adapted variant, roberta-base-bne, was fine-tuned on
Spanish tweets. In addition to this monolingual model, the general-purpose English model
robertabase from Facebook AI [29] was also evaluated in the multilingual track to assess its robustness when
ifne-tuned on bilingual data. This model shares the same architecture as the Spanish variant, but was
pretrained on English-only corpora.</p>
        <p>The motivation behind this choice was to investigate whether a high-capacity model trained
exclusively on English data could generalize efectively when exposed to a mixed-language dataset without
any additional cross-lingual adaptation. While this model lacks native multilingual support, it was
included to explore its ability to handle bilingual social media content implicitly through fine-tuning on
the combined English–Spanish dataset.</p>
        <p>mDeBERTa (Multilingual Decoding-enhanced BERT with disentangled attention) is a multilingual
extension of DeBERTa, a Transformer-based model architecture introduced by Microsoft with the
goal of improving contextual representation by modifying both attention mechanisms and positional
encoding strategies. The original DeBERTa architecture separates the representation of content and
positional information in the attention layers and includes an enhanced mask decoder, resulting in
better performance on several benchmarks compared to BERT and RoBERTa [30].</p>
        <p>Two main innovations characterize the DeBERTa architecture. First, the disentangled attention
mechanism separates the representation of token content and position across diferent vector subspaces.
Unlike traditional Transformer models, where positional and semantic information are combined within
the same vector, DeBERTa encodes them independently, allowing the model to better capture structural
and contextual relationships. Second, an Enhanced Mask Decoder (EMD) is used during pretraining,
which improves the model’s ability to recover masked tokens, resulting in more informative internal
representations.</p>
        <p>The multilingual version, mdeberta-v3-base, builds upon the improvements of DeBERTa v3 and
extends them to a cross-lingual setting. Unlike previous multilingual models that rely on shared
vocabularies and token-level transfer, mDeBERTa leverages a disentangled attention mechanism, which
allows it to represent word meaning and position more flexibly, leading to improved performance across
a range of languages.</p>
        <p>In this project, the version mdeberta-v3-base from Microsoft was used, retrieved via Hugging Face
[26]. It consists of 12 layers, 768 hidden dimensions, and 12 attention heads, and was fine-tuned on
the combined English–Spanish dataset provided by the EXIST 2025 competition. Tokenization was
performed using the default tokenizer for mDeBERTa, which uses SentencePiece with a vocabulary of
250,000 subwords.</p>
        <p>Due to its architectural innovations and multilingual capabilities, mDeBERTa serves as a strong
candidate for addressing complex tasks such as sexist language detection in heterogeneous social media
content.</p>
        <p>DistilBERT is a lightweight and faster version of the original BERT model, developed through a
process known as knowledge distillation. It was introduced by Sanh et al. in 2019 with the goal of
reducing the size and computational cost of BERT, while maintaining most of its performance on
standard NLP benchmarks [31]. The model is designed to serve as a compact student network that
learns to approximate the output distributions of a larger, pretrained teacher model, in this case, BERT.</p>
        <p>Architecturally, DistilBERT follows the same encoder-based Transformer design as BERT, but with
a reduced number of layers and simplified training objectives. Specifically, it reduces the number of
Transformer layers from 12 to 6, resulting in a model that is roughly 40% smaller and 60% faster in
inference time, with approximately 66 million parameters. Despite this reduction, it preserves the same
embedding dimension (768) and the same number of attention heads (12) as BERT-base. In addition,
regularization techniques such as dropout and input noise were incorporated during training to improve
generalization and stability.</p>
        <p>The model was pretrained using a triple loss function combining distillation loss, masked language
modeling (MLM) loss, and cosine embedding loss, which together enable the student model to mimic
the hidden states and output predictions of its teacher [31].</p>
        <p>In this work, the distilbert-base-multilingual-cased model from Hugging Face [27] was used.
Unlike the original DistilBERT, which was pretrained on English-only corpora, this multilingual version
was trained on data from 104 languages and tokenization was carried out using the default WordPiece
tokenizer associated with the model.</p>
        <p>XLM-RoBERTa is a multilingual extension of the RoBERTa architecture, introduced by Facebook AI
in 2020 as part of their eforts to improve cross-lingual language modeling [ 32]. It inherits the same
Transformer-based encoder design and masked language modeling (MLM) objective from RoBERTa,
but is pretrained on a significantly broader linguistic corpus. While RoBERTa was trained exclusively
on English text, XLM-RoBERTa was trained on 100 languages using data from CommonCrawl, totaling
2.5 terabytes of filtered text.</p>
        <p>Its tokenizer is based on SentencePiece, which allows the model to operate on byte-level subwords,
making it suitable for multilingual processing without requiring language-specific preprocessing or
vocabularies. This enables robust cross-lingual transfer and zero-shot learning.</p>
        <p>The variant used in this work is xlm-roberta-base, which consists of 12 layers, 768 hidden
dimensions, and 12 attention heads, summing up to approximately 270 million parameters. The model was
obtained from Hugging Face [28] and fine-tuned on the EXIST 2025 dataset.</p>
        <p>XLM-RoBERTa serves as one of the most powerful pre-trained multilingual models available,
demonstrating competitive performance in a variety of multilingual benchmarks such as XNLI and MLQA.
In this context, it was included to assess whether a model explicitly trained on multilingual corpora
outperforms other alternatives in detecting sexist content across both Spanish and English tweets.</p>
        <p>In sum, we use the following transformers to address the multilingual setting of both tasks.</p>
        <p>All models were trained and adapted to each task by simply adjusting the number of output classes.
The training pipeline used identical hyperparameters, including a linear learning rate scheduler, early
stopping, and evaluation based on validation loss. Table 2 summarizes the key training configurations
used in this work.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Hybrid CNN-BERT Architectures</title>
        <p>To investigate whether convolutional representations can complement transformer-based embeddings
in the task of sexism detection, a hybrid architecture combining CNNs with pre-trained BERT-style
tokenizers was implemented. This approach seeks to leverage the local pattern extraction capability of
CNNs alongside the rich contextual embeddings provided by Transformer tokenizers.</p>
        <p>Rather than using the contextual embeddings generated by the full transformer encoder, the proposed
architecture integrates only the corresponding tokenization and vocabulary structure of the pre-trained
models. The tokens were first converted into integer input_ids using pre-trained tokenizers such
as bert-base-uncased, bert-base-multilingual-cased,roberta-base, and distilbert-base-uncased.
These token IDs were then passed to a CNN model, without using the transformer encoders themselves,
enabling a lighter-weight architecture while still preserving the semantic structure captured by the
tokenizer.</p>
        <p>
          The CNN model adopted in this work is based on the classical The CNN model adopted in this work
is based on the classical TextCNN architecture introduced by Kim (2014) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It includes an embedding
layer followed by parallel convolutional layers with kernel sizes of 3, 4, and 5, each capturing n-gram
features of diferent lengths. These feature maps are max-pooled and concatenated before being passed
through a dropout layer and a fully connected classification head. This design allows the model to
learn multiple local patterns that are potentially relevant to the task, such as discriminatory phrases or
idiomatic expressions.
        </p>
        <p>This hybrid CNN-BERT setup ofers a lightweight yet competitive alternative to full fine-tuning of
large transformer models, enabling faster training while still benefiting from pre-trained linguistic
knowledge embedded in the tokenizer vocabularies. The final architecture implemented in this hybrid
setup is summarized in Figure 2, where the input tweets are first tokenized using pre-trained Transformer
tokenizers and subsequently processed by a convolutional neural network with multiple kernel sizes
and max-pooling layers.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. LLM Prompting</title>
        <p>
          Recent advances in large-scale pretraining have enabled the development of LLMs, which are
transformer-based architectures trained on massive text corpora to learn general-purpose language
representations [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Unlike traditional models that require task-specific fine-tuning, LLMs such as GPT-3
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], GPT-4 [33], LLaMA [34], and PaLM [35] are capable of performing a wide range of downstream
tasks, such as classification, question answering, summarization, or sentiment analysis, by conditioning
on natural language instructions alone, a paradigm known as prompt-based learning or in-context
learning.
        </p>
        <p>Prompt:
Classify the following tweet as sexist (YES) or not sexist (NO):
"I guess she got the job because she’s cute."</p>
        <p>Input: Given a tweet, classify it as either YES (sexist) or NO (non-sexist).</p>
        <p>Example 1: "She got the job just because she’s a woman." → YES
Example 2: "Everyone deserves equal rights regardless of gender." → NO</p>
        <p>Test Tweet: "Maybe she’d understand it if she stopped being so emotional." →</p>
        <p>These LLMs are pretrained using the causal language modeling objective, which predicts the next
token given a preceding context. This simple objective, when scaled up to billions of parameters and
trained on hundreds of billions of tokens, results in emergent capabilities such as instruction-following,
multilinguality, reasoning, and even zero-shot or few-shot generalization. Importantly, these capabilities
can be obtained without updating the model weights, simply by providing a well-crafted input prompt.</p>
        <p>
          In this setting, the model is given a task description and, optionally, a few examples in natural language,
and it generates an output that satisfies the task requirement (see example in Table 3 and Table 4). This
framework is highly appealing in practical scenarios because it allows for task adaptation without
additional fine-tuning, making it especially suitable when labeled data is scarce, or when inference is
needed across multiple tasks or domains [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In this work, the following prompting strategies have
been explored:
1. Zero-shot prompting, where the model receives only an instruction or a question describing
the task (see Table 3). In this setup, the model is asked to classify a tweet as sexist (YES) or not
sexist (NO) using only a natural language prompt, without access to labeled data or examples.
This evaluates the model’s ability to generalize from its pretrained knowledge. Outputs were
post-processed to normalize variations like yes, Sí, or NOPE into canonical labels.
2. Few-shot prompting, where the model is provided with a natural language instruction along with
a small set of manually curated tweet–label pairs to illustrate the task, typically 3–5 examples (see
table 4). This strategy, popularized by GPT-3 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], enables the model to infer the desired output
format by analogy. The examples were selected to reflect diverse linguistic patterns and label
contexts. The same decoding setup as in the zero-shot configuration was used, and outputs were
post-processed to standardize the label format.
        </p>
        <p>The goal of evaluating these methods in the EXIST 2025 challenge was to understand whether LLMs
can classify sexist content in tweets without task-specific fine-tuning, and how their performance
compares to supervised models trained with domain-specific labels.</p>
        <p>To carry out the experiments in all prompting strategies, we relied on several publicly available large
language models from the Hugging Face Model Hub. These include encoder-only models fine-tuned for
Natural Language Inference (NLI), such as vicgalle/xlm-roberta-large-xnli-anli [36],
joeddav/xlm-robertalarge-xnli [37], MoritzLaurer/mDeBERTa-v3-base-mnli-xnli [38], and cross-encoder/nli-deberta-v3-large
[39], as well as the generative model google/flan-t5-large [40]. These models were selected to cover
both classification-oriented and generative prompting paradigms.</p>
        <p>The first group of models is based on pretrained transformers (such as RoBERTa, XLM-RoBERTa,
DeBERTa, and mDeBERTa) that were fine-tuned for NLI tasks like MNLI or XNLI. In zero-shot settings,
these models can be applied to new classification tasks by reframing the input as an NLI
hypothesispremise pair and selecting the most probable label. Since their architectures are encoder-only and
operate on fixed input-output label sets, they were not compatible with few-shot prompting.</p>
        <p>In contrast, flan-t5-large is a 770M-parameter encoder–decoder model developed by Google as part
of the FLAN project [41]. It was instruction-tuned on a large collection of diverse NLP tasks using
textual prompts and natural language feedback. Unlike traditional fine-tuning that adapts models to a
specific dataset, instruction tuning guides the model to follow task instructions expressed in natural
language. This strategy has shown to significantly enhance the zero-shot and few-shot generalization
abilities of LLMs.</p>
        <p>In summary, prompting ofers a flexible and lightweight approach to adapting LLMs to classification
tasks and provides a valuable alternative when labeled data is scarce or quick prototyping is required.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.7. Transformer Ensembles</title>
        <p>Transformer ensemble models are a well-established technique in machine learning to improve the
robustness, stability, and predictive accuracy of a system by combining the outputs of multiple individual
models. Rather than relying on a single model, ensembles aggregate predictions from a diverse set of
architectures or training runs, mitigating individual biases and taking advantage of complementary
strengths [42].</p>
        <p>In this project, ensemble methods were applied to the multilingual Transformer models described
previously (see section 2.4. The goal was to leverage the diferent generalization patterns and inductive
biases of architectures such as BERT, RoBERTa, mDeBERTa, DistilBERT, and XLM-RoBERTa to create a
unified output that is more reliable than any of the individual models alone.</p>
        <p>Each of the five multilingual models was fine-tuned independently on the combined English–Spanish
dataset for both tasks. During inference, the predictions produced by each model were collected and
combined using two ensemble strategies:
1. Majority Voting: For each tweet, the predicted labels from all five multilingual models were
collected and aggregated. In Task 1.1, the number of models is odd, and only two possible
labels ("YES" or "NO") exist. As a result, a unique majority always emerges, making tie-breaking
unnecessary.</p>
        <p>However, in Task 1.2, the set of possible labels includes "DIRECT", "REPORTED",
"JUDGEMENTAL", and "NO", which makes ties more likely. In these cases, the following rule was adopted:
• If a single label receives more votes than any other, it is selected as the final prediction.
• In case of a tie between multiple labels, the final label is randomly chosen among the tied
candidates. However, if one of the tied labels is NO and at least one of the others is not, NO
is excluded from the random selection to favor the detection of sexist content.</p>
        <p>This heuristic prioritizes detecting sexist content over misclassifying it as non-sexist, aligning
with the task’s sensitivity to false negatives in real-world applications.
2. Weighted ICM Voting: In this strategy, each model’s prediction was weighted according to its
validation ICM-score. The final prediction for each tweet was the class with the highest total
weighted score across all model outputs. This method gives more influence to models that have
demonstrated stronger performance during training, particularly useful when certain models are
more reliable under specific linguistic patterns or data conditions.</p>
        <p>The ensemble architecture consists of two main stages. First, each input tweet is passed independently
through all five multilingual models. Each model returns a label prediction, which is then aggregated by
a decision module. Depending on the selected strategy, this module either computes the most frequently
predicted label (majority vote) or calculates a weighted score for each label using the ICM-score weights
(weighted voting). The final decision is the label with the highest aggregated support. Figure 3 illustrates
the architecture followed by an ensemble with majority vote.</p>
        <p>By combining multiple models in this way, the ensemble approach aims to reduce variance and
capture diverse perspectives embedded in diferent pretraining corpora and architectures. This is
particularly beneficial in challenging classification tasks such as sexism detection in social media, where
subtle linguistic cues and cross-lingual variability require nuanced interpretation.
mDeBERTa
DistilBERT</p>
        <p>RoBERTa</p>
        <p>XLM-RoBERTa</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.8. Retrieval-Augmented Classification (RAC) with Fine-Tuning</title>
        <p>In addition to traditional fine-tuning and prompting strategies, this work explores a Retrieval-Augmented
Classification (RAC) approach, inspired by the Retrieval-Augmented Generation (RAG) framework
introduced by Lewis et al. (2020) in the context of open-domain question answering and
knowledgeintensive tasks [43]. While RAG was originally designed for generation, its core idea, which is injecting
relevant retrieved documents into the model’s input, has inspired adaptations to classification tasks as
well [44].</p>
        <p>RAC is a hybrid modeling paradigm that combines neural language models with a non-parametric
retrieval mechanism to improve performance on downstream tasks, particularly in low-data or
linguistically complex settings. Unlike standard prompting methods that rely on a fixed template or a small
set of manually selected examples, RAC dynamically retrieves semantically similar examples from a
support set, that is, a predefined corpus consisting of labeled training examples. Each instance in this
corpus is composed of a tweet and its corresponding ground-truth label.</p>
        <p>These retrieved examples are incorporated into the input both during training and inference, providing
additional context that is specific to each input tweet. This allows the model to condition its prediction
on relevant past examples, efectively adapting the decision boundary in a local and context-aware
manner. In this work, the RAC framework was applied both during training and inference. The models
were fine-tuned on inputs enriched with retrieved examples, allowing them to learn from semantically
similar instances throughout the training process. This approach enables the model to incorporate
retrieved context into its reasoning not only at inference time, but also during the optimization of
model parameters. This setup enables the model to leverage the diversity of the training corpus by
ifne-tuning on inputs enriched with semantically similar examples, allowing the model to adapt its
internal parameters based on the retrieved context.</p>
        <p>The main motivation for using RAC lies in its ability to enhance classification performance in
lowresource or ambiguous settings by injecting external examples that mirror the semantics of the input.
Prior work has shown that this strategy can improve generalization, particularly when the test data
contains nuanced patterns that benefit from analogical reasoning or contextual reinforcement [45].</p>
        <p>To implement the RAC framework, a two-stage pipeline was implemented. The first stage involves
retrieving contextually relevant examples for each tweet, and the second stage augments the tweet with
this retrieved information before feeding it to the classifier.</p>
        <p>For the retrieval step, a Sentence-BERT model, paraphrase-multilingual-MiniLM-L12-v2 [46],
was used to encode the training tweets into dense vector representations. This model supports over 50
languages and is based on a distilled version of the Transformer architecture, optimized for cross-lingual
sentence similarity tasks. These embeddings were normalized and indexed using FAISS with inner
product similarity, equivalent to cosine similarity on normalized vectors. For each tweet in the training
and test sets, the top-k most similar examples were retrieved from the training corpus. In order to
prevent unintended information leakage during training, self-retrieval was prevented by excluding the
query tweet from its own results.</p>
        <p>This structure ensures that the classifier receives both the original tweet and relevant auxiliary
content that may guide its decision. The complete pipeline was implemented using PyTorch and
Hugging Face Transformers, allowing seamless integration with existing training and evaluation code.</p>
        <p>By choice, the top three retrieved results (k =3) were used as contextual information. However, the
retrieval setup is flexible and can be easily extended to incorporate a larger number of neighbors if
needed. The same retrieval and context injection procedure was applied symmetrically to both training
and test sets to ensure consistency and avoid domain shift between training and inference.</p>
        <p>This setup allows the downstream classifier to make predictions not only based on the original
tweet, but also informed by semantically related examples from the training distribution. Unlike
earlier stages of this project, where static, hand-crafted contexts were used, this approach leverages
real semantic similarity computed from large-scale data, aligning more closely with the principles of
retrieval-augmented learning and enabling more informed and context-aware predictions.</p>
        <p>Once the retrieval step had enriched each input tweet with semantically related content, the resulting
inputs were integrated directly into a Transformer-based sequence classification model, such as
XLMRoBERTa, which was then fine-tuned on these enriched inputs. Specifically, each training and test
instance consisted of the original tweet followed by a context string containing the top-k retrieved
neighbors, structured as:
[TWEET]: {original tweet} [CONTEXT]: {retrieved tweet 1}. {retrieved tweet
2}. {retrieved tweet 3}</p>
        <p>This format was treated as a single input sequence and tokenized using the pretrained tokenizer
corresponding to the base model, in the context of this project, the five multilingual models 2.4. The
tokenized inputs were then fed into a Transformer encoder with a classification head, following the
same fine-tuning procedure described in earlier sections.</p>
        <p>
          However, no architectural changes were made to the base classifier. The only modification was the
inclusion of retrieved context within the input string. This decision was based on the design philosophy
of retrieval-augmented classification (RAC): enhancing model reasoning through external information
without modifying its internal structure [
          <xref ref-type="bibr" rid="ref12">47, 12</xref>
          ]. This seamless integration into the pipeline provided
two main benefits:
1. It enabled the model to directly leverage semantically similar examples from the training tweets,
both during training and inference, without requiring external knowledge bases. The retrieval
step was always performed using the available training data.
2. It was fully compatible with the existing fine-tuning framework in Hugging Face Transformers,
as the retrieval-augmented inputs could be processed and tokenized in the same way as standard
text classification inputs, without introducing architectural changes to the model.
        </p>
      </sec>
      <sec id="sec-2-9">
        <title>2.9. Data Augmentation</title>
        <p>Data augmentation is a well-established technique in Natural Language Processing (NLP) aimed at
improving model generalization and robustness, especially in scenarios with limited annotated data
or class imbalance. The goal is to artificially expand the training dataset with additional examples
that preserve the underlying semantics but vary in form, thus reducing overfitting and enhancing
performance on unseen inputs [48].</p>
        <p>Among the various augmentation strategies, back translation has proven to be particularly efective
for text classification tasks [ 49], as it introduces lexical and syntactic diversity while preserving the
original semantic intent of the input text. It involves translating a text into an intermediate language
and then translating it back to the original language. This process generates paraphrased versions of
the input while maintaining its core meaning. In this work, back translation was applied bidirectionally
for both English and Spanish tweets. The intermediate language selected was French, based on its wide
support in machine translation models and its linguistic distance from both source languages.</p>
        <p>To implement this, a two-stage translation pipeline was built using pretrained MarianMT models
from the Hugging Face Model Hub [50]. For English tweets, the pipeline involved translation from
English to French (en → fr) followed by French to English (fr → en). For Spanish tweets, a similar
pipeline was constructed using es → fr followed by fr → es. The process was batch-optimized using
PyTorch and executed on GPU for eficiency.</p>
        <p>Language detection was performed using the langdetect library to route each tweet to the appropriate
translation path. The architecture ensured that only tweets in English or Spanish were processed, and
unsupported languages or malformed entries were skipped or defaulted to their original version.</p>
        <p>To augment the training data, the entire training set was passed through this pipeline, generating
one augmented instance for each original example. The resulting paraphrased tweets were then merged
with the original training set, efectively doubling its size. This process aimed to introduce lexical and
syntactic variation, improving the model’s ability to generalize across diverse formulations of sexist or
non-sexist content. It is important to establish that only the training set was augmented to prevent data
leakage or evaluation bias.</p>
        <p>This strategy aimed to increase linguistic diversity in the training data, particularly in expressions of
sexism that may vary subtly across phrasing. By exposing the model to semantically equivalent yet
lexically distinct examples, the goal was to improve robustness and reduce overfitting to specific lexical
patterns.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <p>This section presents the evaluation framework and results obtained for the models developed in
this work, following the oficial guidelines of the EXIST 2025 lab. Our analysis focuses on the
subtasks addressed in this study: Task 1.1 (binary classification) and Task 1.2 (multi-class classification),
both applied to tweets. These tasks aim to detect sexist content and identify the author’s intention,
respectively.</p>
      <p>To ensure the consistency and fairness of the evaluation process, all models were assessed using the
oficial metric defined by the organizers of EXIST 2025: the Information Contrast Measure (ICM). This
measure, introduced by Amigó and Delgado (2022) [51], was specifically designed to address some of
the limitations of classical evaluation metrics in complex classification scenarios such as those involving
semantic hierarchies, label imbalance, or subjective categories, all of which are inherent to the problem
of sexism detection.</p>
      <p>ICM is grounded in information theory and builds upon the notion of Pointwise Mutual Information
(PMI), which quantifies the association between two discrete variables by measuring the divergence
of their co-occurrence from statistical independence. While PMI has traditionally been used in
unsupervised tasks, ICM adapts this concept to the supervised classification setting, where the goal is to
compare predicted labels with gold standard annotations.</p>
      <p>More formally, ICM quantifies the informativeness of a system’s predictions by contrasting the
information content conveyed by the predicted labels against a baseline of random predictions. In doing
so, it not only captures whether the prediction is correct, but also how informative and meaningful the
prediction is with respect to the underlying class distribution. This makes it particularly appropriate
for tasks where misclassifications may difer in severity depending on the semantic distance between
the predicted and true labels.</p>
      <p>Furthermore, the normalized version of ICM (ICM-Norm) scales the resulting value between 0 and
1, enhancing its interpretability and allowing fair comparisons across diferent systems. A score of 1
indicates optimal informativeness, whereas a score close to 0 reflects low informativeness, comparable
to a random or uninformative classifier.</p>
      <p>Given these properties, ICM ofers two key advantages over standard metrics such as accuracy or
F1-score:
1. It accounts for the semantic structure of the label space, penalizing severe misclassifications more
than minor ones.
2. It provides a quantitative estimate of informational value, rather than treating all correct
predictions equally.</p>
      <sec id="sec-3-1">
        <title>The evaluation of submitted runs supports two formats:</title>
        <p>• Hard-hard evaluation: both system outputs and gold annotations are converted into hard labels.</p>
        <p>This means that only one definitive label is assigned per instance. For example, in Task 1.1, a
tweet is either “YES” or “NO”. The hard label is derived from majority vote among annotators.
• Soft-soft evaluation: both the model predictions and the gold annotations are expressed as
probability distributions. This is especially relevant in Learning with Disagreement (LeWiDi)
scenarios, where annotators may disagree, and the ground truth is represented as a distribution
over labels.</p>
        <p>These metrics evaluate the similarity between the predicted and true label sets, accounting for
hierarchy, probability mass, and class severity. In addition, traditional metrics such as the F1-score are
reported for interpretability, although they are not optimal for the hierarchical structure of the task.</p>
        <p>Task 1.2 poses an additional challenge due to its hierarchical class structure. Predictions must
distinguish not only between "NO" and "YES", but also among the three sexist intentions: DIRECT,
REPORTED, and JUDGEMENTAL.</p>
        <p>ICM naturally accommodates this hierarchy by penalizing misclassifications between YES and NO
more strongly than between, for instance, REPORTED and DIRECT.</p>
        <p>
          In this project, both ICM and its normalized variant (ICM-Norm) were computed using the oficial
evaluation library PyEvALL, as recommended by the competition guidelines. These metrics serve as
the primary basis for comparing model performance across all experimental conditions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>The gold labels were constructed using majority voting among annotators, and only tweets with a
clear majority were retained for training and evaluation. Tweets with ambiguous or tied votes were
excluded to ensure robust training signals.</p>
        <sec id="sec-3-1-1">
          <title>3.1. Results for Task 1.1 - Binary Classification</title>
          <p>All models in this section were trained and evaluated using the hard-hard evaluation mode, with ICM
as the oficial performance metric. The decision to use the hard-hard evaluation mode, rather than
soft-soft, was motivated by two key considerations. First, the hard-hard setting aligns more closely with
standard classification protocols, where the model is required to make a single, definitive prediction for
each instance, facilitating comparability with classical baselines. Second, adopting the soft-soft mode
would have required calibrating probabilistic thresholds across multiple classes and levels, introducing
additional complexity that was beyond the scope of this work. Future work may explore soft evaluation
modes to further investigate the impact of uncertainty in model predictions. Additionally, accuracy and
F1-score were computed to facilitate comparison with standard classification baselines.</p>
          <p>Below, we report the results for each family of models. It is also important to clarify that, although the
dataset contains tweets in both Spanish and English, this work addressed the task as a single multilingual
classification problem. The evaluation was conducted jointly across both languages, without reporting
results separately for each language and conducted on the oficial validation set of EXIST 2025, using
the provided gold labels and evaluation scripts. It is important to note that the final test set of the
challenge was not publicly available at the time of writing which explains the use of the validation set.
3.1.1. Traditional Machine Learning Approaches
As an initial benchmark for the binary classification task, we implemented six pipelines combining
traditional text classification algorithms with varying levels of text normalization. The models evaluated
were Multinomial Naive Bayes (NB) and Support Vector Machine (SVM), each trained with three
preprocessing variants: Basic, Lemmatized, and Stemmed (see Section 2.2). This setup allowed us to
explore the impact of linguistic normalization on model performance under controlled conditions.</p>
          <p>Table 5 summarizes the overall performance of each model using the oficial evaluation metrics for
the EXIST 2025 challenge over the validation set. The best performance was achieved by the SVM
model using stemmed input text, obtaining an ICM score of 0.3686 and an F1-score of 0.7986. This
configuration also produced the highest ICM-Norm value of 0.6845, indicating strong relevance in the
predictions relative to a random baseline.</p>
          <p>In general, SVM-based models outperformed Naive Bayes across all preprocessing strategies,
particularly in terms of F1-score. The stemmed and lemmatized versions showed consistent gains over
the basic variant, suggesting that reducing morphological variation improves model generalization.
Interestingly, lemmatization and stemming produced similar trends in ICM, although stemming slightly
outperformed lemmatization in all configurations.</p>
          <p>To better understand the model’s behavior across both classes ("YES" and "NO"), we also computed
the per-class F1-scores. Table 6 shows the breakdown of F1 performance for each class. While most
configurations achieved relatively balanced performance, the SVM-stemmed model again stood out with
F1 scores of 0.7798 for the sexist class and 0.7996 for the non-sexist class, highlighting its robustness
across both categories.</p>
          <p>Overall, these results confirm that traditional learning pipelines can achieve strong baseline
performance on the binary sexism detection task. Although more advanced transformer-based models will
probably outperform them in absolute terms, these models remain competitive in terms of eficiency,
interpretability, and implementation cost.
3.1.2. Monolingual Transformer Models
In this analysis, the monolingual setup was constructed by combining the predictions from two
independent models: one trained exclusively on Spanish tweets using the roberta-base-bne model, and another</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Majority Label (≥ 4 votes)</title>
      <p>Female Majority Vote
trained on English tweets using the bert-base-uncased model. These models, previously described in
Section 2.3, were trained monolingually, meaning that each model was both trained and evaluated solely
on data in its corresponding language. This unified evaluation allows assessing global performance
while maintaining language-specific training conditions.</p>
      <p>To evaluate the impact of diferent label construction strategies, two versions of this setup were
compared using the same classification pipeline:
1. The first version used majority voting among the six annotators to define the final binary label.</p>
      <p>Tweets with a tie (3 YES, 3 NO) were excluded.
2. The second version applied the female majority resolution, where tied votes were resolved based
on the majority opinion of the three female annotators.</p>
      <p>Table 7 presents the evaluation results for the majority voting version. The model achieved an ICM
score of 0.541, an ICM-Norm of 0.770, and an overall F1-score of 0.847. These values indicate strong
alignment with the hierarchical structure of the labels and high binary classification performance.</p>
      <p>In contrast, the table shows for the results obtained using the female-majority strategy, that the
ICM decreased to 0.482, the ICM-Norm to 0.741, and the overall F1-score dropped to 0.784. These
results suggest that, while the female-resolved strategy enables the inclusion of more data points by
avoiding tweet exclusion, it introduces more label noise or inconsistency with the gold standard used
in evaluation.</p>
      <p>When analyzing class-wise performance in Table 8, the superiority of the majority label strategy
remains evident. Both the "YES" and "NO" classes achieve higher F1-scores under this setting. It is worth
noting that in both versions, separate F1-scores were also computed for the YES and NO classes. The
results show a consistent trend: although both models perform well in identifying non-sexist content
(NO), their ability to detect sexist tweets (YES) is more sensitive to the label construction strategy.</p>
      <p>Overall, the monolingual setup using the majority-vote labels achieved the best balance between
structural consistency (ICM), normalized agreement (ICM-Norm), and binary performance (F1).
3.1.3. Multilingual Transformer Models
This section presents the results obtained by multilingual Transformer models in the binary classification
task. The models evaluated in this section correspond to the multilingual Transformer architectures
previously introduced in Section 2.4, where their theoretical foundations and implementation details
were described. All models were evaluated under both label aggregation strategies, as consistently
applied throughout the study.</p>
      <p>As observed in Table 9, all models show slightly better performance under the majority label setting
compared to the female-only voting strategy. The model mdeberta-v3-base stands out across both
configurations, achieving the highest ICM (0.515), ICM-Norm (0.757), and F1 (0.839) when using the
majority label. This suggests its robustness across linguistic variations and annotation schemes.
Xlmroberta-base also performs consistently well compared to the other models.</p>
      <p>Table 10 provides a more detailed view of the classification behavior for each class. The scores for
F1_NO tend to be slightly higher, indicating that detecting non-sexist content is somewhat easier across
all models. Still, models like mdeberta-v3-base and xlm-roberta-base achieve strong performance in both
classes, maintaining F1_YES scores above 0.83 under the majority label setup.</p>
      <p>In contrast, when trained with female-only voting labels, all models experienced a slight drop in
both F1_YES and F1_NO. This confirms the trend already observed in monolingual models, where
the female-only labeling introduces greater variability or subjectivity, potentially afecting model
generalization.</p>
      <p>Overall, multilingual models proved highly competitive, particularly mdeberta-v3-base, which
outperformed others across all metrics. These findings validate the usefulness of multilingual pretraining and
highlight the efectiveness of cross-lingual encoders in capturing complex societal constructs such as
sexism.</p>
      <p>Compared to the monolingual Transformer models, the best multilingual models achieved slightly
lower ICM scores but remained close in overall F1 performance, especially under the majority label
setting. In contrast, all multilingual Transformers clearly outperformed the traditional machine learning
models (SVM and Naive Bayes), both in terms of structural alignment and class-wise F1-scores. This
reinforces the advantage of deep multilingual architectures in complex classification tasks like sexism
detection.
3.1.4. Ensemble Methods
To further improve classification performance, ensemble techniques were explored by combining
the outputs of the five multilingual Transformer models introduced in section 2.4. This ensemble of
Transformer-based models was designed to leverage the diversity in architectures and training dynamics
to enhance robustness and generalization. Two strategies were implemented: majority voting and
weighted voting based on ICM.</p>
      <p>The following tables show the results of the ensemble under both label aggregation settings. Notably,
the majority label configuration yielded slightly better scores overall. Among the two ensemble
strategies, the majority voting performed best, with marginal diferences compared to the other,</p>
      <p>As shown in Table 12, all ensemble strategies yielded highly similar results under both label settings,
indicating robustness across aggregation methods. Under the majority label, the best performance was
achieved using majority voting, with an ICM of 0.525 and F1 of 0.842, marginally outperforming the
weighted alternative. Similarly, for the female-vote strategy, majority voting again achieved the highest
F1 score (0.799), though the diferences across methods were minimal.</p>
      <p>In Table 13, we observe that ensemble methods maintain a strong balance between classes, with
slightly higher scores for the "NO" class across all configurations. This trend holds for both label
aggregation strategies, with F1_NO exceeding F1_YES by approximately 1%–2% in most cases.</p>
      <p>When using the majority label strategy, ensemble models achieved F1_YES scores above 0.836 and
F1_NO values around 0.846, indicating excellent sensitivity to both classes. The female-vote variant
showed a small drop in performance, with F1_YES ranging from 0.785 to 0.787 and F1_NO remaining
above 0.811. These results are still strong, but the wider gap suggests that detecting sexist tweets
becomes slightly more challenging under the female-only label aggregation.</p>
      <p>Overall, ensemble transformers not only improved macro performance (as seen in the previous
section), but also ensured stable and high-quality predictions across both classes. This consistency
reinforces their reliability, especially in applications where balanced classification between positive and
negative cases is essential.
3.1.5. Hybrid CNN-BERT Architectures
The hybrid CNN-BERT architectures evaluated in this section aim to combine the contextual encoding
capacity of pretrained Transformer models with the local feature extraction ability of Convolutional
Neural Networks (CNNs). The goal was to assess whether adding a CNN classification head on top of
static token embeddings would improve the model’s performance.</p>
      <p>All models in this section were trained using the majority label strategy, which had shown consistently
better results in previous experiments. The architecture used was based on extracting token embeddings
from the pretrained BERT-based encoder (without fine-tuning) and feeding them into a CNN classifier
trained on top. The convolutional block included filters of multiple sizes to capture diferent n-gram
patterns, followed by max pooling, dropout, and a fully connected layer for classification. The tokenizers
corresponding to each encoder were used during preprocessing.</p>
      <p>The following table summarizes the results obtained using this hybrid architecture:
Among the four models tested, the best performing configuration was CNN +
bert-base-multilingualcased, achieving the highest ICM (0.291), ICM-Norm (0.646), and F1 (0.762). This suggests that
multilingual pretraining, even without encoder fine-tuning, provides stronger token-level representations
when paired with CNN classifiers.</p>
      <p>To better understand class-level behavior, Table 15 reports the F1 scores for each class (YES and NO).</p>
      <p>In terms of class-specific performance, all models exhibit slightly higher F1 scores for the non-sexist
class. This is consistent with prior experiments and reflects the tendency of classifiers to favor dominant
or less ambiguous categories. The gap between F1_YES and F1_NO is most pronounced in the RoBERTa
configuration, potentially indicating limitations in capturing the subtle linguistic indicators of sexist
intent when representations are not fine-tuned.</p>
      <p>Overall, while hybrid CNN-BERT models are outperformed by fully fine-tuned Transformer models,
their performance is competitive and demonstrates that learned static embeddings can be successfully
reused with simpler architectures when training eficiency or architectural modularity is a priority.
3.1.6. LLM Prompting
This section evaluates the performance of large language models (LLMs) using prompt-based learning
strategies, specifically zero-shot and few-shot prompting. Due to the computational cost associated
with running inference over the entire dataset, the experiments in this section were conducted on a
representative sample of the data. In the event that one of the LLM approaches had proven superior, a
full-scale evaluation would have followed. However, as discussed below, this was not the case.</p>
      <p>In the zero-shot setting, the model is given a task description, such as “Is this tweet sexist?” and
candidate labels (“YES” / “NO”) without having seen any task-specific examples. The evaluation includes
ifve diferent models:</p>
      <sec id="sec-4-1">
        <title>1. vicgalle/xlm-roberta-large-xnli-anli</title>
        <p>2. MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
3. joeddav/xlm-roberta-large-xnli
4. cross-encoder/nli-deberta-v3-large
5. google/flan-t5-large</p>
        <p>All models except the last one were evaluated using the Hugging Face zero-shot-classification pipeline.
These models are optimized for natural language inference (NLI) and treat classification as a
premisehypothesis matching task. In contrast, flan-t5-large is a generative model that processes prompts as
free-form text, which also enables few-shot learning.</p>
        <p>The results for zero-shot prompting are summarized in Table 16.</p>
        <p>As shown, none of the evaluated zero-shot models outperform traditional fine-tuned classifiers.
Among them, vicgalle/xlm-roberta-large-xnli-anli yields the best results with an F1-score of 0.578 and
an ICM score of -0.128, although still significantly lower than supervised approaches. Surprisingly,
flant5-large performs poorly in the zero-shot setting, possibly due to its generative nature and sensitivity to
prompt formulation.</p>
        <p>Although all models were tested in a zero-shot setting, few-shot prompting was only applicable to
google/flan-t5-large , which supports textual prompts with in-context examples. Other models, designed
for structured classification via NLI pipelines, do not support this mode of inference.</p>
        <p>In this experiment, we designed a multilingual instructional prompt including five curated examples,
three for the "YES" class and two for the "NO" class. The examples were chosen to represent various
expressions of sexism and non-sexism in both English and Spanish.</p>
        <p>Despite the richness and balanced nature of the prompt, performance did not improve over the
zero-shot variant. As shown in Table 17 all key metrics were extremely low.</p>
        <p>Per-class F1 scores, presented in Table 18, further confirm that the model failed to distinguish between
the two classes, barely exceeding random guessing.</p>
        <p>These results suggest that, at least under this setup, the model was unable to efectively leverage
incontext examples. This may be due to the limited size of the prompt, diferences in domain between the
examples and the test data, or intrinsic limitations in adapting generative LLMs to binary classification
without additional fine-tuning or calibration strategies.
3.1.7. Retrieval-Augmented Classification (RAC)
This section presents the performance of the RAC-enhanced Transformer models described in Section
2.8 . Each model was trained using the majority label aggregation strategy and evaluated using the ICM
metric suite and per-class F1 scores. Both multilingual and monolingual variants were tested, and the
retrieval mechanism was implemented using a Sentence-BERT encoder with top-3 neighbor retrieval
from the training corpus.</p>
        <p>As shown in Table 19, integrating retrieved context via RAC did not yield consistent improvements
over the baseline models. Both XLM-RoBERTa and mDeBERTa-v3 show reduced ICM and ICM-Norm
scores compared to their vanilla versions, although their overall F1-scores remain relatively strong. The
monolingual ensemble, however, experienced a considerable drop in performance across all metrics,
suggesting that the injected context may have introduced noise rather than helpful information.</p>
        <p>In Table 20, we observe that mDeBERTa-v3 achieves the highest F1_NO score (0.838772), while
XLMRoBERTa shows more balanced performance across both classes. On the other hand, the monolingual
RAC model drastically underperforms in F1_YES, indicating a reduced ability to identify sexist tweets
once the context is injected. This highlights that multilingual models handle the additional contextual
input more efectively than monolingual architectures.</p>
        <p>When comparing multilingual models, mdeberta-v3-base shows the strongest performance under
the RAC setup, consistent with its competitive results in the standard fine-tuning setting. However,
xlm-roberta-base struggles to benefit from context injection, indicating potential limitations in how
it integrates cross-sentence information.</p>
        <p>To conclude, the Retrieval-Augmented Classification setup does not consistently outperform the base
ifne-tuned models. These findings suggest that, while promising in theory, RAC requires more careful
retrieval and filtering mechanisms to deliver measurable gains in this task.
3.1.8. Error Analysis
Table 21 presents a comparative summary of all models evaluated on Task 1.1 using ICM. Among all
individual models, the best overall performance was achieved by the monolingual model trained with
majority label aggregation, reaching an ICM of 0.5414. In the multilingual category, ensemble methods
also improved performance, with the best ensemble (majority voting with majority labels) achieving an
ICM of 0.5305.</p>
        <p>Before examining the specific types of errors made by individual models, it is essential to explore
how and why even high-performing systems can fail in nuanced classification scenarios. Error analysis
provides qualitative insights that complement aggregate metrics like ICM or F1-score. By identifying
systematic misclassifications, especially between semantically close classes or in the presence of
linguistic ambiguity, we can better understand model behavior and uncover areas where further improvements
or adjustments may be required.</p>
        <p>To obtain a richer understanding of the model’s limitations and decision patterns, we perform an error
analysis focused on the final predictions. This analysis is conducted exclusively on the best-performing
model for Task 1.1, which is the monolingual approach using majority-based label construction.</p>
        <p>We begin by analyzing the overall distribution of prediction outcomes. A confusion matrix reveals
that the model maintains a relatively balanced precision and recall across the "YES" and "NO" classes,
though it makes a comparable number of false positives and false negatives (Figure 4). To better
understand the nature of these errors, we label each prediction as either correct, a false positive, or a
false negative, which serves as the foundation for the following analyses.</p>
        <p>Table 22 shows the classification performance of the model evaluated. The results indicate that
the model achieves balanced precision and recall across both classes, with a slight advantage in the
detection of the "NO" class. The overall F1-scores are comparable, and the global accuracy reaches
84.8%, confirming the robustness of the system across categories</p>
        <p>An initial qualitative analysis is provided using word clouds generated from the tweets that were
bert-base-multilingual-cased (Majority)
distilbert-base-multilingual-cased (Majority)
mdeberta-v3-base (Majority)
roberta-base (Majority)
xlm-roberta-base (Majority)
bert-base-multilingual-cased (Female)
distilbert-base-multilingual-cased (Female)
mdeberta-v3-base (Female)
roberta-base (Female)
xlm-roberta-base (Female)
CNN + bert-base-uncased
CNN + bert-base-multilingual-cased
CNN + roberta-base
CNN + distilbert-base-uncased
vicgalle/xlm-roberta-large-xnli-anli
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
joeddav/xlm-roberta-large-xnli
cross-encoder/nli-deberta-v3-large
google/flan-t5-large (zero-shot)
Majority Voting (Majority Label)
Weighted ICM (Majority Label)
Majority Voting (Female Vote)
Weighted ICM (Female Vote)
XLM-RoBERTa (RAC)
mDeBERTa-v3 (RAC)
Monolingual Ensemble (RAC)</p>
        <p>ICM
misclassified. Separate visualizations were created for false positives and false negatives to highlight
recurring terms associated with each error type (Figures 5a and 5b).</p>
        <p>In both error categories, terms related to gender and identity, such as women, men, género, and persona,
appear prominently, indicating that the model often struggles with correctly interpreting tweets that
revolve around gender issues. This could be attributed to semantic overlap between neutral statements
and those expressing subtle forms of sexism, making the classification task inherently challenging.</p>
        <p>In the case of false positives, the word cloud (Figure 5a) shows a notable presence of words like women,
men, abuso, género, and educación, which often occur in educational or awareness-raising contexts. This
suggests that the model may be overly sensitive to these terms, flagging non-sexist content as sexist
due to keyword presence without fully grasping the intent or tone.</p>
        <p>Conversely, the false negatives word cloud (Figure 5b) includes emotionally charged or socially critical
terms such as violencia, padre, culpa, and money. This implies that in some cases, tweets containing
NO
YES
actual sexist content may go undetected if the language is less explicit or uses indirect expressions.</p>
        <p>Overall, this visualization highlights the limitations of surface-level keyword-based associations and
underlines the need for better contextual reasoning in detecting subtle or implicit forms of sexism.</p>
        <p>We also examine whether tweet length is associated with classification performance. Figure 6 shows
a boxplot comparing tweet length (in number of characters) across error types. While the overall
distributions are similar, correct predictions tend to have slightly longer tweets on average (median:
173), compared to false positives (median: 162) and false negatives (median: 163). This suggests that
longer tweets may provide more contextual clues that help the model make correct decisions.</p>
        <p>To explore whether sentiment influences the model’s behavior, we compute the sentiment polarity of
each tweet using the TextBlob library.
—Insistir en que los genitales determinan el género. Puede ser microagresión por falta de
educación, puede ser transfobia pura y dura. —Relacionar determinados diagnósticos con el
colectivo LGTBIQA+. Me ahorro el comentario.
@Dialogo_es @FuerzasMilCol si asesinar mujeres embarazadas, bombardear niños y sacar
los ojos a estudiantes, pero con los actores armados si son cagado no a otro perro con ese
hueso
Calling A Man Bald Is Sexual Harassment" https://t.co/MiSGkB89bs via @YouTube
the question is do i wear the very short skirt that literally shows my ass with kneehighs or
opt out for a short
NO
NO
NO
NO</p>
        <p>Pred
YES
YES
YES
YES</p>
        <p>As shown in Figure 7, tweets misclassified by the model tend to display a slightly higher variance in
sentiment polarity compared to correctly classified instances. This suggests that emotionally charged
tweets may pose additional challenges for the classifier, possibly due to the presence of sarcasm, indirect
language, or nuanced judgment.</p>
        <p>We also examine the distribution of errors across languages (Figure 8). Since the monolingual strategy
involved using a dedicated Spanish model for tweets in Spanish and a separate English model for tweets
in English, this visualization helps compare their behavior. The number of correct predictions is slightly
higher in Spanish, but both languages show a similar and balanced distribution of false positives and
false negatives. This indicates that both monolingual models performed comparably and that no major
disparity in misclassification patterns emerged between languages.</p>
        <p>Lastly, to complement the quantitative findings, we conduct a qualitative inspection of misclassified
examples, with special attention to challenging cases. We reviewed a selection of tweets from both false
positives and false negatives, focusing on instances that involve irony, implicit bias, or subtle linguistic
cues. These examples help illustrate the types of content that are particularly dificult for the model to
classify correctly and provide insight into the nuanced nature of sexist discourse on social media.</p>
        <p>Together, these findings provide valuable insight into how the model succeeds and where it fails and
highlights the importance of complementing quantitative evaluation with detailed qualitative analysis
con q motivación voy a entrenar ahora si ya no veo a ninguna de las chicas
@Mistywoman1 Have I missed something, or has phallocentrism become the latest cult?
#Women should focus on money freedom, and not retirement! Today’s women may still
think about #retirement as "not having to work."
True</p>
        <p>Pred
YES
YES
YES</p>
        <p>NO
NO
NO
in sensitive classification tasks.</p>
        <sec id="sec-4-1-1">
          <title>3.2. Results for Task 1.2 - Multi-class</title>
          <p>This section presents the evaluation results for Task 1.2 of the EXIST 2025 challenge, which focuses
on identifying the type of sexist content expressed in tweets. Unlike Task 1.1, which involved binary
classification, this task adopts a more granular and hierarchical structure, requiring models to distinguish
between four possible labels: NO, DIRECT, REPORTED, and JUDGEMENTAL.</p>
          <p>As in the binary setting, all models were trained and evaluated using the oficial validation set provided
by the organizers, following the hard-hard evaluation protocol. The label space exhibits a hierarchical
structure, with DIRECT, REPORTED, and JUDGEMENTAL being subclasses of the broader category YES,
while NO denotes the absence of sexist intent. This hierarchy introduces additional complexity, as
misclassifications across the YES–NO boundary are considered more severe than misclassifications
among YES subtypes.</p>
          <p>All experiments in this section were assessed using the Information Contrast Measure (ICM) and
its normalized version (ICM-Norm), which remain the oficial metrics of the competition. These were
Monolingual - (binary+multi, Female Voting) 0.112682
Monolingual - (binary+multi, Majority Voting) -0.746904
Monolingual - (4-label, Majority Voting) 0.053612
complemented by standard classification metrics such as macro-averaged F1-score and per-class F1
scores to support interpretability and comparative analysis. The labels for Task 1.2 were constructed
using the same procedure as in Task 1.1</p>
          <p>Moreover, two training strategies were adopted to handle the hierarchical nature of Task 1.2:
1. In the first strategy, Binary + Multi-Class Pipeline, the classification process was split into two
stages: a binary classifier was first applied to determine whether a tweet was sexist ( YES) or not
(NO), based on models previously described for Task 1.1. Only tweets predicted as sexist were
then passed through a second classifier trained exclusively to distinguish among the three sexism
subcategories: DIRECT, REPORTED, and JUDGEMENTAL. The training data for this second model
was filtered accordingly to include only tweets labeled with these three classes.
2. The second strategy, Single Multi-Class Model (4 Labels), followed a unified approach, where a
single model was trained to classify among all four possible classes simultaneously: NO, DIRECT,
REPORTED, and JUDGEMENTAL. This setup allowed for direct inference in a single step, without
the need for a preceding binary decision. Both strategies were evaluated using the same oficial
metrics and validation set to enable fair comparison.</p>
          <p>The remainder of this section is organized by model family, beginning with monolingual transformer
models and progressing toward multilingual, hybrid, LLM prompting and ensemble-based systems.
Each subsection reports quantitative results and qualitative observations based on the model’s behavior
and its ability to discriminate among nuanced categories of sexist expression. Retrieval-augmented
classification (RAC) approaches were not explored in Task 1.2, as their performance in Task 1.1 did not
yield notable improvements compared to standard transformer-based models.
3.2.1. Monolingual Transformer Models
This section presents the results obtained by monolingual Transformer-based models in Task 1.2. We
evaluated two pretrained models: bert-base-uncased, a general-purpose English BERT model, and
roberta-base-bne, a RoBERTa variant trained on a large Spanish corpus. Both were fine-tuned for the
multiclass classification task using the diferent strategies described earlier.</p>
          <p>The following table reports the global metrics for each configuration:</p>
          <p>Among the three configurations, the best results were achieved by the first configuration model
trained using the binary-to-multiclass pipeline with labels derived from female annotators. It reached
the highest ICM (0.113) and macro F1 (0.495), suggesting that this configuration captures sexist nuances
more efectively than those based on majority voting.</p>
          <p>The unified 4-label model trained with bert-base-uncased performed slightly better than the second
configuration but still lagged behind the binary-to-multiclass pipeline using female-vote labels. Although
this single-step approach simplifies inference by avoiding the need for a cascade, its performance
remained modest, especially in capturing nuanced sexist categories.</p>
          <p>In contrast, the model trained on majority vote labels via the binary+multi pipeline underperformed
across all metrics, indicating that this configuration may introduce label noise or be less robust to the
subtleties of sexism classification.</p>
          <p>The next table shows the per-class F1 scores to analyze where the diferences in performance arise:
bert-base-uncased (binary+multi, Female Voting)
roberta-base-bne (binary+multi, Majority Voting)
bert-base-uncased (4-label, Majority Voting)</p>
          <p>F1_NO</p>
          <p>From the per-class analysis, we observe that all models are most accurate when predicting the NO
class. The unified 4-label model obtains the highest F1 for NO (0.87), reflecting a bias toward the majority
class. However, its performance on minority categories like JUDGEMENTAL remains weak (0.16).</p>
          <p>On the other hand, the model trained with female-only labels (binary+multi) achieves more balanced
results across all classes, especially for REPORTED (0.36) and JUDGEMENTAL (0.26), suggesting better
sensitivity to minority and potentially ambiguous instances.
3.2.2. Multilingual Transformer Models
To evaluate the multilingual capabilities of transformer-based systems on Task 1.2, we explored a diverse
set of 25 model combinations, covering various architectures (e.g., bert-base-multilingual-cased,
xlm-roberta-base, mdeberta-v3-base) and training configurations. These include the
binary+multi-class pipeline, under the female-majority and majority-vote label strategies, and unified
4-label classification.</p>
          <p>Given the large number of experiments (55), Table 27 highlights the top five multilingual
configurations based on ICM for each configuration, which reflects the system’s ability to preserve label
hierarchy and semantic alignment.</p>
          <p>Among the top-performing setups, the binary-to-multiclass pipeline using majority-vote labels
achieved the highest overall ICM, with the best configuration combining mdeberta-v3-base for
binary classification and bert-base-multilingual-cased for multiclassification (ICM = 0.1982).
Models trained using female-vote labels demonstrated more consistent and stable ICM values across
the top-5, albeit slightly lower in absolute terms. This may point to a trade-of between label specificity
and sample diversity.</p>
          <p>Lastly, unified 4-label configurations delivered moderate results, highlighting the challenges of
capturing class granularity without hierarchical decomposition. Nonetheless, all evaluated configurations
outperform the baseline systems provided by the organizers, demonstrating the efectiveness of both
the training strategies and the selected Transformer architectures.
3.2.3. Hybrid CNN-BERT Architectures
This subsection presents the results obtained with hybrid models that combine convolutional neural
networks (CNNs) with BERT-based embeddings for the classification task. These models use pre-trained
BERT variants to encode the input tweets, followed by a CNN architecture designed to capture local
n-gram patterns from the contextualized token embeddings.</p>
          <p>Table 28 shows the performance metrics for these models. Overall, the results indicate that CNN-BERT
hybrids underperform in comparison to full transformer-based architectures. The best performing setup,
CNN + distilbert-base-uncased, reached an ICM of -0.222653 and macro-F1 of 0.286267. While
this configuration shows some ability to diferentiate among classes, its limited capacity to capture
deeper contextual relationships likely hinders its performance.</p>
          <p>Interestingly, all hybrid models still outperform the oficial majority and minority baselines provided
by the competition organizers, demonstrating that even simplified architectures can retain value when
leveraging pre-trained embeddings.
mDeBERTa-v3 + BERT-multilingual
mDeBERTa-v3 + DistilBERT-multilingual
XLM-RoBERTa + BERT-multilingual
XLM-RoBERTa + DistilBERT-multilingual
mDeBERTa-v3 + XLM-RoBERTa
mDeBERTa-v3 + BERT-multilingual
mDeBERTa-v3 + DistilBERT-multilingual
mDeBERTa-v3 + mDeBERTa-v3
BERT-multilingual + BERT-multilingual
RoBERTa-base + BERT-multilingual
mDeBERTa-v3
XLM-RoBERTa
RoBERTa-base
BERT-multilingual
DistilBERT-multilingual
Majority baseline
Minority baseline</p>
          <p>Binary + Multiclass (Female Voting)
Binary + Multiclass (Majority Voting)
Single Multi-Class Model (4 Labels)</p>
          <p>Baselines
CNN + distilbert-base-uncased -0.222653
CNN + roberta-base -0.256391
CNN + bert-base-uncased -0.246319
CNN + bert-base-multilingual-cased -0.283758</p>
          <p>ICM</p>
          <p>ICM-Norm
3.2.4. LLM Prompting
This section presents the results obtained by LLMs under the zero-shot and few-shot prompting
paradigms. As discussed in the methodology, the models were queried using task-specific instructions
without any fine-tuning on the EXIST 2025 data. The goal was to evaluate their ability to handle the
multi-class classification task of sexist intent detection, where predictions must distinguish between
four categories: NO, DIRECT, REPORTED, and JUDGEMENTAL.</p>
          <p>For the zero-shot experiments, five models were evaluated:</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>1. vicgalle/xlm-roberta-large-xnli-anli 2. MoritzLaurer/mDeBERTa-v3-base-mnli-xnli 3. joeddav/xlm-roberta-large-xnli</title>
        <p>-1.023420
-2.908090
In addition, the flan-t5-large model was also used in a few-shot configuration, leveraging manually
constructed prompts with representative examples of each class.</p>
        <p>All evaluations were performed on a representative sample of the validation set, and the results are
summarized in the table below.</p>
        <p>The results reveal that none of the prompting-based LLMs reached competitive performance on Task
1.2. All zero-shot models obtained negative ICM scores, with only marginal F1 scores, especially in
the more nuanced categories. While some models showed better performance on the NO class, they
struggled to correctly identify the sexist subcategories such as REPORTED or JUDGEMENTAL.</p>
        <p>The few-shot configuration with flan-t5-large also failed to deliver meaningful improvements, despite
the inclusion of manually curated examples. This confirms that Task 1.2, due to its fine-grained semantic
structure and hierarchical label space, poses a significant challenge for prompting-only methods.</p>
        <p>These findings support the broader observation made in Task 1.1: instruction-following models are
still limited in their ability to perform complex social classification tasks without in-domain training
or adaptation. While prompting remains a flexible and cost-efective approach, it does not currently
outperform supervised learning in nuanced classification problems like sexist intent detection.
3.2.5. Ensemble Methods
To assess whether combining predictions from multiple multilingual models can further improve
performance, we evaluated ensemble methods based on majority voting. As in Task 1.1, we tested two
configurations: (i) plain majority vote, and (ii) weighted vote based on ICM.</p>
        <p>Although ensemble methods yielded a slight performance boost in Task 1.1, this trend did not hold
in Task 1.2. Despite improving over the competition baselines, ensemble models underperformed
compared to the best individual models. This indicates that, in the context of multiclass classification,
aggregation through majority or weighted voting can dilute the strengths of specialized models, leading
to lower ICM scores than those achieved by carefully tuned single-model configurations.
Binary+Multi (Female Voting)
Binary+Multi (Majority Voting)
4-label (Majority Voting)
mDeBERTa-v3 + BERT (Female)
mDeBERTa-v3 + DistilBERT (Female)
XLM-RoBERTa + BERT (Female)
XLM-RoBERTa + DistilBERT (Female)
mDeBERTa-v3 + XLM-RoBERTa (Female)
mDeBERTa-v3 + BERT (Majority)
mDeBERTa-v3 + DistilBERT (Majority)
BERT + BERT (Majority)
RoBERTa-base + BERT (Majority)
mDeBERTa-v3 (4-label)
CNN + distilbert-base-uncased
CNN + roberta-base
CNN + bert-base-uncased
CNN + bert-base-multilingual-cased
vicgalle/xlm-roberta-large-xnli-anli
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
joeddav/xlm-roberta-large-xnli
cross-encoder/nli-deberta-v3-large
google/flan-t5-large (zero-shot)
Majority Voting (Majority Label)
Weighted ICM (Majority Label)
Majority Voting (Female Vote)
Weighted ICM (Female Vote)
-1.5708</p>
        <p>flan-t5-large (few-shot)
3.2.6. Error Analysis
Table 31 summarizes the performance of all model families evaluated in Task 1.2 using the Information
Contrast Measure, the oficial metric of the EXIST 2025 challenge. This task required multi-class
classification of sexist intent, including fine-grained distinctions between types of sexist expression. The
configurations encompass monolingual and multilingual transformers, hybrid CNN-BERT architectures,
zero- and few-shot LLM prompting, as well as ensemble strategies. For multilingual setups, we evaluated
both cascaded pipelines (binary followed by multi-class) and unified 4-label models.</p>
        <p>To complement the quantitative evaluation presented above, we conducted a detailed error analysis
focused on the best-performing model in Task 1.2. The objective of this analysis is to better understand
the limitations and failure modes of the system, particularly in distinguishing between the nuanced
classes of sexist content defined in the EXIST 2025 framework.</p>
        <p>The selected model for this study is the multilingual pipeline composed of mDeBERTa-v3 followed by
BERT-multilingual, trained using majority-vote labels. This configuration achieved the highest ICM
among all tested systems in Task 1.2 and serves as a strong candidate for qualitative inspection.</p>
        <p>This analysis aims to uncover potential sources of bias or ambiguity that challenge the model’s
decision-making process, and to inform future improvements in handling complex, real-world instances
of sexist expression.</p>
        <p>Figure 9 shows the confusion matrix over the validation set. Most predictions for the NO class are
accurate, with 409 correct predictions. However, the model frequently confuses NO with DIRECT (46
instances) and, to a lesser extent, with REPORTED and JUDGEMENTAL. Among the sexist categories, the</p>
        <p>NO</p>
        <p>DIRECT</p>
        <p>REPORTED
JUDGEMENTAL</p>
        <p>DIRECT</p>
        <p>NO
DIRECT
DIRECT
46
32
29
28
most common misclassification is between REPORTED and DIRECT (29 cases), which are semantically
closer and often share overlapping lexical patterns.</p>
        <p>To further analyze these errors, Table 32 summarizes the most frequent misclassification pairs.</p>
        <p>These patterns suggest that the model tends to favor more explicit categories like DIRECT over more
contextual ones such as JUDGEMENTAL or REPORTED, especially when lacking overt sexist keywords.</p>
        <p>To explore whether the model’s errors correlate with the length of the input, we compared the
character length of tweets across true labels (Figure 10). We observed that tweets labeled as REPORTED
and JUDGEMENTAL tend to be longer on average than NO and DIRECT, possibly indicating that more
complex or narrative-like expressions are harder to classify correctly.</p>
        <p>To identify lexical patterns that may contribute to prediction biases, we generated word clouds for
tweets predicted as each of the four classes, as shown in Figures 11a–11d.</p>
        <p>These visualizations confirm the presence of overlapping terms across classes. For example, “mujer”,
“mujeres”, “men”, “women”, and “feminismo” appear across all sexist categories, making fine-grained
distinction particularly challenging.</p>
        <p>We also examined the sentiment polarity of tweets predicted as each class using TextBlob. Figure 12
shows the distribution. Tweets predicted as NO tend to have slightly more positive sentiment, while
other categories remain centered near neutrality. However, the presence of negative sentiment in</p>
        <p>(a) Predicted as NO
(b) Predicted as REPORTED
(c) Predicted as JUDGEMENTAL
(d) Predicted as DIRECT</p>
        <p>JUDGEMENTAL and REPORTED tweets may reflect the emotional tone of feminist critique, which the
model may misinterpret.</p>
        <p>To complement the quantitative results, Table 33 shows a selection of representative misclassified
tweets. These examples illustrate typical edge cases, such as subtle irony, vague references to gender, or
complex constructions that defy straightforward labeling.</p>
        <p>Overall, this analysis reveals that most misclassifications stem from subtle semantic diferences
@JetCelestial @polgara28951124 @CharlotteEmmaUK I don’t know what you’re on
about? My point was that you shouldn’t mock other people’s looks, because it’s very
shallow and makes you look like a cunt. Thanks for your input though
@joelaltonmoore @KMisGrand @TheMrBarramundi @ZipPulse A vasectomy is a
simple surgery done in an ofice, hospital, or clinic. 24 hours later recovery is pretty
much complete! Tubal ligation recovery takes 1-3 weeks, longer if it’s done following
a C-section or childbirth. Many women can’t aford to take 1-3 weeks of work!
@tuckednuts Notice something else tho. . . Almost every single one, is a man. That’s
such a fucking problem that this is how some men believe they should let out their
angers and hatred. This society needs to do better, it does not even benefit the men.</p>
        <p>Patriarchy/misogyny fails everyone
@TheNon_Nun Never knew you all share tits, how did we stop being pals again yeah
?
made this top and my mom gave me the “you look like a whore” expression anyways
here’s this shirt i braided myself https://t.co/0j2mMv47wl</p>
        <p>DIRECT</p>
        <p>REPORTED
JUDGEMENTAL</p>
        <p>NO
DIRECT</p>
        <p>DIRECT
That’s why they always tell you..you will never understand women. JUDGEMENTAL
https://t.co/Q6KaaNVYRL
between classes, the use of implicit or ironic language, and overlapping vocabulary across categories.
While the model performs well at detecting clear sexist content (DIRECT), it still struggles with nuanced
discourse and context-dependent expressions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions and Future Work</title>
      <sec id="sec-5-1">
        <title>4.1. Key Findings</title>
        <p>This paper addressed the problem of automatic sexism detection in social media through participation
in the EXIST 2025 shared task. The work explored two complementary tasks: the binary classification
of sexist content (Task 1.1) and the multi-class classification of the communicative intent behind sexist
expressions (Task 1.2). A wide range of models were implemented and compared, from traditional
machine learning pipelines to advanced Transformer-based architectures, hybrid CNN-BERT structures,
System
Mario_1
CIMAT-GTO_2
CIMAT-GTO_3
UC3M-LI</p>
        <p>NYCU-NLP_1
System
Mario_1
CIMAT-GTO_3
CIMAT-GTO_2
UC3M-LI
ABCD Team_1
2025
2025
2025
2025
2024
2025
2025
2025
2025
2024
Edition ICM-Hard Norm</p>
        <p>F1-YES
Edition ICM-Hard Norm</p>
        <p>Macro F1
prompting with large language models, ensemble techniques, and retrieval-augmented classification.</p>
        <p>Among the key findings:
• The best overall performance in Task 1.1 was achieved by a monolingual Transformer-based
approach combining RoBERTa-base for Spanish and BERT-base for English, reaching an ICM of
0.541 and an F1-score of 0.847.
• In Task 1.2, the highest performance was obtained by a two-step multilingual pipeline combining
mDeBERTa-v3 for binary classification and multilingual BERT for the stance classification, with
an ICM of 0.198 and F1-score of 0.564.
• Ensemble methods provided moderate gains in Task 1.1 but proved less efective in Task 1.2,
where the hierarchical label structure introduced noise into majority-vote aggregation.
• Prompting-based approaches, both zero-shot and few-shot, were consistently outperformed by
ifne-tuned supervised models, confirming that complex socio-linguistic tasks like sexism detection
still benefit from task-specific adaptation.
• Label construction strategies had a significant impact on performance. In particular, resolving
ties through majority voting among female annotators improved the quality of the training data,
especially in Task 1.2.</p>
        <p>Additional techniques such as data augmentation and the use of hybrid CNN-BERT models contributed
to robustness but did not surpass the performance of fine-tuned Transformers.</p>
        <p>
          In addition to the internal experimental results, the oficial results for the EXIST 2025 shared task
have been published and are available on the competition’s oficial website [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>In Task 1.1, the top three performing systems were Mario_1 (ICM-Hard Norm: 0.8405, F1-YES: 0.8167),
CIMAT-GTO_2 (ICM-Hard Norm: 0.8165, F1-YES: 0.7996), and CIMAT-GTO_3 (ICM-Hard Norm: 0.8144,
F1-YES: 0.7968). Our team (UC3M-LI) achieved an ICM-Hard Norm of 0.7581 and an F1-YES of 0.7613,
positioning it competitively but behind the leading teams. For reference, the best system in the 2024
edition was NYCU-NLP_1, which achieved an ICM-Hard Norm of 0.8002 and an F1-YES of 0.7944.</p>
        <p>In Task 1.2, the top three systems in 2025 were Mario_1 (ICM-Hard Norm: 0.6623, Macro F1: 0.5692),
CIMAT-GTO_3 (ICM-Hard Norm: 0.6521, Macro F1: 0.5555), and CIMAT-GTO_2 (ICM-Hard Norm:
0.6428, Macro F1: 0.5582). The UC3M-LI system obtained an ICM-Hard Norm of 0.5175 and a Macro F1
of 0.4583. For comparison, the best system in the 2024 edition for this task was ABCD Team_1, with an
ICM-Hard Norm of 0.6320 and a Macro F1 of 0.5677.</p>
        <p>These comparative results confirm that while our systems delivered robust performance in both tasks,
a measurable performance gap remains when compared to the top-ranked teams. This gap highlights
areas for future improvement, particularly in addressing the hierarchical complexity of Task 1.2 and
further optimizing the handling of ambiguous and nuanced cases.</p>
        <p>Overall, the project demonstrated that multilingual and monolingual Transformer models remain the
most reliable and efective tools for sexism detection in textual data. Moreover, incorporating annotator
metadata in the labeling process provided both methodological insight and a practical advantage in
improving dataset quality.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Limitations</title>
        <p>Despite the strong performance of several models, this work presents some limitations:
• The dataset used for training and evaluation, while large, presents a moderate class imbalance,
particularly in Task 1.2, where the "DIRECT" category is notably more frequent than the others.
Additionally, the subjective nature of the task may introduce annotation ambiguity, which could
have constrained the performance ceiling.
• The reliance on hard label aggregation (majority vote or female vote) may oversimplify cases
where annotators genuinely disagree, potentially discarding valuable uncertainty information.
• The exploration of prompting methods was limited to a small sample due to computational
constraints, and no prompt tuning or calibration was applied.
• The use of retrieval-augmented classification (RAC) did not yield clear improvements, likely
due to the simplicity of the retrieval strategy (top-k similarity), which may not have provided
semantically helpful context.
• Finally, due to time and resource constraints, the generalization of models was not tested across
other sexism-related datasets, which limits conclusions about domain transfer.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Future Directions</title>
        <p>Building on the results and insights of this project, several directions can be considered for future
research:
• Explore label modeling techniques that go beyond hard aggregation, such as soft labels or
learningwith-disagreement (LwD), to better capture annotator uncertainty and subjectivity.
• Integrate demographic metadata (gender, age, education level) not only in labeling decisions but
also as input features to train fairer and more interpretable models.
• Investigate more advanced prompting techniques, including chain-of-thought prompting,
instruction tuning, or LLM calibration, and test models like GPT-4 or LLaMA 3 in full evaluations. In this
work, a preliminary attempt was made using the instruction-tuned model google/gemma-2b-it
[52]. However, the results were not satisfactory, and due to time limitations, no further tuning or
prompt optimization was explored. As a consequence, this approach was excluded from the main
methodology, although its initial evaluation is documented and may serve as a basis for future
experiments.
• Refine the retrieval component in RAC models by incorporating semantic filtering, diversity-aware
sampling, or using multi-hop context.
• Extend the framework to multimodal classification by including image-text memes and applying
joint embedding models such as CLIP or BLIP.
• Finally, although monolingual models were trained separately for Spanish and English, no
language-specific evaluation was reported. All systems, including multilingual ones, were assessed
on the entire dataset without distinguishing performance per language. Future work could address
this by analyzing potential diferences in model behavior or bias across languages.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Acknowledgments</title>
      <p>Grant PID2023-148577OB-C21 (Human-Centered AI: User-Driven Adapted Language Models-
HUMAN_AI) by MICIU/AEI/ 10.13039/501100011033 and by FEDER/UE.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used Chat-GPT-4 in order to: Grammar and spelling
check. After using these tool, the author reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[15] C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge
University Press, 2008.
[16] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning (1995). doi:10.1007/</p>
      <p>BF00994018.
[17] A. McCallum, K. Nigam, A comparison of event models for naive bayes text classification, in:</p>
      <p>AAAI-98 workshop on learning for text categorization, 1998.
[18] T. Joachims, Text categorization with support vector machines: Learning with many relevant
features, in: European conference on machine learning, Springer, 1998.
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, in: Advances in Neural Information Processing Systems (NeurIPS), 2017.
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019. URL:
https://aclanthology.org/N19-1423.
[21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised
multitask learners, OpenAI Blog (2019).
[22] G. A. L. Team, Bert base model (uncased), https://huggingface.co/bert-base-uncased, 2018.
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[24] B. S. Center, T. Group, Roberta base bne model for spanish, https://huggingface.co/BSC-TeMU/
roberta-base-bne, 2021.
[25] Google AI Language, bert-base-multilingual-cased, https://huggingface.co/google-bert/
bert-base-multilingual-cased, ????
[26] Microsoft, microsoft/mdeberta-v3-base, https://huggingface.co/microsoft/mdeberta-v3-base, 2021.</p>
      <p>Accessed: 2025-06-15.
[27] Hugging Face, distilbert-base-multilingual-cased, https://huggingface.co/distilbert/
distilbert-base-multilingual-cased, 2019.
[28] Facebook AI, xlm-roberta-base, https://huggingface.co/facebook/xlm-roberta-base, 2020.
[29] Facebook AI, roberta-base, https://huggingface.co/facebook/roberta-base, 2019.
[30] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention,
arXiv preprint arXiv:2006.03654 (2020).
[31] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[32] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv
preprint arXiv:1911.02116 (2020).
[33] OpenAI, Gpt-4 technical report, https://openai.com/research/gpt-4, 2023.
[34] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, et al., Llama: Open and eficient foundation language models, arXiv preprint
arXiv:2302.13971 (2023).
[35] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, arXiv preprint
arXiv:2204.02311 (2022).
[36] V. Galle, vicgalle/xlm-roberta-large-xnli-anli, 2022. Accessed via Hugging Face at https://
huggingface.co/vicgalle/xlm-roberta-large-xnli-anli.
[37] J. Davison, joeddav/xlm-roberta-large-xnli, 2020. Accessed via Hugging Face at https://huggingface.</p>
      <p>co/joeddav/xlm-roberta-large-xnli.
[38] M. Laurer, Moritzlaurer/mdeberta-v3-base-mnli-xnli, 2023. Accessed via Hugging Face at https:
//huggingface.co/MoritzLaurer/mDeBERTa-v3-base-mnli-xnli.
[39] H. Face, cross-encoder/nli-deberta-v3-large, 2021. Accessed via Hugging Face at https://
huggingface.co/cross-encoder/nli-deberta-v3-large.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>European</given-names>
            <surname>Institute for Gender</surname>
          </string-name>
          <string-name>
            <surname>Equality</surname>
          </string-name>
          , Sexism - glossary &amp; thesaurus,
          <year>2017</year>
          . URL: https://eige. europa.eu/thesaurus/terms/1222.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Encyclopedia</given-names>
            <surname>Britannica</surname>
          </string-name>
          <string-name>
            <surname>Contributors</surname>
          </string-name>
          , Sexism,
          <year>2024</year>
          . URL: https://www.britannica.com/topic/ sexism.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ministerio de Igualdad de España</surname>
          </string-name>
          ,
          <article-title>Casi tres de cada cuatro mujeres jóvenes se han visto expuestas a comentarios sobre su físico en redes sociales</article-title>
          ,
          <year>2024</year>
          . URL: https://www.igualdad.gob.es/comunicacion/notasprensa/ casi-tres-de
          <article-title>-cada-cuatro-mujeres-jovenes-se-han-visto-expuestas-a-comentarios-sobre-su-fisico-en-redes-sociales</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Amorim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <article-title>Measuring sexist content in social media: Evidence from machine learning analysis of tweets, Social Network Analysis and Mining (????)</article-title>
          . URL: https: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-117.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Drakett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rickett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Day</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Milnes</surname>
          </string-name>
          ,
          <article-title>Old jokes, new media - online sexism and constructions of gender in internet memes</article-title>
          ,
          <source>Feminism &amp; Psychology</source>
          <volume>28</volume>
          (
          <year>2018</year>
          )
          <fpage>89</fpage>
          -
          <lpage>109</lpage>
          . URL: https://doi.org/10. 1177/0959353517727560. doi:
          <volume>10</volume>
          .1177/0959353517727560.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[6] ACM womENcourage</source>
          <year>2024</year>
          , Hackathon: Hacking sexism online,
          <year>2024</year>
          . URL: https://womencourage. acm.org/2024/hackathon/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>EXIST</surname>
          </string-name>
          <year>2025</year>
          Organizers,
          <article-title>Exist 2025: sexism identification in social networks</article-title>
          ,
          <year>2024</year>
          . URL: https: //nlp.uned.es/exist2025/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Maeso</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chulvi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chulvi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2024-learning with disagreement for sexism identification and characterization in tweets and memes, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the CLEF 2024 Conference, Lecture Notes in Computer Science</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Convolutional neural networks for sentence classification</article-title>
          ,
          <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <article-title>Understanding back-translation at scale</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kuksa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2020</year>
          . URL: https://proceedings. neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          , Information Processing &amp;
          <string-name>
            <surname>Management</surname>
          </string-name>
          (
          <year>1988</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Klein</surname>
          </string-name>
          , E. Loper,
          <article-title>Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O'Reilly Media Inc</article-title>
          .,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>