<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GrootWatch at EXIST 2025: Automatic Sexism Detection on Social Networks - Classification of Tweets and Memes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nathan Nowakowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Calogiuri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Előd Egyed-Zsigmond</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diana Nurbakova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johan Erbani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sylvie Calabretto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INSA Lyon</institution>
          ,
          <addr-line>CNRS, Universite Claude Bernard Lyon 1, LIRIS, UMR5205, 69621 Villeurbanne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents our participation in the EXIST (sEXism Identification in Social neTworks) challenge at CLEF 2025, focusing on the classification of tweets and memes. We participated in all the tasks for tweets and memes, including both hard and soft classifications for tweets and hard classification for memes. For tweet classification, we propose a multi-task headed BERT model enriched with relevant information surrounding the tweet, helping the model achieve a full understanding of the tweet and its context. For memes, the paper explores the use of a Vision-Language Model (VLM)-based application to detect and categorise sexism in diferent scenarios, leveraging the ability of such models to understand the relationship between images and text in situations where sexist ideas are often expressed subtly. Our solutions achieved excellent performance, ranking first in all soft-soft tweet classification tasks and second in all hard-hard meme classification tasks. Content Warning: This paper includes examples of hateful, explicit and sexist language presented for illustrative purposes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism Identification</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Image Classification</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sexism, in the form of pre-judges or hateful comments, is a prevalent form of digital violence that
must be addressed in a context where social networks and digital platforms are ubiquitous. In 2024,
81% of French women reported experiencing sexist comments on these platforms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This concerning
situation presents a major societal challenge, creating a balance between the ethical expectations of
moderation and the need to protect free expression. This work takes place against a backdrop in which
platforms such as Meta are drastically relaxing their moderation policies, exacerbating the risks of
polarisation and gendered hatred [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. At the same time, masculinist discourse is gaining in visibility,
making it essential to develop tools capable of mapping and countering these dynamics in real time.
Today’s forms of sexism extend beyond verbal attacks, with diverse representations such as videos,
comments, or images appearing on platforms like X (former Twitter), Instagram or TikTok [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Therefore, automatic identification of sexist content on social media becomes a crucial task. To foster
such initiatives, the EXIST 2025 challenge [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] comprises nine subtasks in two languages, English
and Spanish, which are the same three tasks (sexism identification, source intention detection, and
sexism categorisation) applied to three diferent types of data: text (tweets), image (memes), and video
(TikToks).
• Source Intention Detection (Subtasks 1.2, 2.2 and 3.2): Once a message has been classified
as sexist, this task aims to categorise the message according to the intention of the author. For
tweets and videos, the categories are DIRECT, REPORTED, and JUDGEMENTAL. For memes,
due to their characteristics, the REPORTED label is virtually null, so systems should only classify
memes with DIRECT or JUDGEMENTAL labels.
• Sexism Categorisation (Subtasks 1.3, 2.3, 3.3): This task involves classifying sexist content
into one or more categories: IDEOLOGICAL AND INEQUALITY, STEREOTYPING AND
DOMINANCE, OBJECTIFICATION, SEXUAL VIOLENCE, and MISOGYNY AND
NONSEXUAL VIOLENCE.
      </p>
      <p>The categories of sexism used in this study are defined on the EXIST 2025 challenge website. Given
the complexity and the need for comprehensive detection tools, we decided to tackle both tweet-based
subtasks (1.1, 1.2, 1.3) and meme-based subtasks (2.1, 2.2, 2.3) in our work. To address this challenge, we
evaluated and compared state-of-the-art techniques, incorporating our insights to propose two tailored
solutions: one for textual classification and another for meme classification.</p>
      <p>The remainder of the paper is organised as follows. In Section 2, we provide a brief overview
of approaches used for automatic detection of sexist content. We then describe the dataset and the
evaluation metrics in Section 3. We describe our proposed solutions for tweets and memes in Section 4.
We report the results of our experiments in Section 5. Section 6 concludes the paper and outlines the
directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In this section, we present the diferent approaches used to detect online sexism. These methods fall
into four broad categories: traditional approaches, Deep Learning-based approaches, transform-based
approaches (BERT and LLM) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], and multimodal approaches.
      </p>
      <p>
        Before the emergence of deep architectures, a number of studies used classic machine learning methods
- such as Logistic Regression, SVMs or Random Forests. These methods were generally combined with
feature extraction techniques (N-Grams, TF–IDF, Static Word Embeddings) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While these approaches
provided reasonable performance, they were limited in their ability to handle contextual variations and
language evolution.
      </p>
      <p>
        Deep Learning models have made it possible to capture complex patterns using specialised architectures.
CNN-BiLSTM architectures, combining convolutional neural networks (CNNs) to detect local patterns
(e.g. ofensive N-Grams) and BiLSTMs to model long-term contextual dependencies, marked a significant
advance [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        The advent of transformers has revolutionised the detection of sexism thanks to their ability to encode
the overall context of text:
• BERT and derivatives: Models such as RoBERTa [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or DeBERTa [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], pre-trained on massive
corpora, capture semantic nuances and sexist undertones [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
• LLM and contextual reasoning: LLMs (e.g., Llama-3 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) fine-tuned with methods like LoRA
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] incorporate advanced reasoning capabilities, essential for interpreting emerging cultural
references or sarcasm [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
• Enrichment by sentiment analysis: Sentiment analysis techniques are used to enrich transform
models in order to detect emotional nuances and tonality. This approach proves efective in
spotting sexist comments sometimes disguised under a veneer of positive or neutral sentiment
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Existing datasets have played a crucial role in advancing the field of online sexism detection. Notable
examples include:
• Sexist Stereotype Classification (SSC) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]: Collected from Instagram hashtags like
#bloodymen and #metoo, this English dataset contains 5,544 comments annotated manually and through
active learning.
• Semeval 2023 Task 10 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: Focused on explainable detection of online sexism, this dataset
includes 20,000 English comments from Gab and Reddit, annotated by 19 female annotators with
expert review for disagreements.
• EXIST 2021-2025 [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23 ref5">20, 21, 22, 23, 5</xref>
        ]: These datasets comprise tweets in English and Spanish,
with the detailed annotator demographics included starting from 2023. Notably, the definition
of sexism varies across sociocultural contexts and annotator biases. The adoption of paradigms
like Learning with Disagreements (LeWiDi) enables consideration of multiple and sometimes
contradictory annotations, thus improving model robustness [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        These datasets have contributed significantly to our understanding of online sexism, enabling researchers
to develop more accurate and robust detection methods. With the rise of visual social media platforms,
sexism is increasingly conveyed through multimodal forms such as memes, which blend text and images
to encode prejudice in subtle, culturally loaded ways. This shift has spurred research into models
capable of understanding both modalities simultaneously. Several key datasets support this area:
• MAMI (SemEval-2022) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]: A benchmark dataset with 10,000 memes annotated for sexism and
ifne-grained categories (e.g., shaming, objectification).
• MIMIC [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]: A Hindi-English code-mixed dataset tackling misogyny in multilingual, multimodal
memes with classification tasks.
• EXIST 2024–2025 [
        <xref ref-type="bibr" rid="ref23 ref5">23, 5</xref>
        ] : A shared task that extended sexism detection to memes and, more
recently, TikTok videos, leveraging the LeWiDi paradigm for multilingual and multimodal
challenges.
      </p>
      <p>
        The correlation between textual and visual elements in memes make VLMs (Vision Language Models),
architectures built by combining large language models and vision encoding, suitable for the task.
Several studies have been done on applying VLMs to social media memes for semantic understanding
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and hate speech detection both in a zero-shot [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and fine-tuning paradigm [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], showing the
efectiveness of this approach. Among the methodologies utilised, the principal models applied in this
research fall into the following categories:
• Transformer-based multimodal systems combining textual encoders, such as BERT, with
visual representations often extracted via CLIP [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
• LLMs such as GPT-4 are integrated in the classification pipeline to enrich memes with inferred
context and deeper semantic understanding [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
• Multi-task VLMs such as Florence 2 [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and Qwen 2.5 VL [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] shows strong generalisation for
cross-modal inputs.
• Lightweight and multilingual models like Mistral 3.1 Small [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] and Aya Vision 8B [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ],
ofer high performance with lower resource requirements, supporting deployment across varied
linguistic and visual settings.
      </p>
      <p>
        Despite notable advances, several challenges persist:
1. Knowledge obsolescence: Pre-trained models possess frozen knowledge that may not always
capture recent language and usage developments in tweets, limiting their relevance in current
contexts. Valavi et al. [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] emphasises the need to periodically refresh training data to maintain
high performance.
2. Contextual dependence: Correct classification often relies on information not present in the
text itself (e.g., current events, cultural references, emerging trends).
3. Oversight of visual cues: Many methods overlook the information present in images, relying
mostly on accompanying text for meme analysis [
        <xref ref-type="bibr" rid="ref36 ref37">36, 37</xref>
        ].
4. Costly integration: Some approaches integrate image features using large, proprietary models
like GPT-4 [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], but this comes at a significant computational cost and with limitations.
      </p>
      <p>In this paper, we tackle the aforementioned limitations by proposing two novel approaches that
enhance the performance and robustness of sexism detection models in social media, as well as in
tweets and in memes. The details of our methodology are presented in Section 4, which outlines how
we address these challenges and advance the state-of-the-art in sexism detection.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Evaluation Overview</title>
      <p>Our study is based on the EXIST 2025 dataset, which ofers a rich collection of tweets and memes
annotated for online sexism detection. We draw upon these subsets to train and evaluate our approach.
Tables 1 and 2 resume the datasets for tweets and memes respectively.</p>
      <p>In the subsequent experimental phase, we will conduct model fine-tuning using the labelled training
set, followed by evaluation on the development dataset. Since the meme dataset does not explicitly
provide for a development dataset, the training set was divided into two 80/20 partitions (seed: 1234),
respectively for fine-tuning and for evaluating results on never-before-seen data. Ultimately, our
approach will be benchmarked against challengers using the held-out test set.</p>
      <p>
        The oficial evaluation metric for this challenge is the Information Contrast Measure (ICM) [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ].
Throughout this report, we will employ the normalised variant, ICM Norm, to assess the performance of our
models. We opted for ICM Norm due to its enhanced readability, which results from its normalisation to
a maximum value of 1. Due to the class imbalance in the dataset, as showed in Table 3, we also provided
the F1 score for hard classification to better capture the trade-of between precision and recall. With
respect to the given tables, the Unlabelled class corresponds to records where annotator consensus was
not reached, thereby precluding a definitive ground truth assignment. Furthermore, the percentages for
subtasks 1.3 and 2.3 do not add up to one hundred, due to the nature of the task as a multi-label task.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Tweets</title>
        <sec id="sec-4-1-1">
          <title>4.1.1. Data processing</title>
          <p>This methodology is structured into two distinct components. The first part focuses on our approach to
tweet analysis, while the second part details our method for meme analysis.</p>
          <p>
            The previous comprehensive literature review on classification techniques revealed that BERT and
LLM models are at the forefront of natural language processing tasks. Given their state-of-the-art
performance, we focused our eforts on these models. Our initial step involved conducting multiple
tests to determine the optimal formatting for tweets to be processed by BERT. This process ensured
that the input data was structured to maximise the model’s performance. For LLM, Quan and Thin [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]
indicated that extensive formatting was unnecessary, simplifying our preprocessing pipeline. At the
end, tweets for BERT were pre-processed using the steps described in Table 4.
Feel #blessed that I have raised a caring &amp;amp; feel blessed that i have raised a caring
loving 13 yo who is our Next Gen Feminist &amp;amp; loving 13 yo who is our next gen
femiAlly. I was crying inside when I got this text. Not nist ally. i was crying inside when i got
only we must #BreakTheBias for women, we need this text. not only we must
breaktheto do it for our children. @GlobalFund- bias for women, we need to do it for
Women @UN_Women @womensday @Wom- our children.
          </p>
          <p>eninID https://t.co/UJvloR0IP</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Annotator Information Analysis</title>
          <p>
            To investigate the impact of annotator characteristics on sexism detection, we conducted a
comprehensive analysis of annotator information using Chi-Squared tests and Logistic Regression models
with feature importance. To improve the model’s understanding of the subjective nature of sexism,
we identified study level, country of origin, and ethnicity as relevant annotator attributes through
our analysis. By integrating these attributes, we aimed to enhance the model’s ability to capture
diverse perspectives and biases. To achieve this, we vectorised the selected annotators’ information and
embedded it into the CLS token of the BERT model, prior to passing it to the classification head.
Table 5 illustrates a simplified example of the vectorisation process of annotator information, featuring
three annotators for clarity. Note that in our actual implementation, the final vector is 65 elements
long, encompassing a more extensive range of ethnicities (more than 3), study levels (more than 2),
and countries (more than 2). This simplified representation is intended to facilitate understanding and
presentation.
“ethnicities_annotators”: ethnicities_annotators:
[“White or Caucasian”, “Hispano or Latino”, [
            <xref ref-type="bibr" rid="ref1">1,0,0</xref>
            ] [
            <xref ref-type="bibr" rid="ref1">0,1,0</xref>
            ][
            <xref ref-type="bibr" rid="ref1">0,0,1</xref>
            ] = [
            <xref ref-type="bibr" rid="ref1 ref1 ref1">1,1,1</xref>
            ]
“Asian”],
“study_levels_annotators”: study_levels_annotators:
[“High school degree or equivalent”, “Master’s [
            <xref ref-type="bibr" rid="ref1">1,0</xref>
            ] [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ][
            <xref ref-type="bibr" rid="ref1">1,0</xref>
            ] = [
            <xref ref-type="bibr" rid="ref1 ref2">2,1</xref>
            ]
degree”, “High school degree or equivalent”],
“countries_annotators”: countries_annotators:
[“Spain”, “Portugal”, “Portugal”] [
            <xref ref-type="bibr" rid="ref1">1,0</xref>
            ] [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ][
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] = [
            <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
            ]
⇒ Concatenation:
[
            <xref ref-type="bibr" rid="ref1 ref1 ref1 ref1 ref1 ref2 ref2">1,1,1,2,1,1,2</xref>
            ]
⇒ Normalisation:
[0.1,0.1,0.1,0.2,0.1,0.1,0.2]
          </p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.1.3. Fine-Tuning and Initial Results</title>
          <p>
            Our fine-tuning eforts with both BERT and LLM models yielded promising results (cf. Table 6), closely
approaching the top performances achieved in the previous year [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ]. The specific experience
configurations employed are detailed in Appendix A. Notably, we drew inspiration from last year’s edition [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]
and complemented this with empirical tests on our side to determine the optimal hyperparameters and
prompts.
          </p>
          <p>However, we sought to further enhance our approach. An analysis of misclassification revealed
that certain tweets were incorrectly classified due to their ambiguity or references to recent topics not
present in the training data. For instance, tweets referencing very recent events or slang not included
in the model’s vocabulary posed significant challenges.</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>4.1.4. Leveraging AI Agents for Contextual Information</title>
          <p>
            To overcome the limitations of traditional models, we leveraged the capabilities of AI agents that can
dynamically interact with their environment with tools, plan actions, and integrate external data in
real-time [
            <xref ref-type="bibr" rid="ref40">40</xref>
            ]. Our approach is exemplified in Figure 1, which illustrates the workflow of our agent when
faced with an ambiguous tweet referencing a meme about a pregnant woman in Oklahoma. Initially
misclassified as sexist by our base model, which we hypothesise was due to the presence of keywords
like ’woman’ and ’pregnant’, as our analysis of the TF-IDF representation of misclassified samples
revealed that these words tend to dominate the feature space, leading to incorrect sexist classifications.
To address this limitation, we propose an innovative solution: our AI agent intervenes to identify the
need for context (1) and dynamically queries a search engine (2) to gather relevant information. The
agent then analyses the search results (3), and extracts crucial context (4), enabling the capture of
sexist–or non-sexist–connotations or nuances linked to recent events that are invisible to static models.
If no additional context is required, the agent indicates "No external context needed.". By harnessing the
potential of AI agents, we aim to improve the relevance and robustness of sexism detection, adapting to
the rapid evolution of language and diverse contexts on social media platforms.
          </p>
          <p>tpmorP
ksaT
tewT
eport
tnegA
petS
nopseR
.rgnesap
2
ehT
sihT
tangerp“
mhalko
dna
hcraeS</p>
          <p>We equipped our agent with the DuckDuckGo web search tool. We considered several options for
utilising this AI agent:
• Direct Classification by the Agent: The agent classifies the tweet directly using relevant
information gathered from the web search. Its user prompt is available in Appendix B
• Context Retrieval by the Agent: The agent retrieves contextual information around the tweet
using the web search (the AI agent user prompt is available in Appendix C), which can then be
fed into a BERT or LLM.</p>
          <p>
            – BERT-based Architecture: We fed the retrieved context into a BERT model, employing a
Siamese Dual Encoder architecture (SDE) [
            <xref ref-type="bibr" rid="ref41">41</xref>
            ]. This design choice was motivated by our
empirical findings, as alternative architectures yielded inferior results.
– LLM-based Approach: We incorporated the retrieved context into the prompt, as detailed
in Appendix E, and then fine-tuned an LLM to classify the tweet.
          </p>
          <p>To optimise the LLM performance and output format for each experience configuration, various prompts
have been empirically tested. Furthermore, a system prompt is appended to the LLM Agent by the
library, facilitating correct parsing of its output for tool calls. For more information on the library
details, see Section 5.1.</p>
          <p>We employed two distinct LLMs in our approach: one for powering the autonomous AI agent and
another for fine-tuning to classify tweets.</p>
          <p>• Autonomous AI Agent: We chose the Llama-3.3-70B-Instruct model for its ability to
handle complex tasks, which requires well-formatted responses to efectively leverage tools.
• Fine-Tuned LLM for Classification: Due to computational limitations, we used a smaller LLM,
Llama-3.2-3B-Instruct, for fine-tuning. Despite using 4-bit quantisation, we lacked the
necessary computational resources to fine-tune a 70B model.</p>
          <p>
            Our experiments, presented in Table 7, reveal that BERT models augmented with contextual
information outperform LLM with context, underscoring the eficacy of contextual enrichment on encoder-only
architectures. In contrast, the incorporation of context into fine-tuned LLM appears to degrade
performance, potentially due to the phenomenon of context hijacking [
            <xref ref-type="bibr" rid="ref42">42</xref>
            ], where the model overemphasises
contextual cues. Nonetheless, the AI agent direct classification surpasses the zero-shot baseline in Table
6. Consequently, we will pursue a BERT-based architecture to fully leverage the potential of contextual
research, as the LLM approach does not seem to yield comparable performance gains.
          </p>
          <p>Moving forward, our approach will leverage contextual information retrieved and formatted by an
AI agent. As this external data is generated, evaluating its quality is a necessary consideration. An
initial evaluation of the generated contexts is presented in Appendix D. While we do not delve further
into this aspect in this paper, as it is not the primary focus, additional analysis may be merited for this
case and future applications.</p>
        </sec>
        <sec id="sec-4-1-5">
          <title>4.1.5. Soft Label Learning</title>
          <p>
            One of the significant challenges we encountered was annotator disagreement, the Unlabelled data.
When there was no clear majority—such as three "YES" and three "NO" or three "DIRECT" and three
"JUDGEMENTAL"—we could not use these data points because we were training the model for hard
label classification. This was not a trivial detail, as the amount of training data can impact model
performance [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ]. For instance, for the first task, we were losing around 10% of the data, and this loss
increased with tasks 1.2 and 1.3.
          </p>
          <p>
            A solution we identified was to train the model with probabilities rather than hard labels, aligning with
the principles of soft label learning (SLL) as explored in [
            <xref ref-type="bibr" rid="ref44">44</xref>
            ]. This study demonstrated that incorporating
information about the uncertainty of the outcome in classification models can significantly enhance
performance compared to the standard approach of hard label learning (HLL). For example, when a
tweet had annotations of five "YES" and one "NO," we previously provided "YES" as the training input.
With probabilities, the input would be [0.83, 0.17]. This new formatting approach allowed us to achieve
two key improvements: taking into account the whole training dataset and better capturing annotator
discordance, aligning more closely with the LeWiDi paradigm. Our experiments demonstrated that this
method improved the ICM-Hard Norm by 1 point and the ICM-Soft Norm by 2 points.
          </p>
        </sec>
        <sec id="sec-4-1-6">
          <title>4.1.6. Model Runs and Performance</title>
          <p>
            Now that we have selected the BERT architecture, in Figure 2 we conducted extensive runs with various
of these models, including XLM-Roberta [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], Deberta V3 [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and ModernBERT [
            <xref ref-type="bibr" rid="ref45">45</xref>
            ] variants.
XLMRoberta emerged as the best-performing model with contextual injection and annotator information.
          </p>
        </sec>
        <sec id="sec-4-1-7">
          <title>4.1.7. Multi-Task BERT Architecture</title>
          <p>
            One of the key advantages of selecting the BERT architecture is that, with minimal additional efort and
computational resources, we can accommodate all three tasks and both hard and soft labels within a
single multi-task BERT model [
            <xref ref-type="bibr" rid="ref46 ref47">46, 47</xref>
            ]. This design enables knowledge sharing across tasks by leveraging
the base layers of the BERT model, while task-specific output heads capture the unique characteristics
of each task.
          </p>
          <p>
            Building upon the best existing approach, which employed a multi-task BERT [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], we sought to further
improve it. Notably, our analysis revealed that the probability of a "NO" label remains consistent across
all three tasks. This observation led us to propose a novel 2+1 architecture (cf. Figure 3), wherein one
classification head is dedicated to softmax labels (subtasks 1.1 and 1.2) and another to sigmoid labels
(subtask 1.3). Specifically, this design allocates the first two tasks to the first classification head and the
third task to the second classification head.
          </p>
          <p>A crucial aspect of our proposed architecture is that we leverage the consistency of "NO" probability
across all three tasks. By recognising this consistency, we adapted our training approach to compute the
loss of the Classifier B (subtask 1.3) only when the tweet is classified as sexist by the Classifier A. This
hierarchical design enables us to filter out non-sexist examples and focus on the relevant samples for
subtask 1.3, thereby improving performance and establishing coherence between the two classification
heads despite their distinctness.</p>
          <p>In contrast, our experiments with a single classification head for all categories did not perform well,
likely due to the large number of categories. Similarly, attempting to predict only "YES" labels and
computing "NO" labels by doing 1-"YES" also yielded subpar results.</p>
          <p>Notably, this 2+1 architecture significantly impacted the performance of our results for subtasks 1.2 and
1.3. While subtask 1.1 results remained relatively consistent, our proposed architecture demonstrated
substantial improvements for the latter two tasks. In particular, it led to a substantial improvement
in soft classification, with an increase of two to three ICM Soft Norm points. The final results of our
model are presented in the Section 5.2.
gnidebmE-tewT
LATNMEGDUJ</p>
          <p>TCERID
DTOPER
DTOPER
refisalC
tewT
xamtfoS
ksaTptuO</p>
          <p>TREB
TCERID
4
CNELOIV-LAUXESNO-NYGOSIM</p>
          <p>ON
xetnoC
YTLAUQENI-LACIGLOEDI
gnidebmE-xetnoC
1
refisalC
)ON(P</p>
          <p>CNELOIV-LAUXES
2
domgiS
ksaTptuO</p>
          <p>ECNAMOD-GNIPYORETS</p>
          <p>NOITAFITCEJBO</p>
        </sec>
        <sec id="sec-4-1-8">
          <title>4.1.8. Result Formatting</title>
          <p>To format the results, we rounded the probabilities to the nearest 1/6–as there are six annotators–and
ensured that the sum of probabilities was 1 for subtasks 1.1 and 1.2. For hard classification, we adopted
the following strategies: for subtasks 1.1 and 1.2, we selected the feature with the maximum probability;
for subtask 1.3, a multi-label classification task, we chose all features with probabilities exceeding
0.25. This threshold was determined through testing on the training and development datasets during
soft-to-hard label conversion.</p>
          <p>In summary, our methodology involved a thorough literature review, extensive testing, and innovative
use of AI agents to enhance contextual understanding. It also incorporates annotator information to
address subjectivity and employs a multi-task headed approach, sharing base layers across tasks while
capturing unique characteristics through specific output heads.
4.2. Memes</p>
        </sec>
        <sec id="sec-4-1-9">
          <title>4.2.1. Data preprocessing</title>
          <p>Regarding the Meme Dataset, we first wanted to verify the accuracy of the text and image pairs provided
together. For each meme, we extracted the superimposed text using Florence-2 Large and we then
compared it with the provided one. Average Jaccard similarity in terms of unigrams and bigrams showed
respective values of 0.9518 and 0.9495, marking a minor diference that could be explained as follows:
• for unigrams, since diacritics matters, two semantically equal words could be treated as diferent
(e.g. "tenia" vs "tenía")
• for bigrams, being defined as sequences of two adjacent words, the sequence of words has an
efect on the computed Jaccard similarity</p>
          <p>Through this comparative analysis of the extracted and given texts, we observed that the superimposed
texts provided with the data exhibit superior transcription quality compared to those extracted using
Florence-2. Notably, these texts feature proper accentuation and sequentiality, resulting in a readability
closer to human standards. An exemplary illustration of these findings is presented in Figure 4.
Consequently, we opt to utilise the provided text instead of relying on a specific extraction technique.</p>
        </sec>
        <sec id="sec-4-1-10">
          <title>4.2.2. Approach overview</title>
          <p>To tackle meme classification, informed by our literature review which highlighted the necessity of
exploring multiple approaches, we investigated three complementary strategies:
• Caption-Based Classification: representation of meme images as textual captions and
classification of the captions using a fine-tuned text model.
• Frozen Multimodal Classification: usage of pretrained VLMs in zero-shot and few-shot settings
without fine-tuning.
• Fine-Tuned Multimodal Classification: fine-tuning of medium-to-large VLMs on labeled
sexist and not sexist memes for task-specific performance.</p>
        </sec>
        <sec id="sec-4-1-11">
          <title>4.2.3. Caption-based Classification</title>
          <p>In this text-based classification approach, represented in Figure 5, meme images were first transformed
into textual descriptions using Qwen 2.5 VL 32B (1) and these captions (2), jointly with their respective
ground truths (3), were then used as input for fine-tuning XLM-RoBERTa (4). This two-stage pipeline
was designed to exploit the visual understanding of vision-language models and the adaptability of
multilingual transformers.</p>
          <p>To analyse the impact of visual description granularity, we generated two types of captions:
• Short Captions: concise descriptions capturing minimal visual content.
• Detailed Captions: rich, context-aware descriptions reflecting nuanced or subtle cues in the
image.</p>
          <p>Figure 6 shows how the textual representation can difer from the same meme. The prompts employed
for the generation of captions are disclosed in Appendices L to N.</p>
        </sec>
        <sec id="sec-4-1-12">
          <title>4.2.4. Frozen Multimodal Classification</title>
          <p>This approach used frozen vision-language models in zero-shot and few-shot scenarios without
taskspecific fine-tuning, in order to simulate realistic low-resource classification settings. We evaluated the
following VLMs:
• Qwen-VL 2.5 in its 7B, 32B, and 72B variants (zero-shot and few-shot)
• Aya Vision 8B (zero-shot)
• Mistral Small 3.1 24B (zero-shot)
In the zero-shot setting, models were given only the meme image and a minimal classification prompt
(showed in the Appendices G to K), with no prior examples. The employed prompts were significantly
based on the guidelines provided to annotators for meme labelling across the three subtasks. We also
tested variants where the model received either only the image, image plus superimposed text, or only
the superimposed text. These variations aimed to quantify the importance of superimposed textual
content over the final prediction. To evaluate few-shot performance, we included six example memes in
the prompt using two diferent sampling strategies:
• Random Few-Shot Sampling: six random examples from the training set, imposing a balanced
extraction between sexist and not sexist memes
• Polarised Few-Shot Sampling: three clearly sexist and three clearly non-sexist memes (i.e.,
with ≥5 of 6 annotators in agreement).</p>
          <p>Images were resized to a maximum of 262,144 pixels (e.g., the size of a 512×512 image) maintaining
their original proportions to fit within GPU memory constraints.</p>
        </sec>
        <sec id="sec-4-1-13">
          <title>4.2.5. Fine-Tuned Multimodal Classification</title>
          <p>Finally, we tested the efectiveness of fine-tuning a set of VLMs for both sexism identification and
classification. Specifically:
• For subtask 2.1, we fine-tuned Florence 2 and Qwen 2.5 VL (7B and 32B). The dataset used for
ifne-tuning was gathered from the available ground truths for the task, for a total 3,420 meme
image-label pairs.
• For subtasks 2.2 and 2.3, only Qwen 2.5 VL 32B was fine-tuned. The data curation criteria was
slightly diferent with respect to subtask 2.1, since we excluded the memes labelled as not sexist
from the ground truths of the sexism identification task. Eventually, the number of considered
records was 1,815 (over the 3,197 available ground truths) for the source intention classification
and 2,868 (over 4,250 available ground truths) for sexism categorisation.</p>
          <p>The experimental setup and the fine-tuning hyperparameters for both Florence-2 and Qwen 2.5
VL are presented in detail in Appendix O. In contrast to previously proposed methods for meme
analysis, our proposed LLM-based solution ofers a lightweight yet efective approach to detecting
and classifying sexism in memes while incorporating the entire visual content into the classification
pipeline. By avoiding high inference costs and proprietary APIs, this approach ensures compatibility
with low-to-mid-tier hardware and promotes reproducibility by reducing computational requirements.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. System setting</title>
        <p>All experiments were conducted using PyTorch 2.5.1, the Hugging Face Transformer 4.50 library, and
the Smolagents 1.4 library for AI agent development. The computational environment consisted of two
GPUs with the following specifications:
• NVIDIA A40 (46 GiB), driver version 555.42.06, CUDA 12.5
• NVIDIA A100 (40 GiB), driver version 555.42.02, CUDA 12.5</p>
        <p>
          Additionally, VLMs with more than 7 billion parameters were loaded using Bitsandbytes 4-bit
quantisation technique, which reduces the size of the model and computational costs by representing
weights and activations with just 16 discrete levels [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This technique significantly reduces memory
usage and accelerates inference while having minimal impact on model accuracy.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Development Phase</title>
        <p>In this section, we present the performance metrics of our proposed methods across all the three
tasks. For tweets, evaluations were conducted under both soft and hard contexts, whereas meme-based
methods were assessed under the hard evaluation setting. Our model was trained on the provided
training dataset and evaluated on the corresponding validation dataset for tweets. For memes, as
mentioned before, we split the training dataset to create a validation dataset, thereby enabling us to
assess the model’s generalisation capabilities.</p>
        <p>Regarding the sexism identification task in memes, the main results presented in Table 10 (full
results in the Appendix F) indicate that models incorporating multimodal inputs generally have better
performance on the task. Efectively, since the creation of memes and their virality across communities
is based on a strong correlation between textual and visual elements, the analysis of the textual content
alone could result in partial or impractical comprehension of the content.</p>
        <p>
          However, it is worth mentioning that zero-shot classicfiation of superimposed text using Qwen 2.5
VL 32B achieves results that are relatively close to the best ICM values obtained, while outperforming
other methods that leverage meme images in their pipeline. This suggests that, for this specific type of
memes, text plays a significant role in the final prediction. This may be explained by the fact that the
creators of the EXIST Meme dataset gathered images by curating a lexicon of 250 terms that were used
as search queries on Google Images. Additionally, textless images were removed manually, centring the
dataset on textual elements [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. In the domain of text-based methods, the performance of caption-based
classification with XLM-RoBERTa was found to be inferior to that of superimposed text-based prediction.
This suggests that captions may be lacking in descriptive information relevant crucial for a proper
classification.
        </p>
        <p>The fine-tuned Qwen 2.5 VL 32B model achieved the best results across all metrics, showing a +7.8%
points improvement in the ICM-Hard Norm metric compared to the zero-shot classification performed
using the of-the-shelf version of the same model.</p>
        <p>To gain a clearer view of the results obtained by the best-performing method on subtask 2.1 Hard, we
calculated the proportion of misclassified memes for which the annotators gave unanimous answers
(e.g. all YES or all NO answers). Only 11.33% of misclassifications fall into this category, indicating that
the model is highly confident in predictions for which there is a human agreement. Considering memes
for which there is only one disagree answer, they account for 34.13% of misclassifications. More than
half of the misclassifications (54.54%) come from memes for which two annotators disagree with the
others., i.e. situations in which the evaluation of content is inherently more intricate from a human
perspective.</p>
        <p>Given the superior performance of fine-tuned Qwen 2.5 VL 32B on subtask 2.1, we adopted this
method for subtasks 2.2 and 2.3. This decision allowed us to focus on the scope of the study and avoid
redundant evaluations. Additionally, we reduced the number of experiments to minimise computational
cost and environmental impact, striking a balance between empirical validation and responsible resource
usage. Tables 11 and 12 show the results for source intention classification in memes.
A thorough exploration into the sub-values of the F1 score indicates that the model demonstrates a
high capacity in identifying memes that overtly promote sexist ideologies. Indeed, the relatively high
F1 score for the DIRECT class indicates that this category of content is more easily identifiable by the
model. Performance drops sharply for the JUDGEMENTAL class, as the low F1 score of 0.1413 suggests
that the model has dificulty to identify contents that criticise sexism. This may be due to the complex
nature of such memes, which often rely on sarcasm, as shown in Figure 7. Additionally, this degradation
in performance may be correlated with the under-representation of this class, accounting for just 14.38%
of all ground truths.</p>
        <p>
          Similar considerations could be applied to the results obtained in the sexism categorisation task
displayed in Tables 13 and 14. Moderate F1 scores are observed among sexist categories for IDEOLOGICAL
INEQUALITY, STEREOTYPING DOMINANCE and OBJECTIFICATION (between 0.56 and 0.58), each of
which also appears in over a quarter of the ground truth data. However, the model struggles to identify
the categories MISOGYNY NON SEXUAL VIOLENCE and SEXUAL VIOLENCE, which represent 11.65%
and 14.18% of the ground truths, respectively. The observation that the two lowest F1 sub-values are
associated with these classes, jointly with the considerations made on the results of subtask 2.2, suggests
that a low statistical representation constitutes a strong learning limit for this model. In this field of
research, the relationship between the volume of data available and the precision of classification has
been already examined for other types of models [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ]. Providing a larger number of examples could
therefore improve the ability of fine-tuned Qwen 2.5 VL 32B to recognise more generalised patterns
associated with sexism categorisation and identify these instances more precisely.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation phase</title>
        <p>The present section is dedicated to the presentation of results that have been obtained by our team in
the EXIST 2025 challenge on the given test data.</p>
        <sec id="sec-5-3-1">
          <title>Tweet Classification</title>
          <p>We trained our tweet classification model with three diferent seeds (0, 1, and 42), resulting in three
submissions: GrootWatch_1, GrootWatch_2, and GrootWatch_3. The performance of these models
on the tweet test set is shown in Tables 15 to 17. Notably, our model consistently ranked first in the
Soft-Soft category across all languages for subtasks 1.1, 1.2, and 1.3. In the more challenging Hard-Hard
category, we always placed within the top 20 out of over 130 submissions.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>Meme Classification</title>
          <p>Based on the results on memes for development dataset, we submitted our runs using the following
methods:
• GrootWatch_1: Zero-shot classification of the superimposed text with Qwen 2.5 VL 32B
• GrootWatch_2: Zero-shot classification of the meme images with Qwen 2.5 VL 32B
• GrootWatch_3: Classification with fine-tuned Qwen 2.5 VL 32B
For subtasks 2.2 and 2.3, we used the fine-tuned Qwen 2.5 VL 32B model based on the YES predictions
from the three distinct submissions on subtask 2.1. The results for meme classification in the hard
evaluation setting are shown in Tables 18 to 20. Our methods demonstrated remarkable strength, with
eight out of nine submissions achieving a top five ranking. The predictions obtained by fine-tuning
Qwen 2.5 VL 32B consistently ranked second across all subtasks, achieving first place in subtask 2.1 on
the Spanish instances.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>
        Our sexism detection approach achieved state-of-the-art performance in Soft-Soft classification for
tweet analysis. The combination of contextual information search, annotator profile integration, soft
label learning, and multi-task architecture proved particularly efective in this category. However, the
Hard-Hard category remains a challenging task to overcome. Notably, our results revealed that simply
using the soft probabilities to infer the hard label is not a sucfiient strategy for tackling this challenge.
One potential avenue for future research lies in optimising the inference time for context retrieval with
AI agents. Currently, this process is relatively slow compared to traditional language models like LLM
or BERT. To address this limitation, a possible solution could be the development of a shared dictionary
or database of contexts that can be eficiently queried and retrieved. In cases where the desired context
is not already present in the database, the system could be designed to search for it online and then
store it in the database for future reference. This approach has the potential to significantly reduce
inference times, enabling more eficient and scalable AI-powered language understanding.
Furthermore, despite the promise of incorporating context into language models, our experiments
suggest that fine-tuning LLM with context actually degrades performance. A possible explanation for
this phenomenon is the concept of context hijacking [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ], where the model overemphasises contextual
cues and loses focus on the primary task. Further research is needed to verify this hypothesis and
uncover the underlying causes of this performance drop, which will be crucial in unlocking the full
potential of context-aware language models.
      </p>
      <p>With respect to the best results obtained on meme classification in the past edition of EXIST, which
very mostly based on textual elements, the results obtained by our team in the current edition confirmed
that a full integration of meme images into the classification pipeline leads to better performance.
Despite the top-tier results achieved, the proposed approaches present some limitations:
• Multi-task learning: Qwen 2.5 VL and Florence-2 were fine-tuned using the available ground
truths for the three subtasks to minimise Cross-Entropy Loss. However, introducing a specific
loss function that captures the interaction between subtasks could help the model to leverage the
full potential of the given data and achieve better performance
• Meme Dataset split: The dataset was split 80/20 for training and testing. Despite the significant
computational time required for repeated VLM fine-tuning, future work may consider
crossvalidation to obtain a more comprehensive assessment of model generalisation</p>
      <p>
        Using optimal transport theory and the principle of maximum entropy, Erbani et al. [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ] proposed the
extended confusion matrix (TCM), which applies to single-label, multi-label, and soft-label classification
tasks. TCM keeps the familiar structure of a standard confusion matrix: a square matrix sized by
the number of classes, with diagonal entries representing correct predictions and of-diagonal entries
showing confusions.
• Subtask 1.1: The confusion matrix shows a strong diagonal, indicating strong performance.
• Subtask 1.2: The diagonal entries are higher than of-diagonal ones, showing good model
accuracy. Classes DIRECT and NO have the highest diagonal values but also strong column
values, suggesting the model over-predicts these classes. This is especially true for NO, which
shows the lightest row and the darkest column. JUDGEMENTAL and REPORTED have lower
diagonal values and are often confused with DIRECT and NO, especially REPORTED.
• Subtask 1.3: Again, diagonal values are higher than others, confirming good model behaviour.
      </p>
      <p>The class NO has the lowest row and highest column values, indicating over-prediction that
harms other classes. Notable confusions include MISOGYNY-NON-SEXUAL-VIOLENCE being
misclassified as NO, and SEXUAL-VIOLENCE being confused with
MISOGYNY-NON-SEXUALVIOLENCE, STEREOTYPING-DOMINANCE, or NO.</p>
      <p>Future work could build on this analysis to reduce current misclassifications and enhance our method.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a
scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well
as other organisations (see https://www.grid5000.fr).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 and DeepL Write in order to: grammar and
spelling corrections, rewriting of unnatural phrases, tone improving. After using these tools/services,
the authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Hyperparameter Settings and Prompts</title>
      <p>To achieve the results, we fine-tuned BERT with the following hyperparameters:
• Learning rate: 1e-5
• Batch size: 64
• Weight decay: 0.05
• Number of epochs: 5
For LLM fine-tuning using LoRA, we employed the following configuration:
• Batch size (training and evaluation): 32
• Gradient accumulation steps: 4
• Optimizer: PagedAdamW_8bit
• Learning rate: 5e-5
• Precision format: bf16
• Warm-up ratio: 0.1
• LoRA’s matrix decomposition rank (r): 4
• Alpha LoRA: 16
• Targeted modules: self_attn.q_proj, self_attn.k_proj, self_attn.v_proj,
self_attn.o_proj, mlp.gate_proj, mlp.up_proj, and mlp.down_proj
The zero-shot and fine-tune prompts used in our experiments are:</p>
      <p>Task: Classify tweets as YES (sexist) or NO (not sexist).</p>
      <p>YES: Explicit sexism, descriptions of sexist situations, or criticism of sexism (even implied).
NO: Neutral content. Ignore non-sexist vulgarity. Use societal context.</p>
      <p>Answer: (Only YES or NO)</p>
      <p>Tweet: {tweet}</p>
    </sec>
    <sec id="sec-10">
      <title>B. User Prompt Subtask 1.1 - AI agent direct classification</title>
      <p>Task: Determine whether a tweet is sexist. Categories: YES: The tweet is inherently sexist,
describes a sexist situation, or criticises sexist behaviour. Examples:
• “Women are too emotional to hold leadership positions.”
• “At the meeting, all my ideas were ignored until a male colleague repeated them.”
• “Catcalling is not a compliment; it’s harassment.”
NO: The tweet does not contain sexist content, nor does it describe or criticise sexist
situations or behaviours. Examples:
• “Looking forward to the weekend!”
• “Really looking forward to today’s ‘women in web3’ lunch meetup! If you’re in the la
area and want to join, send me a dm!! See you ladies soon.”
• “Wow! Trouble making witches unite!”
Additional Guidelines:
• Ambiguous Language: If the tweet’s sexism is implied rather than explicit, classify it
as ’YES.’ If context is insuficient, classify it as ’NO.’
• Strong or Vulgar Language: Classify based on content relevance to sexism, not on the
presence of strong language alone.
• Contextual Understanding: Consider societal norms and the broader conversation
when evaluating the tweet.</p>
      <p>Your final answer will be YES or NO.</p>
      <p>Tweet: {tweet}</p>
    </sec>
    <sec id="sec-11">
      <title>C. User Prompt for AI agent context retrieval</title>
      <p>Task: Retrieve concise external context to clarify ambiguous tweets or cultural references
for sexism classification. Do NOT classify the tweet—only provide context that would help
a downstream model to decide.</p>
      <p>When to retrieve context:
• The tweet references events, lyrics, memes, or cultural artefacts unfamiliar to a general
audience.
• The language is ambiguous (e.g., sarcasm, coded terms, or terms with dual meanings).
• The tweet hints at a broader societal debate or news story.</p>
      <p>Guidelines:
1. No classification: Never output YES/NO. Your role is purely contextual.
2. Conciseness: Summarise external context in ≤ 100 tokens.
3. Relevance: Only include context directly tied to potential sexism (e.g., explain a
referenced event’s sexist controversy, not general info).</p>
      <p>4. No context? Output ”No external context needed.”
Output Format: [Summary of context, or ”No external context needed.”]
Examples:
1. Tweet: ”Ugh, not another ‘Boss Babe’ anthem. . . ”</p>
      <p>Output: ”The term ’Boss Babe’ is associated with MLM schemes targeting women,
often criticised for exploiting feminist rhetoric. Some view it as empowering, others
as patronising.”
2. Tweet: ”This is why we need more #NotAllMen energy.”</p>
      <p>Output: ”#NotAllMen is a hashtag used to critique men who derail conversations
about sexism by insisting ’not all men’ are problematic. Often cited in debates about
systemic misogyny.”
3. Tweet: ”Finally got tickets to the concert!”</p>
      <p>Output: ”No external context needed.</p>
    </sec>
    <sec id="sec-12">
      <title>D. Context Analysis</title>
      <p>We conducted a preliminary assessment of the generated contexts to explore at their quality, relevance
and accuracy. Our aim was to explore how well the generated contexts align with the original tweets.
Methodology
We randomly selected 30 context samples from each dataset (train, dev, test) and evaluated them based
on three criteria:
• Relevance: How well did the generated context align with the original tweet? (Score: 1-5)
• Accuracy: Did the generated context provide correct information or insights? (Score: 1-5)
• Quality: Was the generated context coherent, well-structured, and easy to understand? (Score:
1-5)
• In case of ‘No external context needed.’: Was it appropriate not to generate external context for
the given tweet? (Score: 1-5)
Results
The small-scale study reveals that the generated contexts consistently achieve perfect scores in terms of
relevance (100%) and quality. The accuracy, however, is satisfactory but not outstanding, with an average
score of 3.7/5. Notably, our model demonstrates 100% capability in identifying when no additional
context is required. Neither did we observe any hallucinations in the generated texts.
To delve deeper into context accuracy, we stratified the results according to the agreement rate of the six
annotators on the binary sexist classification of the tweet (only applicable for training and development
datasets, as test dataset results are not available).
100%
83%
66%
50%
3
4.4
4.3
4.5</p>
      <p>As shown in Table 21, we observe that accuracy is less satisfactory when there is a high annotator
agreement rate for subtask 1.1. However, with lower agreement rates, accuracy tends to improve. While
this limited analysis provides an encouraging initial look at the generated contexts, we acknowledge
that more samples and evaluators are necessary to draw more robust conclusions.</p>
    </sec>
    <sec id="sec-13">
      <title>E. User Prompt Subtask 1.1 - LLM classification with context</title>
      <p>Task: Classify tweets as YES (sexist) or NO (not sexist).</p>
      <p>YES: Explicit sexism, descriptions of sexist situations, or criticism of sexism (even implied).
NO: Neutral content. Ignore non-sexist vulgarity. Use societal context.
Tweet: {tweet}</p>
      <p>Context: {context}</p>
    </sec>
    <sec id="sec-14">
      <title>F. Full Results Subtask 2.1 and Further Discussion</title>
      <p>The full set of experiments conducted for subtask 2.1 Hard is showed in Table 22. With regard to
multimodal models, it is interesting to note that an increase in model size does not necessarily lead
to improved performance. For instance, zero-shot classification using models from the Qwen 2.5 VL
family with the same prompt shows that Qwen 2.5 VL 32B outperforms the 7B release, while the 72B
version performs worse than the medium-sized model. Interestingly, despite diferences in architecture
and parameter count, the results obtained from Aya Vision 8B, Mistral Small 3.1 24B, and the fine-tuned
Florence 2 model exhibit notable similarity. This convergence in performance suggests that these models
have achieved a similar level of representational capability for sexism detection in memes. Fine-tuned
Qwen 2.5 VL 7B has shown that an increase of +8.7% in the ICM-Hard Norm metric when compared
to zero-shot classification with the same model. This percentage increase is comparable to the value
computed for the 32B release.</p>
    </sec>
    <sec id="sec-15">
      <title>G. System Prompt Subtasks 2.1, 2.2 and 2.3</title>
      <p>"You are an Artificial Intelligence for sexism detection and classification in social media
contents."</p>
    </sec>
    <sec id="sec-16">
      <title>H. User Prompt Subtask 2.1 - Multimodal classification</title>
      <p>"Is this meme sexist towards women? Answer just YES if you judge the meme as sexist
towards women. Answer just NO if you judge the meme as not sexist towards women."</p>
    </sec>
    <sec id="sec-17">
      <title>I. User Prompt Subtask 2.1 - Text-based classification</title>
      <p>"Given the superimposed text of a meme ’[superimposed_text]’, is this meme sexist towards
women? Answer just YES if you judge the meme as sexist towards women. Answer just
NO if you judge the meme as not sexist towards women."</p>
    </sec>
    <sec id="sec-18">
      <title>J. User Prompt Subtask 2.2</title>
      <p>"Answer just DIRECT if the meme promotes or encourages sexism. Answer just
JUDGEMENTAL if the meme criticises or condemns sexist behaviour."
Model
XLM-RoBERTa base
XLM-RoBERTa large
XLM-RoBERTa base
XLM-RoBERTa large</p>
      <p>ICM-Hard Norm</p>
      <p>F1(YES)
"Classify the given meme into one or more of these categories (multi-label allowed):
• IDEOLOGICAL-INEQUALITY if it rejects feminism or denies gender inequality.
• STEREOTYPING-DOMINANCE if it promotes traditional gender roles or male
superiority.
• OBJECTIFICATION if it reduces women to appearance or sexualises them.
• SEXUAL-VIOLENCE if it contains sexual harassment or assault references.
• MISOGYNY-NON-SEXUAL-VIOLENCE if it expresses hatred or non-sexual violence
toward women.</p>
      <p>The answer is just and strictly a list of strings, as the following example:</p>
    </sec>
    <sec id="sec-19">
      <title>L. System Prompt for Meme Caption Generation</title>
      <p>"You are an Artificial Intelligence for meme captioning."</p>
    </sec>
    <sec id="sec-20">
      <title>M. User Prompt for Meme Caption Generation - Simple captions</title>
      <p>"Generate a caption in plain text of this meme without expressing a judgement on it. Answer
in 80 words maximum."</p>
    </sec>
    <sec id="sec-21">
      <title>N. User Prompt for Meme Caption Generation - Detailed captions</title>
      <p>"Generate a detailed caption in plain text of this meme without expressing a judgement on
it."</p>
    </sec>
    <sec id="sec-22">
      <title>O. Fine-tuning Setup for Florence-2 and Qwen 2.5 VL</title>
      <p>On Florence-2, the experiments were conducted by freezing the DaViT vision encoder and using a batch
size of 5. The training was conducted over 3 epochs using the AdamW optimiser with a linear learning
rate scheduler and no warm up steps. The model was optimised to minimise the cross-entropy loss
between predicted and target YES/NO labels, performing the validation after each epoch.
For Qwen 2.5 VL 7B and 32B, due to a larger size of the models the fine-tuning strategy was diferent,
in order to keep reasonable training times. We applied Low-Rank Adaptation (LoRA) to the query and
value projection layers using a rank of 8, a scaling factor of 16, and a dropout rate of 0.05. Only the
low-rank adaptor weights were updated during training, resulting in a significant reduction in the
number of trainable parameters. For the 7B model, the trainable parameters were 2,523,136 (0.0304% of
the total), while for the 32B model the number of trainable parameters was 8,388,608 (0.0251 % of the
total). The models were fine-tuned on 3 epochs by scaling the image resolution up to 262,144 pixels,
using a batch size of 5. As for Florence-2, the loss function to minimise was Cross Entropy Loss.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. I. Toluna</given-names>
            , Baromètre Sexisme, Etude 4,
            <surname>Haut</surname>
          </string-name>
          <string-name>
            <surname>Conseil</surname>
          </string-name>
          <source>à l'Egalité entre les Femmes et les Hommes</source>
          ,
          <year>2024</year>
          . URL: https://www.haut
          <article-title>-conseil-egalite.gouv</article-title>
          .fr/IMG/pdf/rapport_toluna_harris_-_baromc_ tre_sexisme_vague_4_-_
          <year>2024</year>
          _dgcs-hce_-_avec_note_vf.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Meta</surname>
          </string-name>
          ,
          <article-title>More speech</article-title>
          and fewer mistakes,
          <year>2025</year>
          . URL: https://about.fb.com/news/2025/01/ meta-more
          <article-title>-speech-fewer-mistakes/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Amnesty</given-names>
            <surname>International</surname>
          </string-name>
          , Les nouvelles politiques de Meta en matière de contenus risquent d'alimenter davantage de violences de masse et de génocides,
          <year>2025</year>
          . URL: https://www.amnesty.org/fr/latest/news/2025/02/ metas-new
          <article-title>-content-policies-risk-fueling-more-mass-violence-and-genocide/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. L. Gil</given-names>
            <surname>Bermejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martos Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Vázquez</given-names>
            <surname>Aguado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>García-Navarro</surname>
          </string-name>
          ,
          <article-title>Adolescents, ambivalent sexism and social networks, a conditioning factor in the healthcare of women</article-title>
          , in: Healthcare, volume
          <volume>9</volume>
          , MDPI,
          <year>2021</year>
          , p.
          <fpage>721</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos (extended overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chhabra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Vishwakarma</surname>
          </string-name>
          ,
          <article-title>A literature survey on multimodal and multilingual automatic hate speech identification</article-title>
          ,
          <source>Multimedia Systems</source>
          <volume>29</volume>
          (
          <year>2023</year>
          )
          <fpage>1203</fpage>
          -
          <lpage>1230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vetagiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pakray</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>A deep dive into automated sexism detection using fine-tuned deep learning and large language models</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>145</volume>
          (
          <year>2025</year>
          )
          <fpage>110167</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing</article-title>
          ,
          <source>arXiv preprint arXiv:2111.09543</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Y.-Z. Fang</surname>
            ,
            <given-names>L.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.-D.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Nycu-nlp at exist 2024-leveraging transformers with diverse annotations for sexism identification in social networks</article-title>
          ,
          <source>Working Notes of CLEF</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lora: Low-rank adaptation of large language models</article-title>
          .
          <source>, ICLR</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Thin</surname>
          </string-name>
          ,
          <article-title>Sexism identification in social networks with generation-based language models</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          . URL: https://api.semanticscholar. org/CorpusID:271856112.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Belbachir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Roustan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soukane</surname>
          </string-name>
          ,
          <article-title>Detecting online sexism: Integrating sentiment analysis with contextual language models</article-title>
          ,
          <source>AI</source>
          <volume>5</volume>
          (
          <year>2024</year>
          )
          <fpage>2852</fpage>
          -
          <lpage>2863</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Debnath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sumukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bhakt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <source>Sexist Stereotype Classification on Instagram Data</source>
          ,
          <year>2020</year>
          . URL: https://github.com/djinn-anthrope/Sexist_Stereotype_Classification.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Röttger</surname>
          </string-name>
          , Semeval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          ,
          <source>arXiv preprint arXiv:2303.04222</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of exist 2021:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendieta-Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marco-Remón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Makeienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of exist 2022:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2023-learning with disagreement for sexism identification and characterization</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , E. Amigó,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <year>Exist 2024</year>
          :
          <article-title>sexism identification in social networks and memes</article-title>
          ,
          <source>in: Advances in Information Retrieval: 46th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2024</year>
          , Glasgow, UK, March
          <volume>24</volume>
          -28,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>V</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2024</year>
          , p.
          <fpage>498</fpage>
          -
          <lpage>504</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>68</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>68</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>E.</given-names>
            <surname>Leonardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Abercrombie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almanea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          , Semeval-2023 task 11:
          <article-title>Learning with disagreements (lewidi</article-title>
          ),
          <year>2023</year>
          . URL: https://arxiv. org/abs/2304.14803. arXiv:
          <volume>2304</volume>
          .
          <fpage>14803</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saibene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lees</surname>
          </string-name>
          , J. Sorensen, SemEval
          <article-title>-2022 task 5: Multimedia automatic misogyny identification</article-title>
          , in: G. Emerson,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schluter</surname>
          </string-name>
          , G. Stanovsky,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Ratan (Eds.),
          <source>Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>549</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .74/. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>74</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mimic:</surname>
          </string-name>
          <article-title>Misogyny identification in multimodal internet content in hindi-english code-mixed language</article-title>
          ,
          <source>ACM Trans. Asian Low-Resour. Lang. Inf. Process</source>
          . (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3656169. doi:
          <volume>10</volume>
          .1145/3656169, just Accepted.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Christensen</surname>
          </string-name>
          ,
          <article-title>Large vision-language models for knowledge-grounded data annotation of memes, 2025</article-title>
          . URL: https://arxiv.org/abs/2501.13851. arXiv:
          <volume>2501</volume>
          .
          <fpage>13851</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>M.-H. Van</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Detecting and correcting hate speech in multimodal memes with large visual language model</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2311.06737. arXiv:
          <volume>2311</volume>
          .
          <fpage>06737</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Watson</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Kearney, I. Dale, A review of vision-language models and their performance on the hateful memes challenge</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2305.06159. arXiv:
          <volume>2305</volume>
          .
          <fpage>06159</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.00020. arXiv:
          <volume>2103</volume>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , Florence-2:
          <article-title>Advancing a unified representation for a variety of vision tasks</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2311.06242. arXiv:
          <volume>2311</volume>
          .
          <fpage>06242</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Z. Cheng, H. Zhang,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .5-vl
          <source>technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.13923. arXiv:
          <volume>2502</volume>
          .
          <fpage>13923</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>M. AI</surname>
          </string-name>
          ,
          <source>Mistral Small 3</source>
          .1, https://mistral.ai/news/mistral-small-3-
          <issue>1</issue>
          ,
          <year>2025</year>
          . [Online; accessed 27-
          <fpage>May2025</fpage>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Venkitesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shmyhlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aryabumi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Beller-Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pekmez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ozuzu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Richemond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Locatelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Frosst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fadaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Govindassamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ermis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Üstün</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hooker</surname>
          </string-name>
          ,
          <article-title>Aya vision: Advancing the frontier of multilingual multimodality</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.08751. arXiv:
          <volume>2505</volume>
          .
          <fpage>08751</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>E.</given-names>
            <surname>Valavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hestness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ardalani</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Iansiti, Time and the value of data</article-title>
          ,
          <source>arXiv preprint arXiv:2203.09118</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Menárguez-Box</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Torres-Bertomeu</surname>
          </string-name>
          ,
          <article-title>Ditana-pv at sexism identification in social networks (exist) tasks 4 and 6: The efect of translation in sexism identification</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          . URL: https://api.semanticscholar.org/CorpusID:271844312.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <article-title>Concatenated transformer models based on levels of agreements for sexism detection</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Rojing-cl at exist 2024: Leveraging large language models for multimodal sexism detection in memes</article-title>
          , in: G. Faggioli,
          <string-name>
            <surname>N. F.</surname>
          </string-name>
          0001,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1080</fpage>
          -
          <lpage>1090</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-100.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <article-title>Evaluating extreme hierarchical multi-label classification</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>5809</fpage>
          -
          <lpage>5819</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>399</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>399</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , et al.,
          <article-title>A survey on large language model based autonomous agents</article-title>
          ,
          <source>Frontiers of Computer Science</source>
          <volume>18</volume>
          (
          <year>2024</year>
          )
          <fpage>186345</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          , E. Alfonseca,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Zitouni</surname>
          </string-name>
          ,
          <article-title>Exploring dual encoder architectures for question answering</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>9414</fpage>
          -
          <lpage>9419</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>640</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>640</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <article-title>Hijacking context in large multi-modal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.07553</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <surname>C. Sanchez</surname>
            ,
            <given-names>Z. Zhang,</given-names>
          </string-name>
          <article-title>The efects of in-domain corpus size on pre-training bert</article-title>
          ,
          <source>arXiv preprint arXiv:2212.07914</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>S. de Vries</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Thierens</surname>
          </string-name>
          ,
          <article-title>Learning with confidence: Training better classifiers from soft labels</article-title>
          ,
          <source>arXiv preprint arXiv:2409.16071</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>B.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clavié</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hallström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taghadouini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aarsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cooper</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <given-names>J.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Poli</surname>
          </string-name>
          , Smarter, better, faster, longer
          <article-title>: A modern bidirectional encoder for fast, memory eficient, and long context finetuning and inference, 2024</article-title>
          . URL: https://arxiv.org/abs/2412.13663. arXiv:
          <volume>2412</volume>
          .
          <fpage>13663</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Stickland</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Murray, BERT and PALs: Projected attention layers for eficient adaptation in multi-task learning</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5986</fpage>
          -
          <lpage>5995</lpage>
          . URL: https://proceedings.mlr.press/v97/stickland19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>An empirical study of multi-task learning on bert for biomedical text mining</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>02799</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrušaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.-P. Morency,</surname>
          </string-name>
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>41</volume>
          (
          <year>2018</year>
          )
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>J.</given-names>
            <surname>Erbani</surname>
          </string-name>
          , P.-É. Portier,
          <string-name>
            <given-names>E.</given-names>
            <surname>Egyed-Zsigmond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nurbakova</surname>
          </string-name>
          , Confusion Matrices:
          <article-title>A Unified Theory</article-title>
          , IEEE Access
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . URL: https://hal.science/hal-04820752. doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2024</year>
          .
          <volume>3507199</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>