<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NLPDame at EXIST: Sexism Categorization in Tweets via Multi-Head Multi-Task Models, LLM &amp; RAG Voting Synergy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christina Christodoulou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>"Archimedes" Research Unit, "Athena" Research Center</institution>
          ,
          <addr-line>Maroussi, Attica</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science &amp; Engineering, University of Ioannina</institution>
          ,
          <addr-line>Ioannina</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Informatics &amp; Telecommunications, National Centre for Scientific Research (N.C.S.R.) "Demokritos"</institution>
          ,
          <addr-line>Aghia Paraskevi, Attica</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper details the author's participation as NLPDame in the fifth edition of EXIST ( sEXism Identification in Social neTworks) as Lab at CLEF 2025. It outlines an approach for the text partitioning of sub-task 1.3, which pertains to the sexism categorization of tweets in English and Spanish using hard labels. The methodology includes ifne-tuning 12 Transformer language models within a tailored multi-head and multi-task model architecture that employs CLS, mean, and max pooling for multi-label text classification. The multi-head architecture efectively addresses the multilingual nature of the dataset, while the multi-task architecture incorporates sentiment analysis to enhance the multi-label classification process. The methodology also involves utilizing the open-source multilingual LLM Llama-3.2-3B-Instruct, employing prompt engineering for performing classification. Additionally, a method incorporating RAG, chain-of-thought reasoning and annotators' profiles was used to provide contextual information within the LLM prompt engineering framework. Ultimately, majority voting was applied to test submissions, including the predictions from (i) the 12 Transformer models with LLM prompt engineering, and (ii) the 12 Transformer models with LLM prompt engineering, including chain-of-thought and annotators' profiles, along with RAG. The experimental approach consisted of data analysis, baseline experiments, a multi-step pre-processing pipeline, the application of various loss functions and thresholds, as well as the use of class positive weights to tackle class imbalance. For the hard-hard evaluation in both languages, 3 runs were submitted that ranked 4ℎ, 5ℎ, and 6ℎ out of a total of 132 submissions. The highest-scoring run was based on majority voting from the predictions of the 12 Transformer models, utilizing LLM prompt engineering in conjunction with RAG.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sexism categorization</kwd>
        <kwd>tweets</kwd>
        <kwd>sub-task 1</kwd>
        <kwd>3</kwd>
        <kwd>hard labels</kwd>
        <kwd>multilingual multilabel text classification</kwd>
        <kwd>custom multihead multi-task architectures</kwd>
        <kwd>LLM</kwd>
        <kwd>prompt engineering</kwd>
        <kwd>chain-of-thought</kwd>
        <kwd>RAG</kwd>
        <kwd>Transformers</kwd>
        <kwd>majority voting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The term sexism has emerged during the second-wave feminism movement in the 1960s-1980s and
refers to the discrimination, prejudice, and stereotyping directed at individuals based on their sex,
predominantly afecting women and girls [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Sexism against women remains a pervasive issue in modern
society, manifesting in various forms such as workplace discrimination, wage disparity, misogyny,
stereotyping and objectification. Incidents of sexual harassment against women, occurring daily, reveal
the persistent challenges women face. Hence, the consequences of sexism are grave, adversely afecting
women’s physical health, mental health, career advancement, and overall quality of life.
      </p>
      <p>
        The rise and vast reach of social media have brought a surge in the expression of sexist behavior
online, exhibited through various types like gender inequality, stereotyping, objectification, sexual
violence and misogyny. This behavior has become more prevalent in online environments rather than
in person, largely due to the anonymity and invisibility ofered by these environments. The anonymity
and invisibility foster what is known as the online disinhibition efect , where people are more likely to
act in ways they might not in face-to-face interactions without considering the repercussions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Many women, who have experienced or witnessed such behavior, report feeling discouraged from
interacting and posting online, resulting in self-censorship and limited engagement on social media
platforms, as they encounter constant criticism, hate, and discrimination, and even feel threatened
not only for their safety, but also for their safety of their families [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Content monitoring and
moderation on social media are essential. Nevertheless, manual social media content moderation is an
arduous and time-consuming task as well as susceptible to the moderator’s subjective judgment [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The constant exposure of unpleasant, violent, and pornographic content during manual moderation
has a negative impact on the moderators’ mental health [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The significant challenges and severe
consequences associated with manual moderation, in conjunction with the widespread increase in
social media content, underscore the urgent necessity for developing automated systems designed to
detect and eliminate sexist content efectively.
      </p>
      <p>
        EXIST (sEXism Identification in Social neTworks) is a series of shared tasks since 2021, focusing on
detecting and categorizing online sexism, including both explicit and implicit expressions of sexism.
It comprises 3 tasks: Sexism Identification (binary classification), Source Intention (multi-class
classification) and Sexism Categorization (multi-label classification). This year, the fifth edition of
EXIST was held as a Lab in CLEF 2025. It consisted of these 3 tasks, which were further divided into
6 sub-tasks. The aim was to detect sexism not only across diferent formats, including text (tweets),
images (memes from Google Search), and videos (TikToks), but also across diferent languages, meaning
English and Spanish [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The present paper outlines the system and results from the author’s participation as NLPDame in this
year’s sub-task 1.3 of EXIST Lab at CLEF 2025, which focuses on Sexism Categorization in tweets
(hard labels) for both English and Spanish. The approach involved fine-tuning 12 multilingual
Transformer-based language models within a custom multi-head and multi-task architecture designed
specifically for multilingual and multi-label text classification, as well as sentiment analysis. This
approach aimed to capture linguistic nuances and enhance the accuracy of detecting categories of
sexism by integrating sentiment analysis to boost classification performance. Furthermore, the
opensource multilingual Large Language Model (LLM) Llama-3.2-3B-Instruct was leveraged with prompt
engineering, including the definitions of the sexism categories and the sentiment of the input tweet,
to perform multi-label sexism categorization. An additional method, incorporating chain-of-thought
reasoning and annotators’ demographic background in the prompt in conjunction with
RetrievalAugmented Generation (RAG), was implemented to provide contextual information within the LLM
prompt engineering framework. Ultimately, majority voting was applied to evaluate submissions,
merging predictions from (i) the 12 Transformer models using LLM prompt engineering and (ii) the 12
Transformer models using LLM prompt engineering (chain-of-thought and annotators’ profiles) along
with RAG. The experimental approach included comprehensive data analysis, baseline experiments, a
multi-step pre-processing pipeline, the application of various loss functions and thresholds, as well as
leveraging class positive weights to address class imbalance. The code for the presented approach is
available on the provided GitHub link.1</p>
      <p>The structure of this paper is as follows: Section 2 provides background information concerning
sexism detection shared tasks. Section 3 presents various aspects of the data, including data analysis
and pre-processing. Section 4 introduces an overview of the developed system, the experiments and
the submissions. Section 5 presents and discusses the results from both the development and test sets.
Finally, Section 6 concludes the paper by discussing the findings and potential future work, while
Section 7 addresses the limitations of the presented approach.
1https://github.com/christinacdl/EXIST_2025_CLEF</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Developing automated systems using Natural Language Processing (NLP) and machine learning methods
to detect sexism in social media has gained considerable popularity as a shared task in recent years.
The EXIST task was the first to be organized at IberLEF 2021, concentrating on identifying sexism in
English and Spanish texts. This involved both determining the presence of sexism (binary classification)
and categorizing it (multi-label classification) in posts from Twitter and Gab [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The following year,
sexism was approached in task 5 of SemEval-2022, titled Multimedia Automatic Misogyny Identification
(MAMI), for detecting multi-modal misogyny, particularly in English misogynous memes found online,
by leveraging both texts and images. It was divided into 2 sub-tasks: sub-task A, which involved
binary classification to determine whether a meme was misogynous, and sub-task B, which dealt with
the identification of various types of misogyny (multi-label classification) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Additionally, sexism
detection was featured in task 10 of SemEval-2023, named Explainable Detection of Online Sexism (EDOS),
which focused on detecting English texts from Gab and Reddit. It was divided into 3 sub-tasks: sub-task
A was focused on identifying whether a post is sexist or not (binary classification). Sub-tasks B and C
aimed at multi-class classification by identifying 4 categories and 11 fine-grained vectors of sexism,
respectively [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Since its inception in 2021, the EXIST task has been held annually, concentrating exclusively on
detecting sexism in English and Spanish social media texts until 2024 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], when it expanded to
incorporate additional sub-tasks focusing on memes, and in 2025, it further extended to include videos
from TikTok [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The previous year, in the fourth version of EXIST, 31 systems were submitted for
sub-task 3 (Sexism Categorization), the hard-hard evaluation, which is discussed in the present paper.
28 out of the 31 systems managed to surpass the majority class baseline (where all instances are labelled
as "NO"), while all systems outperformed the minority class baseline (where all instances are labelled as
"SEXUAL-VIOLENCE").
      </p>
      <p>
        More particularly, the team ABCD achieved the first rank on the test set with an ICM of 0.3713,
ICM-NORM of 0.5862, and an F1 score of 0.6004. They also secured the second rank with an ICM of
0.3540, ICM-NORM of 0.5862, and an F1 score of 0.6042. The team utilized the xlm-RoBERTa small and
large models, the multilingual-T5 version models, and Llama-2-7B-Instruct to attain these results. They
divided the dataset into 6 subsets based on the number of annotators, retaining identical tweets while
incorporating metadata specific to each annotator. This allowed them to fine-tune 6-component models
for each subset and develop an ensemble approach. For the task of sexism categorization, they only
considered the predictions for instances classified as sexist from the component models used in
subtask 1, which focused on sexism identification (binary classification). Their methods included prompt
engineering and Low-Rank Adaptation (LoRA). Notably, Llama-2 achieved the rfist rank, while
xlmRoBERTa-large secured the second [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The NYCU-NLP team achieved impressive results, securing the
third, fourth, and fifth positions in the hard-hard evaluation of sub-task 3. They applied extensive data
pre-processing methods, such as deleting redundant elements, standardizing text formats, increasing
data diversity by back-translation, and augmenting texts utilizing the AEDA approach. They also
incorporated annotator demographics such as gender, age, and ethnicity into the DeBERTa-v3 and
xlm-RoBERTa models. Moreover, they adapted the Round to Closest Value approach to deal with
non-continuous annotation values and maintain precise probability distributions. Optimization of
shared layers across tasks based on the hard parameter-sharing techniques was followed to enhance
generalization and computational eficiency [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Other teams, like Awakened, fine-tuned pre-trained
models only in English and multilingual models as well as domain-specific models, such as
twitterxlm-roberta-base-sentiment or roberta-hate-speech-dynabench-r4, while also leveraging ensembling
methods [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset &amp; Data Analysis</title>
        <p>
          Established in 2021, the EXIST shared task is designed to advance research in sexism detection, initially
concentrating on textual data such as tweets. The EXIST 2025 edition introduced a significant expansion
by connecting multiple data sources and modalities, including TikTok videos, memes from Google
Search, and tweets, for a more holistic approach to multimedia and multimodal sexism detection [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The approach presented in this paper focuses solely on the textual component of the EXIST 2025
Dataset, with particular emphasis on sub-task 1.3, which pertains to hierarchical multi-label Sexism
Categorization in tweets. The approach employs the oficial hard labels (gold standard) for training and
evaluation purposes.
        </p>
        <p>The EXIST 2025 Tweet Dataset consists of tweets in both Spanish and English, resulting in a balanced
cross-lingual corpus of over 10,000 annotated tweets. Notably, adhering to the Learning with
Disagreement (LeWiDi) paradigm, each tweet is evaluated by 6 annotators from various socio-demographic
backgrounds. The background information of the annotators, including details such as gender, age,
ethnicity, education level, and country of origin, was provided along with the tweets and their labels
in JSON files. Each tweet was represented as a JSON object with the attributes shown in Figure 1.
Additionally, an evaluation folder was provided to the participants to assess their system’s outputs,
which contained two sub-folders. One sub-folder included the oficial gold standards for all sub-tasks
for hard and soft evaluation contexts in both training and development sets. The other sub-folder
contained the oficial baselines for each sub-task.</p>
        <p>"200006": {
"id_EXIST": "200006",
"lang": "en",
"tweet": "According to a customer I have plenty of time to go spent the Stirling
˓→ coins he wants to pay me with, in Derry. \"Just like any other woman, I'm sure
˓→ of it.\" #EveryDaySexism in retail.",
"number_annotators": 6,
"annotators": ["Annotator_409", "Annotator_410", "Annotator_411", "Annotator_412",
˓→ "Annotator_413", "Annotator_414"],
"gender_annotators": ["F", "F", "M", "M", "M", "F"],
"age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],
"ethnicities_annotators": ["White or Caucasian", "White or Caucasian", "White or
˓→ Caucasian", "White or Caucasian", "White or Caucasian", "White or Caucasian"],
"study_levels_annotators": ["Bachelor's degree", "Master's degree", "High school
˓→ degree or equivalent", "Bachelor's degree", "Doctorate", "High school degree or
˓→ equivalent"],
"countries_annotators": ["Estonia", "Romania", "Slovenia", "Greece", "Spain",
˓→ "United Kingdom"],
"labels_task1_1": ["YES", "YES", "YES", "YES", "YES", "YES"],
"labels_task1_2": ["REPORTED", "REPORTED", "REPORTED", "REPORTED", "REPORTED",
˓→ "JUDGEMENTAL"],
"labels_task1_3": [
["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],
["STEREOTYPING-DOMINANCE"],
["OBJECTIFICATION"],
["IDEOLOGICAL-INEQUALITY"],
["STEREOTYPING-DOMINANCE", "MISOGYNY-NON-SEXUAL-VIOLENCE"],
["OBJECTIFICATION"]
}
],
"split": "TRAIN_EN"</p>
        <p>For all experiments, the provided training and development sets with their hard labels were utilized
for training and evaluation, respectively. Table 1 presents an overview of the hard labels of the sub-task
as well as their distribution across the training and development sets, both overall and per language.
Hence, from Table 1, it can be observed that:
• The dataset is significantly imbalanced, with a skewed distribution toward the "NO" label in both
the training and development sets. This imbalance poses challenges for models, as they may be
prone to overpredicting non-sexist outputs unless appropriate class weight balancing techniques
or loss weighting strategies are implemented.
• The sexism categories "MISOGYNY-NON-SEXUAL-VIOLENCE" and "SEXUAL-VIOLENCE" are
relatively under-represented compared to more prevalent categories like
"STEREOTYPINGDOMINANCE" and "IDEOLOGICAL-INEQUALITY", especially in the development set. This
under-representation makes it more dificult to learn and evaluate nuanced patterns associated
with rarer, but concerning and significant forms of sexist expression.
• The dataset features two languages, English and Spanish, and maintains a relatively balanced
distribution between them. Except for the "NO" category, the majority of examples in the other
categories are slightly skewed towards Spanish. This cross-linguistic aspect highlights the need
to integrate language-specific components into the model to efectively capture cultural and
linguistic variations.
• The dataset is designed for hierarchical multi-label classification, which means that each tweet
can first be identified as either sexist or non-sexist. If a tweet is identified as sexist, it can then be
categorized into multiple overlapping types of sexism. This introduces considerable complexity
to the learning task compared to simpler binary or multi-class problems, as models must not
only accurately diferentiate between sexist and non-sexist tweets but also identify various
combinations of intertwined sexist behaviors.</p>
        <p>Overall, the inherently multilingual nature of the dataset, which encompasses both English and
Spanish tweets, was carefully considered throughout the stages of data pre-processing, model design,
training, and evaluation. This approach ensured that the system remains adaptable and fair across
diferent linguistic contexts. Additionally, the significant class imbalance within the data was tackled
by implementing targeted strategies during training to reduce bias and enhance the model’s sensitivity
to these nuanced yet crucial expressions of sexism. The background information of the annotators was
also incorporated into the prompt engineering process to provide context for the role of the LLM and
facilitate accurate classification. This method efectively addresses both the multilingual and imbalanced
characteristics of the dataset while taking advantage of the additional information given to develop a
robust system.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Pre-processing</title>
        <p>
          The pre-processing pipeline for both English and Spanish tweets followed a multi-step approach.
Previous research conducted by the author has demonstrated that pre-processing significantly enhances
system outcomes compared to utilizing raw texts for training purposes, particularly tweets, which
often include usernames, URLs, non-ASCII characters, and excessive punctuation [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The
pre-processed data were leveraged for all kinds of experiments, as described below:
1. Loading Raw Datasets: The provided JSON files for training, development, and test sets were
read, and identifier fields were standardized (e.g., id_EXIST) and were merged with the JSON
ifles containing the gold hard labels.
2. Data Cleaning: Duplicate tweets were removed based on text content from the training and
development sets. Rows with empty or null tweet entries or missing labels were removed. No
data cleaning was applied to the test set. As can be observed from Table 2, a considerable amount
of text was removed from both sets. More specifically, 870 and 107 texts were deleted in total
from the training and development sets. 396 English and 474 Spanish tweets were discarded from
the training set, while 45 English and 62 Spanish tweets were removed from the development set.
        </p>
        <p>All</p>
        <sec id="sec-3-2-1">
          <title>Training Set</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Development Set</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>English</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Training Set</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Development Set</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Spanish</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Training Set Development Set</title>
          <p>
            3. Emoji Conversion: Emojis were converted into their descriptive textual representations, using
language-specific mappings for English or Spanish from the Emoji library 2.
4. HTML and URL Cleaning: HTML entities (e.g., &amp;amp; to and) were unescaped, and all URLs
were removed using regular expressions.
5. Contraction Expansion: Spanish tweets were expanded using a manually defined contractions
dictionary (e.g., pa’ to para, toy to estoy). The contractions of English tweets were expanded
through the Ekphrasis library3[
            <xref ref-type="bibr" rid="ref20">20</xref>
            ].
6. Accent Normalization (Spanish Only): Accented characters were normalized to their base
forms (e.g., acción to accion) to ensure consistency.
7. Punctuation Pattern Substitution: Repeated punctuation patterns (multiple question marks,
multiple exclamation marks and mixed question and exclamation marks) were replaced with
standardized placeholders, "??", "!!" and "?!", respectively. Spanish-specific punctuation such as
inverted question marks (¿) or exclamation marks (¡) was removed.
8. Text Normalization with Ekphrasis library: The Ekphrasis library used the Twitter segmenter
and corrector to segment hashtags and correct common misspellings, respectively. The usernames,
emails, numbers, dates, and phone numbers were converted into &lt;user&gt;, &lt;email&gt;, &lt;number&gt;,
&lt;date&gt;, and &lt;phone&gt;, respectively, for anonymization[
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. These tokens were added as special
tokens into the model tokenizers.
          </p>
          <p>
            9. Repetition Cleaning and Spacing Fixes: Excessive repetitions of characters or words were
2https://pypi.org/project/emoji/
3https://github.com/cbaziotis/ekphrasis
cleaned, appropriate spaces were ensured after punctuation, and only a single &lt;user&gt; token was
maintained if multiple were present, as they were redundant.
10. Tokenization with spaCy: Tweets were tokenized using spaCy’s language-specific models
(en_core_web_sm for English and es_core_news_sm for Spanish).
11. Hard Label Processing: Hard labels were converted into multi-hot vectors in a column indicating
the presence of each sexism category.
12. Sentiment Analysis: For the English tweets, sentiment scores were computed using the VADER
sentiment analyzer4 [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ]. Since VADER does not support Spanish, TextBlob5 was used to derive
polarity scores for the Spanish tweets. Based on these scores, each tweet was classified as Positive,
Neutral, or Negative.
13. Dataset Analysis (Optional): Dataset-level statistics were analyzed, including text length
distributions, language distributions, and label distributions. The visualization plots for sentiment
and label distributions were generated and saved as PNG files.
14. Data Merging (Optional): If required, the training and development sets were combined into a
large training set for training without evaluation.
15. Saving Final Outputs: The pre-processed sets were exported as CSV files to be used for training,
evaluation and prediction.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. System Overview</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>
          The multilingual nature, as well as the source of the data, led to the decision to leverage open-source
Transformer-based multilingual models from Hugging Face. These models are considered efective in
capturing language-specific nuances and contexts, resulting in achieving more accurate and reliable
results, as they have been pre-trained in various languages. More specifically, the large version of
XLM-RoBERTa from Facebook6 [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], the multilingual version of DeBERTa from Microsoft7 [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], the
multilingual XLM-R, which was re-trained on over 1 billion tweets from various languages until
December 2022 from Cardif NLP group at Cardif University 8 [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] as well as the multilingual
version of BERT from Google9 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] were employed for baseline experiments. To achieve baseline scores,
they were fine-tuned for multi-label text classification using AutoModelForSequenceClassification .
        </p>
        <p>The initial round of baseline experiments was conducted using only the
twitter-xlm-roberta-large2022 model with various loss functions to find the best option for this multi-label task and to assess
the need for positive weights (See section 4.5). This model was chosen because of its relevant training
data and multilingual capabilities. The second round of baseline experiments included fine-tuning
4https://github.com/cjhutto/vaderSentiment
5https://github.com/sloria/TextBlob
6https://huggingface.co/FacebookAI/xlm-roberta-large
7https://huggingface.co/microsoft/mdeberta-v3-base
8https://huggingface.co/cardifnlp/twitter-xlm-roberta-large-2022
9https://huggingface.co/google-bert/bert-base-multilingual-uncased
the 4 aforementioned models in both languages and each language separately, with the best loss
function revealed from the first round of baseline experiments (Distribution-Balanced Loss). All baseline
experiments were carried out using the Transformers and Hugging Face libraries, in conjunction with
1 NVIDIA TITAN RTX GPU card with 24GB VRAM. Table 7 in Appendix A, demonstrates that the
twitter-xlm-roberta-large-2022 attains the highest scores across all metrics in both languages and
Spanish separately. Although the xlm-roberta-large achieved a higher ICM score in the English-only
subset, Spanish remains the predominant language in the training and development data. Consequently,
the twitter-xlm-roberta-large-2022 model was chosen as the best foundation for conducting further
experiments in both languages.</p>
        <p>
          Inspired by the multilingual multi-head model architecture for multi-label text classification developed
by the Hierocles of Alexandria team during Touché at CLEF 2024 [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], a more advanced version of
the model was created. This new architecture, which is presented in detail in the following section
4.2, integrates multi-tasking capabilities by incorporating sentiment analysis in addition to sexism
categorization. This improvement aims not only to tackle the linguistic challenges inherent in each
language, but also to enhance the model’s ability to understand, detect and categorize sexism by adding
the sentiment expressed in each text as additional information.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Multi-head Multi-task Model Architecture</title>
        <p>The developed architecture is depicted in Figure 2 and was designed for multi-task and multilingual
classification, targeting both multi-label sexism categorization and multi-class sentiment analysis
(positive, neutral, negative) in English and Spanish tweets. At its foundation, the model leverages
a pre-trained XLM-RoBERTa encoder to obtain deep multilingual representations. On top of the
shared backbone, the architecture integrates language-specific classification heads for both tasks to
address each language’s linguistic challenges. More particulary, it includes 2 sexism classification heads
(one for English and one for Spanish) and 2 sentiment classification heads (one for English and one
for Spanish), allowing the model to learn language-adapted representations while benefiting from
shared multilingual knowledge. This multi-head architecture allows for cross-lingual flexibility while
maintaining specialization where needed.</p>
        <p>Each classification head was implemented as a custom Transformer stack, which optionally consists
of 1 to 4 Transformer layers depending upon the depth of the ablation experiments. These Transformer
layers replicate components of the base architecture and are composed of:
• A self-attention mechanism for contextual token interactions.
• Layer normalization to stabilize and accelerate training.
• Feed-forward sub-layers introducing non-linearity and complexity.
• Residual connections for mitigating the vanishing gradient problem and allowing deeper networks.
• Dropout for preventing overfitting.</p>
        <p>
          Following processing by the classification heads, sentence-level representations can be obtained using
one of three available pooling methods. The option to select from these pooling methods was inspired
by the author’s previously proposed system for detecting signs of depression in social media texts,
which was presented in Task 4 of LT-EDI@RANLP 2023 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The pooling methods are the following:
• CLS pooling: returns the representation of the first token in the sequence (i.e., the [CLS] token).
• Mean pooling: returns the mean value of token embeddings, weighted by the attention mask.
• Max pooling: returns the maximum value across the token dimensions while using the attention
mask.
        </p>
        <p>For sexism classification, each language-specific head receives the pooled sequence representation
from that language and the logits from the sentiment head corresponding to that language. This creates
a task-aware input and enables support for auxiliary learning. The additional sentiment information
is subsequently concatenated to the pooled representation before classification, providing contextual
information related to the sentiment of the tweet, which may assist in identifying overly subtle sexist
expressions. The proposed multi-task training process is as follows:
1. The input batch is passed through the shared XLM-RoBERTa encoder to generate contextualized
token embeddings.
2. Based on language identifiers in the batch (en, es), the model dynamically routes each instance to
the language-specific sentiment and sexism heads.
3. The sentiment heads for each language first generate sentiment logits, which are then used as
auxiliary features for the corresponding sexism heads.
4. The sexism heads for each language produce sexism logits using the sentiment logits as auxiliary
contextual input.
5. Both the sexism logits (multi-label) and sentiment logits (multi-class) are aggregated across the
batch and passed as input to the loss function.
6. A single, joint loss function is used to optimize the model. The joint loss function enables the 2
tasks to learn jointly and leverage both shared representations and task-specific knowledge.</p>
        <p>This architecture is particularly efective in the cross-lingual setting, as it allows the model to
share knowledge via the base encoder while maintaining per-language specialization in the task
heads. Furthermore, the integration of flexible pooling methods and adjustable head depth enables
experimentation and comparison to find the optimal configuration/-s for this task. Notably, adding
sentiment predictions as contextual auxiliary inputs to the sexism classification heads facilitates the
model’s ability to diferentiate between positive, neutral and negative content, particularly in ambiguous
cases where linguistic cues alone may be insuficient.</p>
        <p>The best-performing model from the baseline experiments (See section 4.1),
twitter-xlm-roberta-large2022, was leveraged as the foundation for this advanced model architecture. 3 multi-head, multi-task
models were developed, utilizing each of the 3 pooling methods, and starting with 1 Transformer
layer in the classification head for each language and task. To assess performance improvements,
additional Transformer layers were incorporated into the classification heads, resulting in a more
complex architecture. Consequently, models comprising 2, 3, and 4 layers were created. In total, 12
models were trained using the provided training set and evaluated with the development set. Their
results are illustrated in Table 3. The development and test predictions of these 12 models will be
later leveraged for majority ensemble learning (See section 4.7), along with the predictions of (1) an
LLM using prompt engineering, (2) an LLM using prompt engineering plus RAG (See section 4.3).
Additionally, these 12 models will be trained on the entire provided dataset, training and development
combined, with no validation during training. Their test predictions will be combined for another
majority ensemble learning. All experiments with this model architecture were also conducted using the
Transformers and Hugging Face libraries, alongside 1 NVIDIA TITAN RTX GPU card, which features
24GB of VRAM. The hyperparameters used for both the baselines and the multi-head, multi-task models
can be found in Table 8 in Appendix A.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. LLM Prompt Engineering &amp; RAG</title>
        <p>
          In a previously proposed system by the author for identifying hate speech, the targets and stance of
hate speech within the Climate Activism Stance and Hate Event Detection Shared Task at CASE 2024, an
LLM was fine-tuned using Parameter Eficient Fine-Tuning (PEFT), specifically employing Low-Rank
Adaptation (LoRA) and prompt tuning. The results demonstrated that using prompting for classification
yielded superior results compared to LoRA [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. For this reason, the state-of-the-art, open-source
Llama-3.2-3B-Instruct developed by Meta10 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], which supports English and Spanish, was leveraged in
2 Python scripts to classify tweets into hierarchical sexism categories and to generate predictions for
both the development and test sets. This approach allows for a comparison of the performance of an
LLM with that of multi-head, multi-task Transformer models.
10https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
        </p>
        <p>Shared Pretrained XLM-RoBERTa Encoder</p>
        <p>EN Sentiment Head
(1-4 Transformer Layers)
Pooling (CLS / Mean / Max)</p>
        <p>Sentiment Logits (EN)</p>
        <p>EN Sexism Head
(with Sentiment Aux)
Pooling (CLS / Mean / Max)</p>
        <p>ES Sentiment Head
(1-4 Transformer Layers)
Pooling (CLS / Mean / Max)</p>
        <p>Sentiment Logits (ES)</p>
        <p>ES Sexism Head
(with Sentiment Aux)</p>
        <p>Pooling (CLS / Mean / Max)</p>
        <p>Final Output Logits</p>
        <p>The first script used the LLM with prompt engineering. The prompt incorporated a pre-processed
tweet in either English or Spanish, along with its sentiment derived during the pre-processing phase.
It also included the sexism categories and their definitions as provided by the EXIST organizers. The
instructions specified that the output should be a JSON containing all applicable labels; if no sexism
categories applied, the script was to return "NO". Additionally, the sentiment was to inform the
decisionmaking process, and no extra explanation or commentary was to be included in the output. The prompt
is shown in Figure 3 in Appendix A. After experimentation with the prompt, the best results achieved
an ICM = -1.4709, ICM-NORM = 0.1723, and F1 = 0.4147 on the development set in both languages.</p>
        <p>
          The second script combined LLM prompt engineering and Retrieval Augmented Generation (RAG).
RAG was employed as it provides external knowledge from the most relevant texts and minimizes
LLM hallucinations. It was used to provide context to assist the LLM in performing classification more
efectively [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. The prompt incorporated a pre-processed tweet in English or Spanish, along with its
sentiment derived during the pre-processing phase, the sexism categories and their definitions, a
chainof-thought reasoning process, as well as a context block retrieved from the 3 most topically similar labeled
tweets, fetched through similarity search using ChromaDB11 and multilingual Sentence-Transformers
embeddings. These tweets were sourced from the training set when generating predictions for the
development set and from the combined training and development set when generating predictions
for the test set. The prompt also contained an annotator profile for the input tweet, synthesized from
demographic metadata provided in the EXIST dataset formatted as: "female/male, {age} years old,
{ethnicity}, {education}, living in {country}". The inclusion of annotator profiles was inspired by the
ABCD team’s approach from the previous year [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The instructions directed the model to adopt
the perspective of the provided annotator profile and to think step-by-step, evaluate multiple sexism
categories independently, incorporate sentiment and contextual examples into its decision-making,
and return only the final JSON-formatted label list without any extra explanation or commentary.
The prompt is shown in Figure 4 in Appendix A. The chain-of-thought reasoning was added in this
prompt after several endeavors, and it was proved that it improved classification performance. After
experimentation with the prompt, the best results achieved an ICM = -0.9604, ICM-NORM = 0.2860,
and F1 = 0.4488 on the development set in both languages. These results demonstrate that prompt
engineering, when combined with chain-of-thought reasoning and RAG, can significantly enhance
LLM performance in specialized multi-label classification. This approach enables the LLM to act as an
annotator based on a specific profile, grounding its predictions in sentiment and relevant human-labeled
examples as contextual support, rather than relying solely on the isolated tweet.
        </p>
        <p>By comparing the performance of the LLM (from both scripts) with the multi-head multi-task
Transformer models across both languages, it became evident that the LLM scored significantly lower.
Therefore, its predictions were included in the majority vote ensemble rather than being submitted
individually. Notably, the inclusion of the LLM in the majority vote ensemble resulted in higher scores
on the development set compared to using only the multi-head, multi-task Transformer models. The
experiments with both scripts were conducted using the Hugging Face and bitsandbytes library for
4-bit quantization12 and the LangChain library13, alongside 2 NVIDIA TITAN RTX GPU cards, featuring
each 24GB of VRAM.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation &amp; Metrics</title>
        <p>The hard-hard evaluation protocol is used for systems producing discrete, non-probabilistic outputs.
The ground truth annotations produced by multiple human annotators are converted to hard labels
using probabilistic thresholds defined for each sub-task. For sub-task 1.3 in this paper, a label is assigned
to an instance only if more than one annotator has selected it. This thresholding approach captures
only the labels with at least some level of inter-annotator agreement, thus providing a conservative yet
reliable approximation of the ground truth.</p>
        <p>
          The oficial metric of sub-task 1.3 used in the hard-hard evaluation framework is the Information
Contrast Model (ICM) [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. This metric is appropriately tailored towards hierarchical classification
problems, as it penalizes classification errors based on the semantic distance between predicted and true
labels, thereby taking into account the relationships among classes. In addition to ICM, the evaluation
framework also reports the normalized version known as ICM-NORM, which enhances comparability
across tasks, as well as the F-measure (F1) implemented in the PyEvALL evaluation library14. Although
the F1 score (the harmonic mean of precision and recall) is reported for reference, it is not considered
ideal for this context due to its limitation in capturing class relationships, as it treats errors between
conceptually distant classes (e.g., predicting "NO" instead of a specific sexist category) equivalently to
errors between similar positive classes (e.g., confusing two sexist categories), despite the former being
far more severe.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Loss Functions &amp; Class Weights</title>
        <p>
          Several loss functions were evaluated during the baseline experiments, including Binary Cross-Entropy
Loss with Logits15, Focal Loss [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], Class-balanced Loss [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], Distribution-Balanced Loss [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], and
Class-balanced Negative Tolerant Regularization Loss [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. These were implemented by modifying the
Trainer class from Hugging Face to identify the most efective loss function for the task. These loss
functions were specifically chosen for their efectiveness in addressing data imbalance challenges in
multi-label contexts. They have previously been employed for the Human Value Detection shared task
by the PAI team at SemEval-2023 [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], as well as by the Hierocles of Alexandria team during Touché at
CLEF 2024 [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
        </p>
        <p>Positive weights were calculated for each class and applied exclusively in experiments using the
Binary Cross-Entropy Loss with Logits to enhance model performance in under-represented areas.
Baseline experiments using the twitter-xlm-roberta-large-2022 model with the standard classification
12https://huggingface.co/docs/bitsandbytes/main/index
13https://github.com/langchain-ai/langchain
14https://github.com/UNEDLENAR/PyEvALL
15https://pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy_with_logits.html
head (AutoModelForSequenceClassification ) indicated that the Distribution-Balanced Loss yielded the
most favorable results in terms of ICM, ICM-NORM, and F1 scores, as detailed in Table 6 in Appendix A.</p>
        <p>
          Distribution-Balanced Loss [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] addresses label imbalance and co-occurrence in multi-label
classiifcation. It employs instance-level reweighting to adjust the contribution of each label based on its
inverse frequency in the dataset, while considering the number of positive labels in each instance. This
approach helps mitigate overfitting to dominant classes and retains the learning signal for rare labels.
Additionally, it applies a negative-tolerant regularization term to prevent overconfidence with negative
labels in cases where most labels are negative, avoiding suppression of rare labels. Consequently, this
loss function was selected for all submitted Transformer models without applying class positive weights.
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Thresholds</title>
        <p>
          All Transformer experiments, in both baselines and multi-head multi-task models, were carried out
using various thresholds between 0.1 and 0.95. During fine-tuning, both the ICM and F1 scores were
adapted to the compute_metrics function based on the evaluation package PyEvALL in Python, and
were constantly monitored. A general ICM and F1 score for all classes as well as ICM and F1 scores per
class were calculated in the compute_metrics function. For predictions post-training, the best thresholds
for each class were saved in a JSON file in order to calculate metrics and create predictions for the
development set. After implementing the sigmoid function, predictions were converted to 1 if they met
or exceeded the threshold and to 0 if they fell below it. Consequently, 3 distinct prediction files were
generated: one based on the default threshold of 0.5, another using the best general threshold across all
classes, and a third employing the best threshold for each class. The development set predictions created
using the optimal class-specific thresholds achieved the highest scores across all metrics. Applying
the best threshold per class was also verified during last year’s winning participation in sub-task 1 of
Touché at CLEF 2024 [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Thus, all predictions submitted for the test set were generated based on the
best threshold for each class.
        </p>
      </sec>
      <sec id="sec-4-7">
        <title>4.7. Majority Ensemble Learning</title>
        <p>A majority voting ensemble strategy of multiple multi-head multi-task Transformer-based models with
diferent pooling methods and number of layers was employed to enhance the robustness and
generalizability of the final development and test predictions. The idea was that the individual strengths and
weaknesses of these models would complement one another when addressing the sexism categorization
multi-label hierarchical task. The ensemble consisted of predictions from 12 runs of the
twitter-xlmroberta-large-2022 multi-head multi-task model, each defined by a specified pooling method (CLS,
max, mean) and a classification head layer count ranging from 1 to 4. The ensemble also incorporated
predictions from Llama-3.2-3B-Instruct in 2 variations: (i) prompt-only using category definitions,
and (ii) prompt with RAG utilizing both category definitions and examples from the training set as
contextual guidance.</p>
        <p>The majority voting mechanism operated on both an instance and a label level. If a label was predicted
in 7 out of 13 predictions, it was included in the final output. This threshold of 7 out of 13 was established
as a simple majority, meaning that a label had to be predicted by more than half of the models to be
included in the final results. This was a fair compromise in balancing accuracy and recall. The "NO"
label was included if no sub-category reached this threshold; in such instances, "NO" was enforced
to maintain the hierarchical nature of the task. Overall, the ensemble process efectively leveraged
the complementary strengths of various Transformer model predictions and LLM-based predictions,
resulting in more balanced and robust classification outputs.</p>
        <p>
          Results produced from the provided development set, evaluated via the PyEvALL framework, indicated
that the ensemble of the 12 transformer models along with the prompt-only Llama achieved the highest
performance metrics: ICM = 0.5352, ICM-NORM = 0.6192, and F1 = 0.6858. When the
promptonly Llama-3.2-3B-Instruct was substituted with the combined prompt and RAG-based version, the
performance metrics were slightly lower but still comparable: ICM = 0.5325, ICM-NORM = 0.6186, and
F1 = 0.6852. These findings highlight the efectiveness of the majority voting strategy in aggregating
predictions from diverse models rather than relying on individual ones, while ensuring high-quality
classifications from both Transformer-based models and the LLM-based approach. The majority voting
initiative was inspired by the author’s previous work on sexism detection at Task 10 of SemEval-2023 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
where aggregating model predictions was shown to increase performance.
        </p>
        <p>The 3 runs submitted for the test set predictions, hard-hard evaluation for sub-task 1.3 are as follows:
• Run 1: Majority vote ensemble of 12 multi-head multi-task models trained on the provided
training and development sets combined, CLS, mean and max pooling, 1-4 Transformer layers
per classification head, along with the prompt-only Llama-3.2-3B-Instruct.
• Run 2: Majority vote ensemble of 12 multi-head multi-task models trained on the provided
training set and evaluated on the development set (See Table 3, CLS, mean and max pooling, 1-4
Transformer layers per classification head, along with the prompt-only Llama-3.2-3B-Instruct.
• Run 3: Majority vote ensemble of 12 multi-head multi-task models trained on the provided
training and development sets combined, CLS, mean and max pooling, 1-4 Transformer layers
per classification head, along with the combined prompt and RAG-based version of
Llama-3.2-3BInstruct with the training and development sets combined as RAG documents.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results &amp; Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Development Set</title>
        <p>From Table 3, it is evident that the twitter-xlm-roberta-large-2022 utilizing max pooling achieved
the highest scores in the sentiment and sexism classification heads with 1 and 4 Transformer layers.
Conversely, the model employing mean pooling attained its peak performance with 2 Transformer
layers. The model applying CLS pooling outperformed others in the classification heads with 3 stacked
Transformer layers. Notably, adding 2 and 4 layers in the heads resulted in a decline in performance,
whereas 3 layers yielded lower yet significant scores compared to 1 layer per head. Despite the variations
in scores, all of these models were incorporated in the majority ensemble.</p>
        <p>According to Table 4, which illustrates the F1 scores for each class of the multi-head multi-task
cardifnlp/twitter-xlm-roberta-large-2022 models based on various pooling methods (CLS, mean, max)
and configurations with 1 to 4 Transformer layers per classification head, the "NO" class consistently
achieves the highest F1 scores, ranging from 0.86 to 0.87, across all model configurations. This reflects
the model’s ability to successfully detect non-sexist content, primarily due to the dominance of the "NO"
label in the class distribution. Regarding the sexist categories, the "STEREOTYPING-DOMINANCE"
and "IDEOLOGICAL-INEQUALITY" classes achieve moderately high F1 scores between approximately
0.63 and 0.71, showing that the models are efective in capturing concepts of inequality between
men and women and identifying mentions of stereotypical roles assigned to women, particularly
when utilizing CLS and max pooling for "STEREOTYPING-DOMINANCE" and mean pooling for
"IDEOLOGICAL-INEQUALITY", regardless of the number of layers per classification head. The category
"OBJECTIFICATION" falls within the moderate range, achieving scores between 0.59 and 0.63. This
suggests a balanced challenge arising from the need to navigate both explicit objectifying language
and more subtle instances of objectification towards women. Nevertheless, the rest of the sexism
categories mainly tend to achieve lower scores since they are the least represented in the dataset.
The "SEXUAL-VIOLENCE" class attains a lower F1 performance, ranging from 0.60 to 0.65, showing
the dificulty of the models to capture sexual violence or sexual harassment indicators. Finally, the
"MISOGYNY-NON-SEXUAL-VIOLENCE" class consistently produces the lowest F1 scores across all
model configurations, mostly between 0.52 and 0.56, suggesting that the models struggle to detect
hatred or non-sexual violence expressions towards women.</p>
        <p>Model
CLS Pooling
Mean Pooling
Max Pooling
Model
CLS Pooling
Mean Pooling
Max Pooling
Model
CLS Pooling
Mean Pooling
Max Pooling
Model
CLS Pooling
Mean Pooling
Max Pooling</p>
        <p>ICM
cardiffnlp/twitter-xlm-roberta-large-2022 as foundation trained with Distribution-Balanced
Loss in both languages using the 3 pooling methods: CLS, Mean and Max pooling. They were trained on the
provided training set and evaluated on the provided development set. Results are grouped by the number of
Transformer layers per classification head. The results were computed using the PyEvALL evaluation library.</p>
        <p>All – 1 Transformer Layer Per Classification Head
F1</p>
        <p>F1
cardiffnlp/twitter-xlm-roberta-large-2022 as foundation trained with Distribution-Balanced
Loss in both languages using the 3 pooling methods: CLS, Mean and Max pooling. They were trained on the
provided training set and evaluated on the provided development set. Results are grouped by the number of
Transformer layers per classification head. The results were computed using the PyEvALL evaluation library.</p>
        <p>All – 1 Transformer Layer Per Classification Head
NO IDEOLOGICAL-INEQUALITY STEREOTYPING-DOMINANCE OBJECTIFICATION SEXUAL-VIOLENCE MISOGYNY-NON-SEXUAL-VIOLENCE</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Test Set</title>
        <p>
          From Table 5, it is illustrated that all submitted runs (See section 4.7 for more details) outperformed the
two baselines in both languages, as well as individually for each language. The baselines include the
EXIST2025-test_majority-class, which categorizes all instances as the majority class, and the
EXIST2025test_minority-class, which categorizes all instances as the minority class. It was shown that training
the models with a larger dataset by combining both the training and development sets (as in Runs 1
and 3) resulted in improved performance. In contrast, Run 2, which relied solely on the training set
for training, achieved the lowest rank among the three runs across both languages and within each
language separately. In both languages (denoted as ALL), Run 3 achieved the 4ℎ rank with ICM =
0.4842, ICM-NORM = 0.6124, and F1 = 0.6335, exceeding the submission scores of the ABCD team
from last year’s EXIST competition, which secured the first rank [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. This performance highlights that
the inclusion of chain-of-thought reasoning, along with the annotators’ profiles in the LLM prompt and
the use of RAG to provide relevant contextual information from tweets, contributes to improved results
by reducing hallucinations. Run 3 maintained the same rank when evaluated solely in Spanish, possibly
due to the over-representation of this language in the dataset. However, it ranked slightly lower than
Run 1, which secured the 6ℎ rank in English alone, with ICM = 0.3928, ICM-NORM = 0.5963, and F1 =
0.6108. Run 1 achieved the 5ℎ rank across both languages as well as in each language separately and
recorded the highest score among the submitted runs for English only, with ICM = 0.3932, ICM-NORM
= 0.5964, and F1 = 0.6118. This indicates that incorporating category definitions and considering the
sentiment of the input tweets can significantly enhance performance, particularly when the tweets are
exclusively in English.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion &amp; Future Work</title>
      <p>The system developed for the EXIST Lab at CLEF 2025 participated in sub-task 1.3, focusing on tackling
sexism categorization in the tweet partition in hard evaluation through hierarchical multi-label text
classification. The submitted system involved fine-tuning the multilingual
twitter-xlm-roberta-large2022 Transformer model 12 times, utilizing a multi-head and multi-task model architecture that facilitates
both sentiment analysis and sexism categorization. Variations in these 12 runs included CLS, mean and
max pooling, as well as adjustments to the number of layers in the classification heads. Additionally,
the system leveraged the multilingual Llama-3.2-3B-Instruct model and explored 2 methods: one
implementing classification via prompt engineering and the other combining classification with prompt
engineering and RAG to incorporate contextual information for improved performance. The predictions
from the 12 multi-head multi-task Transformer models and the LLM were aggregated using majority
ensemble learning, resulting in 3 submissions for hard-hard evaluation on the unlabeled test set (See
section 4.7). The experimental strategy employed a multi-step pre-processing pipeline, appropriate
loss functions for multi-label classification with imbalanced classes, positive class weights, and varying
thresholds per class, all aimed at alleviating class imbalance and enhancing the models’ ability to
comprehend and classify texts more efectively.</p>
      <p>The submissions achieved commendable rankings, securing the 4ℎ, 5ℎ, and 6ℎ positions out of a
total of 132 submissions in both English and Spanish. All submitted runs surpassed the baseline scores
across both languages and each language separately. Run 3 and Run 1, which combined the training and
development sets for training, outperformed Run 2, which solely utilized the training set for training. It
was revealed that integrating contextual information through RAG and employing chain-of-thought
reasoning contributed significantly to successful LLM classification. Notably, Run 3, as the majority
ensemble approach with the 12 Transformer models along with RAG, chain-of-thought and annotators’
profiles included in the prompt, yielded the highest score and ranked 4 ℎ in both languages. This method
also achieved a 4ℎ rank in Spanish, consistent with the other two runs, which maintained the same
rank for both languages. This pattern may be attributed to the minor over-representation of Spanish
within the EXIST dataset. When focusing solely on English, Run 1 attained a slightly higher score than
Run 3, ranking 5ℎ, while Run 3 and Run 2 ranked 6ℎ and 7ℎ, respectively.</p>
      <p>To optimize model performance for multi-label sexism categorization, future research should explore
the use of larger Transformer language models within the multi-task multi-head architecture, such as
XLM-RoBERTa-xl16 or XLM-RoBERTa-xxl17. Additionally, experimenting with larger LLMs for
classification and refining the prompt with chain-of-thought reasoning could yield significant improvements.
Incorporating data augmentation techniques or synthetic data generation may further enhance overall
performance.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>The data analysis and baseline experiments revealed a significant issue of class imbalance among the
labels. Despite the eforts to utilize loss functions specifically designed to address class imbalance and to
explore various thresholds and class-positive weights for each label, detecting instances of sexism while
simultaneously identifying one or more categories of sexism remains a challenging task. This dificulty
primarily arises from the overwhelming prevalence of "NO" sexism annotations in the training and
development datasets, along with an inadequate distribution of sexism categories, particularly concerning
labels such as "SEXUAL-VIOLENCE" and "MISOGYNY-NON-SEXUAL-VIOLENCE". Additionally, the
disproportionate number of Spanish tweets compared to English tweets poses additional challenges for
models due to the smaller token count available for Spanish compared to English. Compounding these
issues, limitations in GPU VRAM have hindered experimenting with larger Transformer models, such
as facebook/xlm-roberta-xl or LLMs with more parameters like Gemma 3.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The research that contributed to these results was funded by the European Union’s Horizon Europe
research and innovation programme, in the context of the TITAN project, grant agreement No. 101070658,
and the AI4TRUST project, grant agreement No. 101070190. This research has also been partially
supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by
the European Union under the NextGenerationEU Program. The views expressed in this paper are
exclusively those of the author, and the European Commission is not responsible for any use that may
be made of the information it contains.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used Grammarly for grammar and spelling checks.
After using this tool/service, the author reviewed and edited the content as needed and takes full
16https://huggingface.co/facebook/xlm-roberta-xl
17https://huggingface.co/facebook/xlm-roberta-xxl
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Appendix</title>
      <p>[INST]
You are a sexism detection assistant analyzing tweets in English or Spanish. Each tweet is accompanied by its sentiment.
Your task is:
- To detect all applicable **sexism categories** from a predefined list.
- Use the sentiment to inform your judgment (e.g., aggressive tone may signal abuse).
- Consider **each category independently**; multiple categories can apply.
- If no sexism is present, return ONLY ["NO"].
### Sexism Categories (with definitions):
1. **IDEOLOGICAL-INEQUALITY**: The text discredits the feminist movement, rejects inequality between men and women, or presents men
˓→ as victims of gender-based oppression.
2. **STEREOTYPING-DOMINANCE**: The text expresses false ideas about women that suggest they are more suitable to fulfill certain
˓→ roles (mother, wife, caregiver, submissive, etc.), or inappropriate for certain tasks (e.g., driving, hard work), or claims that
˓→ men are superior.
3. **OBJECTIFICATION**: The text presents women as objects, disregarding their dignity and personality, or assumes physical traits
˓→ women must have to fulfill traditional gender roles (beauty standards, hypersexualization, women's bodies at men's disposal,
˓→ etc.).
4. **SEXUAL-VIOLENCE**: The text includes or describes sexual suggestions, requests for sexual favors, or harassment of a sexual
˓→ nature, including rape or sexual assault.
5. **MISOGYNY-NON-SEXUAL-VIOLENCE**: The text expresses hatred or non-sexual violence toward women (e.g., insults, aggression, or
˓→ psychological abuse without sexual undertone).
6. **NO**: Use this only if none of the above categories are present.</p>
      <p>"labels": ["&lt;CATEGORY1&gt;", "&lt;CATEGORY2&gt;", ...]
[INST]
You are a sexism detection assistant analyzing tweets in English or Spanish. You have the perspective of the following person:
### Sexism Categories (with definitions):
1. **IDEOLOGICAL-INEQUALITY**: The text discredits the feminist movement, rejects inequality between men and women, or presents men
˓→ as victims of gender-based oppression.
2. **STEREOTYPING-DOMINANCE**: The text expresses false ideas about women that suggest they are more suitable to fulfill certain
˓→ roles (mother, wife, caregiver, submissive, etc.), or inappropriate for certain tasks (e.g., driving, hard work), or claims that
˓→ men are superior.
3. **OBJECTIFICATION**: The text presents women as objects, disregarding their dignity and personality, or assumes physical traits
˓→ women must have to fulfill traditional gender roles (beauty standards, hypersexualization, women's bodies at men's disposal,
˓→ etc.).
4. **SEXUAL-VIOLENCE**: The text includes or describes sexual suggestions, requests for sexual favors, or harassment of a sexual
˓→ nature, including rape or sexual assault.
5. **MISOGYNY-NON-SEXUAL-VIOLENCE**: The text expresses hatred or non-sexual violence toward women (e.g., insults, aggression, or
˓→ psychological abuse without sexual undertone).
6. **NO**: Use this only if none of the above categories are present.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Masequesmay</surname>
          </string-name>
          , Sexism,
          <year>2022</year>
          . URL: https://www.britannica.com/topic/sexism.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Wright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Harper</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Wachs,</surname>
          </string-name>
          <article-title>The associations between cyberbullying and callousunemotional traits among adolescents: The moderating efect of online disinhibition</article-title>
          ,
          <source>Personality and Individual Diferences</source>
          <volume>140</volume>
          (
          <year>2019</year>
          )
          <fpage>41</fpage>
          -
          <lpage>45</lpage>
          . URL: https://www.sciencedirect.com/science/article/ pii/S0191886918301910. doi:https://doi.org/10.1016/j.paid.
          <year>2018</year>
          .
          <volume>04</volume>
          .001,
          <article-title>personality pathologies in the world: beyond dichotomies</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Perpetuating online sexism ofline: Anonymity, interactivity, and the efects of sexist hashtags on social media</article-title>
          ,
          <source>Computers in Human Behavior</source>
          <volume>52</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>442</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0747563215004641. doi:https://doi.org/ 10.1016/j.chb.
          <year>2015</year>
          .
          <volume>06</volume>
          .02.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. d.</given-names>
            <surname>Meco</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>MacKay, Social media, violence and gender norms: The need for a new digital social contract</article-title>
          ,
          <year>2022</year>
          . URL: https://www.alignplatform.org/resources/blog/ social
          <article-title>-media-violence-and-gender-norms-need-new-digital-social-contract.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Valenti</surname>
          </string-name>
          ,
          <article-title>Toxic twitter - women's experiences of violence and abuse on twitter</article-title>
          ,
          <year>2022</year>
          . URL: https://www.amnesty.org/en/latest/news/2018/03/online-violence
          <article-title>-against-women-chapter-3-2/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Pitsilis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ramampiaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Langseth</surname>
          </string-name>
          ,
          <article-title>Efective hate-speech detection in twitter data using recurrent neural networks</article-title>
          ,
          <source>Applied Intelligence</source>
          <volume>48</volume>
          (
          <year>2018</year>
          )
          <fpage>4730</fpage>
          -
          <lpage>4742</lpage>
          . URL: https://doi.org/10.1007/ s10489-018-1242-y. doi:
          <volume>10</volume>
          .1007/s10489-018-1242-y.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arsht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Etcovitch</surname>
          </string-name>
          ,
          <article-title>The human cost of online content moderation</article-title>
          ,
          <year>2018</year>
          . URL: https://jolt.law. harvard.edu/digest/the-human
          <article-title>-cost-of-online-content-moderation.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of EXIST 2025:
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Tweets, Memes, and TikTok Videos</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of EXIST 2025:
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Tweets, Memes, and TikTok Videos (Extended Overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Rodríguez-Sanchez</surname>
          </string-name>
          , J. C. de Albornoz, L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of exist 2021:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Proces. del Leng. Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:250527210.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saibene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lees</surname>
          </string-name>
          , J. Sorensen, SemEval
          <article-title>-2022 task 5: Multimedia automatic misogyny identification</article-title>
          , in: G. Emerson,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schluter</surname>
          </string-name>
          , G. Stanovsky,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Ratan (Eds.),
          <source>Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>549</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .74/. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>74</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , P. Röttger, SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval2023)</source>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2193</fpage>
          -
          <lpage>2210</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .305/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Tweets</article-title>
          and Memes,
          <year>2024</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>117</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Thin</surname>
          </string-name>
          ,
          <article-title>Sexism identification in social networks with generation-based language models</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S.</surname>
          </string-name>
          Herrera (Eds.),
          <source>Working Notes Papers of the CLEF 2024 Evaluation Labs</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          . URL: http: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-109.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Y.-Z. Fang</surname>
            ,
            <given-names>L.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.-D.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , Nycu-nlpat exist
          <year>2024</year>
          :
          <article-title>Leveraging transformers with diverse annotations for sexism identification in social networks</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S.</surname>
          </string-name>
          Herrera (Eds.),
          <source>Working Notes Papers of the CLEF 2024 Evaluation Labs</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-93.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Petrescu</surname>
          </string-name>
          , C.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Truică</surname>
            ,
            <given-names>E.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Apostol</surname>
          </string-name>
          ,
          <article-title>Language-based mixture of transformers for exist2024</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S.</surname>
          </string-name>
          Herrera (Eds.),
          <source>Working Notes Papers of the CLEF 2024 Evaluation Labs</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-108.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulou</surname>
          </string-name>
          , NLP_CHRISTINE at SemEval-2023 task 10:
          <article-title>Utilizing transformer contextual representations and ensemble learning for sexism detection on social media texts</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>595</fpage>
          -
          <lpage>602</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .81/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>81</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulou</surname>
          </string-name>
          , NLP_CHRISTINE@
          <article-title>LT-EDI-2023: RoBERTa &amp; DeBERTa fine-tuning for detecting signs of depression from social media text</article-title>
          , in: B.
          <string-name>
            <surname>R. Chakravarthi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bharathi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Grifith</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bali</surname>
          </string-name>
          , P. Buitelaar (Eds.),
          <source>Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion</source>
          , INCOMA Ltd.,
          <string-name>
            <surname>Shoumen</surname>
          </string-name>
          , Bulgaria, Varna, Bulgaria,
          <year>2023</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>116</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .ltedi-
          <volume>1</volume>
          .16/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulou</surname>
          </string-name>
          , NLPDame at ClimateActivism 2024:
          <article-title>Mistral sequence classification with PEFT for hate speech, targets and stance event detection</article-title>
          , in: A.
          <string-name>
            <surname>Hürriyetoğlu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Tanev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Thapa</surname>
          </string-name>
          , G. Uludoğan (Eds.),
          <source>Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics, St</article-title>
          . Julians, Malta,
          <year>2024</year>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>104</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .case-
          <volume>1</volume>
          .13/.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Baziotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          , Datastories at semeval
          <article-title>-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis</article-title>
          ,
          <source>in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>747</fpage>
          -
          <lpage>754</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Hutto</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. E. Gilbert,</surname>
          </string-name>
          <article-title>VADER: A parsimonious rule-based model for sentiment analysis of social media text</article-title>
          , in: Eighth International Conference on Weblogs and Social
          <string-name>
            <surname>Media</surname>
          </string-name>
          (ICWSM-14), Ann Arbor, MI,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>1911</year>
          .02116.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Debertav3: Improving deberta using electra-style pre-training with gradientdisentangled embedding sharing</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2111</volume>
          .
          <fpage>09543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <article-title>XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>266</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Loureiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Camacho-collados, TimeLMs: Diachronic language models from Twitter, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .acl-demo.
          <volume>25</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2022</year>
          .acl-demo.
          <volume>25</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Legkas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zidianakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koutrintzes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dagioglou</surname>
          </string-name>
          , G. Petasis,
          <article-title>Hierocles of alexandria at touché: Multi-task &amp; multi-head custom architecture with transformer-based models for human value detection</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S.</surname>
          </string-name>
          Herrera (Eds.),
          <source>Working Notes Papers of the CLEF 2024 Evaluation Labs</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>3419</fpage>
          -
          <lpage>3432</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-330.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sravankumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinsvark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2312.10997. arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <article-title>Evaluating extreme hierarchical multi-label classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , p.
          <fpage>5809</fpage>
          -
          <lpage>5819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/1708.
          <year>02002</year>
          . arXiv:
          <fpage>1708</fpage>
          .
          <year>02002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jia</surname>
          </string-name>
          , T.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <article-title>Class-balanced loss based on efective number of samples, 2019</article-title>
          . URL: https://arxiv.org/abs/
          <year>1901</year>
          .05555. arXiv:
          <year>1901</year>
          .05555.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Distribution-balanced loss for multi-label classification in long-tailed datasets</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2007</year>
          .09654. arXiv:
          <year>2007</year>
          .09654.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Giledereli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köksal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Özgür</surname>
          </string-name>
          , E. Ozkirimli,
          <article-title>Balancing methods for multi-label text classification with long-tailed class distribution</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2109.04712. arXiv:
          <volume>2109</volume>
          .
          <fpage>04712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , PAI at SemEval
          <article-title>-2023 task 4: A general multi-label classification system with class-balanced loss function and ensemble module</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>256</fpage>
          -
          <lpage>261</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .34. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>