<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COMFOR at EXIST 2025: Support Vector Machines vs. Large Language Models in Sexism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Fritzsche</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jenny Felser</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Spranger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mittweida University of Applied Sciences</institution>
          ,
          <addr-line>Technikumplatz 17, Mittweida, 09648</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The increasing prevalence of sexist and misogynistic statements on social media poses a serious challenge. Due to the high volume of content, manual moderation is not feasible; therefore, automated detection systems are urgently needed. The EXIST task within CLEF 2025 is dedicated to the automatic identification of sexist content in social networks and the determination of the intention and subtype of sexism. This paper describes the COMFOR team's contribution to the first task of the competition, which focuses on tweets. A support vector machine (SVM) based on a comprehensive feature representation, including embeddings and lexical features, was used. For the third subtask, this classifier was used as the basis for a classifier chain. Additionally, the results of the ifrst subtask were compared with those of a large language model (LLM) with an assigned persona. Our best models achieved an Information Contrast Measure (ICM) of 0.4928 (Subtask 1.1), −0.2203 (Subtask 1.2), and −0.4635 (Subtask 1.3) in the hard evaluation of the English test data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sexism detection</kwd>
        <kwd>support vector machine</kwd>
        <kwd>large language models</kwd>
        <kwd>twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Social networks are becoming increasingly important in today’s society as a source of information,
entertainment and even for exchanging opinions across geographical borders. Furthermore, a key
advantage of online communication is the anonymity that allows people to express their opinions freely.
However, this anonymity also has its downsides, as it can encourage the sharing of condescending or
ofensive content more frequently [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        A particularly critical phenomenon is so-called hate speech, i.e. statements that attack individuals
or groups based on characteristics attributed to the groups [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. According to a 2022 report by the
European Union, which examined ofensive language and harassment on YouTube, Telegram, Reddit,
and X, women were particularly afected by online hate compared to other social groups [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A study
conducted by the Federal Working Group “Gegen Hass im Netz” showed, for example, that misogynistic
online communication has increased from 2022 to 2023, and that women are targeted with disinhibited
language that threatens and maligns them [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These findings demonstrate that social media can
promote the spread of misogynistic views. Thus, users of social media specifically seek confirmation of
their opinions and resort to sexist comments, often regardless of the discussion context [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To counteract the problem of online hate, including sexism, content moderators are tasked with
monitoring communication and promptly removing harmful content if necessary [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, given
the sheer volume of content published daily, purely manual moderation appears too labour-intensive
and time-consuming. Accordingly, there is an urgent need for methods to recognise sexist statements
automatically.
      </p>
      <p>
        The sEXism Identification in Social neTworks (EXIST) competition addresses this challenge by
calling for the identification of content in text, image, and video data that either expresses critical or
contemptuous views towards women or refers to such events [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It must be noted that the organisers
equate misogyny with sexism, which is why this term is used throughout this document, even though
sexism can be directed against any gender. The tasks were bilingual, in both English and Spanish. In this
paper, we present our system for accomplishing the first task: detecting statements related to sexism in
tweets, focusing on the English language. This task was divided into three subtasks:
      </p>
      <p>To accomplish these tasks, two diferent systems were compared with each other: On the one hand, a
Support Vector Machine (SVM) was used, which was based on features such as embeddings, emojis and
the frequency of sexist words. On the other hand, large language models (LLMs) were used as part of a
few-shot prompting approach, in which the classification was carried out from the perspective of an
assigned role (e.g., a male or female person). However, due to the long computing times of the LLMs,
results could only be submitted for Subtask 1.1 in the competition using this approach.</p>
      <p>The rest of this paper is organised as follows: After discussing the current state of the literature in
section 2, the datasets used for this work and the methodology are briefly described in section 3. The
results are then discussed in section 4. Finally, we conclude in section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Since this paper uses both traditional classification methods and advanced LLMs, the following section
provides an overview of recent approaches to sexism detection, both in the field of traditional methods
and modern language models, as well as related areas of application.</p>
      <sec id="sec-2-1">
        <title>2.1. Approaches based on Traditional Machine Learning</title>
        <p>
          The automatic detection of sexist tweets and the determination of their intention and subtype have
already been addressed in two previous shared tasks [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ], using the same dataset. However, only a
few participants employed traditional machine learning approaches in these competitions, for example
[
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ]. Nevertheless, the potential of these approaches should not be underestimated. For example,
during the GerMS-Detect task of GermEval 2024 [12] – a competition series focusing on the German
language - Donabauer [13] demonstrated that traditional methods, such as XGBoost, can achieve better
results in the binary detection of sexism and misogyny than transformer-based models, such as the
Bidirectional Encoder Representations from Transformers (BERT) model [14]. This result illustrates
that more powerful language models do not necessarily deliver better results than classic approaches.
Therefore, one focus of this work was on analysing traditional methods, particularly taking into account
diferent feature representations.
        </p>
        <p>SVM was chosen as the classification algorithm. As demonstrated by the survey conducted by
Abdollah Zadeh et al. [15], SVM is robust against outliers and particularly well-suited to high-dimensional data,
a property that is particularly relevant in text classification due to the large number of potential features.
However, the disadvantage of classic SVM is that it cannot handle overlapping classes. For Subtask 1.2,
in which multiple classes exist, this problem can be solved by using separate binary classifiers for each
class.</p>
        <p>Subtask 1.3, however, represents a genuine multi-label problem that requires more complex solutions.
Asti et al. [16] investigated various methods for multi-label classification, including the combination of
classification algorithms such as SVM, Multinomial Naive Bayes (MNB) and Random Forest (RF) with
transformation methods such as the binary relevance method [17], classifier chains [ 18] and the label
powerset transformation [19]. The combination of SVM and classifier chains, as introduced by Read
et al. [20] in particular, showed superior performance with identical features [16]. That means that
SVM can be used not only as a binary classifier for Subtask 1.1, but also, with appropriate extensions,
for the more complex Subtasks 1.2 and 1.3.</p>
        <p>A key element in modelling is the selection of suitable features. Fasoli et al. [21], for example,
compiled a list of sexist swear words and their social acceptability. Pamungkas et al. [22] also utilised
swear words from both formal and informal language, drawing on data from the NoSwearing website
[23], among other sources.</p>
        <p>In addition, binary, dictionary-based features can be defined, for example, by checking whether a
tweet contains terms associated with the word “woman” [21]. A hate word lexicon based on Bassignana
et al. [24] was also used, which is divided into 17 weightable subcategories.</p>
        <p>In the field of lexical features, classic representations such as Bag of Words (BoW), Bag of Hashtags and
Bag of Emojis were used [22]. In addition, newer approaches demonstrate that semantic representation
using word embeddings can yield better results [25]. Asudani et al. [25] provided a comprehensive
overview of various methods, including Word2Vec [26] and its further development Global Vectors
for Word Representation (GloVe) [27]. GloVe not only takes local contexts into account but also global
co-occurrence statistics, and is particularly well suited for this application thanks to pre-trained models
on Twitter data.</p>
        <p>In summary, the overview of related research indicates that SVM, in conjunction with a classifier
chain and features such as embeddings, presents an intriguing approach and can contribute to the
diversification of methods within shared tasks.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Approaches using Large Language Models</title>
        <p>
          In contrast to traditional approaches, participants in the two previous EXIST competitions primarily
employed smaller and larger language models [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">28, 29, 30</xref>
          ]. In particular, variants of BERT were frequently
used, such as RoBERTa [
          <xref ref-type="bibr" rid="ref12">28</xref>
          ] applied for instance by Mohammadi et al. [
          <xref ref-type="bibr" rid="ref15">31</xref>
          ], Multilingual BERT (MBert)
[14] for classifying the English and Spanish tweets [
          <xref ref-type="bibr" rid="ref16">32</xref>
          ] and Twitter-specific models such as
TwitterRoBERTa [
          <xref ref-type="bibr" rid="ref13">29</xref>
          ] applied by Martinez et al. [
          <xref ref-type="bibr" rid="ref17">33</xref>
          ]. A common practice for detecting sexism in social media
using BERT is fine-tuning for the classification task [
          <xref ref-type="bibr" rid="ref14 ref18 ref19">30, 34, 35</xref>
          ], which is typically resource-intensive
[
          <xref ref-type="bibr" rid="ref20">36</xref>
          ].
        </p>
        <p>
          Instead of BERT-based and other so-called encoder-only models, larger, mostly decoder-only models,
for instance from the Generative Pre-trained Transformer (GPT) [
          <xref ref-type="bibr" rid="ref21">37</xref>
          ] or Large Language Model Meta AI
(LLaMA) family [
          <xref ref-type="bibr" rid="ref22">38</xref>
          ], have been proven promising for binary sexism and misogyny detection of English
social media posts [
          <xref ref-type="bibr" rid="ref23 ref24 ref25">39, 40, 41</xref>
          ] as well as for their categorisation into subforms [
          <xref ref-type="bibr" rid="ref24 ref26">40, 42</xref>
          ].
        </p>
        <p>
          The work of Samani et al. [
          <xref ref-type="bibr" rid="ref24">40</xref>
          ] is interesting in that they investigated various strategies for binary and
ifne-grained classification of sexism – zero-shot prompting, supervised fine-tuning and Reinforcement
Learning from Human Feedback (RLHF) – based on the dataset of the Explainable Detection of Online
Sexism (EDOS) task [
          <xref ref-type="bibr" rid="ref27">43</xref>
          ]. They concluded that the open-source model LLaMA, in particular, proved
to be efective using RLHF. However, another promising technique that the authors did not include
in their study is few-shot prompting (i.e., in-context learning). In this simple approach, the model is
shown a few examples of the task via prompting, without the need for larger annotated datasets or
human feedback [
          <xref ref-type="bibr" rid="ref28">44</xref>
          ]. This approach has proven successful, for example, in detecting hate speech in
low-resource languages [
          <xref ref-type="bibr" rid="ref29 ref30">45, 46</xref>
          ], which can be seen as a kind of generalisation of sexism detection
and is therefore also relevant to the task at hand. Nevertheless, both in the recognition of hate speech
[
          <xref ref-type="bibr" rid="ref29 ref31">45, 47</xref>
          ] and, more importantly, in the detection of sexism within the EXIST 2024 Task [
          <xref ref-type="bibr" rid="ref14">30</xref>
          ], the results
of LLMs applied in the few-shot scenario were outperformed by fine-tuned BERT-based models.
        </p>
        <p>
          One possible approach for improvement is to combine few-shot prompting with the specification of
a persona (i.e., role) or a specific perspective from which the LLM should perform the classification
[
          <xref ref-type="bibr" rid="ref26 ref32 ref33">48, 42, 49</xref>
          ]. Examples of this person-based prompting include assigning a political stance to LLMs for
hate speech classification [
          <xref ref-type="bibr" rid="ref32">48</xref>
          ], as well as assigning age and education level [
          <xref ref-type="bibr" rid="ref33">49</xref>
          ], age and gender [
          <xref ref-type="bibr" rid="ref26">42</xref>
          ],
or various sociodemographic characteristics including age, ethnicity, and sexuality for the detection of
sexism [
          <xref ref-type="bibr" rid="ref24">40</xref>
          ]. However, opinions difer as to whether assigning such characteristics to an LLM improves
classification results [
          <xref ref-type="bibr" rid="ref26">42</xref>
          ] or have almost no efect [
          <xref ref-type="bibr" rid="ref33">49</xref>
          ] – depending, among other things, on the language
of the dataset, the model used and the specific prompting strategy.
        </p>
        <p>
          In particular, Tian et al. [
          <xref ref-type="bibr" rid="ref26">42</xref>
          ] highlights the efectiveness of the persona-based prompting approach: a
single LLaMA-3 model [
          <xref ref-type="bibr" rid="ref34">50</xref>
          ] assigned gender and age outperformed a more complex cascading strategy
based on GPT models without persona assignment on the EXIST-2023 dataset [
          <xref ref-type="bibr" rid="ref34">50</xref>
          ]. Jiang et al. [
          <xref ref-type="bibr" rid="ref25">41</xref>
          ]
demonstrated that instructing an LLM to classify texts such as tweets from the perspective of a person
with specific sociodemographic characteristics can also be efective; however, overly complex prompts
or too detailed descriptions of the persona can negatively impact model performance. For this reason,
the present work focuses specifically on a single characteristic – gender – for the binary classification
of sexism. Since, according to Aoyagui et al. [
          <xref ref-type="bibr" rid="ref35">51</xref>
          ], diferent model types are fundamentally capable of
taking diferent perspectives when evaluating sexism, we do not limit ourselves to one model type, as
Jiang et al. [
          <xref ref-type="bibr" rid="ref25">41</xref>
          ], for example, does, but compare the performance of four diferent models.
        </p>
        <p>
          Despite their potential, it should not be forgotten that LLMs also have weaknesses, such as dificulties
in recognising implicit sexism [
          <xref ref-type="bibr" rid="ref23 ref36">39, 52</xref>
          ] and a high sensitivity to the design of prompts [
          <xref ref-type="bibr" rid="ref37">53</xref>
          ]. Accordingly,
it is interesting to investigate whether and to what extent LLMs outperform traditional approaches
such as SVM.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>To address the problem of sexism detection in social media, SVM and LLM-based approaches were
employed for Subtask 1.1, while the SVM system was utilised for Subtasks 1.2 and 1.3. We submitted a
total of six runs. An overview of the approaches is provided in Table 1. Concerning Subtask 1.3, run
COMFOR_1 and COMFOR_2 difer only in that hard labels were provided in the first run and soft labels
in the second. The following paragraphs provide a detailed description of the individual methods.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Description</title>
        <p>
          The Twitter dataset for all subtasks of Task 1 addressed in this paper was initially created for the previous
EXIST 2023 edition and described in detail by Plaza et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In total, the dataset comprises 4,727
English and 5,307 Spanish tweets, which the organisers had already divided into training, validation
and test data. Competition participants had the option of submitting results for only one language. We
chose English because many of the features for SVM were language-dependent and required tools [
          <xref ref-type="bibr" rid="ref38">54</xref>
          ]
and resources [23] that were only available in English at this scope. The organisers split the English
tweets into a training dataset of 3,260 tweets, a validation dataset of 489 tweets and a test dataset of 978
tweets. To increase the training data set for the SVM, the Spanish training data set, comprising 3,660
tweets, was translated and combined with the English training data set. The translation was done with
DeepL [
          <xref ref-type="bibr" rid="ref39">55</xref>
          ]. The combined training data set thus comprised 6,920 tweets.
        </p>
        <p>
          A special characteristic of the annotations provided for the training data is that the “learning with
disagreement” paradigm [
          <xref ref-type="bibr" rid="ref40">56</xref>
          ] was employed. That means that for each tweet, the individual annotations
of all six annotators, along with their socio-demographic information, were provided instead of
aggregated labels. An initial idea was therefore to account for the unreliability of individual annotators by
merging the labels of the training dataset into a final label using a weighted majority vote, as suggested
by Labudde and Spranger [
          <xref ref-type="bibr" rid="ref41">57</xref>
          ], for example. However, this approach was not feasible because the
same six annotators did not annotate every tweet; instead, a total of 1,065 annotators were involved in
the annotation process. Accordingly, it was not possible to make a statement about the reliability of
annotators by evaluating their labelling behaviour.
        </p>
        <p>
          Therefore, as is customary in the literature [
          <xref ref-type="bibr" rid="ref42 ref43">58, 59</xref>
          ], the labels for the three subtasks – both for the
augmented training data and validation data – were summarised as follows based on the majority
decisions of the annotators:
Subtask 1.1: In the first subtask, the binary decision of whether the tweet is sexist (YES) or not (NO),
a simple majority decision was made, with YES being assigned in the event of a tie.
Subtask 1.2: If the majority label of the tweet in the first subtask was NO, the tweet was also assigned
the label NO for the second subtask, which was to identify the author’s intention. The reason for
this was that only the intention of sexist tweets was to be determined [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Otherwise, a majority
decision was made for the three intention categories. In the event of a tie, the corresponding
tweets were removed from the dataset.
        </p>
        <p>Subtask 1.3: In the third subtask, the multi-label classification of the targeted facets of a woman’s life,
only the sexist tweets (majority label YES) were given a label. Due to the design of this task as
a multi-label classification task, a slightly modified procedure was necessary here: a label was
assigned as soon as at least two annotators agreed on an assignment. If two annotators did not
agree on at least one label for a tweet, this tweet was removed from the dataset.</p>
        <p>
          The resulting class distribution of the training dataset, enriched with the translated tweets, is presented
in Table 2, where NO indicates non-sexist tweets. For descriptions of the other classes, please refer to
Plaza et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. System 1: Support Vector Machine</title>
        <p>For all three subtasks, runs were submitted that utilised a traditional classification approach employing
an SVM. The implementation of this system is described in more detail in the following subsections.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Data Preprocessing</title>
          <p>
            Several cleanup steps were performed. Initially, URLs and brackets were removed, as those were not
relevant for classification. In addition, mentions beginning with an @ sign were replaced with the
placeholder “person”, as it was irrelevant who exactly was being addressed since no information about
users was available. However, for the second sub-task, which aims to distinguish whether the tweet
reports on sexism, describes it or is itself sexist [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] – it is crucial if someone is addressed at all. Emojis
were detected using Unicode patterns and mapped to their corresponding word expressions based on a
dictionary-based approach. They were then treated as ordinary tokens of the tweet. In this way, their
symbolic meaning could be preserved.
          </p>
          <p>
            Since hashtags can also contain sexist references [
            <xref ref-type="bibr" rid="ref44">60</xref>
            ], the original hashtags were used as the basis for
two features (see subsubsection 3.2.2), and compound hashtags were broken down into individual words
so that they could be considered part of the tweet, just like emojis. The Wordninja tool by Keredson
[
            <xref ref-type="bibr" rid="ref38">54</xref>
            ] was used for this purpose, as it can reliably separate hashtags even if their components are not
separated by capital letters. At the same time, this tool removes all special characters.
          </p>
          <p>
            Finally, all tweets were converted to lowercase for normalisation and lemmatised using the English
UDPipe language model by Straka and Straková [
            <xref ref-type="bibr" rid="ref45">61</xref>
            ]. Over- and undersampling strategies were not
used for the final model, as preliminary experiments indicated that these strategies deteriorated the
results.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Feature extraction</title>
          <p>To generate the feature representation of the tweets for all three subtasks, various types of features
were extracted and concatenated into a single feature vector for each instance. Those included a sparse
term vector representation based on term frequency–inverse document frequency (TF-IDF) weighting,
a dense embedding of the tweet, and additional numerical features such as the number of dictionary
words and the frequency of specific token types, including emojis. Each feature type is described in
more detail subsequently.</p>
          <p>TF-IDF vectors Following the BoW approach, a TF-IDF weighted document-term matrix was
generated, with each row containing the weighted term vector of a tweet.</p>
          <p>GloVe-Embeddings In addition, the GloVe model by Pennington et al. [27] was used, which was
trained on a large tweet corpus and is therefore particularly well suited for this data. Subsequently,
to obtain a vector representation for a tweet, the arithmetic mean of the embeddings of the words
occurring in that tweet was calculated.</p>
          <p>Emojis The semantic information of the emojis was taken into account by including their word
descriptions in the BoW representation and the embedding representation after preprocessing (see
subsubsection 3.2.1). Their word descriptions were included in the bag-of-words representation and the
embedding representation. To also take into account quantitative information about the use of emojis,
the number of emojis per tweet was added as a numerical feature.</p>
          <p>Hashtags The original hashtags served as the basis for two additional features: First, following the
approach of Pamungkas et al. [22], a TF-IDF weighted bag-of-hashtags representation was created,
analogous to the standard bag-of-words model. For each tweet, this approach yielded a sparse vector
whose elements corresponded to the TF-IDF values of the hashtags appearing in that tweet. Second, the
total number of hashtags per tweet served as an additional numerical feature.</p>
          <p>Swear words Words with an ofensive or hurtful character play a central role in the detection of
sexist language. Since traditional dictionaries often do not include such slang expressions, a list of swear
words from the website NoSwearing.com [23] was used. This source, which is continually expanded,
was particularly well-suited for identifying these terms. The number of swear words per tweet was
integrated as a numerical feature.</p>
          <p>Sexist words With a special focus on sexist language, a separate word list was created. To this end,
the 100 words with the highest Pearson correlation to the YES label (i.e. sexism) in the training data
were extracted. The frequency of these presumably sexist words served as a numerical feature.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Feature selection</title>
          <p>
            After feature extraction, feature selection was carried out using the approach proposed by Kuhn [
            <xref ref-type="bibr" rid="ref46">62</xref>
            ],
based on correlation analyses, to eliminate redundancies and make the subsequent training of the model
more eficient. Specifically, the Pearson correlation coeficient was computed for each pair of features.
If the absolute correlation value between two features exceeded a threshold of 0.5, the feature with a
higher mean absolute correlation to all other features was removed, as it was considered less impactful.
As a result of feature selection, individual terms were removed from the TF-IDF vector, while all other
feature types were retained without modification.
          </p>
        </sec>
        <sec id="sec-3-2-4">
          <title>3.2.4. Classification approach</title>
          <p>
            Using the extracted features, SVM models were trained for each subtask. A sigmoid kernel was chosen
because Srivastava and Bhambhu [
            <xref ref-type="bibr" rid="ref47">63</xref>
            ] generally recommend a non-linear kernel for classification tasks,
and the sigmoid kernel proved promising in initial experiments. The sigmoid kernel ℎ( ′) was
used, where  was set to the inverse of the number of features.
          </p>
          <p>The diferent types of classification tasks required the use of diferent strategies for using the SVM
algorithm:
Subtask 1.1: Binary classification SVM could be applied directly to this task.</p>
          <p>
            Subtask 1.2: Multi-class classification Following the one-vs-one approach [
            <xref ref-type="bibr" rid="ref48">64</xref>
            ], a binary classifier
was trained for each pair of classes. The class of new tweets was determined using the voting
strategy described by Chang and Lin [
            <xref ref-type="bibr" rid="ref49">65</xref>
            ].
          </p>
          <p>
            Subtask 1.3: Multi-label classification To take into account possible correlations between the
categories, an ensemble classifier chain was trained using the approach of [ 20]. For each category
(i.e. each facet), a binary classifier was created and linked in such a way that the predictions of
previous classifiers were incorporated into the following classifiers as additional features. Since
there is no natural order of the categories, the order of the classifiers in the chain was chosen
arbitrarily. The classifier chain provided the probability of tweets belonging to the categories.
These soft labels represented the COMFOR_2 run of the subtask for the soft-soft evaluation, as
described by Plaza et al. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. To obtain hard assignments, a threshold value of 0.3 was set for the
membership probabilities for a label. These hard labels represented the COMFOR_1 run of the
subtask for the hard-hard evaluation.
          </p>
          <p>Subtasks 1.2 and 1.3 were treated as a hierarchical classification problem: The systems were trained
only on sexist annotated training data and applied exclusively to tweets that were predicted as sexist in
Subtask 1.1. This approach met the requirement to only classify sexist tweets further.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. System 2: Large Language Models</title>
        <p>
          For Subtask 1.1, the results of the traditional approach, based on SVM, were compared with those of
LLMs using persona-based prompting in combination with in-context learning. The focus was on
opensource LLMs rather than paid APIs such as GPT-4 [
          <xref ref-type="bibr" rid="ref21">37</xref>
          ]. Specifically, the performance of the following
models was evaluated using validation data: Qwen2.5-32B [
          <xref ref-type="bibr" rid="ref50">66</xref>
          ], Gemma 3-27B [
          <xref ref-type="bibr" rid="ref51">67</xref>
          ], LLama-3-8B [
          <xref ref-type="bibr" rid="ref34">50</xref>
          ]
and Mixtral 8x22B [
          <xref ref-type="bibr" rid="ref52">68</xref>
          ]. The analysis of open-source models for detecting sexism ofers the particular
advantage that this sensitive data can be better protected, as the models are run locally on the user’s
servers.
        </p>
        <p>The classification task was presented to all models using the same system and user prompt.
System prompt The system consists of an assignment to a persona or sociodemographic
characteristic, a task description, and a definition of sexism. The following roles (personas) were each assigned to
an LLM, which then annotated the entire English validation data set or test dataset:
• a male person
• a female person
• Alice Schwarzer, a well-known feminist</p>
        <p>
          The selection of the first two roles – a male person and a female person – was based on the work
of Tian et al. [
          <xref ref-type="bibr" rid="ref26">42</xref>
          ], Smith et al. [
          <xref ref-type="bibr" rid="ref33">49</xref>
          ] and on the assumption of the organisers of the shared task that
the gender of human annotators can influence the assessment of sexism [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Since the classification
task deals specifically with discrimination against women [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], it can be assumed that the “female” LLM
tends to classify tweets as sexist more often than the “male” one. To further reinforce this efect, the
persona of a prominent feminist, specifically Alice Schwarzer, was also introduced. The expectation
here was that an LLM assigned such a role would be particularly sensitive to microaggressions and
subtle forms of sexism. The idea of assigning a real personality to an LLM was taken up by Deshpande
et al. [
          <xref ref-type="bibr" rid="ref53">69</xref>
          ], among others. They showed that this approach can significantly change both the attitude
and the language of the model.
        </p>
        <p>System Prompt
You are &lt;persona&gt;. Your task is to decide whether a tweet expresses ideas related to sexism:
Consider the following definition: Ideas about sexism can be expressed in the following three
forms: the tweet is sexist itself, the tweet describes a sexist situation in which discrimination
towards women occurs, or the tweet criticizes a sexist behaviour. Remember that you should
annotate the tweet from the point of view of &lt;persona&gt;.</p>
        <p>
          In addition, the LLM was provided with both the specific task and a definition of sexism via the
system prompt—an approach also suggested, for example, by Reuver et al. [
          <xref ref-type="bibr" rid="ref54">70</xref>
          ]. The definition used is
based on that provided by the organisers of EXIST 2025 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and has been slightly adapted to clarify
which types of tweets are to be classified as sexist.
        </p>
        <p>
          User prompt The user prompt consisted of the following parts: a repetition of the task and the
assigned role, instructions on the desired output format, and examples of the classification task for
in-context learning. Specifically, the one-class shot strategy recommended by Assis et al. [
          <xref ref-type="bibr" rid="ref29">45</xref>
          ] for hate
speech detection was chosen, which involves providing only one annotated, preferably characteristic
example for each of the two classes in this case. The motivation behind this is that a single, well-chosen
example can be more efective and helpful in guiding the LLM than several examples of lower quality.
Therefore, the examples chosen were those mentioned by the organisers when presenting the tasks,
rather than examples from the training data, where there were often contradictions between the human
annotators [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>User Prompt
Determine whether the input tweet contains sexism. You should only reply with “YES” or “NO”.
Do not provide explanations or notes. Respond with a single word. Respond “YES” if the tweet
contains sexism and “NO” if it does not. Respond using JSON. Always remember that you should
annotate the tweets from the perspective of &lt;role&gt;. Examples of classification are:
• People really try to convince women with little to no ass that they should go out and buy
a body. Like bih, I don’t need a fat ass to get a man. Never have. Label: YES
• @messyworldorder it’s honestly so embarrassing to watch and they’ll be like “not all white
women are like that Label: NO</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>The following section compares the models using the validation data and discusses the final results of
the competition based on the test data.</p>
      <sec id="sec-4-1">
        <title>4.1. Comparison of Large Language Models</title>
        <p>
          First, the performance of the LLMs was compared on the validation dataset, with the Information
Contrast Measure (ICM) [
          <xref ref-type="bibr" rid="ref55">71</xref>
          ] score being the decisive factor. The organisers chose this score as the
evaluation criterion for calculating the ranking [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The ICM score achieved and the normalised form,
Macro ICM Norm, are presented in Table 3 for each of the four models, with the persona combination
that yielded the highest ICM score for the corresponding model highlighted in bold. The results coloured
in grey, without persona assignment, were obtained after the competition phase for further analysis.
        </p>
        <p>
          The results presented in Table 3 reveal apparent performance diferences between the models.
Regardless of the assigned persona, Gemma 3-27B consistently achieves the best scores, while
LLaMA3-8B performs the worst. For example, Gemma 3-27B achieves approximately three times the ICM value
of LLaMA-3-8B for male personas. This result supports the widely held assumption that model size is a
key performance factor in LLMs, as explained by Bender et al. [
          <xref ref-type="bibr" rid="ref56">72</xref>
          ]. However, since Qwen2.5-32B lags
behind Gemma 3-27B despite having a higher number of parameters, other influencing factors should
not be ruled out, such as whether a model takes a perpetrator or victim perspective for classification, as
described by Aoyagui et al. [
          <xref ref-type="bibr" rid="ref35">51</xref>
          ].
        </p>
        <p>
          Furthermore, there was no tendency for the models to favour a particular gender for better
classification results. While Gemma 3-27B and LLaMA-3-8B performed better when assigned the male
persona, Qwen2.5-32B and Mixtral 8x22B achieved the highest scores in the role of Alice Schwarzer.
Remarkably, none of the models achieved the best results with the unspecified female persona. One
possible explanation for the diferent performance of the models depending on the assigned persona
lies in the variation of the pre-training datasets [
          <xref ref-type="bibr" rid="ref22 ref50 ref51 ref52">38, 66, 67, 68</xref>
          ], which intensely influence the models’
level of knowledge [73]. In particular, it must be investigated to what extent an LLM’s knowledge of the
persona assigned to it in the classification task afects its performance. The behaviour of the smallest
model, LLaMA-3-8B, is particularly noteworthy: the ICM score for the generic female persona is twice
as high as for the Alice Schwarzer persona. That could indicate that the model has dificulty adequately
assuming the assigned role, possibly due to a lack of knowledge about the specific figure.
        </p>
        <p>For the larger models, the assigned persona appears to have less influence on performance, especially
in the Gemma 3-27B model, which raises the question of whether the assignment of personas makes
a remarkable contribution to the classification result. For clarification, the classification task was
subsequently repeated with all models, this time without persona assignment. The system prompt and
user prompt remained identical to those in paragraph 3.3 and paragraph 3.3; only the text elements
marked in red, which defined the persona, were omitted.</p>
        <p>
          As Table 3 (results highlighted in grey) demonstrates, persona assignment led to a slight improvement
in performance for all models except Gemma 3-27B. One possible explanation for the fact that the
best-performing model did not show any improvement is that specifying a specific perspective is
particularly helpful when a model is less confident in performing the task. However, it should also be
noted that the diference in performance between prompting with and without persona was minor,
especially for the larger models. This observation is also consistent with the findings of Civelli et al.
[
          <xref ref-type="bibr" rid="ref32">48</xref>
          ], who suggest that persona assignments can increase the stylistic diversity of model responses, but
do not necessarily lead to strongly diferent classification results.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results on validation data</title>
        <p>All systems used for submitting the runs were first evaluated on the English validation data (see Table 4).
For Subtask 1.1, the SVM-based system (COMFOR_1) and the two most powerful LLM-based approaches
were used. As shown in Table 3, the latter are Gemma 3-27B with an assigned male persona (COMFOR_2)
and a female persona (COMFOR_3). The runs listed in Table 4 for Subtask 1.2 and Subtask 1.3 are based
on the SVM system described in subsection 3.2.</p>
        <p>
          It should also be noted that, for technical reasons, the ICM-Hard and ICM-Hard Norm scores could not
be determined for the run COMFOR_1 for Subtask 1.3 based on the validation data. The run COMFOR_2
for Subtask 1.3 yielded so-called soft labels (see subsubsection 3.2.4). Accordingly, the metrics ICM-Soft
and ICM-Soft Norm were calculated for this run, as described by Plaza et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Subtask 1.1 When comparing the results of Subtask 1.1, it is noteworthy that the two LLM-based runs
- COMFOR_2 (Gemma 3-27B with male persona) and COMFOR_3 (Gemma 3-27B with female persona)
outperformed the SVM baseline (COMFOR_1) in terms of the ICM hard score, the primary evaluation
metric selected by the organizers, with a relative diference of 62.5%. All other calculated metrics also
yielded lower scores for the SVM-based system compared to the LLM-based approaches. One possible
reason for the lower performance of the SVM is the significant class imbalance in the training data
COMFOR_1
COMFOR_1
COMFOR_2
(see subsection 3.1). This problem did not afect the LLM, as no further training or fine-tuning was
performed on the training dataset.</p>
        <p>Another limitation of the SVM approach arises from the large number of concatenated features.
Although attempts were made to identify redundancy through correlation analyses, these experiments
should be repeated with alternative correlation measures such as XICOR [74], which also capture
non-linear relationships.</p>
        <p>Furthermore, it is questionable whether the combination of diferent lexical feature types – such as
GloVe embeddings and the BoW representation – ofers added value. A systematic investigation of the
respective contribution of individual feature types and their combinations to model performance is
therefore necessary. More targeted feature engineering could further improve the results of SVM and
may bring them closer to those of LLM-based approaches.</p>
        <p>Concerning the results of the LLM approaches, it is remarkable that the macro precision, macro recall
and macro F1 values hardly difer when assigning a male (COMFOR_2) or female persona (COMFOR_3).
This result confirms the previous impression that the selected persona in Gemma 3-27B has little
influence on the classification results.</p>
        <p>Subtask 1.2 Regarding Subtask 1.2, it is noticeable that a significantly lower 1 measure of only 0.47
and a negative ICM score were achieved. One possible reason for this result is that only tweets that
were classified as sexist (YES) by the SVM in Subtask 1.1 were passed on to the SVM for fine-grained
classification. Misclassifications by the SVM for Subtask 1.1, therefore, have a direct impact on the
performance of the SVM for Subtask 1.2.</p>
        <p>An alternative approach would be to address the task not as a hierarchical problem, but to directly
classify the tweets into four categories, including the NO category, i.e., no sexism. However, this
approach would exacerbate the problem of imbalance, as the non-sexist tweets strongly outnumber
the tweets assigned to the various categories of sexist tweets. The fact that the imbalance is already
problematic in the three categories of author intent is evident from the 1 scores achieved for these
individual categories:
• DIRECT: 0.76
• JUDGEMENTAL: 0.08
• REPORTED: 0.27</p>
        <p>
          The 1 measure for the overrepresented class DIRECT clearly exceeds that of the two
underrepresented classes. In addition to class imbalance, the performance diferences could also be due to the fact
that feature types such as BoW or dictionary-based features are particularly well suited for detecting
explicitly sexist tweets of the DIRECT category. Tweets in the JUDGEMENTAL category, however,
describe (social) circumstances perceived as sexist [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] without necessarily containing terms with clear
sexist connotations or swear words. Accordingly, it is not surprising that the SVM system with lexical
features achieved the lowest 1 score in this category.
        </p>
        <p>Subtask 1.3 Both the problem of imbalance and the dependence on the classifier’s performance from
Subtask 1.1 also afect Subtask 1.3. Nevertheless, the macro 1 measure for this task was higher than for
Subtask 1.2. This result was achieved through comparatively high macro precision, but at the expense
of lower macro recall. The decisive factor in this trade-of is primarily the threshold value at which a
tweet is assigned a label, which is why systematic testing of diferent threshold values could lead to
performance improvements.</p>
        <p>
          The results of the soft evaluation were considerably weaker, with a negative ICM soft score of −10.89
and an ICM soft norm of 0.0. A possible reason is the training on an aggregated gold standard (see
subsection 3.1), where agreement between two annotators was suficient for label assignment. Thus,
annotation uncertainties were not taken into account, which may have afected the prediction of the
actual proportion of annotators who assigned a label [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results on test data</title>
        <p>
          The final evaluation by the competition organisers resulted in several rankings: on the one hand, the
systems were compared only with the gold standard of one language, and on the other hand, with
the gold standard for both languages [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The results of the hard-all evaluation, i.e. comparison of
the predicted hard labels with the hard labels of the entire gold standard, can be found in Table 5. In
addition to the evaluation values calculated by the organisers, this table also shows the ranking within
the total number of submitted results.
        </p>
        <p>For all subtasks, even the first one, the submitted runs are listed at the bottom of the rankings.
However, these results can mainly be explained by the fact that we only submitted results for English
tweets. Therefore, the results and ranking of our runs, based solely on the English test data, which can
be found in Table 6, are more interesting.</p>
        <p>
          Table 6 clearly shows that LLM-based models continue to outperform SVM on the test data for
Subtask 1.1. In contrast to the results of the validation data, COMFOR_3, i.e. Gemma 3-27B with a female
persona, achieved a slightly higher ICM hard score than COMFOR_2. Ranked 77th out of 158, this run
is roughly in the middle of the submitted results. Accordingly, there is also room for improvement
for the LLM-based system. In addition to the aspects already discussed in subsection 4.2, it should
also be investigated whether, as emphasised by Jiang et al. [
          <xref ref-type="bibr" rid="ref25">41</xref>
          ], less information-overloaded prompts
could be more efective, for which persona assignment is combined with zero-shot prompting instead
of in-context learning. The results of the SVM on the English test data were similar to those on the
English validation data, with the low ranking once again highlighting the system’s weaknesses.
        </p>
        <p>
          Moreover, the SVM-based systems achieved poor results in Subtasks 1.2 and 1.3 with negative ICM
hard scores. However, the fact that over 20 % of the teams in Subtask 1.2 and 10 % in Subtask 1.3
achieved an ICM hard score of 0.0 [75] underscores the challenge of reliably predicting aggregated
labels in these classification tasks, considered subjective [
          <xref ref-type="bibr" rid="ref35 ref7">7, 51</xref>
          ].
        </p>
        <p>Finally, the results for the soft-soft evaluation of Subtask 1.3 are shown in Table 7, both for all test
data and for English data only.</p>
        <p>What is particularly notable here is that the ICM Soft norm is zero in both cases. However, when
comparing these results with those of the other participants, it becomes apparent that over 20
participants achieved a score of zero in this measure, which highlights that predicting the proportion of
annotators who assigned a label also poses challenges. In this context, the organisers of the Shared
Task themselves emphasised that label ambiguity and the lack of agreement between annotators make
this task particularly dificult [75].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The aim of the first task of the EXIST-2025 task was to detect sexism in tweets and then to classify
sexist content in more detail. In this paper, we presented our approach to these subtasks specifically for
English-language tweets. For binary sexism detection (Subtask 1.1), two approaches were compared: On
the one hand, LLMs were used in combination with persona-based prompting and in-context learning.
On the other hand, a traditional machine learning approach was used, in which an SVM was trained on
a dataset enriched with tweets translated from Spanish into English. Features based on a GloVe model
pre-trained specifically for Twitter, as well as additional lexical and statistical features, were used for
the SVM. The SVM was also used for the other subtasks: to classify the author’s intention (Subtask 1.2)
and to detect the facet of a woman’s life afected in the tweet (Subtask 1.3). For multi-label classification
in Subtask 1.3, the SVM served as the basis for the classifier chain method.</p>
      <p>For Subtask 1.1, our best model, the LLM Gemma 3-27B with an assigned male persona, achieved
an ICM score of 0.4928 on the English test data, outperforming the SVM approach, which achieved
an ICM score of 0.2033. In the other two subtasks, however, negative ICM values were achieved in
each case, a result primarily due to the strong class imbalance in the training data, which negatively
impacted the performance of the SVM.</p>
      <p>One possible approach to mitigate this problem is to follow the example of Lee et al. [76] and utilise
LLMs to generate additional training instances for underrepresented classes artificially. Furthermore,
since LLMs delivered better results than classic models in Subtask 1.1, it makes sense to extend their
use to Subtasks 1.2 and 1.3. To date, it has been demonstrated that assigning a gender-specific persona
has had only a minor impact on the classification results. Therefore, further potential can be tapped by
assigning alternative perspectives, such as a specific cultural background.</p>
      <p>Additionally, it has not yet been taken into account that the intense subjectivity of the task can
lead to disagreements among annotators. A possible solution here could be a combination of LLMs
with classic methods such as SVM. In this case, the LLM would not act as the sole classifier, but as a
kind of ‘artificial additional annotator’ that takes over a decision-making function in the event of label
inconsistencies and can thus contribute to the consistency of the training dataset.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword, Improve writing style. Further, the authors used DeepL in order
to: Text Translation. After using these tool/services, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[12] S. Gross, J. Petrak, L. Venhof, B. Krenn, GermEval2024 Shared Task: GerMS-Detect – Sexism
Detection in German Online News Fora, in: B. Krenn, J. Petrak, S. Gross (Eds.), Proceedings of
GermEval 2024 Task 1 GerMS-detect Workshop on Sexism Detection in German Online News Fora
(GerMS-detect 2024), Association for Computational Lingustics, Vienna, Austria, 2024, pp. 1–9.
[13] P. Donabauer, Pd2904 at GermEval2024 (Shared Task 1: GerMS-Detect): Exploring the Efectiveness
of Multi-Task Transformers vs. Traditional Models for Sexism Detection (Closed Tracks of Subtasks
1 and 2), in: B. Krenn, J. Petrak, S. Gross (Eds.), Proceedings of GermEval 2024 Task 1 GerMS-detect
Workshop on Sexism Detection in German Online News Fora (GerMS-detect 2024), Association
for Computational Lingustics, Vienna, Austria, 2024, pp. 39–47.
[14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, volume 1, Association for Computational Linguistics,
Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
[15] A. Abdollah Zadeh, J. Leevy, T. Khoshgoftaar, A Survey on the Choice Between Binary Classification
and One-Class Classification, in: Proceedings of the 27th ISSAT International Conference on
Reliability and Quality in Design, International Society of Science and Applied Technologies,
Virtual Event, 2022.
[16] A. D. Asti, I. Budi, M. O. Ibrohim, Multi-label Classification for Hate Speech and Abusive Language
in Indonesian-Local Languages, in: Proceedings of the 2021 International Conference on Advanced
Computer Science and Information Systems (ICACSIS), IEEE, Depok, Indonesia, 2021, pp. 1–6.
doi:10.1109/ICACSIS53237.2021.9631316.
[17] E. Montañes, R. Senge, J. Barranquero, J. Ramón Quevedo, J. José Del Coz, E. Hüllermeier,
Dependent binary relevance models for multi-label classification, Pattern Recognition 47 (2014)
1494–1508. doi:10.1016/j.patcog.2013.09.029.
[18] E. Alvares-Cherman, J. Metz, M. C. Monard, Incorporating label dependency into the binary
relevance framework for multi-label classification, Expert Systems with Applications 39 (2012)
1647–1655. doi:10.1016/j.eswa.2011.06.056.
[19] M. R. Boutell, J. Luo, X. Shen, C. M. Brown, Learning multi-label scene classification, Pattern</p>
      <p>Recognition 37 (2004) 1757–1771. doi:10.1016/j.patcog.2004.03.009.
[20] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier Chains for Multi-label Classification, in:
W. Buntine, M. Grobelnik, D. Mladenić, J. Shawe-Taylor (Eds.), Machine Learning and Knowledge
Discovery in Databases, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 254–269.
[21] F. Fasoli, A. Carnaghi, M. P. Paladino, Social acceptability of sexist derogatory and sexist
objectifying slurs across contexts, Language Sciences 52 (2015) 98–107. doi:10.1016/j.langsci.2015.
03.003.
[22] E. W. Pamungkas, V. Basile, V. Patti, Misogyny Detection in Twitter: A Multilingual and
CrossDomain Study, Information Processing &amp; Management 57 (2020) 102360. doi:10.1016/j.ipm.
2020.102360.
[23] All Slang Network, NoSwearing.com: Swear Word List, Dictionary, Filter, and API,
https://www.noswearing.com/, 2025.
[24] E. Bassignana, V. Basile, V. Patti, Hurtlex: A Multilingual Lexicon of Words to Hurt, in: Proceedings
of the 5th Italian Conference on Computational Linguistics (CLiC-it), volume 2253, CEUR-WS,
Turin, Italy, 2018, pp. 51–56. doi:10.4000/books.aaccademia.3085.
[25] D. S. Asudani, N. K. Nagwani, P. Singh, Impact of word embedding models on text analytics
in deep learning environment: A review, Artificial Intelligence Review 56 (2023) 10345–10425.
doi:10.1007/s10462-023-10419-1.
[26] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed Representations of Words and
Phrases and their Compositionality, Advances in Neural Information Processing Systems 26 (2013)
1–8.
[27] J. Pennington, R. Socher, C. D. Manning, GloVe: Global Vectors for Word Representation,
in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language
ProcessCan Language Models Be Too Big?, in: Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency, ACM, Virtual Event Canada, 2021, pp. 610–623. doi:10.1145/
3442188.3445922.
[73] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh,
M. Lewis, L. Zettlemoyer, O. Levy, LIMA: Less Is More for Alignment, in: Proceedings of the 37th
International Conference on Neural Information Processing Systems, Nips ’23, Curran Associates
Inc., Red Hook, NY, USA, 2024, pp. 55006–55021. doi:10.48550/arXiv.2305.11206.
[74] S. Chatterjee, A New Coeficient of Correlation, Journal of the American Statistical Association
116 (2021) 2009–2022. doi:10.1080/01621459.2020.1758115.
[75] L. Plaza, J. Carrillo-de-Albornoz, I. Arcos, P. Rosso, D. Spina, E. Amigó, J. Gonzalo, R. Morante,
Overview of EXIST 2025: Learning with Disagreement for Sexism Identification and
Characterization in Tweets, Memes, and TikTok Videos, in: J. Carrillo-de-Albornoz, A. G. S. de Herrera,
J. Gonzalo, L. Plaza, J. Mothe, F. Piroi, P. Rosso, D. Spina, G. Faggioli, N. Ferro (Eds.), Experimental IR
Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International
Conference of the CLEF Association (CLEF 2025), CEUR-WS, Madrid, Spain, 2025.
[76] D.-H. Lee, J. Pujara, M. Sewak, R. White, S. Jauhar, Making large language models better data
creators, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, Association for Computational Linguistics, Singapore,
2023, pp. 15349–15360. doi:10.18653/v1/2023.emnlp-main.948.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vandebosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Poels</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Pabian,</surname>
          </string-name>
          <article-title>The ecology of online hate speech: Mapping expert perspectives on the drivers for online hate perpetration with the Delphi method</article-title>
          ,
          <source>Aggressive Behavior</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <article-title>e22136</article-title>
          . doi:
          <volume>10</volume>
          .1002/ab.22136.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Guterres</surname>
          </string-name>
          ,
          <source>United Nations Strategy and Plan of Action on Hate Speech</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>European</given-names>
            <surname>Union Agency for Fundamental Rights</surname>
          </string-name>
          (Ed.), Online Content Moderation:
          <article-title>Current Challenges in Detecting Hate Speech</article-title>
          , Publications Ofice, Luxembourg,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .2811/ 332335.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dolezalek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fielitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heindl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tracing Online Misogyny - Eine Analyse Misogyner Ideologien Und Praktiken Aus Deutsch-Internationaler</surname>
            <given-names>Perspektive</given-names>
          </string-name>
          ,
          <source>Technical Report, Bundesarbeitsgemeinschaft Gegen Hass im Netz</source>
          , Berlin,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. L. Gil</given-names>
            <surname>Bermejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martos Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Vázquez</given-names>
            <surname>Aguado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>García-Navarro</surname>
          </string-name>
          ,
          <article-title>Adolescents, Ambivalent Sexism and Social Networks, a Conditioning Factor in the Healthcare of Women</article-title>
          .,
          <string-name>
            <surname>Healthcare</surname>
          </string-name>
          (Basel, Switzerland)
          <volume>9</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .3390/healthcare9060721.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lindekilde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Karg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Bang</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H. R.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <article-title>Combatting online hate: Crowd moderation and the public goods problem</article-title>
          ,
          <source>Communications</source>
          <volume>49</volume>
          (
          <year>2024</year>
          )
          <fpage>444</fpage>
          -
          <lpage>467</lpage>
          . doi:
          <volume>10</volume>
          .1515/commun-2023-0109.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Arcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <surname>EXIST</surname>
          </string-name>
          <year>2025</year>
          :
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Tweets, Memes, and TikTok Videos</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Proceedings of the 47th European Conference on Information Retrieval</source>
          , volume
          <volume>15576</volume>
          , Springer Nature Switzerland, Lucca, Italy,
          <year>2025</year>
          , pp.
          <fpage>442</fpage>
          -
          <lpage>449</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>65</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of EXIST 2023 -
          <article-title>Learning with Disagreement for Sexism Identification</article-title>
          and Characterization, Vienna, Austria,
          <year>2023</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>342</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -42448-9_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Tweets and Memes</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Di Nunzio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland, Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>117</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Frick</surname>
          </string-name>
          , M. Steinebach, FraunhoferSIT@EXIST2024:
          <article-title>Leveraging Stacking Ensemble Learning for Sexism Detection</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sreekumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karthik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gopalakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          ,
          <article-title>Sexism Identification In Tweets Using Machine Learning Approaches</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>1253</fpage>
          -
          <lpage>1259</lpage>
          . ing (EMNLP),
          <article-title>Association for Computational Linguistics</article-title>
          , Doha, Qatar,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1162.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <string-name>
            <surname>RoBERTa: A Robustly Optimized BERT Pretraining Approach</surname>
          </string-name>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1907</year>
          .
          <volume>11692</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          , L. Neves,
          <article-title>TweetEval: Unified benchmark and comparative evaluation for tweet classification</article-title>
          , in: T. Cohn,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1644</fpage>
          -
          <lpage>1650</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>148</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Azadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ansari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eetemadi</surname>
          </string-name>
          , Bilingual Sexism Classification:
          <article-title>Fine-Tuned XLMRoBERTa and GPT-3.5 Few-Shot Learning</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , arXiv, Grenoble, France,
          <year>2025</year>
          , pp.
          <fpage>958</fpage>
          -
          <lpage>965</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv. 2406.07287. arXiv:
          <volume>2406</volume>
          .
          <fpage>07287</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mohammadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giachanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bagheri</surname>
          </string-name>
          ,
          <article-title>Towards Robust Online Sexism Detection: A MultiModel Approach with BERT, XLM-RoBERTa, and DistilBERT for EXIST 2023 Tasks</article-title>
          , in: Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS, Thessaloniki</article-title>
          , Greece,
          <year>2023</year>
          , pp.
          <fpage>1000</fpage>
          -
          <lpage>1011</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Usmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samad</surname>
          </string-name>
          ,
          <article-title>Sexism Identification in Tweets using BERT and XLM Roberta</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>1274</fpage>
          -
          <lpage>1279</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>E.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cuadrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. M.</given-names>
            <surname>Santos</surname>
          </string-name>
          , E. Puertas,
          <article-title>Notebook for the VerbaNex AI Lab at CLEF 2024</article-title>
          ,
          <article-title>CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>1107</fpage>
          -
          <lpage>1113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [34]
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Raychawdhary</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Bhattacharya</surname>
            , G. Dozier,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Seals</surname>
          </string-name>
          , AU_NLP at SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism using fine-tuned RoBERTa</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>717</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>97</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Panwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mamidi</surname>
          </string-name>
          ,
          <string-name>
            <surname>DAP-LeR-DAug</surname>
          </string-name>
          :
          <article-title>Techniques for enhanced online sexism detection</article-title>
          , in: M.
          <string-name>
            <surname>Abbas</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Freihat</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP</source>
          <year>2023</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2023</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kit</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Mokji</surname>
          </string-name>
          ,
          <article-title>Sentiment Analysis Using Pre-Trained Language Model With No Fine-Tuning and Less Resource</article-title>
          ,
          <source>IEEE access 10</source>
          (
          <year>2022</year>
          )
          <fpage>107056</fpage>
          -
          <lpage>107065</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2022</year>
          .
          <volume>3212367</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [37] OpenAI, GPT-4
          <source>Technical Report</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
            , E. Grave, G. Lample, LLaMA: Open and
            <given-names>Eficient</given-names>
          </string-name>
          <string-name>
            <surname>Foundation Language Models</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2302.13971.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vetagiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pakray</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <article-title>A deep dive into automated sexism detection using fine-tuned deep learning and large language models</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>145</volume>
          (
          <year>2025</year>
          )
          <article-title>110167</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.engappai.
          <year>2025</year>
          .
          <volume>110167</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Samani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Large Language Models with Reinforcement Learning from Human Feedback Approach for Enhancing Explainable Sexism Detection</article-title>
          , in: O.
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Apidianaki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Khalifa</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          <string-name>
            <surname>Eugenio</surname>
          </string-name>
          , S. Schockaert (Eds.),
          <source>Proceedings of the 31st International Conference on Computational Linguistics (COLING)</source>
          ,
          <article-title>Association for Computational Linguistics, Abu Dhabi</article-title>
          ,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          ,
          <year>2025</year>
          , pp.
          <fpage>6230</fpage>
          -
          <lpage>6243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vitsakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dinkar</surname>
          </string-name>
          , G. Abercrombie,
          <string-name>
            <surname>I. Konstas</surname>
          </string-name>
          ,
          <article-title>Re-examining Sexism and Misogyny Classification with Annotator Attitudes</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>15103</fpage>
          -
          <lpage>15125</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . findings-emnlp.
          <volume>887</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>Large Language Model Cascades and Persona-Based In-Context Learning for Multilingual Sexism Detection</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Di Nunzio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco De Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , volume
          <volume>14958</volume>
          of Lecture Notes in Computer Science, Springer Nature Switzerland, Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>254</fpage>
          -
          <lpage>265</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71736-9_
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , P. Röttger, SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2193</fpage>
          -
          <lpage>2210</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          , G. Neubig,
          <article-title>Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          . doi:
          <volume>10</volume>
          .1145/3560815.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>G.</given-names>
            <surname>Assis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Amorim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          , D. de Oliveira,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vianna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paes</surname>
          </string-name>
          ,
          <article-title>Exploring Portuguese Hate Speech Detection in Low-Resource Settings: Lightly Tuning Encoder Models or In-Context Learning of Large Models?</article-title>
          , in: P.
          <string-name>
            <surname>Gamallo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Claro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Teixeira</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Real</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>H. G.</given-names>
          </string-name>
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , R. Amaro (Eds.),
          <source>Proceedings of the 16th International Conference on Computational Processing of Portuguese</source>
          , volume
          <volume>1</volume>
          , Association for Computational Lingustics, Santiago de Compostela, Galicia/Spain,
          <year>2024</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>U.</given-names>
            <surname>Sahin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. E.</given-names>
            <surname>Kucukkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ozcelik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Toraman</surname>
          </string-name>
          ,
          <article-title>Zero and Few-Shot Hate Speech Detection in Social Media Messages Related to Earthquake Disaster</article-title>
          ,
          <source>in: Proceedings of the 31st Signal Processing and Communications Applications Conference (SIU)</source>
          , IEEE, Istanbul, Turkiye,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/SIU59756.
          <year>2023</year>
          .
          <volume>10224056</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>K.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vishwamitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>An Investigation of Large Language Models for Real-World Hate Speech Detection</article-title>
          , in: 2023
          <source>International Conference on Machine Learning and Applications (ICMLA)</source>
          , IEEE, Jacksonville, FL, USA,
          <year>2023</year>
          , pp.
          <fpage>1568</fpage>
          -
          <lpage>1573</lpage>
          . doi:
          <volume>10</volume>
          . 1109/ICMLA58977.
          <year>2023</year>
          .
          <volume>00237</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>S.</given-names>
            <surname>Civelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bernardelle</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Demartini, The Impact of Persona-based Political Perspectives on Hateful Content Detection</article-title>
          ,
          <source>in: Companion Proceedings of the ACM on Web Conference</source>
          <year>2025</year>
          , ACM,
          <source>Sydney NSW Australia</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>1963</fpage>
          -
          <lpage>1968</lpage>
          . doi:
          <volume>10</volume>
          .1145/3701716.3718383.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Trippas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <source>RMIT-IR at EXIST Lab at CLEF</source>
          <year>2024</year>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>1237</fpage>
          -
          <lpage>1252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          , et al.,
          <source>The Llama 3 Herd of Models</source>
          , arXiv e-prints (
          <year>2024</year>
          ) arXiv-
          <fpage>2407</fpage>
          . doi:
          <volume>10</volume>
          .48550/ arXiv.2407.21783.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Aoyagui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stemmler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Ferguson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuzminykh</surname>
          </string-name>
          ,
          <article-title>A Matter of Perspective(s): Contrasting Human and LLM Argumentation in Subjective Decision-Making on Subtle Sexism</article-title>
          ,
          <source>in: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI'25)</source>
          , volume
          <volume>529</volume>
          of Chi '
          <volume>25</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . doi:
          <volume>10</volume>
          .1145/3706598.3713248.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Almendros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <source>Do Large Language Models Understand Mansplaining? Well</source>
          , Actually...,
          <string-name>
            <surname>LREC-COLING</surname>
          </string-name>
          <year>2024</year>
          (
          <year>2024</year>
          )
          <fpage>5235</fpage>
          -
          <lpage>5246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>L.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Designing of Prompts for Hate Speech Recognition with In-Context Learning</article-title>
          , in: 2022
          <source>International Conference on Computational Science and Computational Intelligence (CSCI)</source>
          , IEEE,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA,
          <year>2022</year>
          , pp.
          <fpage>319</fpage>
          -
          <lpage>320</lpage>
          . doi:
          <volume>10</volume>
          .1109/CSCI58124.
          <year>2022</year>
          .
          <volume>00063</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [54]
          <string-name>
            <surname>Keredson</surname>
          </string-name>
          , Word Ninja, https://github.com/keredson/wordninja,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [55]
          <string-name>
            <surname>DeepL</surname>
            <given-names>SE</given-names>
          </string-name>
          , DeepL Translate:
          <article-title>The world's most accurate translator</article-title>
          , https://www.deepl.com/en/translator,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dumitrache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          , E. Simpson, M. Poesio, SemEval-2021
          <source>Task</source>
          <volume>12</volume>
          :
          <article-title>Learning with Disagreements</article-title>
          , in: A.
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Emerson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbelot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          Zhu (Eds.),
          <source>Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>338</fpage>
          -
          <lpage>347</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>D.</given-names>
            <surname>Labudde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spranger</surname>
          </string-name>
          ,
          <article-title>Towards Inter-Rater-Agreement-Learning</article-title>
          ,
          <source>in: Proceedings of the Tenth International Conference on Advances in Information Mining and Management (IMMM)</source>
          , IARIA Press, Lisabon, Portugal,
          <year>2020</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>C.</given-names>
            <surname>Demus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schütz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Probol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegel</surname>
          </string-name>
          , D. Labudde,
          <article-title>DeTox: A Comprehensive Dataset for German Ofensive Language and Conversation Analysis</article-title>
          , in: K.
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A. Mostafazadeh</given-names>
          </string-name>
          <string-name>
            <surname>Davani</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mathias</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Vidgen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          Talat (Eds.),
          <source>Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle, Washington (Hybrid),
          <year>2022</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>153</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .woah-
          <volume>1</volume>
          .
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Anzovino</surname>
          </string-name>
          ,
          <source>Overview of the Task on Automatic Misogyny Identification at IberEval</source>
          <year>2018</year>
          , in:
          <source>Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          ),
          <article-title>CEUR-WS, Sevilla</article-title>
          , Spain,
          <year>2018</year>
          , pp.
          <fpage>214</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chiril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Moriceau</surname>
          </string-name>
          , “
          <article-title>Be nice to your wife! The restaurants are closed”: Can gender stereotype detection improve sexism classification?</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2833</fpage>
          -
          <lpage>2844</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>242</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [61]
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Straková</surname>
          </string-name>
          ,
          <source>Universal Dependencies 2</source>
          .
          <article-title>5 Models for UDPipe (</article-title>
          <year>2019</year>
          -12-06), http://hdl. handle.net/11234/1-
          <fpage>3131</fpage>
          ,
          <year>2019</year>
          . Accessed on 2025-
          <volume>07</volume>
          -30.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [62]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <article-title>Data Sets and Miscellaneous Functions in the caret Package</article-title>
          ,
          <source>Journal of Statistical Software</source>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [63]
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bhambhu</surname>
          </string-name>
          ,
          <article-title>Data classification using support vector machine</article-title>
          ,
          <source>Journal of Theoretical and Applied Information Technology</source>
          <volume>12</volume>
          (
          <year>2010</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [64]
          <string-name>
            <given-names>U.</given-names>
            <surname>Kreßel</surname>
          </string-name>
          ,
          <article-title>Pairwise Classification and Support Vector Machines</article-title>
          , in: C. J.
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <string-name>
            <surname>Smola</surname>
          </string-name>
          (Eds.),
          <source>Advances in Kernel Methods</source>
          , The MIT Press,
          <year>1998</year>
          . doi:
          <volume>10</volume>
          .7551/mitpress/ 1130.003.0020.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [65]
          <string-name>
            <surname>C.-C. Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Lin</surname>
            ,
            <given-names>LIBSVM:</given-names>
          </string-name>
          <article-title>A library for support vector machines</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          . doi:
          <volume>10</volume>
          .1145/1961189.1961199.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [66]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          , Y. Han,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lu</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , T. Zhu,
          <source>Qwen Technical Report</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2309.16609.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [67]
          <string-name>
            <surname>Gemma</surname>
            <given-names>Team</given-names>
          </string-name>
          ,
          <source>Gemma: Open Models Based on Gemini Research and Technology</source>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          . 48550/arXiv.2403.08295.
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [68]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Hanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel, G. Bour,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Antoniak</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gervet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mixtral of Experts,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2401.04088.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [69]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Murahari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rajpurohit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <article-title>Toxicity in chatgpt: Analyzing persona-assigned language models, in: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>1236</fpage>
          -
          <lpage>1270</lpage>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>88</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [70]
          <string-name>
            <given-names>M.</given-names>
            <surname>Reuver</surname>
          </string-name>
          , I. Sen,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melis</surname>
          </string-name>
          , G. Lapesa,
          <article-title>Tell me what you know about sexism: Expert-LLM interaction strategies and co-created definitions for zero-shot sexism detection</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2025</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          , pp.
          <fpage>8438</fpage>
          -
          <lpage>8467</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          .findings-naacl.
          <volume>470</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [71]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Verdejo</surname>
          </string-name>
          ,
          <article-title>On the foundations of similarity in information access</article-title>
          ,
          <source>Information Retrieval Journal</source>
          <volume>23</volume>
          (
          <year>2020</year>
          )
          <fpage>216</fpage>
          -
          <lpage>254</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10791-020-09375-z.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [72]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          , On the Dangers of Stochastic Parrots:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>