<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Stacked Reflective Reasoning in Large Neural Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kapioma Villarreal-Haro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Sánchez-Vega</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Rosales-Pérez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrián Pastor López-Monroy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consejo Nacional de Ciencia y Tecnología (CONACYT)</institution>
          ,
          <addr-line>Av. Insurgentes Sur 1582, Col. Crédito Constructor, 03940, CDMX</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mathematics Research Center (CIMAT)</institution>
          ,
          <addr-line>Jalisco S/N Valenciana, 36023, Guanajuato, Guanajuato</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mathematics Research Center (CIMAT)</institution>
          ,
          <addr-line>Monterrey, Av. Alianza Centro 502, Apodaca, 66628, Nuevo León</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Sexism, far from being merely a conceptual issue, is a concerning and pervasive social health problem that negatively impacts individuals' well-being and perception. In today's digital era, as sexism permeates online platforms, the creation of systems that detect this type of content is a challenging yet essential task. This paper presents the approach of the CIMAT-GTO team to Task 1 of EXIST 2024, which involves identifying tweets with sexism-related content. Our proposal takes advantage of the reasoning capabilities of Llama 3 in a two-step process. Initially, we generate rationales to analyze the nature of the tweets. Then, in a second step, we let the model reflect on the previously produced reasoning. The intuitive idea is to create text that supports opposite categories, and expect the model to contrast valid and invalid reasons by itself. We then use these generated rationales as extra information to complement the tweets and fine tune a Twitter-specialized XLM-RoBERTa model. Our experiments showed that incorporating Llama 3's rationales improves performance compared to only using tweets and yields competitive results in the task, demonstrating the potential of these methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative Large Language Models</kwd>
        <kwd>Large Language Model Reasoning</kwd>
        <kwd>Stacked Large Language Models</kwd>
        <kwd>Transformers</kwd>
        <kwd>Sexism Detection</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today’s world, social media is an essential platform for the communication and difusion of information
and opinions among individuals. However, social media interactions are often related to misleading
or harmful content. For instance, users might directly express bias in their own generated content.
Alternatively, they could engage by sharing and commenting on biased content created by other users.
Within these interactions, one major social concern is sexism, defined as prejudice or discrimination
based on sex or gender [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Sexism negatively afects the psychological well-being of women and men not only in everyday
face-to-face interactions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but also in social media platforms [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this context, social media has been
used for two contrary purposes: 1) as a platform for bias dissemination where misleading information
and hateful behavior against women are spread, and 2) as a means for bias awareness and activism,
enabling users to address, report, and discuss misogynistic and sexist narratives [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        Instances of harmful sexist expressions in social media include hostile behavior and negative
evaluations of female job candidates [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Other problems involve the distribution and consumption of
content that perpetuates appearance anxiety, body shame, and eating disorder behaviors, primarily
among women [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. On the other hand, social media has also been used as a platform for positive impact
behaviors, such as mobilizing digital media in response to shaming, harassment, and rape culture [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In the modern world, the eficient identification of sensitive content is crucial due to the vast volume
of data, human biases, and the profound impact of this task. Despite advancements in computer science
and the deployment of more accurate and sophisticated models, the challenge remains unsolved. In
this context, several eforts include shared tasks to address this issue, such as Automatic Misogyny
Identification at IberEval [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Multimedia Automatic Misogyny Identification and Explainable Detection
of Online Sexism at SemEval [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], and sEXism Identification in Social neTworks (EXIST) [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>This paper describes CIMAT-GTO’s participation in Task 1 of EXIST 2024, which tackles the binary
detection of sexism in tweets. We propose a technique that takes advantage of the reasoning capabilities
of Large Language Models (LLMs). In a first step, the LLM generates reasoning supporting the target
categories for Task 1, which are sexism-related and not-sexism-related. In a second step, we exploit and
take advantage of this valuable information and feed it to the LLM to reflect on this reasoning. The
reasoning outputs are then used to extend the information provided by the tweets in fine tuning an
XLM-RoBERTa model pre-trained on multilingual tweets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. EXIST Shared Lab</title>
      <p>
        EXIST shared tasks aim to detect and capture sexism-related content in social networks while identifying
intention and fine-grained topics [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. EXIST has evolved from only analyzing content in text format
to multi-modal content. While the 2023 edition focused on detecting and categorizing sexist tweets, the
2024 edition extended its scope to encompass both tweets and meme images.
      </p>
      <p>For tweets, three primary tasks were established.</p>
      <p>• Task 1: Binary classification to identify tweets with sexism-related content.
• Task 2: Multi-class classification to identify the intention of the tweet.</p>
      <p>• Task 3: Multi-label classification to categorize the types of sexism expressed.</p>
      <p>Analogous tasks were introduced regarding meme images. Systems that address such shared tasks
can be set in two contexts: a hard setting, in which systems aim to predict a hard conventional output
category or set of categories, and a soft setting, in which systems are intended to provide probabilities
instead. In the case of Task 1, the categories consist of sexist and non-sexist tweets. In this paper, we
address Task 1 in the hard context.</p>
      <sec id="sec-2-1">
        <title>2.1. Tweet Dataset</title>
        <p>The tweet dataset contains 10, 034 tweets in Spanish and English; 6920 for training, 1038 for
development, and 2076 for testing. Each tweet was labeled by six annotators selected such that they had
diferent demographic characteristics to minimize bias in the labeling. The age ranges and gender of
the annotators belonged to the sets {18 − 22, 23 − 45, 46+} and { ,  }. All tweets were
annotated such that there was one annotator from each of the six possible gender-age groups.</p>
        <p>For Task 1, annotators were required to indicate whether the tweet was related to sexism or not.
In the following, we will refer to these two categories as sexist and not sexist for simplicity purposes,
since that’s the labeling convention used in EXIST. It is worth noting that the sexist category not only
encompasses tweets with direct harmful messages but also tweets where sexism-related situations are
being discussed or exposed. Therefore, this task is not only to detect direct hateful behavior against
women.</p>
        <p>Table 1 summarizes the train dataset partition according to language and label assigned in Task 1.
The dataset is mostly balanced among languages and sexism-related and not-sexism-related categories.
A minority, less than 13% of the tweets, do not have a majority class attached to them.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Previous Work</title>
      <sec id="sec-3-1">
        <title>3.1. Previous Editions of EXIST</title>
        <p>
          During EXIST 2023, the most commonly used approaches to address binary sexism classification included
the use of variations and combinations of the following three [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]:
1. Transformer-based architectures like BERT or RoBERTa. Models were either pre-trained and then
ifne-tuned or trained from scratch. They included general knowledge text or domain-specific text
like tweets.
2. Classical Machine and Deep Learning Methods utilizing input embeddings from pre-trained
models and additional attributes like toxicity and sentiment metrics, or linguistic and handcrafted
features.
3. Data augmentation techniques, external datasets, and ensembles of multiple models.
4. Addressing the task as a monolingual problem using separate models for each language, or as a
multilingual problem using cross-lingual or translation techniques.
        </p>
        <p>
          While classical methods and architectures remain popular and competitive, models like GPT, Llama,
or Gemini have not yet been deeply explored. In EXIST 2023, GPT-based large language model cascades
were shown to be competitive and ranked among the top systems. Notwithstanding the fact of being
efective, their strategy appears to be used in a classification setting rather than generating text [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
We speculate that the generation capabilities of LLMs to provide key ideas, identify elements, and
compare arguments that are substantial for a correct label assignment, as LLMs have shown to achieve
state-of-the-art in several tasks even without fine-tuning.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Large Language Models “Reasoning” Capabilities.</title>
        <p>
          Recent research on Large Language Models is focused on exploiting their “reasoning” capabilities.
An overview of the current state-of-the-art knowledge on reasoning in LLMs provides a review of
diferent techniques [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Some approaches involve traditional, fully supervised fine-tuning to generate
rationales on a specific domain. Others are prompt-based and in-context learning methods that do not
require fine-tuning. A third type consists of hybrid approaches that combine both of the previous.
        </p>
        <p>
          Given that fine-tuning massive generative LLMs is not eficient regarding computational resources,
prompt-based methods have gained popularity because of their knowledge capacities. Adding prompt
pieces in zero-shot scenarios such as “Think step by step” [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] or “Let’s first understand the problem
and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step”
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] encourage the models to provide rationales that guide the answer. Other prompts like “This is very
important to my career” have positively influenced the performance in some tasks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. However, in
certain settings, prompts like “Think step by step” might lead the model to produce inaccurate answers
or generate harmful content [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>
          In few-shot scenarios, techniques like chains of thought have been shown to improve the answers by
demonstrating a thought process and encouraging the model to provide its own behind the answer [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
The order and quality of the few-shot demonstrations are crucial and impact the performance. Some
studies propose techniques for providing good permutations of the examples to enhance the quality
of the results [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Several strategies include encouraging the model to take advantage of multiple
prompts. This can be used to answer the same question and then apply approaches to regularize the
prompt consistency to obtain a final label [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Other techniques propose using external “prompters”
that iteratively prompt the LLM to recall a series of knowledge and derive a “chain of thought” [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
Additional approaches include subdividing all the context into questions and enabling cross-model
communication during problem-solving to aggregate the answers [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>All these strategies require a prompt-refining process, to some extent, to provide better context and
enhance the use of these generative LLMs. Still, this refinement process is usually not automatic and is
done through a highly qualitative assessment.</p>
        <p>
          Another important consideration is the source and target languages used to prompt an LLM. Studies
have shown a disparity between the performance of LLMs in English and non-English languages, with
LLMs generally performing better in English [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. These techniques and considerations have become
more popular and have been leveraged in large language models like GPT-3, Llama, Gemini, and Claude
to generate more accurate answers to diferent tasks.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Baseline</title>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing the data</title>
        <p>
          We will focus on the data’s hard label predictions. Although we will not dismiss the individual annotators’
labels, we will filter the tweets so that only those classified as sexist and not sexist by majority vote will
be included. We preprocess the tweets using the library “pysentimiento” [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] in the following way:
1. User handles are substituted with @user.
2. URL directions are replaced with the special token url.
3. The # symbol is substituted by the special token hashtag, and the content in multi-word hashtags
is split into separate words.
        </p>
        <p>4. Emojis are replaced with their text descriptions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. XLM-RoBERTa fine tuned</title>
        <p>
          As a baseline, we worked with a Twitter-specific multilingual language model that consists of an
XLM-RoBERTa architecture trained on multilingual tweets [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. We fine-tuned the model to predict
the Task 1 labels. The input consists of the tokenized tweets. We built two variants: prediction of the
hard binary label and simultaneous prediction of the hard binary label and the single-annotator labels
grouped by age and gender. The second variant was chosen because it resulted in a slightly better
macro-F1 score for the hard binary labels. We will refer to this system as XLM-RoBERTa-Baseline.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. The Proposed Method</title>
      <p>We have an experimental setup in two stages: In the first stage, we generate “reasoning” texts using
an LLM that aims to understand the tweets’ nature. In the second stage, we use the generated texts to
process them further with a pre-trained XLM-RoBERTa model. We explain these details in the following
section.</p>
      <sec id="sec-5-1">
        <title>5.1. LLM Stacked Reasoning</title>
        <p>We relied on an autoregressive LLM to generate text analysis of a tweet. Rather than asking the model to
assign a label and provide an explanation in one straightforward step, we created rationales that support
the tweet’s target categories, sexist and not sexist. Then, we let the model compare both arguments
and choose the most accurate. We hypothesize that this is better than the direct closed question “Is this
tweet sexist?” because, despite the validity of the reasonings for each category, the model internally
evaluates the correctness of the statements produced. The setting is the same for both tweets in English
and Spanish, and all the generated analyses are in English.</p>
        <p>The first two steps occur independently:
1. Positive Reasoning. Generating analysis that supported the idea that the tweet was related to
sexism.
2. Negative Reasoning. Generating analysis that supported the idea that the tweet was not related
to sexism.</p>
        <p>In a further step, the LLM is asked to “reflect” on the opposite texts it produced:
3. Stacked Comparison Reasoning. The model is fed with the information generated in the Positive
and Negative Reasonings and has the chance to compare them.</p>
        <p>This process is illustrated in Figure 1. The LLM generates three rationales that provide insight into
the nature of the tweet. We call all the rationales to be Tweet Reasonings (marked as purple boxes in
Figure 1), and they will be further used in the next section.</p>
        <p>
          As an experimental setting, we ask the model to answer as a gender equality specialist, to respond
as unbiased as possible or to be concise. The exact prompts and an output example can be found in
Appendix A. The LLM selected for this reasoning stage was Llama 3, an open-source auto-regressive
language model developed by Meta AI [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. We used the Llama 3 8B Instruct version.
        </p>
        <p>In a conclusive step, we ask the model to provide a label synthesizing the reflection made in the</p>
        <sec id="sec-5-1-1">
          <title>Stacked Comparison Reasoning. The Binary Final Answer produced is not used in the second stage of</title>
          <p>our proposed methodology, but we use it to assess the method’s gain in performance. The answers
provided by this method are reported as B-StackLlama. Even though this workflow encourages the
model to produce better-structured answers, the final classification is still not accurate enough. The
second stage, which will be explained in the next section, addresses this issue.
5.1.1. Fine tune of XLM-RoBERTa
In this stage, we test the reasoning provided by the LLM in the previous step as a supplement to the
tweets. To enhance the XLM-RoBERTa baseline described before, we experimented with feeding the
tweet concatenated with the reasonings generated. Due to the maximum input token restriction, we
are limited to incorporating one reasoning at a time.</p>
          <p>We fine tune three diferent XLM-RoBERTa models using diferent inputs. The input of each model
consists of the concatenation using the special separator token [SEP] of a tweet with one of its
corresponding Tweet Reasonings generated in the previous stage. The diferent reasonings Positive, Negative or
Stacked Comparison Reasoning yield a diferent fine-tuned model, which we will call P-LLM-R-Stack-Ra,</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>N-LLM-R-Stack-Ra and C-LLM-R-Stack-Ra respectively.</title>
          <p>Figure 2 shows the process realized. In addition to the individual {P, N, C}-X-LLM-R-Stack-Ra models,
we create ensembles of individual models aiming to capture a more accurate label. Ensembles are
generated by aggregating the scores that individual systems assign to the labels. The final label is
designated as the one with the highest score.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Tweet Fine Grained Sexist Evaluation Questionnaire</title>
      <p>
        A more general task can be addressed as an aggregate of fine-grained tasks. In particular, there
are questionnaire-based retrieval models used to provide a final response diagnosis, like the case of
depression [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
      </p>
      <p>Using this idea, in addition to the previous approach, we created a list of binary and close-option
questions meant to identify the nature of the tweet. The questionnaire focused on identifying ofensiveness,
intention, and whether fine-grained sexism-related topics were expressed directly or passively. The
complete questionnaire is included in Appendix B. The answers to these questions were fed through a
multi-layer feed-forward neural network that aimed to predict all three tweet-related tasks. Because this
method did not outperform the scores obtained by the previous method on the validation set by itself,
we used it only to enhance the previous techniques in an ensemble setting in the fashion described
before. Reference for this method is Q-Llama-MLP.</p>
      <p>Although this method seems promising, the quality of the answers is sensible to question formulation.
Appendix B shows an example of the refinement through a qualitative assessment of a question that
provides more accurate results in its final versions. We believe that all the questions in the original
questionnaire can be refined into versions that lead to more accurate and representative answers and
can be further used both to address the binary task and to identify fine-grained topics.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Evaluation Results</title>
      <p>In this section, we discuss the performance of our proposals over the validation set and the oficial
evaluation metrics at EXIST 2024.</p>
      <sec id="sec-7-1">
        <title>7.1. Preliminary evaluation over validation data</title>
        <p>This section is dedicated to evaluating the efectiveness and the impact of the key components of the
methods proposed in the previous sections. We include the system Ensemble , which is the ensemble
merging the two best performing individual systems N-LLM-R-Stack-Ra and C-LLM-R-Stack-Ra. In
extension to this ensemble, we also include Ensemble, which in addition to individual systems</p>
        <sec id="sec-7-1-1">
          <title>N-LLM-R-Stack-Ra and C-LLM-R-Stack-Ra incorporates Q-Llama-MLP. The ensemble setting is, as</title>
          <p>described before, the result of equally aggregating the scores of the individual scores of each category
to obtain the ensemble scores.</p>
          <p>We present in Table 2 the F1 of the positive class for the diferent systems proposed considering
both tweets in English and Spanish of the validation set. As observed, models {N, C, P}-LLM-R-Stack-Ra
outperform the baseline XLM-RoBERTa-Baseline. This demonstrates the benefit of incorporating the
reasoning against using only tweets. The Negative and Stacked Comparison Reasonings incorporation
perform slightly better than the Positive Reasonings. We also observe that B-StackLlama underperforms
XLM-RoBERTa-Baseline, indicating the deficiency of relying only on the reasonings. Regarding
Q-LlamaMLP, it does not beat XLM-RoBERTa-Baseline, partly due to the redundancy and inability to capture
the questions’ nature and meaning. Despite that, Q-Llama-MLP enhances the ensemble’s performance
slightly. The best performance overall belongs to Ensemble.</p>
          <p>During the prompting process of B-StackLkama, the binary final answer underestimates
sexismrelated tweets. The performance by itself is poor because the two-step process is biased to produce
a final negative label. We additionally observed the length of the Tweet Reasonings also influences
the quality of the response: too short, and the analysis might not have enough details; too long, and
the answer might be repetitive, afecting the performance during the nfie-tuning process of {N, C,
P}-LLM-R-Stack-Ra. We set up the reasoning generation to contain at most 200 tokens of the Positive and</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>Negative Reasonings and the Stacked Comparison Reasoning to contain at most 250 tokens. An example</title>
          <p>showing the variation of the answer based on the token length is also included in Appendix A</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Oficial Leaderbord</title>
        <p>
          The oficial metrics for EXIST 2024 are ICM-hard and F1 of the positive class, and scores are divided by
language [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ]. Table 3 summarizes the results obtained over the test set.
        </p>
        <p>The single system N-LLMRStack-Ra performs less efectively than Ensemble . We hypothesize that
the Negative and Stacked Comparison Reasonings complement the tweets with diferent information
during the fine-tuning process. In the case of the Negative Reasoning, as provided for all tweets, we expect
to learn an internal diferentiation between accurate and inaccurate supporting facts of the not-sexist
category. In the Comparison Stacked Reasoning, we expect to capture the contrast between reasonings
and correct the preference of one over the other if necessary. We think that the distinct aspects and
relationships these reasonings capture contribute to the improved performance in Ensemble .</p>
        <p>The ensemble Ensemble is the best performer of our systems. In particular, the diference between
the top score with this system considering the set of evaluation tweets in English and Spanish is less
than 0.01 in the ICM-Hard Norm and the F1 of the positive class, which shows the competitivity of our
method. Performance of this model is just slightly better than Ensemble , and we believe that even
though Q-Llama-MLP did not outperform the individual systems in validation, adding the output in
Ensemble provides a slight correction of the underestimation of sexist tweets.</p>
        <p>Performance over the Spanish tweets is slightly better than performance over English tweets, and we
conjecture that the multilingual setting benefits performance for both English and Spanish languages.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>In this paper, we propose a methodology to detect sexism that takes advantage of the capabilities
to generate rationales of LLMs, which was not used in EXIST’s previous editions. These rationales
show the potential of the LLM to support the binary classes and the capabilities to compare two
diferent reasoning processes. As shown in this paper, the LLM reflection alone is not enough to achieve
competitive results, and it is biased to underestimate sexist tweets, which is not desirable. That is why
using the rationales enhances the baseline results provided by an XLM-RoBERTa that is fine-tuned only
using the tweets. It allows for information that explores the nature of the tweet and a more accurate
classification. Results show that the proposed models are competitive and open the panorama of how
internal knowledge and reasoning capabilities of autoregressive LLMs can address this task.</p>
      <p>In future work, we plan to extend this approach to address other tasks, refine the prompts to obtain
better rationales, and study the capabilities and limitations of other autoregressive LLMs. Questionnaire
results can be explored to dive into the fine-grained classifications of sexism and analyze source intention
and topic classification. A more in-depth exploration regarding the length of the rationales generated,
the efects of prompting variation, alternative LLMs, and bias remains to be explored. This method also
shows insight into the biases encoded in the large language models, as the model we chose (Llama 3) is
misled to provide inaccurate explanations of the wrong category classification and, as observed in the
scores, fails to choose the correct classification label by itself.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Ethical Concerns</title>
      <p>It is important to note that the systems developed in this work predict binary labels based on annotators’
majority vote and might overlook people’s perceptions at an individual level. Another important
distinction is label names, where instead of “sexist” and “not-sexist” they could be more accurately
described as “sexism-related” and “non-sexism-related” to recognize two diferent perspectives: negative
intention (difusion of biased content) and positive intention that treats sensitive content (bias awareness
and discussion).</p>
      <p>The reasoning generated by the LLMs put into evidence the biased internal views they provide, where
sexism-related content is underestimated. Using these labels by themselves or trusting the rationales
generated should be considered carefully, as they can be highly misleading. Given the implications
of deploying detection systems for sensitive content, the proposed solution requires a more in-depth
analysis by social scientists, and ethics and fairness experts. Misusing these systems might have
significant implications, including the potential non-detection of toxic and dangerous content or the
unintended censorship of discussions about social problems.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>Villarreal-Haro acknowledges CONAHCYT for its support provided by the program Becas Nacionales
Para Estudios de Posgrados (CVU 1309535). We thank CONAHCYT for the computer resources provided
through the INAOE Supercomputing Laboratory’s Deep Learning Platform for Language Technologies
and CIMAT Bajio Super-computing Laboratory (#300832). Sanchez-Vega acknowledges CONAHCYT
for its support through the program “Investigadoras e Investigadores por México” (Project ID.11989,
No.1311). Rosales-Perez acknowledges CONAHCYT for its support through the project grant Búsqueda
de arquitecturas neuronales eficientes y efectivas (CBF2023-2024-2797).</p>
    </sec>
    <sec id="sec-11">
      <title>A. LLM Reasoning Prompt</title>
      <sec id="sec-11-1">
        <title>A.1. Prompt template</title>
        <p>In the experimental templates, we tried to set an objective environment for the answer given by the
system role. The exact prompts used for the task are the following:</p>
        <p>The first two prompts take as input {tweet}, and generate the Positive and Negative Reasonings are the
following:</p>
        <p>Positive Prompt.</p>
        <sec id="sec-11-1-1">
          <title>System Role</title>
        </sec>
        <sec id="sec-11-1-2">
          <title>You are a gender equality specialist. Think step by step. Answer as unbiased as possible.</title>
        </sec>
        <sec id="sec-11-1-3">
          <title>User</title>
        </sec>
        <sec id="sec-11-1-4">
          <title>Explain why the following tweet contains elements talking about sexism.</title>
        </sec>
        <sec id="sec-11-1-5">
          <title>Tweet: {tweet}</title>
        </sec>
        <sec id="sec-11-1-6">
          <title>Assistant: {...}</title>
          <p>Negative Prompt.</p>
        </sec>
        <sec id="sec-11-1-7">
          <title>Assistant: {...} System Role You are a gender equality specialist. Think step by step. Answer as unbiased as possible.</title>
        </sec>
        <sec id="sec-11-1-8">
          <title>User</title>
        </sec>
        <sec id="sec-11-1-9">
          <title>Explain why the following tweet does not contain elements talking about sexism.</title>
        </sec>
        <sec id="sec-11-1-10">
          <title>Tweet: {tweet}</title>
          <p>Positive and Negative Reasonings are limited to a maximum of 200 generated tokens. In a further step,
the tweet and both the Positive and Negative Reasonings are provided so that the model self-evaluates
both.</p>
          <p>Comparison Prompt.</p>
        </sec>
        <sec id="sec-11-1-11">
          <title>System Role</title>
        </sec>
        <sec id="sec-11-1-12">
          <title>You are a gender equality specialist. Think step by step. Be concise.</title>
        </sec>
        <sec id="sec-11-1-13">
          <title>User</title>
        </sec>
        <sec id="sec-11-1-14">
          <title>Consider the following tweet.</title>
        </sec>
        <sec id="sec-11-1-15">
          <title>Tweet: {tweet}</title>
        </sec>
        <sec id="sec-11-1-16">
          <title>Which analysis is the most accurate?</title>
        </sec>
        <sec id="sec-11-1-17">
          <title>Analysis 1: {positive reasoning}</title>
        </sec>
        <sec id="sec-11-1-18">
          <title>Analysis 2: {negative reasoning}</title>
        </sec>
        <sec id="sec-11-1-19">
          <title>Assistant: {...}</title>
          <p>The Comparison Stacked Reasoning is limited to a maximum of 250 tokens.</p>
        </sec>
      </sec>
      <sec id="sec-11-2">
        <title>A.2. Answer samples</title>
        <p>We show an example of the LLM output to illustrate the type of text reasoning we generated.</p>
        <p>The original tweet is the following:</p>
        <sec id="sec-11-2-1">
          <title>Writing a uni essay in my local pub with a cofee. Random old man keeps asking me drunk questions when I’m trying to concentrate; ends with "good luck, but you’ll just end up getting married and not use it anyway". #everydaysexism is alive and well".</title>
          <p>The positive and negative reasoning generated are:</p>
          <p>Positive Reasoning:
Let’s break down the tweet and identify the elements that suggest sexism:
1. The tweet starts by mentioning the author is writing a university essay in a pub, which is a
common scenario. This sets the context for the rest of the tweet.
2. The author mentions that a "random old man" is asking them "drunk questions" while they’re
trying to concentrate. This already has a slightly confrontational tone, as the author is implying that
the man is interrupting and being a distraction.</p>
        </sec>
        <sec id="sec-11-2-2">
          <title>3. The phrase "good luck, but you’ll just end up getting married and not use it anyway" is what</title>
          <p>specifically suggests sexism. The man’s comment is making an assumption about the author’s future
plans and goals, which are likely to be academic in nature. The comment implies that the author’s
ultimate goal is to get married and abandon their academic pursuits, which is a stereotype often
perpetuated about women.</p>
        </sec>
        <sec id="sec-11-2-3">
          <title>This type of comment is a classic example of sexism.</title>
          <p>Negative Reasoning:
The tweet does not contain elements of sexism because the content is focused on the author’s
experience of being interrupted by an old man while trying to write an essay, rather than any specific
gender-based discrimination or harassment. The tweet is more about the author’s frustration with
the distraction and the old man’s opinion about the author’s future plans, which is not related to the
author’s gender. While the author uses the hashtag #everydaysexism, which is often used to describe
the subtle and pervasive forms of sexism that people face in their daily lives, the tweet itself does not
contain any explicit sexist language, attitudes, or behaviors. The tweet is more about the author’s
annoyance with the old man’s behavior and the disruption it caused, rather than any gender-based
issue. In fact, the tweet could be interpreted as an example of ageism, as the old man’s comment
about the author’s future plans is not specific to the author’s gender, but rather a stereotypical
comment.</p>
          <p>The comparison reasoning where the model self-evaluates the outputs produced is the following:
Stacked Comparison Reasoning:
Analysis 1 is the most accurate. The tweet contains elements of sexism because:
1. The old man’s comment "good luck, but you’ll just end up getting married and not use it anyway"
is a stereotypical and gendered assumption about the author’s future plans and goals, which are
likely to be academic in nature. The comment implies that the author’s ultimate goal is to get married
and abandon their academic pursuits, which is a stereotype often perpetuated about women.</p>
        </sec>
        <sec id="sec-11-2-4">
          <title>2. The tweet also contains a subtle tone of annoyance and frustration, which may be attributed to the author’s experience as a woman, as they are more likely to face interruptions and condescending comments from men.</title>
        </sec>
        <sec id="sec-11-2-5">
          <title>The other analyses do not accurately identify the elements of sexism in the tweet. Analysis 2 is</title>
          <p>incorrect because it does not recognize the gendered nature of the old man’s comment and the impact
it has on the author. Analysis 2 is also incorrect because it does not identify the specific elements of
sexism in the tweet, instead, it focuses on broader issues of sexism and gender inequality. Analysis 1
accurately identifies the elements of sexism in the tweet, including the stereotypical and gendered
assumption made by the old man.</p>
          <p>It is essential to remark on some things. First, the model’s answer quality varies with length. If we
ask the model to generate positive reasoning with a smaller amount of tokens, the answer does not
provide as many insights that support the target category. The following is an example of a positive
reasoning limited to less than 50 tokens.</p>
          <p>Positive Reasoning:</p>
        </sec>
        <sec id="sec-11-2-6">
          <title>The tweet contains elements of sexism because it describes a situation where the author, a woman, is being subjected to unwanted and inappropriate comments from a stranger, specifically an older man, while she is trying to focus on her academic work.</title>
          <p>As we can observe, limiting the reasoning to fewer tokens in the rationale prevents the model from
fully breaking down the elements that it identifies as relevant. Allowing the model to produce a much
longer output will result in repetition of the rationale generating.</p>
          <p>Secondly, due to the safety settings of the model, in certain topics, the generated answer is as follows:
Positive Reasoning:</p>
        </sec>
        <sec id="sec-11-2-7">
          <title>I cannot analyze a tweet that contains explicit and harmful content, including sexual harassment.</title>
          <p>In the second step of our methodology, we expect that even though the model does not provide a
rationale supporting the targeted position, the model is learning that this type of analysis also provides
information about the nature of the tweet and helps to learn about the topics contained.</p>
          <p>Finally, these are all automatic reasonings generated, and as shown in the example, are internally
capable of supporting both sexist and not-sexist categories. We noticed that this two-step scheme is
biased to predict as a final answer negative results. So, these models and the extracted knowledge shall
be used carefully.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>B. Sexist Evaluation Questionnaire</title>
      <sec id="sec-12-1">
        <title>B.1. Refinement of questions</title>
        <p>As we discussed before, refining the questions impact the accuracy of the answers. Table 4 shows
an example of the qualitative analysis of the evolution of a question if asked to explain the answer
provided.</p>
      </sec>
      <sec id="sec-12-2">
        <title>B.2. List of questions</title>
        <p>The complete list of fine-grained sexist questions is the following. The model was asked to answer with
“Yes/No” for binary questions and only with the answer identificator for closed-option questions.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lindberg</surname>
          </string-name>
          , New Oxford American Dictionary, Third Edition, OUP USA,
          <year>2010</year>
          . URL: https://books.google.com.mx/books?id=sZoFRwAACAAJ.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Swim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hyers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferguson</surname>
          </string-name>
          ,
          <article-title>Everyday sexism: Evidence for its incidence, nature, and psychological impact from three daily diary studies</article-title>
          ,
          <source>Journal of Social Issues</source>
          <volume>57</volume>
          (
          <year>2001</year>
          )
          <fpage>31</fpage>
          -
          <lpage>53</lpage>
          . doi:
          <volume>10</volume>
          .1111/
          <fpage>0022</fpage>
          -
          <lpage>4537</lpage>
          .
          <fpage>00200</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paciello</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. D'Errico</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Saleri</surname>
          </string-name>
          , E. Lamponi,
          <article-title>Online sexist meme and its efects on moral and emotional processes in social media</article-title>
          ,
          <source>Comput. Hum. Behav</source>
          .
          <volume>116</volume>
          (
          <year>2020</year>
          )
          <article-title>106655</article-title>
          . doi:
          <volume>10</volume>
          .1016/j. chb.
          <year>2020</year>
          .
          <volume>106655</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Turley</surname>
          </string-name>
          , J. Fisher,
          <article-title>Tweeting back while shouting back: Social media and feminist activism</article-title>
          ,
          <source>Feminism &amp; Psychology</source>
          <volume>28</volume>
          (
          <year>2018</year>
          )
          <fpage>128</fpage>
          -
          <lpage>132</lpage>
          . URL: https://api.semanticscholar.org/CorpusID: 149235968.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tassone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Matheson</surname>
          </string-name>
          ,
          <article-title>Tweeting about sexism motivates further activism: A social identity perspective</article-title>
          .,
          <source>The British journal of social psychology</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1111/bjso.12431.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Perpetuating online sexism ofline: Anonymity, interactivity, and the efects of sexist hashtags on social media</article-title>
          ,
          <source>Comput. Hum. Behav</source>
          .
          <volume>52</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>442</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:45231644.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <article-title>Concept of beauty in the age of the internet: Impact of social media on appearance anxiety and body shame</article-title>
          ,
          <source>Communications in Humanities Research</source>
          (
          <year>2023</year>
          ). URL: https://api. semanticscholar.org/CorpusID:266066470.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>K. M. Jessalynn Keller</surname>
          </string-name>
          , J. Ringrose,
          <article-title>Speaking 'unspeakable things': documenting digital feminist responses to rape culture</article-title>
          ,
          <source>Journal of Gender Studies</source>
          <volume>27</volume>
          (
          <year>2018</year>
          )
          <fpage>22</fpage>
          -
          <lpage>36</lpage>
          . URL: https://doi.org/10.1080/09589236.
          <year>2016</year>
          .
          <volume>1211511</volume>
          . doi:
          <volume>10</volume>
          .1080/09589236.
          <year>2016</year>
          .
          <volume>1211511</volume>
          . arXiv:https://doi.org/10.1080/09589236.
          <year>2016</year>
          .
          <volume>1211511</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Anzovino</surname>
          </string-name>
          ,
          <article-title>Overview of the task on automatic misogyny identification at ibereval 2018</article-title>
          , in: IberEval@SEPLN,
          <year>2018</year>
          . URL: https://api.semanticscholar.org/CorpusID:51942244.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saibene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lees</surname>
          </string-name>
          , J. Sorensen, SemEval
          <article-title>-2022 task 5: Multimedia automatic misogyny identification</article-title>
          , in: G. Emerson,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schluter</surname>
          </string-name>
          , G. Stanovsky,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Ratan (Eds.),
          <source>Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>549</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .74. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>74</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , P. Röttger, SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval2023)</source>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2193</fpage>
          -
          <lpage>2210</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .305. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes (Extended Overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          , J. C. de Albornoz,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2023 - learning with disagreement for sexism identification and characterization (extended overview)</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.),
          <source>Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>Eficient multilingual sexism detection via large language model cascades</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          . URL: https://api.semanticscholar. org/CorpusID:264441414.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. C.-C. Chang</surname>
          </string-name>
          ,
          <article-title>Towards reasoning in large language models: A survey</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1049</fpage>
          -
          <lpage>1065</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>67</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>67</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Neural Information Processing Systems</source>
          , NIPS '22, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. K.-W. Lee</surname>
            ,
            <given-names>E.-P.</given-names>
          </string-name>
          <string-name>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <surname>Plan-</surname>
          </string-name>
          and
          <article-title>-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2609</fpage>
          -
          <lpage>2634</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>147</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>147</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Hou,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Emotionprompt:
          <article-title>Leveraging psychology for large language models enhancement via emotional stimulus</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>O.</given-names>
            <surname>Shaikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Held,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>On second thought, let's not think step by step! bias and toxicity in zero-shot reasoning</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>4454</fpage>
          -
          <lpage>4470</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>244</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>244</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Neural Information Processing Systems</source>
          , NIPS '22, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bartolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          ,
          <article-title>Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>8086</fpage>
          -
          <lpage>8098</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>556</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>556</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Berg-Kirkpatrick, G. Neubig,
          <article-title>Prompt consistency for zero-shot task generalization</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>2613</fpage>
          -
          <lpage>2626</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-emnlp.
          <volume>192</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-emnlp.
          <volume>192</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Iteratively prompt pre-trained language models for chain of thought</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>2714</fpage>
          -
          <lpage>2730</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>174</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>174</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <article-title>Exchange-of-thought: Enhancing large language model capabilities through cross-model communication</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>15135</fpage>
          -
          <lpage>15153</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>936</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>936</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Diddee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ochieng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nambi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ganu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Segal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          , S. Sitaram, MEGA:
          <article-title>Multilingual evaluation of generative AI</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>4232</fpage>
          -
          <lpage>4267</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>258</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>258</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>J. M. Pérez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rajngewerc</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Giudici</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <string-name>
            <surname>Furman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Luque</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Alemany</surname>
            ,
            <given-names>M. V.</given-names>
          </string-name>
          <string-name>
            <surname>Martínez</surname>
          </string-name>
          ,
          <article-title>pysentimiento: A python toolkit for opinion mining and social nlp tasks</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2106</volume>
          .
          <fpage>09462</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <article-title>XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>266</lpage>
          . URL: https://aclanthology. org/
          <year>2022</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <article-title>AI@Meta, Llama 3 model card (</article-title>
          <year>2024</year>
          ). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fernández-Iglesias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernandez-Pichel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aragon</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. E. Losada,</surname>
          </string-name>
          <article-title>DepressMind: A depression surveillance system for social media analysis</article-title>
          , in: N.
          <string-name>
            <surname>Aletras</surname>
          </string-name>
          , O. De Clercq (Eds.),
          <article-title>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, St</article-title>
          . Julians, Malta,
          <year>2024</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>43</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .eacl-demo.
          <volume>5</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>