Stacked Reflective Reasoning in Large Neural Language Models Notebook for the EXIST Lab at CLEF 2024 Kapioma Villarreal-Haro1,*,† , Fernando Sánchez-Vega1,2,† , Alejandro Rosales-Pérez3,† and Adrián Pastor López-Monroy1,† 1 Mathematics Research Center (CIMAT), Jalisco S/N Valenciana, 36023, Guanajuato, Guanajuato, México. 2 Consejo Nacional de Ciencia y Tecnología (CONACYT), Av. Insurgentes Sur 1582, Col. Crédito Constructor, 03940, CDMX, México. 3 Mathematics Research Center (CIMAT), Monterrey, Av. Alianza Centro 502, Apodaca, 66628, Nuevo León, México. Abstract Sexism, far from being merely a conceptual issue, is a concerning and pervasive social health problem that negatively impacts individuals’ well-being and perception. In today’s digital era, as sexism permeates online platforms, the creation of systems that detect this type of content is a challenging yet essential task. This paper presents the approach of the CIMAT-GTO team to Task 1 of EXIST 2024, which involves identifying tweets with sexism-related content. Our proposal takes advantage of the reasoning capabilities of Llama 3 in a two-step process. Initially, we generate rationales to analyze the nature of the tweets. Then, in a second step, we let the model reflect on the previously produced reasoning. The intuitive idea is to create text that supports opposite categories, and expect the model to contrast valid and invalid reasons by itself. We then use these generated rationales as extra information to complement the tweets and fine tune a Twitter-specialized XLM-RoBERTa model. Our experiments showed that incorporating Llama 3’s rationales improves performance compared to only using tweets and yields competitive results in the task, demonstrating the potential of these methods. Keywords Generative Large Language Models, Large Language Model Reasoning, Stacked Large Language Models, Trans- formers, Sexism Detection, Social Media 1. Introduction In today’s world, social media is an essential platform for the communication and diffusion of information and opinions among individuals. However, social media interactions are often related to misleading or harmful content. For instance, users might directly express bias in their own generated content. Alternatively, they could engage by sharing and commenting on biased content created by other users. Within these interactions, one major social concern is sexism, defined as prejudice or discrimination based on sex or gender [1]. Sexism negatively affects the psychological well-being of women and men not only in everyday face-to-face interactions [2], but also in social media platforms [3]. In this context, social media has been used for two contrary purposes: 1) as a platform for bias dissemination where misleading information and hateful behavior against women are spread, and 2) as a means for bias awareness and activism, enabling users to address, report, and discuss misogynistic and sexist narratives [4, 5]. Instances of harmful sexist expressions in social media include hostile behavior and negative eval- uations of female job candidates [6]. Other problems involve the distribution and consumption of content that perpetuates appearance anxiety, body shame, and eating disorder behaviors, primarily among women [7]. On the other hand, social media has also been used as a platform for positive impact behaviors, such as mobilizing digital media in response to shaming, harassment, and rape culture [8]. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ kapioma.villarreal@cimat.mx (K. Villarreal-Haro); fernando.sanchez@cimat.mx (F. Sánchez-Vega); alejandro.rosales@cimat.mx (A. Rosales-Pérez); pastor.lopez@cimat.mx (A. P. López-Monroy) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In the modern world, the efficient identification of sensitive content is crucial due to the vast volume of data, human biases, and the profound impact of this task. Despite advancements in computer science and the deployment of more accurate and sophisticated models, the challenge remains unsolved. In this context, several efforts include shared tasks to address this issue, such as Automatic Misogyny Identification at IberEval [9], Multimedia Automatic Misogyny Identification and Explainable Detection of Online Sexism at SemEval [10, 11], and sEXism Identification in Social neTworks (EXIST) [12, 13]. This paper describes CIMAT-GTO’s participation in Task 1 of EXIST 2024, which tackles the binary detection of sexism in tweets. We propose a technique that takes advantage of the reasoning capabilities of Large Language Models (LLMs). In a first step, the LLM generates reasoning supporting the target categories for Task 1, which are sexism-related and not-sexism-related. In a second step, we exploit and take advantage of this valuable information and feed it to the LLM to reflect on this reasoning. The reasoning outputs are then used to extend the information provided by the tweets in fine tuning an XLM-RoBERTa model pre-trained on multilingual tweets. 2. EXIST Shared Lab EXIST shared tasks aim to detect and capture sexism-related content in social networks while identifying intention and fine-grained topics [12, 13]. EXIST has evolved from only analyzing content in text format to multi-modal content. While the 2023 edition focused on detecting and categorizing sexist tweets, the 2024 edition extended its scope to encompass both tweets and meme images. For tweets, three primary tasks were established. • Task 1: Binary classification to identify tweets with sexism-related content. • Task 2: Multi-class classification to identify the intention of the tweet. • Task 3: Multi-label classification to categorize the types of sexism expressed. Analogous tasks were introduced regarding meme images. Systems that address such shared tasks can be set in two contexts: a hard setting, in which systems aim to predict a hard conventional output category or set of categories, and a soft setting, in which systems are intended to provide probabilities instead. In the case of Task 1, the categories consist of sexist and non-sexist tweets. In this paper, we address Task 1 in the hard context. 2.1. Tweet Dataset The tweet dataset contains 10, 034 tweets in Spanish and English; 6920 for training, 1038 for devel- opment, and 2076 for testing. Each tweet was labeled by six annotators selected such that they had different demographic characteristics to minimize bias in the labeling. The age ranges and gender of the annotators belonged to the sets {18 − 22, 23 − 45, 46+} and {𝐹 𝑒𝑚𝑎𝑙𝑒, 𝑀 𝑎𝑙𝑒}. All tweets were annotated such that there was one annotator from each of the six possible gender-age groups. For Task 1, annotators were required to indicate whether the tweet was related to sexism or not. In the following, we will refer to these two categories as sexist and not sexist for simplicity purposes, since that’s the labeling convention used in EXIST. It is worth noting that the sexist category not only encompasses tweets with direct harmful messages but also tweets where sexism-related situations are being discussed or exposed. Therefore, this task is not only to detect direct hateful behavior against women. Table 1 summarizes the train dataset partition according to language and label assigned in Task 1. The dataset is mostly balanced among languages and sexism-related and not-sexism-related categories. A minority, less than 13% of the tweets, do not have a majority class attached to them. Table 1 Distribution of Training Set. Classes are mostly balanced between languages. Among the tweets with no ties between annotator votes, sexist and not-sexist categories are mostly balanced. Training 6920 Spanish English 3660 3260 no majority no majority majority class majority class class availible class availible 466 3194 390 2870 related to not related to not sexism related to sexism related to sexism sexism 1560 1634 1137 1733 3. Previous Work 3.1. Previous Editions of EXIST During EXIST 2023, the most commonly used approaches to address binary sexism classification included the use of variations and combinations of the following three [14]: 1. Transformer-based architectures like BERT or RoBERTa. Models were either pre-trained and then fine-tuned or trained from scratch. They included general knowledge text or domain-specific text like tweets. 2. Classical Machine and Deep Learning Methods utilizing input embeddings from pre-trained models and additional attributes like toxicity and sentiment metrics, or linguistic and handcrafted features. 3. Data augmentation techniques, external datasets, and ensembles of multiple models. 4. Addressing the task as a monolingual problem using separate models for each language, or as a multilingual problem using cross-lingual or translation techniques. While classical methods and architectures remain popular and competitive, models like GPT, Llama, or Gemini have not yet been deeply explored. In EXIST 2023, GPT-based large language model cascades were shown to be competitive and ranked among the top systems. Notwithstanding the fact of being effective, their strategy appears to be used in a classification setting rather than generating text [15]. We speculate that the generation capabilities of LLMs to provide key ideas, identify elements, and compare arguments that are substantial for a correct label assignment, as LLMs have shown to achieve state-of-the-art in several tasks even without fine-tuning. 3.2. Large Language Models “Reasoning” Capabilities. Recent research on Large Language Models is focused on exploiting their “reasoning” capabilities. An overview of the current state-of-the-art knowledge on reasoning in LLMs provides a review of different techniques [16]. Some approaches involve traditional, fully supervised fine-tuning to generate rationales on a specific domain. Others are prompt-based and in-context learning methods that do not require fine-tuning. A third type consists of hybrid approaches that combine both of the previous. Given that fine-tuning massive generative LLMs is not efficient regarding computational resources, prompt-based methods have gained popularity because of their knowledge capacities. Adding prompt pieces in zero-shot scenarios such as “Think step by step” [17] or “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step” [18] encourage the models to provide rationales that guide the answer. Other prompts like “This is very important to my career” have positively influenced the performance in some tasks [19]. However, in certain settings, prompts like “Think step by step” might lead the model to produce inaccurate answers or generate harmful content [20]. In few-shot scenarios, techniques like chains of thought have been shown to improve the answers by demonstrating a thought process and encouraging the model to provide its own behind the answer [21]. The order and quality of the few-shot demonstrations are crucial and impact the performance. Some studies propose techniques for providing good permutations of the examples to enhance the quality of the results [22]. Several strategies include encouraging the model to take advantage of multiple prompts. This can be used to answer the same question and then apply approaches to regularize the prompt consistency to obtain a final label [23]. Other techniques propose using external “prompters” that iteratively prompt the LLM to recall a series of knowledge and derive a “chain of thought” [24]. Additional approaches include subdividing all the context into questions and enabling cross-model communication during problem-solving to aggregate the answers [25]. All these strategies require a prompt-refining process, to some extent, to provide better context and enhance the use of these generative LLMs. Still, this refinement process is usually not automatic and is done through a highly qualitative assessment. Another important consideration is the source and target languages used to prompt an LLM. Studies have shown a disparity between the performance of LLMs in English and non-English languages, with LLMs generally performing better in English [26]. These techniques and considerations have become more popular and have been leveraged in large language models like GPT-3, Llama, Gemini, and Claude to generate more accurate answers to different tasks. 4. Baseline 4.1. Preprocessing the data We will focus on the data’s hard label predictions. Although we will not dismiss the individual annotators’ labels, we will filter the tweets so that only those classified as sexist and not sexist by majority vote will be included. We preprocess the tweets using the library “pysentimiento” [27] in the following way: 1. User handles are substituted with @user. 2. URL directions are replaced with the special token url. 3. The # symbol is substituted by the special token hashtag, and the content in multi-word hashtags is split into separate words. 4. Emojis are replaced with their text descriptions. 4.2. XLM-RoBERTa fine tuned As a baseline, we worked with a Twitter-specific multilingual language model that consists of an XLM-RoBERTa architecture trained on multilingual tweets [28]. We fine-tuned the model to predict the Task 1 labels. The input consists of the tokenized tweets. We built two variants: prediction of the hard binary label and simultaneous prediction of the hard binary label and the single-annotator labels grouped by age and gender. The second variant was chosen because it resulted in a slightly better macro-F1 score for the hard binary labels. We will refer to this system as XLM-RoBERTa-Baseline. 5. The Proposed Method We have an experimental setup in two stages: In the first stage, we generate “reasoning” texts using an LLM that aims to understand the tweets’ nature. In the second stage, we use the generated texts to process them further with a pre-trained XLM-RoBERTa model. We explain these details in the following section. Figure 1: LLM Generation of a Positive, Negative, and Stacked Comparison Reasonings of a tweet. 5.1. LLM Stacked Reasoning We relied on an autoregressive LLM to generate text analysis of a tweet. Rather than asking the model to assign a label and provide an explanation in one straightforward step, we created rationales that support the tweet’s target categories, sexist and not sexist. Then, we let the model compare both arguments and choose the most accurate. We hypothesize that this is better than the direct closed question “Is this tweet sexist?” because, despite the validity of the reasonings for each category, the model internally evaluates the correctness of the statements produced. The setting is the same for both tweets in English and Spanish, and all the generated analyses are in English. The first two steps occur independently: 1. Positive Reasoning. Generating analysis that supported the idea that the tweet was related to sexism. 2. Negative Reasoning. Generating analysis that supported the idea that the tweet was not related to sexism. In a further step, the LLM is asked to “reflect” on the opposite texts it produced: 3. Stacked Comparison Reasoning. The model is fed with the information generated in the Positive and Negative Reasonings and has the chance to compare them. This process is illustrated in Figure 1. The LLM generates three rationales that provide insight into the nature of the tweet. We call all the rationales to be Tweet Reasonings (marked as purple boxes in Figure 1), and they will be further used in the next section. As an experimental setting, we ask the model to answer as a gender equality specialist, to respond as unbiased as possible or to be concise. The exact prompts and an output example can be found in Appendix A. The LLM selected for this reasoning stage was Llama 3, an open-source auto-regressive language model developed by Meta AI [29]. We used the Llama 3 8B Instruct version. In a conclusive step, we ask the model to provide a label synthesizing the reflection made in the Stacked Comparison Reasoning. The Binary Final Answer produced is not used in the second stage of our proposed methodology, but we use it to assess the method’s gain in performance. The answers provided by this method are reported as B-StackLlama. Even though this workflow encourages the model to produce better-structured answers, the final classification is still not accurate enough. The second stage, which will be explained in the next section, addresses this issue. 5.1.1. Fine tune of XLM-RoBERTa In this stage, we test the reasoning provided by the LLM in the previous step as a supplement to the tweets. To enhance the XLM-RoBERTa baseline described before, we experimented with feeding the tweet concatenated with the reasonings generated. Due to the maximum input token restriction, we are limited to incorporating one reasoning at a time. Figure 2: Fine tuning of Tweet-focused XLM-RoBERTa. Input is concatenation (+) using special separator token [SEP] of a tweet with either a Positive, Negative or Stacked Comparison Reasoning generated by an LLM (purple boxes in Figure 1) . We fine tune three different XLM-RoBERTa models using different inputs. The input of each model consists of the concatenation using the special separator token [SEP] of a tweet with one of its corre- sponding Tweet Reasonings generated in the previous stage. The different reasonings Positive, Negative or Stacked Comparison Reasoning yield a different fine-tuned model, which we will call P-LLM-R-Stack-Ra, N-LLM-R-Stack-Ra and C-LLM-R-Stack-Ra respectively. Figure 2 shows the process realized. In addition to the individual {P, N, C}-X-LLM-R-Stack-Ra models, we create ensembles of individual models aiming to capture a more accurate label. Ensembles are generated by aggregating the scores that individual systems assign to the labels. The final label is designated as the one with the highest score. 6. Tweet Fine Grained Sexist Evaluation Questionnaire A more general task can be addressed as an aggregate of fine-grained tasks. In particular, there are questionnaire-based retrieval models used to provide a final response diagnosis, like the case of depression [30]. Using this idea, in addition to the previous approach, we created a list of binary and close-option ques- tions meant to identify the nature of the tweet. The questionnaire focused on identifying offensiveness, intention, and whether fine-grained sexism-related topics were expressed directly or passively. The complete questionnaire is included in Appendix B. The answers to these questions were fed through a multi-layer feed-forward neural network that aimed to predict all three tweet-related tasks. Because this method did not outperform the scores obtained by the previous method on the validation set by itself, we used it only to enhance the previous techniques in an ensemble setting in the fashion described before. Reference for this method is Q-Llama-MLP. Although this method seems promising, the quality of the answers is sensible to question formulation. Appendix B shows an example of the refinement through a qualitative assessment of a question that provides more accurate results in its final versions. We believe that all the questions in the original questionnaire can be refined into versions that lead to more accurate and representative answers and can be further used both to address the binary task and to identify fine-grained topics. 7. Evaluation Results In this section, we discuss the performance of our proposals over the validation set and the official evaluation metrics at EXIST 2024. 7.1. Preliminary evaluation over validation data This section is dedicated to evaluating the effectiveness and the impact of the key components of the methods proposed in the previous sections. We include the system Ensemble𝑁 𝐶 , which is the ensemble merging the two best performing individual systems N-LLM-R-Stack-Ra and C-LLM-R-Stack-Ra. In extension to this ensemble, we also include Ensemble𝑁 𝐶𝑄 , which in addition to individual systems N-LLM-R-Stack-Ra and C-LLM-R-Stack-Ra incorporates Q-Llama-MLP. The ensemble setting is, as Table 2 Results over the validation set. Hard Evaluation. Incorporating the reasonings benefits the system. The best systems overall are ensembles. System Input information lang F1 positive class XLM-RoBERTa-Baseline Tweets only en/es 0.8291 B-StackLlama Tweets only en/es 0.5966 N-LLM-R-Stack-Ra Tweets + Negative Reasoning en/es 0.8489 C-LLM-R-Stack-Ra Tweets + Stacked Comparison Reasoning en/es 0.8474 P-LLM-R-Stack-Ra Tweets + Positive Reasoning en/es 0.8325 Ensemble𝑁 𝐶 {C, N}-LLM-R-Stack-Ra en/es 0.8568 Q-Llama-MLP Questionnaire Answers en/es 0.7994 Ensemble𝑁 𝐶𝑄 {C, N}-LLM-R-Stack-Ra and Q-Llama-MLP en/es 0.8591 described before, the result of equally aggregating the scores of the individual scores of each category to obtain the ensemble scores. We present in Table 2 the F1 of the positive class for the different systems proposed considering both tweets in English and Spanish of the validation set. As observed, models {N, C, P}-LLM-R-Stack-Ra outperform the baseline XLM-RoBERTa-Baseline. This demonstrates the benefit of incorporating the reasoning against using only tweets. The Negative and Stacked Comparison Reasonings incorporation perform slightly better than the Positive Reasonings. We also observe that B-StackLlama underperforms XLM-RoBERTa-Baseline, indicating the deficiency of relying only on the reasonings. Regarding Q-Llama- MLP, it does not beat XLM-RoBERTa-Baseline, partly due to the redundancy and inability to capture the questions’ nature and meaning. Despite that, Q-Llama-MLP enhances the ensemble’s performance slightly. The best performance overall belongs to Ensemble𝑁 𝐶𝑄 . During the prompting process of B-StackLkama, the binary final answer underestimates sexism- related tweets. The performance by itself is poor because the two-step process is biased to produce a final negative label. We additionally observed the length of the Tweet Reasonings also influences the quality of the response: too short, and the analysis might not have enough details; too long, and the answer might be repetitive, affecting the performance during the fine-tuning process of {N, C, P}-LLM-R-Stack-Ra. We set up the reasoning generation to contain at most 200 tokens of the Positive and Negative Reasonings and the Stacked Comparison Reasoning to contain at most 250 tokens. An example showing the variation of the answer based on the token length is also included in Appendix A 7.2. Official Leaderbord The official metrics for EXIST 2024 are ICM-hard and F1 of the positive class, and scores are divided by language [12, 13]. Table 3 summarizes the results obtained over the test set. The single system N-LLMRStack-Ra performs less effectively than Ensemble𝑁 𝐶 . We hypothesize that the Negative and Stacked Comparison Reasonings complement the tweets with different information during the fine-tuning process. In the case of the Negative Reasoning, as provided for all tweets, we expect to learn an internal differentiation between accurate and inaccurate supporting facts of the not-sexist category. In the Comparison Stacked Reasoning, we expect to capture the contrast between reasonings and correct the preference of one over the other if necessary. We think that the distinct aspects and relationships these reasonings capture contribute to the improved performance in Ensemble𝑁 𝐶 . The ensemble Ensemble𝑁 𝐶𝑄 is the best performer of our systems. In particular, the difference between the top score with this system considering the set of evaluation tweets in English and Spanish is less than 0.01 in the ICM-Hard Norm and the F1 of the positive class, which shows the competitivity of our method. Performance of this model is just slightly better than Ensemble𝑁 𝐶 , and we believe that even though Q-Llama-MLP did not outperform the individual systems in validation, adding the output in Ensemble𝑁 𝐶𝑄 provides a slight correction of the underestimation of sexist tweets. Performance over the Spanish tweets is slightly better than performance over English tweets, and we conjecture that the multilingual setting benefits performance for both English and Spanish languages. Table 3 Results over the Test set. Hard Evaluation. The best system overall is Ensemble𝑁 𝐶𝑄 . The best results are achieved on Spanish Tweets. System Rank lang ICM-Hard Norm F1 positive class Top Score 1 en/es 0.800 0.7944 Ensemble𝑁 𝐶𝑄 5 en/es 0.7939 0.7903 Ensemble𝑁 𝐶 6 en/es 0.7914 0.7887 N-LLM-R-Stack-Ra 14 en/es 0.7718 0.7694 Top score 1 es 0.8108 0.8238 Ensemble𝑁 𝐶𝑄 4 es 0.8017 0.8123 Ensemble𝑁 𝐶 6 es 0.7986 0.8071 N-LLM-R-Stack-Ra 9 es 0.7830 0.7936 Top Score 1 en 0.8153 0.7610 Ensemble𝑁 𝐶𝑄 8 en 0.7784 0.7594 Ensemble𝑁 𝐶 9 en 0.7767 0.7626 N-LLM-R-Stack-Ra 25 en 0.7517 0.7350 The systems N-LLMRStack-Ra, Ensemble𝑁 𝐶 and Ensemble𝑁 𝐶𝑄 correspond to the runs CIMAT- GTO_1.json, CIMAT-GTO_2.json, CIMAT-GTO_3.json in the EXIST official leaderboard. 8. Conclusion In this paper, we propose a methodology to detect sexism that takes advantage of the capabilities to generate rationales of LLMs, which was not used in EXIST’s previous editions. These rationales show the potential of the LLM to support the binary classes and the capabilities to compare two different reasoning processes. As shown in this paper, the LLM reflection alone is not enough to achieve competitive results, and it is biased to underestimate sexist tweets, which is not desirable. That is why using the rationales enhances the baseline results provided by an XLM-RoBERTa that is fine-tuned only using the tweets. It allows for information that explores the nature of the tweet and a more accurate classification. Results show that the proposed models are competitive and open the panorama of how internal knowledge and reasoning capabilities of autoregressive LLMs can address this task. In future work, we plan to extend this approach to address other tasks, refine the prompts to obtain better rationales, and study the capabilities and limitations of other autoregressive LLMs. Questionnaire results can be explored to dive into the fine-grained classifications of sexism and analyze source intention and topic classification. A more in-depth exploration regarding the length of the rationales generated, the effects of prompting variation, alternative LLMs, and bias remains to be explored. This method also shows insight into the biases encoded in the large language models, as the model we chose (Llama 3) is misled to provide inaccurate explanations of the wrong category classification and, as observed in the scores, fails to choose the correct classification label by itself. 9. Ethical Concerns It is important to note that the systems developed in this work predict binary labels based on annotators’ majority vote and might overlook people’s perceptions at an individual level. Another important distinction is label names, where instead of “sexist” and “not-sexist” they could be more accurately described as “sexism-related” and “non-sexism-related” to recognize two different perspectives: negative intention (diffusion of biased content) and positive intention that treats sensitive content (bias awareness and discussion). The reasoning generated by the LLMs put into evidence the biased internal views they provide, where sexism-related content is underestimated. Using these labels by themselves or trusting the rationales generated should be considered carefully, as they can be highly misleading. Given the implications of deploying detection systems for sensitive content, the proposed solution requires a more in-depth analysis by social scientists, and ethics and fairness experts. Misusing these systems might have significant implications, including the potential non-detection of toxic and dangerous content or the unintended censorship of discussions about social problems. Acknowledgments Villarreal-Haro acknowledges CONAHCYT for its support provided by the program Becas Nacionales Para Estudios de Posgrados (CVU 1309535). We thank CONAHCYT for the computer resources provided through the INAOE Supercomputing Laboratory’s Deep Learning Platform for Language Technologies and CIMAT Bajio Super-computing Laboratory (#300832). Sanchez-Vega acknowledges CONAHCYT for its support through the program “Investigadoras e Investigadores por México” (Project ID.11989, No.1311). Rosales-Perez acknowledges CONAHCYT for its support through the project grant Búsqueda de arquitecturas neuronales eficientes y efectivas (CBF2023-2024-2797). References [1] A. Stevenson, C. Lindberg, New Oxford American Dictionary, Third Edition, OUP USA, 2010. URL: https://books.google.com.mx/books?id=sZoFRwAACAAJ. [2] J. Swim, L. Hyers, L. Cohen, M. Ferguson, Everyday sexism: Evidence for its incidence, nature, and psychological impact from three daily diary studies, Journal of Social Issues 57 (2001) 31–53. doi:10.1111/0022-4537.00200. [3] M. Paciello, F. D’Errico, G. Saleri, E. Lamponi, Online sexist meme and its effects on moral and emotional processes in social media, Comput. Hum. Behav. 116 (2020) 106655. doi:10.1016/j. chb.2020.106655. [4] E. L. Turley, J. Fisher, Tweeting back while shouting back: Social media and feminist activism, Feminism & Psychology 28 (2018) 128 – 132. URL: https://api.semanticscholar.org/CorpusID: 149235968. [5] M. Foster, A. Tassone, K. Matheson, Tweeting about sexism motivates further activism: A social identity perspective., The British journal of social psychology (2020). doi:10.1111/bjso.12431. [6] J. Fox, C. Cruz, J. Y. Lee, Perpetuating online sexism offline: Anonymity, interactivity, and the effects of sexist hashtags on social media, Comput. Hum. Behav. 52 (2015) 436–442. URL: https://api.semanticscholar.org/CorpusID:45231644. [7] Z. Gong, Concept of beauty in the age of the internet: Impact of social media on appearance anxiety and body shame, Communications in Humanities Research (2023). URL: https://api. semanticscholar.org/CorpusID:266066470. [8] K. M. Jessalynn Keller, J. Ringrose, Speaking ‘unspeakable things’: documenting dig- ital feminist responses to rape culture, Journal of Gender Studies 27 (2018) 22–36. URL: https://doi.org/10.1080/09589236.2016.1211511. doi:10.1080/09589236.2016.1211511. arXiv:https://doi.org/10.1080/09589236.2016.1211511. [9] E. Fersini, P. Rosso, M. E. Anzovino, Overview of the task on automatic misogyny identification at ibereval 2018, in: IberEval@SEPLN, 2018. URL: https://api.semanticscholar.org/CorpusID:51942244. [10] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. Sorensen, SemEval-2022 task 5: Multimedia automatic misogyny identification, in: G. Emerson, N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Association for Computational Linguistics, Seattle, United States, 2022, pp. 533–549. URL: https://aclanthology.org/2022.semeval-1.74. doi:10. 18653/v1/2022.semeval-1.74. [11] H. Kirk, W. Yin, B. Vidgen, P. Röttger, SemEval-2023 task 10: Explainable detection of online sexism, in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino, H. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.), Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval- 2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 2193–2210. URL: https://aclanthology.org/2023.semeval-1.305. doi:10.18653/v1/2023.semeval-1.305. [12] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin- guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [13] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo, R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi- cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [14] L. Plaza, J. C. de Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso, Overview of exist 2023 – learning with disagreement for sexism identification and characterization (extended overview), in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, 2023. [15] L. Tian, N. Huang, X. Zhang, Efficient multilingual sexism detection via large language model cascades, in: Conference and Labs of the Evaluation Forum, 2023. URL: https://api.semanticscholar. org/CorpusID:264441414. [16] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 1049–1065. URL: https://aclanthology.org/2023.findings-acl.67. doi:10.18653/v1/2023.findings-acl.67. [17] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2024. [18] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, E.-P. Lim, Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 2609–2634. URL: https://aclanthology.org/2023.acl-long.147. doi:10.18653/v1/2023. acl-long.147. [19] C. Li, J. Wang, K. Zhu, Y. Zhang, W. Hou, J. Lian, X. Xie, Emotionprompt: Leveraging psychology for large language models enhancement via emotional stimulus, 2023. [20] O. Shaikh, H. Zhang, W. Held, M. Bernstein, D. Yang, On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 4454–4470. URL: https://aclanthology.org/2023.acl-long.244. doi:10.18653/v1/2023.acl-long.244. [21] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain- of-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2024. [22] Y. Lu, M. Bartolo, A. Moore, S. Riedel, P. Stenetorp, Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 8086– 8098. URL: https://aclanthology.org/2022.acl-long.556. doi:10.18653/v1/2022.acl-long.556. [23] C. Zhou, J. He, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Prompt consistency for zero-shot task generalization, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 2613–2626. URL: https://aclanthology.org/2022.findings-emnlp.192. doi:10.18653/v1/2022.findings-emnlp.192. [24] B. Wang, X. Deng, H. Sun, Iteratively prompt pre-trained language models for chain of thought, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 2714–2730. URL: https://aclanthology.org/2022.emnlp-main.174. doi:10.18653/v1/2022.emnlp-main.174. [25] Z. Yin, Q. Sun, C. Chang, Q. Guo, J. Dai, X. Huang, X. Qiu, Exchange-of-thought: Enhancing large language model capabilities through cross-model communication, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 15135–15153. URL: https://aclanthology.org/2023.emnlp-main.936. doi:10.18653/v1/2023.emnlp-main.936. [26] K. Ahuja, H. Diddee, R. Hada, M. Ochieng, K. Ramesh, P. Jain, A. Nambi, T. Ganu, S. Segal, M. Ahmed, K. Bali, S. Sitaram, MEGA: Multilingual evaluation of generative AI, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 4232–4267. URL: https://aclanthology.org/2023.emnlp-main.258. doi:10.18653/v1/2023.emnlp-main.258. [27] J. M. Pérez, M. Rajngewerc, J. C. Giudici, D. A. Furman, F. Luque, L. A. Alemany, M. V. Martínez, py- sentimiento: A python toolkit for opinion mining and social nlp tasks, 2023. arXiv:2106.09462. [28] F. Barbieri, L. Espinosa Anke, J. Camacho-Collados, XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 258–266. URL: https://aclanthology. org/2022.lrec-1.27. [29] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [30] R. Fernández-Iglesias, M. Fernandez-Pichel, M. Aragon, D. E. Losada, DepressMind: A depression surveillance system for social media analysis, in: N. Aletras, O. De Clercq (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, St. Julians, Malta, 2024, pp. 35–43. URL: https://aclanthology.org/2024.eacl-demo.5. A. LLM Reasoning Prompt A.1. Prompt template In the experimental templates, we tried to set an objective environment for the answer given by the system role. The exact prompts used for the task are the following: The first two prompts take as input {tweet}, and generate the Positive and Negative Reasonings are the following: Positive Prompt. System Role You are a gender equality specialist. Think step by step. Answer as unbiased as possible. User Explain why the following tweet contains elements talking about sexism. Tweet: {tweet} Assistant: {...} Negative Prompt. System Role You are a gender equality specialist. Think step by step. Answer as unbiased as possible. User Explain why the following tweet does not contain elements talking about sexism. Tweet: {tweet} Assistant: {...} Positive and Negative Reasonings are limited to a maximum of 200 generated tokens. In a further step, the tweet and both the Positive and Negative Reasonings are provided so that the model self-evaluates both. Comparison Prompt. System Role You are a gender equality specialist. Think step by step. Be concise. User Consider the following tweet. Tweet: {tweet} Which analysis is the most accurate? Analysis 1: {positive reasoning} Analysis 2: {negative reasoning} Assistant: {...} The Comparison Stacked Reasoning is limited to a maximum of 250 tokens. A.2. Answer samples We show an example of the LLM output to illustrate the type of text reasoning we generated. The original tweet is the following: Writing a uni essay in my local pub with a coffee. Random old man keeps asking me drunk questions when I’m trying to concentrate; ends with "good luck, but you’ll just end up getting married and not use it anyway". #everydaysexism is alive and well". The positive and negative reasoning generated are: Positive Reasoning: Let’s break down the tweet and identify the elements that suggest sexism: 1. The tweet starts by mentioning the author is writing a university essay in a pub, which is a common scenario. This sets the context for the rest of the tweet. 2. The author mentions that a "random old man" is asking them "drunk questions" while they’re trying to concentrate. This already has a slightly confrontational tone, as the author is implying that the man is interrupting and being a distraction. 3. The phrase "good luck, but you’ll just end up getting married and not use it anyway" is what specifically suggests sexism. The man’s comment is making an assumption about the author’s future plans and goals, which are likely to be academic in nature. The comment implies that the author’s ultimate goal is to get married and abandon their academic pursuits, which is a stereotype often perpetuated about women. This type of comment is a classic example of sexism. Negative Reasoning: The tweet does not contain elements of sexism because the content is focused on the author’s experience of being interrupted by an old man while trying to write an essay, rather than any specific gender-based discrimination or harassment. The tweet is more about the author’s frustration with the distraction and the old man’s opinion about the author’s future plans, which is not related to the author’s gender. While the author uses the hashtag #everydaysexism, which is often used to describe the subtle and pervasive forms of sexism that people face in their daily lives, the tweet itself does not contain any explicit sexist language, attitudes, or behaviors. The tweet is more about the author’s annoyance with the old man’s behavior and the disruption it caused, rather than any gender-based issue. In fact, the tweet could be interpreted as an example of ageism, as the old man’s comment about the author’s future plans is not specific to the author’s gender, but rather a stereotypical comment. The comparison reasoning where the model self-evaluates the outputs produced is the following: Stacked Comparison Reasoning: Analysis 1 is the most accurate. The tweet contains elements of sexism because: 1. The old man’s comment "good luck, but you’ll just end up getting married and not use it anyway" is a stereotypical and gendered assumption about the author’s future plans and goals, which are likely to be academic in nature. The comment implies that the author’s ultimate goal is to get married and abandon their academic pursuits, which is a stereotype often perpetuated about women. 2. The tweet also contains a subtle tone of annoyance and frustration, which may be attributed to the author’s experience as a woman, as they are more likely to face interruptions and condescending comments from men. The other analyses do not accurately identify the elements of sexism in the tweet. Analysis 2 is incorrect because it does not recognize the gendered nature of the old man’s comment and the impact it has on the author. Analysis 2 is also incorrect because it does not identify the specific elements of sexism in the tweet, instead, it focuses on broader issues of sexism and gender inequality. Analysis 1 accurately identifies the elements of sexism in the tweet, including the stereotypical and gendered assumption made by the old man. It is essential to remark on some things. First, the model’s answer quality varies with length. If we ask the model to generate positive reasoning with a smaller amount of tokens, the answer does not provide as many insights that support the target category. The following is an example of a positive reasoning limited to less than 50 tokens. Positive Reasoning: The tweet contains elements of sexism because it describes a situation where the author, a woman, is being subjected to unwanted and inappropriate comments from a stranger, specifically an older man, while she is trying to focus on her academic work. As we can observe, limiting the reasoning to fewer tokens in the rationale prevents the model from fully breaking down the elements that it identifies as relevant. Allowing the model to produce a much longer output will result in repetition of the rationale generating. Secondly, due to the safety settings of the model, in certain topics, the generated answer is as follows: Positive Reasoning: I cannot analyze a tweet that contains explicit and harmful content, including sexual harassment. Table 4 Example of question refinement to get better answers. Question Version Qualitative analysis Does the tweet discredit feminism? Original question The model is biased to hallucinate reasons why it, in an implicit way, discredits feminism. It does not seem to distinguish the feminist movement by itself and rather settles the answer in the ideas feminism discusses. Does the tweet discredit the Modification 1 of the Answers remained biased, but feminist movement? Question there is a correction to the object of study: the feminist movement itself. Does the tweet directly discredit Modification 2 of the Answer is less biased, reasons are the feminist movement? Question not forced. In the second step of our methodology, we expect that even though the model does not provide a rationale supporting the targeted position, the model is learning that this type of analysis also provides information about the nature of the tweet and helps to learn about the topics contained. Finally, these are all automatic reasonings generated, and as shown in the example, are internally capable of supporting both sexist and not-sexist categories. We noticed that this two-step scheme is biased to predict as a final answer negative results. So, these models and the extracted knowledge shall be used carefully. B. Sexist Evaluation Questionnaire B.1. Refinement of questions As we discussed before, refining the questions impact the accuracy of the answers. Table 4 shows an example of the qualitative analysis of the evolution of a question if asked to explain the answer provided. B.2. List of questions The complete list of fine-grained sexist questions is the following. The model was asked to answer with “Yes/No” for binary questions and only with the answer identificator for closed-option questions. 1. Does the tweet contain stereotypes or discrimination against women? 2. Choose only the best option: a) The tweet is offensive to everyone b) The tweet is offensive mostly to women c) The tweet is not offensive 3. Choose only the best option: a) The tweet talks about sexism and is sexist. b) The tweet talks about sexism but is not sexist c) The tweet is not talking about sexism 4. Choose only the best option: a) The tweet is not related to sexism. b) The tweet is describing or reporting a sexist situation suffered by a woman. c) The tweet criticizes a sexist behavior d) The tweet is expressing a sexist message. 5. Does the tweet directly discredit the feminist movement? 6. Does the tweet devalue women’s struggles? 7. Does the tweet deny the existence of gender inequality? 8. Does the tweet portray men as victims of gender oppression? 9. Does the tweet treat being male as the default or norm? 10. Does the tweet imply men are superior to women? 11. Does the tweet suggest women are unsuitable for certain tasks? 12. Does the tweet suggest traits or abilities are determined by gender? 13. Does the tweet objectify or dehumanize women? 14. Does the tweet reinforce traditional gender roles for women? 15. Does the tweet contain sexual references? 16. Does the tweet contain sexual harassment towards women? 17. Does the tweet express hatred or misogyny towards women? 18. Does the tweet include threats of violence against women? 19. Does the tweet use gendered insults or slurs?