RMIT-IR at EXIST Lab at CLEF 2024 Notebook for the EXIST Lab at CLEF 2024

RMIT-IR at EXIST Lab at CLEF 2024 Notebook for the EXIST Lab at CLEF 2024 TonyKimSmith RMIT University

Melbourne Australia

RudaNie Tay Nguyen University

Buon Ma Thuot Vietnam

JohanneRTrippas j.trippas@rmit.edu.au RMIT University

Melbourne Australia

DamianoSpina damiano.spina@rmit.edu.au RMIT University

Melbourne Australia

RMIT-IR at EXIST Lab at CLEF 2024 Notebook for the EXIST Lab at CLEF 2024 1613-0073 F5BE85C55E241D505993918B90A501BC GROBID - A machine learning software for extracting information from scholarly documents sexism characterization, large language models, in-context learning, multi-modal contrastive learning Orcid 0009-0009-4752-6679 (T. K. Smith) 0000-0002-1194-2496 (H. R. Nie) 0000-0002-7801-0239 (J. R. Trippas) 0000-0001-9913-433X (D. Spina)

This paper describes RMIT-IR team's participation in the EXIST Lab at CLEF 2024. The proposed approaches aim to address sexism characterization on microblog posts (Tasks 1, 2, and 3) and sexism identification on memes (Task 4). For Tasks 1-3, we studied the effectiveness of zero-shot In-Context Learning (ICL) [1] with off-the-shelf pre-trained Large Language Models (LLMs) to mimic the scenario of minimal intervention of a practitioner aiming to build sexism characterization systems. Our approaches for meme classification (Task 4) utilize CLIP (Contrastive Language-Image Pre-training) [2] to experiment with multi-modal embeddings and zero-shot sexism identification models. We report the performance of our approaches under the learning with disagreements regime (Soft evaluation) and also for label predictions (Hard evaluation). The code of our submission is available at https://github.com/rmit-ir/exist2024/.

Warning: Some of the examples included in this paper may contain offensive language and explicit descriptions of sexist behavior, which may be disturbing to the reader.

Introduction

Social media has had a considerable impact on human societies. Applications such as Facebook, YouTube, Instagram, and TikTok have helped move the zeitgeist while creating large communities in the millions. However, many social media platforms have issues with people creating and posting harmful information. The challenge of detecting and managing such harmful content remains a concern for social media companies, contributing to consequences ranging from misinformation to adverse effects on mental health [3]. In addition, the rise of social media has empowered influencers who often unwittingly or deliberately propagate harmful stereotypes and negative gender norms. This type of content attracts an audience and drives advertising revenue, perpetuating a cycle of negativity [4]. As a result, it often fosters negative behaviour towards women and minority groups, impacting many people negatively [5].

Sexism is the belief that the members of one sex or gender are less than the members of the other sex, especially that women are less able than men [6]. This can be categorized into hostile sexism and benevolent sexism.

Sexism can limit the opportunities and roles people of different sexes and genders are expected to take. It can be conveyed through any form of expression, like images, cartoons, memes, objects, gestures, and symbols, and can be spread offline or online. This oppression can take different forms, such as economic exploitation and social domination [7].

Sexist attitudes and behaviours can perpetuate stereotypes of social and gender roles based on one's biological sex. Usually, people are socialized with sexist concepts that teach traditional gender roles for males and females [8]. Hostile sexism represents a form of sexist ideology, marked by explicit hostility towards women and the perception of them as inferior and submissive [9]. This deeply ingrained perception often results in the mistreatment of women at both individual and institutional levels [8]. Benevolent sexism is a nuanced manifestation that ingrains in men the belief that they should be responsible for providing for women in intimate relationships [9]. This belief system dictates specific roles and behaviours for women, such as expecting them to demonstrate motherly instincts, subtly reinforcing traditional gender roles. A society that has high rates of hostile and benevolent sexism often has high rates of violence against women, such as domestic violence, rape, and the commodification of women and their bodies [10,11].

There has been a recent increase in research on identifying different forms of hate speech, corresponding with advancements in generative pre-trained transformers and, in general, large language models (LLMs) [12]. Researchers are asking how LLMs can be trained to identify subtle and overt sexist content [13,14,15]. However, many questions on how state-of-the-art LLMs can be used for sexism detection are still open. What criteria should be used to evaluate what constitutes sexism in varied cultural contexts? If a dataset with binary classifications1 is employed, can a machine learning model accurately capture the nuances within the text? And how do we address the evolution of language with new slang and phrases continually emerging? These questions highlight the complexity of sexism detection. The cost and technical skills required to create a system that incorporates LLMs that can identify sexism make it unattainable for most individuals. We aim to simplify the process using pre-trained LLMs and prompts to address the EXIST lab tasks of classifying and labelling tweets.

In addition to the text classification in Tasks 1-3, we address the problem of identifying sexism in multi-modal formats for Task 4. Memes -ideas, images, or videos that are spread very quickly on the internet [16] -exist not only in text form but also include any accompanying images. Therefore, combining text and the attached image (i.e., making the input multi-modal) can be more conducive to identifying whether a meme is sexist. Multi-modal models are usually proposed to deal with multi-modal datasets for classification tasks. Among existing multi-modal systems, Contrastive Language-Image Pre-Training (CLIP) [2] is a powerful vision-and-language (VL) pre-trained model that can directly learn raw text about images. In addition, CLIP has the ability to map data of different modalities, text and images into a shared embedding space. Hence, CLIP has been shown to be a powerful tool for zero-shot image and text classification [2]. Furthermore, CLIP can be beneficial for image-text feature fusion, which can boost model performance on natural language processing (NLP) downstream tasks such as text classification [17] and multi-modal sarcasm detection [18]. Motivated by the success of CLIP on various VL downstream tasks, this study aims to investigate the following research questions for Task 4:

• How effective is CLIP for zero-shot sexism identification?

• How can the naturally inherited multi-modal knowledge from pre-trained CLIP be extracted to identify sexism effectively?

Addressing the first research question, we proposed Prompt-CLIP for zero-shot sexism identification. For the latter question, we employed CLIP to perform supervised sexism classification.

Inspired by the impressive performance of multi-view CLIP for sarcasm detection in a previous study [18], we adopted multi-view CLIP for supervised sexism classification, namely, text-image multi-view CLIP (TIMV-CLIP) and proposed text-image multi-modal models via CLIP-Guided Learning (TI-CLIP) as a baseline. The paper is organized as follows. Details about the tasks participated in are described in Section 2. Section 3 provides details about the proposed approaches. In Section 4, we provide and discuss the results. Finally, we conclude in Section 5.

Tasks Addressed

The sEXism Identification in Social neTworks (EXIST) [19] lab at the Conference and Labs of the Evaluation Forum (CLEF) 2024 [20] aims to identify and characterize sexism using the learning with disagreements paradigm [21,22,23]. This edition of the EXIST lab consists of sexism characterization on microblog posts (tweets) and memes.

Tasks 1-3: Sexism Characterization of Microblog Posts

• Task 1: Addresses sexism identification in tweets as a binary classification, requiring the system to classify whether a tweet is sexist (YES) or not (NO).

• Task 2: Focuses on determining the source intention in tweets as a multi-class classification, requiring the system to classify the tweet's intention as Direct, Reported, or Judgemental.

• Task 3: Involves sexism categorization in tweets as a multi-label classification, requiring the system to classify tweets into categories such as Ideological Inequality, Stereotyping Dominance, Objectification, Sexual Violence, and Misogyny-Non-Sexual Violence.

Task 4: Sexism Identification of Memes

While the above tasks address sexism identification in text, Task 4 deals with multi-modal input. Task 4 aims to address sexism identification as a binary classification, requiring the systems to classify whether a given meme is sexist or not.

Evaluation approaches

• Soft-Soft Evaluation: For systems that produce probabilities for each category, soft-soft evaluation is provided to compare the probabilities assigned by the systems with those assigned by the set of human annotators. The official evaluation metric is ICM-soft [24,23].

Additionally, Cross Entropy is also reported.

• Hard-Hard Evaluation: Hard labels are derived from the different annotators' labels through a probabilistic threshold computed for each task. Hard-hard evaluation is provided to evaluate systems that return Hard labels as output by comparing against a ground truth that combines multiple annotations into one. The original ICM [25] and 𝐹 1 score are used as evaluation metrics.

Proposed Approaches

Unsupervised In-Context Learning for Sexism Characterization in Microblog Posts

Our goal was to examine the procedure of developing a functional solution with readily available LLMs while minimizing the manual effort required from the practitioner. As shown in Figure 1, the basic architecture involves giving the researcher a set of labeling or classification tasks and asking the LLM to generate an accurate output. To ensure that the responses followed the predefined criteria for each task, the outputs were systematically stored in a JSON format and manually inspected for errors in the "value" field. Responses such as "YES", "YES", or variations with additional text or punctuation like "Yes, the ... is sexist" required manual corrections to conform to the expected format. Incidences of token limit rates that resulted in "HTTP" errors were addressed by re-running the task for the affected tweet using its unique ID. These occurrences were uncommon, making manual correction a more efficient solution than an automated task given the time constraint. The prompts used for runs submitted to Tasks 1, 2, and 3 were designed with multiple parts:

• Definition of the underlying concept being addressed in the task (e.g., sexism): Sexism, prejudice or discrimination based on sex or gender, especially against women and girls. Although its origin is unclear, the term sexism emerged from the "second-wave" feminism of the 1960s through '80s and was most likely modeled on the civil rights movement's term racism (prejudice or discrimination based on race). Sexism can be a belief that one sex is superior to or more valuable than another sex. It imposes limits on what men and boys can and should do and what women and girls can and should do. The concept of sexism was originally formulated to raise consciousness about the oppression of girls and women, although by the early 21st century it had sometimes been expanded to include the oppression of any sex, including men and boys, intersex people, and transgender people.

• Instruction to Address Task and to Obtain Consistent Outputs: You are a robot who detects sexism from text given in the prompt.

• Perspectivism:

- • Instance to Classify:

#### [tweet] ####

An example of a prompt submitted and output obtained from gpt-4-turbo to classify instance with id_EXIST 600090 using RMIT-IR_3 for Task 2 (Soft):

• Input: Sexism, prejudice or discrimination based on sex or gender, especially against women and girls. Although its origin is unclear, the term sexism emerged from the "second-wave" feminism of the 1960s through '80s and was most likely modeled on the civil rights movement's term racism (prejudice or discrimination based on race). Sexism can be a belief that one sex is superior to or more valuable than another sex. To find the distribution of responses, we initially tried to use GPT to figure out the likelihood percentage. Unfortunately, GPT only gave absolute values (either 100 or 0) or a consistent split of 70/30 most of the time. We directed the model to generate six responses for each tweet, which matched the number of annotators per tweet. For example, a set of responses like "YES", "NO", "YES", "NO", "YES", and "NO" would result in a calculated distribution of 50%. Additionally, in our final submissions, experimental runs two and three included prompts that provided additional context, such as the annotators' gender or educational backgrounds. This was done to see if providing relevant background information would improve the LLM's ability to predict annotator responses. The formats for each task can be seen above along with an example prompt used in RMIT-IR_3 for Task 2 (Soft).

Runs Submitted to Tasks 1-3

We used OpenAI's API to submit prompts to the pre-trained model gpt-4-turbo-2024-04-09 [26]. For each tweet in the test set, we instantiated the prompt from above by appending the textual content of the instance. We used the syntax #### [tweet] #### to provide explicit delimiters to the model. For Soft tasks, we asked for six instances and then created a distribution based on the frequency of the predicted labels. We experimented with multiple versions of prompt templates using the development set supplied by the EXIST organizers (we did not use the training set). We found that the following

Table 1

Summary of the runs submitted to Tasks 1-3. The Output Format was according to the type of task (Soft and Hard) as detailed previously. The instance to classify was appended at the end of the prompt.

Run

Definition Instruction Perspectivism Output Format

RMIT-IR_1 -RMIT-IR_2

Level of Education RMIT-IR_3

Level of Education + Gender elements were especially effective in directing the model to concentrate on the specific task and to ensure the responses were properly formatted (i.e., single-word answers and capitalized):

• Employing a role-playing technique of framing the task with the prompt "You are a robot who detects sexism from text given in the prompt. "

• Giving explicit formatting instructions such as "Give me 6 answers with NO or YES. Format:

[NO], [YES]".

Multi-modal Contrastive Learning for Sexism Identification on Memes

Inspired by the successful applications of CLIP [2] for NLP [17] and computer vision tasks [27], [28], we adopted CLIP for the sexism identification task (Task 4). Unlike conventional methods that rely heavily on labelled image-text pairs, CLIP is a cross-modality model pre-trained with 400M noisy image-text pairs collected from the internet to learn high-level semantic features. CLIP consists of two encoders that embed texts and images into a uniform mathematical space.

Then, for the matched image-text pair, CLIP is encouraged to maximize the cosine similarity between the embedding of the two modalities. Otherwise, the similarity is minimized for the model to find the most suitable paired images and texts. Our motivation for using CLIP-based learning for sexism identification is to capture cross-modal ambiguity by explicitly measuring the correlation between texts and images of targeted memes and to guide the feature-fusing and decision-making stages. We propose two supervised contrastive learning models based on CLIP: Text-Image multimodal model via CLIP-guided learning (TI-CLIP) and Text-Image Multi-View multi-modal model via CLIP-guided learning (TIMV-CLIP). The architecture of TIMV-CLIP is shown in Figure 2. We also propose Prompt-CLIP to address zero-shot sexism classification and CLIP-based models for supervised sexism classification.

• TI-CLIP: The overall architecture of TI-CLIP consists of two feature encoding models used to encode texts and images. These embeddings are then combined into a multi-modal embedding before passing into a feedforward network for sexism classification.

• TIMV-CLIP: We adopted a novel multi-view CLIP framework (MV-CLIP) [18] for sexism identification, namely TIMV-CLIP (Figure 2). In addition to encoding image and text as TI-CLIP, TIMV-CLIP further considers modelling relationships across text and image modality using a transformer encoder, which aims to capture the interaction across different modalities. Unlike MV-CLIP in a previous study [18], TIMV-CLIP employs BERT Base Multilingual (mBERT) [29] to encode texts.

• Prompt-CLIP: Prompt-CLIP performs zero-shot sexism identification. Prompt-CLIP uses a pre-trained CLIP model to create a custom classifier without training and considers images as inputs. It further encodes pre-defined classes (sexism and not sexism) with more description, known as prompts, into a learned latent space, and compares their similarity to the image latent space. In this study, we used "an image contains no information about sexism" and "an image contains information about sexism and against women" as prompts for Prompt-CLIP. The pre-trained text encoder transforms the class names (e.g., prompts) into a text embedding vector, while the pre-trained Image Encoder embeds the image.

• Model Training: We first randomly split the training subset into training (80%) and validation (20%) for cross-validation purposes. We implemented TI-CLIP and TIMV-CLIP based on the Hugging Face library [30] and adopted clip-vit-base-patch32 as the backbone. Both TI-CLIP and TIMV-CLIP were trained directly with Soft labels. We use Adam as an optimizer to optimize the parameters in both TI-CLIP and TIMV-CLIP models. After several trials with other hyperparameters, we selected the parameters that performed best on the validation set. Specifically, the batch size is 32. The learning rate for CLIP is 1e-6 and for the other parts is 5e-4. Finally, we use the dropout percentage of 0.3 and train the models for 10 epochs.

Runs Submitted to Task 4

The proposed multi-modal sexism identification models mainly focus on Soft label predictions.

For the Hard submissions, hard labels are directly assigned by applying the max function, i.e., based on the highest probability score. Table 2 presents the submitted runs for Task 4, which can be summarized as follows:

• RMIT-IR_1: For the first submission, the trained TI-CLIP model was used to predict whether given memes are sexist or not sexist.

• RMIT-IR_2: We used the trained TIMV-CLIP to generate the second submission.

• RMIT-IR_3: Prompt-CLIP was used to predict Soft and Hard labels for the third submission.

Results and Discussion

Tasks 1-3

The performance of our proposed approaches for Tasks 1, 2, and 3 are presented in Tables 3, 4, and 5, respectively. As shown in Table 3, simplifying the architecture required for creating a LLM that can identify sexism shows promise, as evidenced by the classification of English and Spanish tweets. Although the current model achieved a 49% ICM-Soft score, which is 17% lower than the best-performing run, this result indicates the potential to use prompting for classification tasks. The cost of this process, which included testing with a development dataset and submitting with a gold dataset, was close to $150 AUD. In particular, it did not require knowledge of cloud computing, expensive hardware, or much energy. The average time taken to produce an output was about 90 minutes. Looking at Table 3, Task 1, the use of the Task-specific Prompt yielded an ICM-Soft Norm score of 49%, securing the 23rd position overall. It is interesting to note that the inclusion of additional clues such as the annotator's education level and gender did not bolster performance; instead, it diminished the score. In particular, when looking at the Spanish test instances, the model scored higher across all Spanish test instances compared to all English test instances, despite being trained on an English-based GPT. This underscores the robust cross-lingual applicability of the model, showcasing its proficient handling of Spanish data despite its primary training on English. In Task 2, our analysis revealed challenges in multi-class classification as shown in Table 4. The approach yielded a 13% ICM-Soft Norm score, indicating considerable difficulty in discerning the intention of the tweets. The introduction of additional clues, such as the annotator's education level, generally led to a decline in performance. However, adding gender information resulted in a slight improvement, elevating the score from almost 0% to 3%. The results indicated that the approach performed more effectively in English without additional clues; however, its performance diminished once clues were introduced. Conversely, our analysis demonstrated that the GPT model exhibited greater efficacy with clues in Spanish, suggesting potential advantages in providing contextual information in non-English scenarios. Task 3 was also challenging with multi-label classification. The initial ICM-Soft Norm score, as shown in Table 5, stood at 11%. Following a similar trend to Task 2, the integration of clues such as the level of education of the annotator, resulted in a reduction in performance to 8%. Subsequently, when both education and gender clues were included, performance decreased further to 4%. The English scores are quite similar to Task 2. Notably from the initial performance on the Spanish dataset exceeded that of Task 2, but declined with the addition of educational clues and further declined with the incorporation of both education and gender clues. This observation underscores a consistent pattern of diminishing returns with the incorporation of more specific annotator information.

Our proposed approaches for the second and third runs involve implementing few-shot and in-context learning. The experiments for these runs were conducted using gpt-4-turbo, and future tests should include gpt-4o along with other pre-trained LLMs to determine their efficacy in this context. Our experiment results show that involving few-shot and in-context learning does not improve model performance on sexism identification in tweets (as shown in Tables 3-5). Although prompting requires less coding and understanding of LLMs, producing the exact desired response 100% of the time was challenging. The prompts had to be carefully designed to ensure that the GPT provided a single and consistent answer, especially when dealing with distribution. Although pre-processing text for an LLM is more complex than pre-processing answers from GPTs, ensuring the response is in the correct format is simpler. Another unexpected aspect of this architecture was the ability to assist the GPT with hints. We tested how adding biases by including the annotator's education level and gender affected its ability to classify or label tweets. This gave insights into how such biases can influence model performance and classification accuracy.

Task 4

Table 6 present the results of the proposed approaches for Task 4. Among the proposed approaches, TIMV-CLIP performs best in all cases (English+Spanish, English, or Spanish test instances) considering both Soft and Hard evaluation scenarios. This indicates the importance of effectively utilizing deep interactions between texts and images of memes with CLIP. Furthermore, TIMV-CLIP achieved the best-performance model (RMIT-IR_2) on English test instances with an ICM-Soft Norm score of 0.4998, ranked first in the leaderboard considering the Soft evaluation (English test instances). This observation confirms the advantages of CLIP for text-image pair classification tasks. However, the performance of TIMV-CLIP has dropped in Spanish test instances, which leads to lower performance in all test instances (Spanish test instances). We believe using a translation component for Spanish text in memes could lead to better overall performance.

Conclusions

This paper proposed unsupervised in-context learning with off-the-shelf pre-trained LLMs to address sexism characterization on microblog posts (Tasks 1, 2, and 3). Dealing with multi-modal inputs, we proposed multi-modal contrastive learning, including Prompt-CLIP, TI-CLIP, and TIMV-CLIP for sexism identification in memes (Task 4).

The results of our experiment demonstrated the effectiveness of TIMV-CLIP under the Learning with Disagreements regime, indicating the need to consider capturing sexism cues from different perspectives, including image, text, and image-text interactions.

Future work includes further experimentation with unsupervised In-Context Learning in other tasks or meta-tasks such as MonsterCLEF [31], and the inclusion of machine translation for multi-modal contrastive learning.

Level of Education: For each response, consider the perspective of individuals representing the following study levels: [study_levels_annotators] -Level of Education and Gender: For each response, consider the perspective of individuals representing the following study levels: [study_levels_annotators] and gender: [gender_annotators]. • Output Format: -Task 1 (Soft): Give me 6 answers with NO or YES. Format: [NO], [YES] -Task 1 (Hard): Give me 1 answer with [NO] or [YES] -Task 2 (Soft): Give me 6 answers with NO, DIRECT, REPORTED or JUDGEMEN-TAL using commas for each answer. Example: [NO], [DIRECT], [REPORTED], [JUDGEMENTAL], [JUDGEMENTAL], [NO] -Task 2 (Hard): Give me 1 answer with NO, DIRECT, REPORTED or JUDGEMEN-TAL using commas for each answer. Example: [NO], [DIRECT], [REPORTED], [JUDGEMENTAL], [JUDGEMENTAL], [NO] -Task 3 (Soft): Give me 6 answers with NO, IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINANCE, OBJECTIFICATION, SEXUAL-VIOLENCE, or MISOGYNY-NON-SEXUAL-VIOLENCE using commas for each answer. Example: [NO], [IDEOLOGICAL-INEQUALITY], [STEREOTYPING-DOMINANCE], [OBJEC-TIFICATION], [SEXUAL-VIOLENCE], [MISOGYNY-NON-SEXUAL-VIOLENCE] -Task 3 (Hard): Give me 1 answers with NO, IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINANCE, OBJECTIFICATION, SEXUAL-VIOLENCE, or MISOGYNY-NON-SEXUAL-VIOLENCE using commas for each answer. Example: [NO], [IDEOLOGICAL-INEQUALITY], [STEREOTYPING-DOMINANCE], [OBJEC-TIFICATION], [SEXUAL-VIOLENCE], [MISOGYNY-NON-SEXUAL-VIOLENCE]

Figure 2 :2Figure 2: The architecture of our proposed multi-modal model TIMV-CLIP for supervised sexism identification on memes (Task 4).

It imposes limits on what men and boys can and should do and what women and girls can and shoulddo. The concept of sexism was originally formulated to raise consciousness about the oppression of girls and women, although by the early 21st century it had sometimes been expanded to include the oppression of any sex, including men and boys, intersex people, and transgender people. You are a robot who detects sexism from text given in the prompt. For each response, consider the perspective of individuals representing the followingstudy levels: ["High school degree or equivalent", "Bachelor's degree", "Bachelor's degree", "Bachelor's degree", "Bachelor's degree", "High school degree or equivalent"]. Give me 6 answers with NO, DIRECT, REPORTED or JUDGEMENTAL using commas for each answer. Example: [NO], [DIRECT], [REPORTED], [JUDGEMENTAL], [JUDGEMENTAL], [NO]. #### Girls, don't let anyone ever tell you, you're not as good as a man #gender #girlpower #equity #### • Output: [NO],[NO],[NO],[NO],[NO],[NO]

Table 22Runs submitted to Task 4: sexism identification on memes.

RunModelRMIT-IR_1 TI-CLIP (feedforward network)RMIT-IR_2 TIMV-CLIP (Transformer encoder)RMIT-IR_3 Prompt-CLIP (zero-shot)

Table 33Results of the proposed approaches for Task 1 (Soft).Rank ICM-Soft ICM-Soft Norm Cross Entropy

Table 44Results of the proposed approaches for Task 2 (Soft).Rank ICM-Soft ICM-Soft Norm Cross Entropy

Table 55Results of the proposed approaches for Task 3 (Soft).Rank ICM-Soft ICM-Soft Norm

Table 66Results of the proposed approaches for Task 4 (Soft and Hard).Rank (Soft) ICM-Soft ICM-Soft Norm Cross Entropy ICM-Hard ICM-Hard NormF1 YESAll Test Instances (English + Spanish)EXIST2024_gold03.11071.00000.58520.98321.00001.0000EXIST2024_majority_class36−2.35680.12124.4015−0.40380.29470.6821EXIST2024_minority_class38−3.50890.00005.5672−0.64680.17110.0000TI-CLIP (RMIT-IR_1)29−1.28190.29401.0128−0.64680.17110.0000TIMV-CLIP (RMIT-IR_2)8−0.37800.43920.9852−0.01230.49380.6726Prompt-CLIP (RMIT-IR_3)24−1.08940.32491.1206−0.26010.36770.6040English Test InstancesEXIST2024_gold03.07941.00000.55280.98481.00001.0000EXIST2024_majority_class34−2.22360.13904.4798−0.40760.29310.6880EXIST2024_minority_class36−3.12350.00005.4888−0.63810.17610.0000TI-CLIP (RMIT-IR_1)33−1.28890.29071.0115−0.63810.17610.0000TIMV-CLIP (RMIT-IR_2)1−0.00110.49980.92430.15360.57800.7250Prompt-CLIP (RMIT-IR_3)25−1.01060.33591.1316−0.20890.39400.5641Spanish Test InstancesEXIST2024_gold03.13601.00000.61600.98151.00001.0000EXIST2024_majority_class36−2.49970.10144.3270−0.40010.29620.6765EXIST2024_minority_class38−3.94080.00005.6416−0.65570.16600.0000TI-CLIP (RMIT-IR_1)29−1.27300.29701.0141−0.65570.16600.0000TIMV-CLIP (RMIT-IR_2)17−0.78510.37481.0431−0.17620.41030.6192Prompt-CLIP (RMIT-IR_3)27−1.19030.31021.1101−0.31400.34000.6332

We acknowledge that the classification of sex and gender into two categories is a simplification of people's identities.

Acknowledgments

This research has been carried out in the unceded lands of the Woi Wurrung and Boon Wurrung peoples of the eastern Kulin Nation. We pay our respects to their Ancestors and Elders, past, present, and emerging. This research is partially supported by the Australian Research Council (ARC, project nr. DE200100064 and CE200100005).

QDong LLi DDai CZheng ZWu BChang XSun JXu LLi ZSui arXiv:2301.00234 A Survey on In-context Learning 2023 Learning Transferable Visual Models From Natural Language Supervision ARadford JWKim CHallacy ARamesh GGoh SAgarwal GSastry AAskell PMishkin JClark GKrueger ISutskever Proceedings of the 38th International Conference on Machine Learning, ICML 2021 MMeila TZhang the 38th International Conference on Machine Learning, ICML 2021

PMLR

July 2021. 2021 139 Proceedings of Machine Learning Research Social Media and Mental Health LBraghieri RLevy AMakarin 10.1257/aer.20211218 American Economic Review 112 2022 Young Adults' Folk Theories of How Social Media Harms Its Users RYoung VKananovich BGJohnson 10.1080/15205436.2021.1970186 Mass Communication and Society 26 2023 The Problem of Anti-feminist 'Manfluencer' Andrew Tate in Australian Schools: Women Teachers' Experiences of Resurgent Male Supremacy SRStephanie Wescott XZhao 10.1080/09540253.2023.2292622 Gender and Education 36 2024 Cambridge Dictionary, Definition of sexism 2024. 2024-07-04 The State and the Oppression of Women MMcintosh Feminism and Materialism (RLE Feminist Theory) Routledge 2013 GMasequesmay Sexism | Definition, Types, Examples, & Facts | Britannica 2024 Ambivalent sexism PGlick STFiske 10.1016/S0065-2601(01)80005-8 Advances in Experimental Social Psychology 33 2001 Academic Press Misogynistic Tweets Correlate with Violence against Women KRBlake SMO'dean JLian TFDenson 10.1177/0956797620968529 Psychological science 32 2021 Online Misogyny KBarker OJurasz Journal of International Affairs 72 2019 Multi-class Hate Speech Detection in the Norwegian Language Using FAST-RNN and Multilingual Fine-Tuned Transformers EHashmi SYYayilgan 10.1007/s40747-024-01392-5 Complex & Intelligent Systems 10 2024 UMUTeam at EXIST 2023: Sexism Identification and Categorisation Fine-tuning Multilingual Large Language Models JAGarcía-Díaz RPan RValencia-García Working Notes of CLEF 2023 -Conference and Labs of the Evaluation Forum MAliannejadi GFaggioli NFerro MVlachos 2023 Multi-task Learning Neural Framework for Categorizing Sexism HAbburi PParikh NChhaya VVarma 10.1016/j.csl.2023.101535 Comput. Speech Lang 83 2024 Overview of EXIST 2023: sEXism Identification in Social neTworks LPlaza JCarrillo-De Albornoz RMorante JGonzalo EAmigó DSpina PRosso 10.1007/978-3-031-28241-6_68 Proceedings of ECIR' ECIR' 2023 23 Cambridge Dictionary, Definition of meme 2024. 2024-07-04 CLIPText: A New Paradigm for Zero-shot Text Classification LQin WWang QChen WChe 10.18653/v1/2023.findings-acl.69 Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics ARogers JBoyd-Graber NOkazaki

Toronto, Canada

2023 MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System LQin SHuang QChen CCai YZhang BLiang WChe RXu 10.18653/v1/2023.findings-acl.689 Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics ARogers JBoyd-Graber NOkazaki

Toronto, Canada

2023 LPlaza JCarrillo-De Albornoz EAmigó JGonzalo RMorante PRosso DSpina BChulvi AMaeso VRuiz EXIST: sEXism Identification in Social neTworks 2024. 2024-07-04 Conference and Labs of the Evaluation Forum (CLEF) 2024. 2024-07-04 EXIST 2024: sEXism Identification in Social neTworks and Memes LPlaza JCarrillo-De Albornoz EAmigó JGonzalo RMorante PRosso DSpina BChulvi AMaeso VRuiz 10.1007/978-3-031-56069-9_68 Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024

Glasgow, UK; Part V; Berlin, Heidelberg

Springer-Verlag March 24-28, 2024. 2024 Proceedings Overview of EXIST 2024 -Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes LPlaza JCarrillo-De-Albornoz VRuiz AMaeso BChulvi PRosso EAmigó JGonzalo RMorante DSpina Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association

CLEF

2024. 2024 Overview of EXIST 2024 -Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes (Extended Overview) LPlaza JCarrillo-De-Albornoz VRuiz AMaeso BChulvi PRosso EAmigó JGonzalo RMorante DSpina Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum GFaggioli NFerro PGaluščáková AG SHerrera 2024 Overview of EXIST 2023 -Learning with Disagreement for Sexism Identification and Characterization (Extended Overview) LPlaza JCarrillo-De-Albornoz RMorante EAmigó JGonzalo DSpina PRosso Working Notes of CLEF 2023 -Conference and Labs of the Evaluation Forum MAliannejadi GFaggioli NFerro MVlachos 2023 Evaluating Extreme Hierarchical Multi-label Classification EAmigó ADelgado 10.18653/v1/2022.acl-long.399 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics <author> <persName><surname>Openai</surname></persName> </author> <ptr target="https://help.openai.com/en/articles/8555510-gpt-4-turbo-in-the-openai-api" /> <imprint> <date type="published" when="2024-07-04">2024. 2024-07-04</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b26"> <analytic> <title level="a" type="main">CMA-CLIP: Cross-Modality Attention Clip for Text-Image Classification JFu SXu HLiu YLiu NXie C.-CWang JLiu YSun BWang 10.1109/ICIP46576.2022.9897323 IEEE International Conference on Image Processing (ICIP) 2022. 2022 CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification MVConde KTurgutlu 10.1109/CVPRW53098.2021.00444 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021. 2021 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers JBurstein CDoran TSolorio the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics Transformers: State-of-the-Art Natural Language Processing TWolf LDebut VSanh JChaumond CDelangue AMoi PCistac TRault RLouf MFuntowicz JDavison SShleifer PVon Platen CMa YJernite JPlu CXu TLeScao SGugger MDrame QLhoest ARush 10.18653/v1/2020.emnlp-demos.6 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics QLiu DSchlangen the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics 2020 NFerro JGonzalo JKarlgren HMüller MonsterCLEF: One Lab to Rule Them All 2024. 2024-07-04