Enhancing Sexism Detection in Tweets with Annotator-Integrated Ensemble Methods and Multimodal Embeddings for Memes Notebook for the EXIST Lab at CLEF 2024

Enhancing Sexism Detection in Tweets with Annotator-Integrated Ensemble Methods and Multimodal Embeddings for Memes Notebook for the EXIST Lab at CLEF 2024 MarthaPaola Jimenez-Martinez Centro de Investigación Científica y de Educación Superior de Ensenada

Mexico

JoanManuelRaygoza-Romero Centro de Investigación Científica y de Educación Superior de Ensenada

Mexico

CarlosEduardoSánchez-Torres Universidad Autónoma de Baja California

Ensenada Baja California Mexico

IrvinHussein Lopez-Nava Centro de Investigación Científica y de Educación Superior de Ensenada

Mexico

Universidad Autónoma de Baja California

Ensenada Baja California Mexico

ManuelMontes-Y-Gómez Óptica y Electrónica Instituto Nacional de Astrofísica

Mexico

Enhancing Sexism Detection in Tweets with Annotator-Integrated Ensemble Methods and Multimodal Embeddings for Memes Notebook for the EXIST Lab at CLEF 2024 1613-0073 86252BD35207730DE681027B74B0F079 GROBID - A machine learning software for extracting information from scholarly documents Sexism detection, Sexism identification, Sexism classification, Social media, Transformer models Montes-y-Gómez) 0009-0005-8701-9875 (M. P. Jimenez-Martinez) 0000-0003-3085-5678 (J. M. Raygoza-Romero) 0000-0001-5799-4067 (C. E. Sánchez-Torres) 0000-0003-3979-9465 (I. H. Lopez-Nava) 0000-0002-7601-501X (M. Montes-y-Gómez)

This paper details MMICI's participation in the EXIST challenge at CLEF 2024, focusing on the identification and categorization of sexism in social media and memes. For tweets, we employed pre-trained transformer models and ensemble voting approaches. For memes, we utilized CLIP embeddings using a Vision Transformer (ViT) model and two types of classifiers: feed-forward neural networks and factorization machines. The tasks encompassed detecting sexism in tweets and memes, as well as categorizing their type and the author's intention. Our methodology for tweets integrates annotator profiles, such as gender and age, to enhance the accuracy of sexism identification, source intention, and sexism categorization. For memes, we utilized all annotator features (gender, age, ethnicity, study level, and country) for the same tasks. The results demonstrate the effectiveness of our models across various tasks, emphasizing the integration of diverse perspectives. Notably, our best performances include a 10th place ranking in Task 1, a 15th place ranking in Task 2, and a 13th place ranking in Task 3 for Spanish tweets. For memes, we achieved a 3rd place ranking in Task 4 for English memes, two 1st place rankings in Task 5 for both English and Spanish memes, and a 2nd place ranking in Task 6 for English memes. These results underscore the importance of incorporating the demographic factors of annotators and taking advantage of multimodal embeddings for robust performance in sexism detection.

Introduction

According to the Cambridge Dictionary, sexism is defined as "(actions based on) the belief that the members of one sex are less intelligent, able, skilful, etc. than the members of the other sex, especially that women are less able than men [1]". In contrast, the Royal Spanish Academy defines sexism as "discrimination against individuals based on their sex" (in spanish: discriminación de las personas por razón de sexo) [2]. Both interpretations, based on the meaning and expression in both languages, agree that sexism not only reflects but also communicates and perpetuates the stereotypes and roles historically assigned to women and men in society. This perpetuation of stereotypes is a significant factor in the struggle for gender equity [3].

Research on gender ideologies employs the Ambivalent Sexism Inventory and the Ambivalence toward Men Inventory. The Ambivalent Sexism Inventory measures hostile sexism, which reflects antagonistic attitudes towards women, and benevolent sexism, which consists of subjectively favorable but patronizing beliefs about women. The Ambivalence toward Men Inventory assesses hostility toward men, rooted in the resentment of men's perceived greater power, and benevolence toward men, which involves favorable views of men as protectors and providers. Ambivalent sexism theory posits that hostile sexism and benevolent sexism arise due to social and biological factors common across cultures, such as patriarchy, gender differentiation, and heterosexuality. Systemically, hostile sexism and benevolent sexism function as complementary ideologies that justify and perpetuate gender inequality, showing a strong correlation across cultures. This underscores the necessity of addressing both hostile and benevolent forms of sexism in the pursuit of gender equality [4].

This paper details MMICI's participation in the "sEXism Identification in Social neTworks" (EXIST) shared task at CLEF 2024. EXIST aims to broadly capture instances of sexism, ranging from overt misogyny to subtler expressions of implicit sexist behavior, a task it has been undertaking since 2021. The goal of utilizing automatic tools is not only to detect and alert against sexist behaviors and discourses but also to estimate the prevalence of sexist and abusive situations on social media platforms, identify the most common forms of sexism, and understand how sexism manifests in these media [3].

Over the years, EXIST has evolved significantly. In 2021 and 2022, it provided a dataset with definitive (hard) labels for each tweet. However, starting from 2023 and continuing into 2024, the task expanded to generate six different labels per tweet, each derived from six distinct annotator profiles. These profiles include three women and three men from distinct age groups: 18-22, 23-45, and 46+. Furthermore, the most recent edition incorporates the demographic parameters of the annotators, such as gender, age, level of education, ethnicity, and country of residence.

Dataset EXIST 2024

In its fourth edition [5], the task has incorporated new challenges involving images, specifically memes. The six tasks are as follows:

• Task 1: Sexism Identification in Tweets involves identifying whether a tweet is sexist or not.

• Task 2: Source Intention in Tweets follows, where once a tweet is classified as sexist, it involves categorizing the intention of the author-whether the intention is direct, reported, or judgmental. These tasks aim to enhance the understanding and detection of sexism across various forms of social media content in both English and Spanish, ultimately supporting efforts to combat sexism online. Given that information is provided from expressions in different languages, it cannot be assumed that models for detecting sexism in one language can be applied directly to another. This is due to the syntax and semantic differences in the manifestations of sexism across various countries and contexts [6]. To better understand the differences between the expressions in both languages, Table 1 provides some examples of the labels given for the different Dataset tasks where all annotators reached a consensus on that label. @messyworldorder it's honestly so embarrassing to watch and they'll be like "not all white women are like that"

TASK 2: Source Intention in Tweets

Direct

Una mujer necesita amor, llenar la nevera, si un hombre puede darle esto a cambio de sus servicios (tareas domésticas, cocinar, etc.), no veo qué más necesita.

Women shouldn't code. . . perhaps be influencer/creator instead. . . it's their natural strength.

Reported

Me duermo en el metro, abro los ojos sintiendo algo raro: la mano del hombre sentado a mi lado en mi pierna #SquealOnYourPig.

Today, one of my year 1 class pupils could not believe he'd lost a race against a girl.

Judgemental

Como de costumbre, la mujer fue la que dejó su trabajo por el bienestar de la familia. . . "

21st century and we are still earning 25% less than men #Idonotrenounce.

TASK 3: Sexism Categorization in Tweets

Ideological and Inequality

Mi hermana y mi madre se burlan de mí por defender todo el tiempo los derechos de todos y me acaban de decir feminazi, la completaron.

I think the whole equality thing is getting out of hand. We are different, thats how were made! Stereotyping and Dominance @Paula2R @faber_acuria A las mujeres hay que amarlas. . . solo eso. . . Nunca las entenderás.

Most women no longer have the desire or the knowledge to develop a high quality character, even if they wanted to.

Objectification "Pareces una puta con ese pantalón" -Mi hermano de 13 cuando me vio con un pantalón de cuero.

Overview of the proposal

Previous research has developed methods to model annotators in subjective tasks, allowing for the prediction of personalized labels for each annotator. For instance, Akhtar et al. [7] conducted an exhaustive search to classify annotators into two groups based on their annotation patterns. Their study demonstrated that an ensemble model, composed of two distinct classifiers representing the perspectives of each group, outperformed the traditional single-task model that only considers aggregated labels. Additionally, traditional classification methods typically aggregate labels through majority voting or averaging before training. However, this approach has been found to potentially "silence the voices" of socio-demographic minority groups [7]. One of the objectives of this study is to leverage the individual opinions of annotators, or group them based on specific demographic characteristics, to ensure that their "voices" are effectively integrated into the sexism detection models.

Building on previous concepts, our approach to addressing the EXIST task encompasses multiple strategies across various runs and tasks: • Run 1 for Tasks 1, 2, and 3: The model predicts labels by employing an ensemble method that combines outputs based on different age groups and gender. • Run 2 for Tasks 1, 2, and 3: The model predicts labels by employing an ensemble method that integrates outputs from various profiles of the six annotators. • Run 3 for Tasks 1, 2, and 3: The model predicts labels using a majority vote approach, where the final prediction is based on the consensus among all annotators. • Runs 1 and 2 for Tasks 4 and 5: For these tasks, our approach involves using embeddings for both the text and the image of each meme. These embeddings represent deep features of the meme. Additionally, annotator attributes are incorporated to develop a model capable of predicting labels for each annotator.

The final label is determined by a voting mechanism among the predictions of the annotators. • Runs 1 and 2 for Task 6: A specialized model is trained for each label using only sexism data, with the data balanced for each class. Embeddings for the meme text and image are utilized. The final output combines the model's prediction for non-sexist cases (from Task 4) with the outputs of the specialized models for each sexism category to produce a single prediction. • Run 3 for Tasks 4, 5, and 6: The system predicts labels by concatenating the profile annotator's embedding with an image embedding in the same space. A multimodal embedding model assesses the relationship between annotators and items, and a voting mechanism is then applied to determine the final score.

Our general approach is presented in Figure 1 and integrates text and visual processing using transformer models to extract features and perform classifications. Texts (tweets) are preprocessed and fed into a transformer model to generate text embeddings, while images (memes) are processed through a vision transformer model to produce visual embeddings. Annotator features are extracted from the text embeddings, and a classifier is trained using these features along with the text and visual embeddings. An ensemble technique is applied to combine the outputs of the models, enhancing the accuracy of the classifier. The performance is then evaluated across several specific tasks to ensure comprehensive assessment and optimization of the results.

In the domain of text analysis in Spanish, our dataset was meticulously constructed, comprising 2526 samples for training purposes, with an additional 639 samples reserved for validation, and a final set of 490 samples designated for comprehensive testing. Conversely, for the English language domain, our dataset consisted of 1832 meticulously curated samples for training. An additional 574 samples were allocated for validation. Furthermore, our local test set, comprising 978 meticulously selected samples, served as a crucial benchmark for evaluating the generalization capabilities of our models in real-world applications. The metrics used for each task are as follows: For Task 1 and Task 4, we employed the ICM-Hard Norm F1-score for the positive class (sexism). For Task 2, Task 3, Task 5, and Task 6, we used the ICM-Hard Norm F1-score macro, which is the average of the F1-scores for all classes.

Sexism Detection in Tweets

Firstly, for the detection of sexism in tweets, we focus on integrating annotator information, particularly considering their profiles such as gender and age, as summarized in Figure 2.

• Text Preprocessing: Mentions within the tweets were substituted with '@USER,' while any URLs were replaced with 'HTTPURL. ' • Transformer Model: We decided to use pre-trained models specifically for tweets:

"cardiffnlp/twitter-roberta-base-sentiment" for English and "pysentimiento/robertuito-baseuncased" for Spanish, both from Hugging Face, since these models were trained with data in the respective languages. • Ensemble: We employed two different ensembles. The first ensemble used a majority vote from the outputs of six different models, one for each annotator. The second ensemble used a majority vote from the outputs of five different models, focusing on gender and age (females, males, 18-22, 23-45, and 46+).

As mentioned in the previous Section, our runs for the first three text-focused tasks were:

1. Run 1: An ensemble was created from the outputs of five different models, focusing on gender and age. A majority vote was taken from the outputs of these five models, with the label being assigned if three or more groups agreed. 2. Run 2: An ensemble was created from the outputs of six different models, one for each annotator.

A majority vote was taken from the outputs of these six models, with the label being assigned if four or more annotators agreed. 3. Run 3: A majority vote was taken initially from the six annotators' inputs, serving as the baseline.

Similarly, the label was assigned if four or more annotators agreed.

To ensure a decision was always made in each ensemble without ties, we used probabilistic voting rather than hard voting from each model. This means that even if three models classify a tweet as sexist and three do not, the probabilities are compared, and the decision is made based on the highest probability, ensuring a definitive decision for all predictions in the ensembles.

For Task 2, which requires determining the intention of the tweets (single label), the label was assigned based on the highest probability prediction among the types of intentions if the tweet was sexist. To achieve this, a binary model was trained for each label. This approach ensures that the classification is both precise and comprehensive, taking into account the nuanced nature of the intentions expressed in the tweets.

For Task 3, which involves identifying the types of sexism in a tweet (multi-label), the ensemble takes into account all types of sexism indicated by the annotators. For example, if one annotator labels a tweet as objectification and another labels it as misogynistic and sexual violence, all three types of sexism are included in our ensemble prediction Furthermore, we employ multiple binary classification models for each label, allowing us to address each facet of identified sexism with specificity and precision.

To analyze in depth the impact of considering individualized annotators' opinions (A1-A6), grouped opinions before the classification models (All), opinions by demographic group (Females, Males, 18-22, 23-45, 46+), or assembled at the end (Ensemble Annotators, Ensemble Groups), Figure 3 presents the results based on the performance of the sexism identification, intention, and categorization models, respectively.

The selection of group ensemble and annotator ensemble approaches as Run 1 and Run 2, respectively, is grounded in their ability to integrate a wide range of perspectives and individual judgments. The group ensemble, by combining different demographics, offers an enriched and balanced overview, which is crucial for tackling the complexity of the tasks at hand. On the other hand, the annotator ensemble capitalizes on the diversity of individual judgments, ensuring a robust and competitive performance. Finally, the direct majority vote of annotators is established as the baseline (Run 3) due to its simplicity and effectiveness, providing a clear reference for evaluating ensemble methods. These choices are backed by the best results obtained in each task, where the group ensemble consistently outperforms others in terms of performance and ability to capture the inherent complexity in the datasets.

Sexism Detection in Memes

We chose a different path for tasks 4, 5, and 6 as shown in Figure 4. Although we leveraged the annotator data from the dataset, the text preprocessing steps depicted in Figure 1 were not applied. Instead, embeddings were extracted directly from the raw data using different approaches. We utilized CLIP embeddings for the memes and text along with annotator features. It was decided to address the tasks from the textual domain due to the high variability in the representation and graphic styles of the Memes (see examples in Table 1).

In Runs 1 and 2, annotator features were represented using one-hot encoding for gender, age range, ethnicity, study level, and country. The reader can explore this approach in the subsection 5.1. In Run 3, a descriptive text was created for the annotator features, from which embeddings were extracted. For Runs 2 and 3, the classifier used was a feed-forward neural network (FNN) with two hidden layers, containing 4096 and 512 neurons, respectively. Following these layers, a dropout layer with a dropout rate of 0.1 was applied. In Run 3, we further leverage the annotation and meme relationship and propose a Factorization Machine model, a Collaborative filtering technique, to predict the annotation based on annotator and meme CLIP embeddings. We explain more about this approach in the subsection 5.2.

Feed-forward neural network with CLIP embeddings

For Task 4, the output layer in the FNN consisted of a single neuron to produce the probability of the sexism class. We evaluated various approaches for the model: using only text embeddings, using only image embeddings, using both text and image embeddings, utilizing a general model (without annotator characteristics), and some combinations of the outputs of some of these models. Table 2 outlines the features of each evaluated model.

It is essential to define two concepts: early fusion and late fusion. In early fusion, the model simultaneously receives both text and image embeddings, meaning the model's input includes annotator features, text embeddings, and image embeddings (as seen in the "Text+Image" and "Text+Image General" models). In late fusion, the outputs of two models are combined. For example, in the "Text|Image" model, the outputs of the "Text" model (trained only with text embeddings) and the "Image" model (trained only with image embeddings) are combined by averaging their outputs. Similarly, the "Text|Image & Text|Image General" model combines the outputs of the "Text|Image" and "Text+Image General" models by averaging their outputs. achieve higher mean F1 scores and low variations of the performance. These models correspond to Run 1 and Run 2, respectively. For task 5, the output layer in the FNN consisted of a 3 neurons to yield the probability of every label. Similar to task 4, we evaluated some approaches for the model: using only text embeddings, using only image embeddings, using both text and image embeddings, and a combination of the outputs of "Text" and "Image" models by averaging their outputs. Figure 6 displays the F1 score macro across 10 runs for each model in Task 5.

The results in Figure 6 indicate the "Text+Image" model achieves higher mean F1 scores and low variations of the performance. These models correspond to Run 1. For Run 2, the "Text|Image" model was selected. Although it did not achieve the highest F1 score, it demonstrated a strong MSE score comparable to the "Text+Image" model.

For Task 6, the output layer in the FNN consisted of one neuron to yield the probability for each label of sexism categorization. We created 5 models, each trained exclusively on data from sexism memes and with a random subset of training data for negative cases equal to the amount of training data for positive cases. Consequently, each model was trained on a balanced dataset. The probability output from the model for Task 4 was used to determine the probability of the not sexism label and then combined with the outputs from each of these 5 models to produce a final prediction.

There are two exceptional cases to consider: i) If the probability of not sexism is higher than 0.5, as well as one of the 5 categories of sexism, the final prediction is always not sexism. ii) If the probability of not sexism is lower than 0.5, as well as one of the 5 categories of sexism, the meme is classified as sexist, and the category of sexism with the highest probability is selected. Similar to Tasks 4 and 5, we evaluated various approaches for the model. Figure 7 presents the macro F1 scores for each model. We observed similar performance among the "Text", "Text+Image", and "Text|Image" models. Based on these results, we selected the "Text|Image" model for Run 1 and the "Text+Image" model for Run 2. The "Text" model was not chosen, as we believe that the combination of text and image embeddings yields better results.

Multimodal Collaborative Filtering employing CLIP embeddings and Factorization Machines

Loni2018FactorizationMF In this approach, we model similar to how to assign a score in a recommendation system or to predict links between nodes in a bipartite graph, leveraging the fact that we have the annotator and the item features. Given known subject-item preferences, predict new subject-item preferences. Formally, let 𝑈 a set of all subjects 𝑈 and 𝑉 a set of all items, our core task is to find a real-valued scalar function 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) where 𝑢 ∈ 𝑈 and 𝑣 ∈ 𝑉 . To provide a hard label or multi-label, 𝑘 subjects vote with their encoded scores. Hence, we've reduced our problem into a score prediction problem. For each user 𝑢 ∈ 𝑈 , let 𝑢 ∈ R D for its 𝐷-dimensional embedding. For each item 𝑣 ∈ 𝑉 , let 𝑣 ∈ R 𝐷 be its 𝐷-dimensional embedding. So, 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) ≡ 𝑓 : R 𝐷 × R 𝐷 → R. In this approach, memes and annotators are transformed into the same embedding space using CLIP. Specifically, user demographics such as age, gender, and interests are encoded with a phrase such as "A female aged 18-22, of Hispanic or Latino ethnicity, with a high school degree or equivalent, and located in Mexico" into one CLIP embedding. In contrast, the meme, which may include both image and text components, is encoded into another CLIP embedding. These embeddings capture the nuanced features of both the user and the meme content. We then concatenate these two embeddings into a single embedding that represents the combined features of the user and the meme.

For instance, the table 3 illustrates a complete utility matrix for Task 4 with known score entries 𝑓 (𝑢, 𝑣) where 0 represents the label "NO" and 1 represents the label "YES". Our encoding method ignores "UNKNOWN" labels, but other encodings are possible. In this case, the voting policy is the selection of the class annotated by more than 3 subjects.

Table 3

An example of utility matrix for the task 4

𝑉1 𝑉2 𝑉3 𝑈1 1 0 1 𝑈2 0 1 1 𝑈3 1 0 0 𝑈4 1 1 1 𝑈5 0 0 0 𝑈6 1 0 0 𝑉 𝑜𝑡𝑖𝑛𝑔 1 0 Undefined 𝐿𝑎𝑏𝑒𝑙 YES NO

For Task 5, the 𝑠𝑐𝑜𝑟𝑒 function is encoded similarly to task 4, with the addition of a voting policy and a method to define similarity to hard labels. The voting policy is the arithmetic mean of votes 𝑠𝑐𝑜𝑟𝑒 which entails us into the encoding to predict the hard label as follows 𝑠𝑐𝑜𝑟𝑒 ∈ [0, 0.67] =⇒ No 𝑠𝑐𝑜𝑟𝑒 ∈ (0.67, 1.34] =⇒ Direct 𝑠𝑐𝑜𝑟𝑒 ∈ (1.34, 2] =⇒ Judgemental We apply softmax over the votes to find the probabilities, thus solving the soft-soft task.

For Task 6, the different combinations are encoded into a compact bit set as follows: each 𝑙𝑎𝑏𝑒𝑙 𝑖 is a bit 2 𝑖 where 𝑖 ≥ 0. The union gives us the bit set across the different combinations. We provide an example below:

𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000001 =⇒ - 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000010 =⇒ IDEOLOGICAL-INEQUALITY 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000100 =⇒ MISOGYNY-NON-SEXUAL-VIOLENCE 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏00010|0𝑏00001 = 0𝑏00011 =⇒ -, IDEOLOGICAL-INEQUALITY

Similarly to Task 5, we count the number of common bits and apply softmax to find the probability distribution.

We've defined how to decode 𝑠𝑐𝑜𝑟𝑒 to solve the tasks, but how can we learn 𝑠𝑐𝑜𝑟𝑒 from annotator and memes CLIP embeddings? Different embedding-based models include memory-based CF, model-based CF, Neighborhood methods, Neural Graph Collaborative Filtering, Factorization Machines [8], and GCN-based CF. Among those, the Factorization Machine models stands out for being efficient and accurate, enabling it to effectively predict the score [9] from concatenated embedding. Figure 8 shows how well this approach performs on our validation dataset after 10 runs.

Outcomes of the Evaluation Phase

Table 4 presents the combined results for both English and Spanish submissions in the sexism detection challenge across six different tasks. Each task involves several runs evaluated using two metrics: Hard-Hard and Soft-Soft. Below, we describe the results, focusing on the best runs for each task. For Task 1 (Tweets), the best performance was achieved by running MMICI_3, which ranked 17th in the Hard-Hard metric with an ICM-Hard Norm of 0.7676, and an F1 score of 0.7637. In the Soft-Soft metric, this run ranked 21st with an ICM-Soft Norm of 0.5736, indicating it was the most effective in both metrics for this task.

For Task 4 (Memes), run MMICI_2 excelled, ranking 8th in the Hard-Hard metric with an ICM-Hard Norm of 0.5515, and an F1 score of 0.7261. For Task 5, the top run was MMICI_1, which ranked 7th in the Hard-Hard metric with an ICM-Hard Norm of 0.3934, and an F1 score of 0.4179. In the Soft-Soft metric, this run performed even better, ranking 2nd with an ICM-Soft Norm of 0.3654, making it the most effective in both categories. Lastly, for Task 6, the best run was MMICI_1, which ranked 3rd in the Hard-Hard metric with an ICM-Hard Norm of 0.2954, and an F1 score of 0.4342. The results for the Spanish submissions are showcased in Table 5. Hereafter, we delve into these outcomes, centering our attention on the most successful executions for each task. For Task 1, the best performance was achieved by running MMICI_3, which ranked 10th in the Hard-Hard metric with an ICM-Hard Norm of 0.7802, and an F1 score of 0.7892. In Task 2, the best run was MMICI_1, ranking 15th in the Hard-Hard metric with an ICM-Hard Norm of 0.5522, and an F1 score of 0.5133. For Task 3, the top run was MMICI_1, ranking 13th in the Hard-Hard metric with an ICM-Hard Norm of 0.4586, and an F1 score of 0.5486. In Task 4, run MMICI_2 excelled, ranking 14th in the Hard-Hard metric with an ICM-Hard Norm of 0.4900, and an F1 score of 0.6997. For Task 5, the top run was MMICI_1, which ranked 7th in the Hard-Hard metric with an ICM-Hard Norm of 0.3945, and an F1 score of 0.4198. In the Soft-Soft metric, this run performed even better, ranking 1st with an ICM-Soft Norm of 0.3461, making it the best-performing in both categories. Lastly, in Task 6, the best run was MMICI_1, which ranked 4th in the Hard-Hard metric with ICM-Hard Norm of 0.2473, and an F1 score of 0.3868. The outcomes for the English submissions are outlined in Table 6. We elaborate on these results, specifically highlighting the top performances for each task. In Task 4, run MMICI_2 excelled, ranking 3rd in the Hard-Hard metric with an ICM-Hard Norm of 0.6129, and an F1 score of 0.7559. For Task 5, the top run was MMICI_3, which ranked 1st in the Hard-Hard metric with an ICM-Hard Norm of 0.4413, and an F1 score of 0.4094. Lastly, in Task 6, the best run was MMICI_1, which ranked 2nd in the Hard-Hard metric with an ICM-Hard Norm of 0.3419, and an F1 score of 0.4726.

The strong results achieved with memes can be attributed to the use of CLIP (Contrastive Language-Image Pre-training) embeddings. CLIP effectively learns visual concepts from natural language descriptions, aligning images and text within a shared embedding space. This alignment is achieved by training on a vast dataset of images paired with their corresponding textual descriptions, enabling the model to understand and relate visual and textual information seamlessly. Using CLIP, Vision Transformers can be employed for image encoding and Text Transformers for text encoding, resulting in a unified model that excels in multi-modal tasks. The Vision Transformer processes the image data, while the Text Transformer processes the text data. Both sets of embeddings are then projected into a common space where their similarities can be measured and aligned, allowing the model to leverage the strengths of both visual and textual information effectively. This approach enabled the extraction of sexist expressions from memes in the dataset across both languages. By transferring the representation to the textual domain, it became possible to adopt state-of-the-art techniques for the classification tasks.

In summary, the combined analysis of English and Spanish submissions in the sexism detection challenge illuminates diverse approaches and performances across tasks. Each language cohort showcased distinct strengths, with notable runs such as MMICI_1 and MMICI_3 consistently demonstrating effectiveness across multiple tasks. These results underscore the complexity of sexism detection and highlight the importance of multilingual evaluation frameworks. Further exploration and refinement of these methodologies promise continued advancements in combating bias and fostering inclusivity in online content.

Conclusion

This paper has detailed MMICI's participation in the EXIST shared task at CLEF 2024, focusing on the detection and categorization of sexism in social media content. By leveraging various innovative methodologies, including ensemble approaches that incorporate diverse annotator profiles and multimodal embeddings, our models have demonstrated substantial efficacy in identifying and understanding sexism in both tweets and memes. The results of our evaluation phase reveal that our ensemble methods, particularly those combining annotator profiles with text and image embeddings, achieve robust performance across multiple tasks. Specifically, our runs have shown competitive results in detecting sexism, discerning the intent behind sexist content, and categorizing different types of sexism. For instance, the ensemble approaches used in Runs 1 and 2 consistently outperformed traditional majority voting methods, highlighting the value of integrating diverse perspectives in addressing complex subjective tasks like sexism detection. Our approach emphasizes the importance of considering individual annotator characteristics, such as gender and age, to ensure that our models capture a wide range of viewpoints and avoid silencing minority voices. In most tasks, our baseline strategy performed the best. However, for tasks 2 and 3 in Spanish, our ensembles surpassed the baseline by capturing a broader range of perspectives. This nuanced understanding of sexism, facilitated by advanced machine learning techniques and diverse data representation, is crucial for effectively combating sexist behaviors and discourses online.

In related work, there is significant potential in exploring additional data collected on annotators in the EXIST 2024 dataset, including their ethnicities, study levels, and countries of origin, to enhance the cross-lingual and cross-cultural analysis capabilities of sexism detection systems. Developing models that effectively handle multiple languages and cultural contexts, possibly through cross-lingual transfer learning and the creation of culturally nuanced models, would improve global applicability. Additionally, further exploration of Transformer-based models and the creation of ensembles can leverage their strengths to improve detection accuracy. Expanding the dataset to include more diverse and underrepresented demographic groups would also contribute to building more robust and generalizable models. This could involve collecting additional annotated data from various social media platforms and cultural contexts. Moreover, improving multimodal techniques by leveraging advanced neural network architectures and incorporating additional features can further enhance model performance in detecting sexism.

Overall, our participation in the EXIST task underscores the potential of advanced ensemble methods and multimodal analysis in improving the detection and categorization of sexism in social media. These methods not only enhance the accuracy of automatic tools but also contribute to a deeper understanding of how sexism manifests in various forms, thereby supporting broader efforts to promote gender equity and reduce discrimination in digital spaces.

Figure 1 :1Figure 1: Overview of the proposal for Sexism Detection in EXIST 2024.

Figure 2 :2Figure 2: Leveraging Annotator Consensus and Profiles for Sexism Detection in Tweets.

Figure 55displays the F1 score for the positive case (sexism) across 10 runs for each model in Task 4. The results indicate that the "Text|Image" model and the "Text|Image & Text|Image General" model (a) Task 1: Sexism Identification in Tweets. (b) Task 2: Source Intention in Tweets. (c) Task 3: Sexism Categorization in Tweets.

Figure 3 :3Figure 3: Classification results for Sexism Detection in Tweets

Figure 4 :4Figure 4: Leveraging Annotator Consensus and Profiles for Sexism Detection in Memes.

Figure 5 :5Figure 5: Classification results of different approaches for task 4.

Figure 6 :6Figure 6: Classification results of different approaches for task 5.

Figure 7 :7Figure 7: Classification results of different approaches for task 6.

Figure 8 :8Figure 8: F1 score of hard-hard task 4, 5, 6 employing Collaborative Filtering.

• Task 3: Sexism Categorization in Tweets focuseson classifying sexist tweets into specific categories such as ideological and inequality, stereotyping and dominance, objectification, sexual violence, misogyny, and non-sexual violence.• Task 4: Sexism Identification in Memes is similar to Task 1 but applied to memes, determining whether a meme is sexist. • Task

5: Source Intention in Memes mirrorsTask 2 but for memes, categorizing them based on the author's intention, either direct or judgmental. •

Task 6: Sexism Categorization in Memes parallels

Task 3, classifying sexist memes into the same categories as tweets.

Table 1 :1Examples of Tweets and Memes from the dataset EXIST 2024

TaskLabelExample 1 (Spanish)Example 2 (English)TASK 1: SexismIdentification inTweets

SexistMujer al volante, tenga cuidado! People really try to convince women with little to no ass that they should go out and buy a body. Like bih, I don't need a fat ass to get a man. Never have.Continued on next page

Table 1 -1Continued from previous pageTaskLabelExample 1 (ES)Example 2 (EN)Not SexistAlguien me explica que zorra hace la gente en el cajero que se demora tanto.

TASK 4: Sexism Identification in Memes Sexist Not Sexist TASK 5: Source Intention in Memes Direct Judgemental Continued on next pageTable 1 -Continued from previous page Task Label Example 1 (ES) Example 2 (EN) TASK 6: Sexism Categorization in Memes1Ideological andInequalityStereotypingand DominanceDon't get married than blame all woman forObjectificationyour poor investment. You should of got a hooker but instead you choose to go get a wed-ding ring.Sexual Violence#MeToo Estas 4 no han conseguido su objetivo. El juez estima que se abrieron de patasFuck that cunt, I would with my fist.Misogyny and Non-Sexual Vio-lenceLas mujeres de hoy en dia te enseñar a querer. . . estar solteroSome woman are so toxic they don't even know they are draining everyone around them in poison. If you lack self awareness you won't even notice how toxic you really are.Sexual ViolenceMisogyny andNon-Sexual Vio-lence

Table 22Features of different models for the task 4.Model NameAnnotator Features Text Embeddings Image Embbedings Early Fusion Late FusionTextYesYesNoN/AN/AImageYesNoYesN/AN/AText+ImageYesYesYesYesNoText GeneralNoYesNoN/AN/AImage GeneralNoNoYesN/AN/AText+Image GeneralNoYesYesYesNoText|ImageYesYesYesNoYesText|Image & Text|Image GeneralYes&NoYesYesNoYes

Table 44Results of Submission on Leaderboard for both Spanish and English (ALL)Hard-HardSoft-SoftTaskRunRanking ICM-Hard ICM-Hard NormF1Ranking ICM-Soft ICM-Soft NormTask1 MMICI_1310.47050.73650.745529-0.33940.4456Task1 MMICI_2280.47800.74020.746030-0.36220.4419Task1 MMICI_3170.53240.76760.7637210.45890.5736Task2 MMICI_127-0.09870.46790.454824-4.57530.1314Task2 MMICI_232-0.24060.42180.438325-4.62850.1271Task2 MMICI_328-0.10760.46500.452520-3.63500.2071Task3 MMICI_127-1.45090.16310.402622-7.93560.0809Task3 MMICI_228-1.50030.15160.401723-7.93800.0808Task3 MMICI_323-0.81050.31180.480520-7.64130.0965Task4 MMICI_1120.07510.53820.720217-0.61890.4005Task4 MMICI_280.10140.55150.726116-0.61830.4006Task4 MMICI_324-0.03610.48160.678119-0.64100.3970Task5 MMICI_17-0.30660.39340.41792-1.26600.3654Task5 MMICI_210-0.38680.36550.37703-1.37380.3539Task5 MMICI_38-0.32970.38540.381413-3.47510.1304Task6 MMICI_13-0.98630.29540.434219-16.12480.0000Task6 MMICI_27-1.34460.22100.445320-19.32460.0000Task6 MMICI_324-3.83410.00000.234721-45.02370.0000

Table 55Results of Submission on Leaderboard for SpanishHard-HardSoft-SoftTaskRunRanking ICM-Hard ICM-Hard NormF1Ranking ICM-Soft ICM-Soft NormTask1 MMICI_1160.53230.76620.7817240.08940.5143Task1 MMICI_2220.50070.75040.7705250.01700.5027Task1 MMICI_3100.56030.78020.7892150.67060.6076Task2 MMICI_1150.16700.55220.513323-4.17280.1658Task2 MMICI_2260.00640.50200.493324-4.21270.1626Task2 MMICI_329-0.11460.46420.477920-3.49620.2200Task3 MMICI_113-0.18530.45860.548624-7.82610.0927Task3 MMICI_214-0.22690.44930.544625-7.83560.0922Task3 MMICI_322-0.58700.36890.516522-7.42910.1134Task4 MMICI_117-0.05910.46990.690614-0.66550.3939Task4 MMICI_214-0.01960.49000.699715-0.66890.3933Task4 MMICI_326-0.18480.40590.647018-0.83610.3667Task5 MMICI_17-0.30280.39450.41981-1.48130.3461Task5 MMICI_29-0.40770.35800.37282-1.54860.3392Task5 MMICI_310-0.48750.33020.354513-4.04000.0804Task6 MMICI_14-1.23460.24730.386818-14.94950.0000Task6 MMICI_213-1.69250.15360.414120-18.09020.0000Task6 MMICI_324-3.86860.00000.222521-42.65400.0000

Table 66Results of Submission on Leaderboard for EnglishHard-HardSoft-SoftTaskRunRanking ICM-Hard ICM-Hard NormF1Ranking ICM-Soft ICM-Soft NormTask1 MMICI_1400.38400.69600.697132-0.88050.3586Task1 MMICI_2330.44020.72460.714131-0.83490.3659Task1 MMICI_3250.49120.75070.7315210.14130.5227Task2 MMICI_133-0.45720.34180.368023-5.06410.0861Task2 MMICI_236-0.57280.30180.357024-5.12640.0810Task2 MMICI_330-0.13840.45210.408719-3.80240.1892Task3 MMICI_132-2.89620.00000.235722-7.90940.0666Task3 MMICI_234-2.95730.00000.237321-7.90590.0668Task3 MMICI_326-1.10240.22980.428719-7.74760.0755Task4 MMICI_150.20940.60630.753820-0.57790.4062Task4 MMICI_230.22240.61290.755919-0.57350.4069Task4 MMICI_3180.11310.55740.712217-0.46210.4250Task5 MMICI_16-0.31120.39200.41562-1.10890.3790Task5 MMICI_28-0.36570.37310.38153-1.24470.3642Task5 MMICI_31-0.16910.44130.409413-2.97040.1760Task6 MMICI_12-0.74410.34190.472615-18.36430.0000Task6 MMICI_27-1.00950.28550.475216-21.67640.0000Task6 MMICI_320-3.86870.00000.244717-49.20400.0000

Acknowledgments

This work has been partially supported by CONAHCYT (The National Council of Humanities, Sciences, and Technologies of Mexico), which promotes scientific and technological development in the country.

Additionally, we acknowledge the support provided through the following scholarships: Martha Paola Jimenez-Martinez (scholarship number 828539) and Joan Manuel Raygoza-Romero (scholarship number 806073).

Comisión Nacional para Prevenir y Erradicar la Violencia Contra las Mujeres , ¿qué es el lenguaje sexista y por qué es importante visibilizarlo? 2016 Ambivalent sexism PGlick STFiske Advances in experimental social psychology 33 2001 Elsevier Overview of EXIST 2024 -Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes LPlaza JCarrillo-De-Albornoz VRuiz AMaeso BChulvi PRosso EAmigó JGonzalo RMorante DSpina Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association

CLEF

2024. 2024 Detecting misogyny in spanish tweets. an approach based on linguistics features and word embeddings JAGarcía-Díaz MCánovas-García RColomo-Palacios RValencia-García Future Generation Computer Systems 114 2021 SAkhtar VBasile VPatti arXiv:2106.15896 Whose opinions matter? perspective-aware models to identify opinions of hate speech victims in abusive language detection 2021 arXiv preprint Factorization machines for data with implicit feedback BLoni MLarson AHanjalic 2018 Factorization machines SRendle 10.1109/ICDM.2010.127 2010 IEEE International Conference on Data Mining 2010