=Paper=
{{Paper
|id=Vol-3740/paper-97
|storemode=property
|title=Enhancing Sexism Detection in Tweets with Annotator-Integrated Ensemble Methods and
Multimodal Embeddings for Memes
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-97.pdf
|volume=Vol-3740
|authors=Martha Paola Jimenez-Martinez,Joan Manuel Raygoza-Romero,Carlos Eduardo Sánchez-Torres,Irvin Hussein Lopez-Nava,Manuel Montes-Y-Gómez
|dblpUrl=https://dblp.org/rec/conf/clef/Jimenez-Martinez24
}}
==Enhancing Sexism Detection in Tweets with Annotator-Integrated Ensemble Methods and
Multimodal Embeddings for Memes==
Enhancing Sexism Detection in Tweets with
Annotator-Integrated Ensemble Methods and Multimodal
Embeddings for Memes
Notebook for the EXIST Lab at CLEF 2024
Martha Paola Jimenez-Martinez1,* , Joan Manuel Raygoza-Romero1 ,
Carlos Eduardo Sánchez-Torres3 , Irvin Hussein Lopez-Nava1,3 and Manuel Montes-y-Gómez2
1
Centro de Investigación Científica y de Educación Superior de Ensenada, Mexico
2
Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
3
Universidad Autónoma de Baja California, Ensenada, Baja California, Mexico
Abstract
This paper details MMICI’s participation in the EXIST challenge at CLEF 2024, focusing on the identification
and categorization of sexism in social media and memes. For tweets, we employed pre-trained transformer
models and ensemble voting approaches. For memes, we utilized CLIP embeddings using a Vision Transformer
(ViT) model and two types of classifiers: feed-forward neural networks and factorization machines. The tasks
encompassed detecting sexism in tweets and memes, as well as categorizing their type and the author’s intention.
Our methodology for tweets integrates annotator profiles, such as gender and age, to enhance the accuracy of
sexism identification, source intention, and sexism categorization. For memes, we utilized all annotator features
(gender, age, ethnicity, study level, and country) for the same tasks. The results demonstrate the effectiveness
of our models across various tasks, emphasizing the integration of diverse perspectives. Notably, our best
performances include a 10th place ranking in Task 1, a 15th place ranking in Task 2, and a 13th place ranking in
Task 3 for Spanish tweets. For memes, we achieved a 3rd place ranking in Task 4 for English memes, two 1st
place rankings in Task 5 for both English and Spanish memes, and a 2nd place ranking in Task 6 for English
memes. These results underscore the importance of incorporating the demographic factors of annotators and
taking advantage of multimodal embeddings for robust performance in sexism detection.
Keywords
Sexism detection, Sexism identification, Sexism classification, Social media, Transformer models
1. Introduction
According to the Cambridge Dictionary, sexism is defined as “(actions based on) the belief that the
members of one sex are less intelligent, able, skilful, etc. than the members of the other sex, especially
that women are less able than men [1]". In contrast, the Royal Spanish Academy defines sexism as
“discrimination against individuals based on their sex" (in spanish: discriminación de las personas por
razón de sexo) [2]. Both interpretations, based on the meaning and expression in both languages,
agree that sexism not only reflects but also communicates and perpetuates the stereotypes and roles
historically assigned to women and men in society. This perpetuation of stereotypes is a significant
factor in the struggle for gender equity [3].
Research on gender ideologies employs the Ambivalent Sexism Inventory and the Ambivalence
toward Men Inventory. The Ambivalent Sexism Inventory measures hostile sexism, which reflects
antagonistic attitudes towards women, and benevolent sexism, which consists of subjectively favorable
but patronizing beliefs about women. The Ambivalence toward Men Inventory assesses hostility
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ jimenezmp@cicese.edu.mx (M. P. Jimenez-Martinez); jraygoza@cicese.edu.mx (J. M. Raygoza-Romero);
a361075@uabc.edu.mx (C. E. Sánchez-Torres); hussein@cicese.edu.mx (I. H. Lopez-Nava); mmontesg@inaoep.mx
(M. Montes-y-Gómez)
0009-0005-8701-9875 (M. P. Jimenez-Martinez); 0000-0003-3085-5678 (J. M. Raygoza-Romero); 0000-0001-5799-4067
(C. E. Sánchez-Torres); 0000-0003-3979-9465 (I. H. Lopez-Nava); 0000-0002-7601-501X (M. Montes-y-Gómez)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
toward men, rooted in the resentment of men’s perceived greater power, and benevolence toward men,
which involves favorable views of men as protectors and providers. Ambivalent sexism theory posits
that hostile sexism and benevolent sexism arise due to social and biological factors common across
cultures, such as patriarchy, gender differentiation, and heterosexuality. Systemically, hostile sexism and
benevolent sexism function as complementary ideologies that justify and perpetuate gender inequality,
showing a strong correlation across cultures. This underscores the necessity of addressing both hostile
and benevolent forms of sexism in the pursuit of gender equality [4].
This paper details MMICI’s participation in the "sEXism Identification in Social neTworks" (EXIST)
shared task at CLEF 2024. EXIST aims to broadly capture instances of sexism, ranging from overt
misogyny to subtler expressions of implicit sexist behavior, a task it has been undertaking since 2021.
The goal of utilizing automatic tools is not only to detect and alert against sexist behaviors and discourses
but also to estimate the prevalence of sexist and abusive situations on social media platforms, identify
the most common forms of sexism, and understand how sexism manifests in these media [3].
Over the years, EXIST has evolved significantly. In 2021 and 2022, it provided a dataset with definitive
(hard) labels for each tweet. However, starting from 2023 and continuing into 2024, the task expanded to
generate six different labels per tweet, each derived from six distinct annotator profiles. These profiles
include three women and three men from distinct age groups: 18-22, 23-45, and 46+. Furthermore, the
most recent edition incorporates the demographic parameters of the annotators, such as gender, age,
level of education, ethnicity, and country of residence.
2. Dataset EXIST 2024
In its fourth edition [5], the task has incorporated new challenges involving images, specifically memes.
The six tasks are as follows:
• Task 1: Sexism Identification in Tweets involves identifying whether a tweet is sexist or not.
• Task 2: Source Intention in Tweets follows, where once a tweet is classified as sexist, it involves
categorizing the intention of the author—whether the intention is direct, reported, or judgmental.
• Task 3: Sexism Categorization in Tweets focuses on classifying sexist tweets into specific categories
such as ideological and inequality, stereotyping and dominance, objectification, sexual violence, misogyny,
and non-sexual violence.
• Task 4: Sexism Identification in Memes is similar to Task 1 but applied to memes, determining whether
a meme is sexist.
• Task 5: Source Intention in Memes mirrors Task 2 but for memes, categorizing them based on the
author’s intention, either direct or judgmental.
• Task 6: Sexism Categorization in Memes parallels Task 3, classifying sexist memes into the same
categories as tweets.
These tasks aim to enhance the understanding and detection of sexism across various forms of social
media content in both English and Spanish, ultimately supporting efforts to combat sexism online.
Given that information is provided from expressions in different languages, it cannot be assumed that
models for detecting sexism in one language can be applied directly to another. This is due to the syntax
and semantic differences in the manifestations of sexism across various countries and contexts [6]. To
better understand the differences between the expressions in both languages, Table 1 provides some
examples of the labels given for the different Dataset tasks where all annotators reached a consensus
on that label.
Table 1: Examples of Tweets and Memes from the dataset EXIST 2024
Task Label Example 1 (Spanish) Example 2 (English)
People really try to convince women with little
TASK 1: Sexism
to no ass that they should go out and buy a
Identification in Sexist Mujer al volante, tenga cuidado!
body. Like bih, I don’t need a fat ass to get a
Tweets
man. Never have.
Continued on next page
Table 1 – Continued from previous page
Task Label Example 1 (ES) Example 2 (EN)
@messyworldorder it’s honestly so embarrass-
Alguien me explica que zorra hace la gente
Not Sexist ing to watch and they’ll be like “not all white
en el cajero que se demora tanto.
women are like that”
Una mujer necesita amor, llenar la nevera, si
TASK 2: Source Women shouldn’t code. . . perhaps be in-
un hombre puede darle esto a cambio de sus
Intention in Direct fluencer/creator instead. . . it’s their natural
servicios (tareas domésticas, cocinar, etc.), no
Tweets strength.
veo qué más necesita.
Me duermo en el metro, abro los ojos sin-
Today, one of my year 1 class pupils could not
Reported tiendo algo raro: la mano del hombre sentado
believe he’d lost a race against a girl.
a mi lado en mi pierna #SquealOnYourPig.
Como de costumbre, la mujer fue la que dejó 21st century and we are still earning 25% less
Judgemental
su trabajo por el bienestar de la familia. . . " than men #Idonotrenounce.
Mi hermana y mi madre se burlan de mí por
TASK 3: Sexism I think the whole equality thing is getting out
Ideological and defender todo el tiempo los derechos de to-
Categorization of hand. We are different, thats how were
Inequality dos y me acaban de decir feminazi, la comple-
in Tweets made!
taron.
@Paula2R @faber_acuria A las mujeres hay Most women no longer have the desire or the
Stereotyping
que amarlas. . . solo eso. . . Nunca las enten- knowledge to develop a high quality character,
and Dominance
derás. even if they wanted to.
Don’t get married than blame all woman for
“Pareces una puta con ese pantalón” - Mi her-
your poor investment. You should of got a
Objectification mano de 13 cuando me vio con un pantalón
hooker but instead you choose to go get a wed-
de cuero.
ding ring.
#MeToo Estas 4 no han conseguido su objetivo.
Sexual Violence Fuck that cunt, I would with my fist.
El juez estima que se abrieron de patas
Some woman are so toxic they don’t even
Misogyny and
Las mujeres de hoy en dia te enseñar a know they are draining everyone around them
Non-Sexual Vio-
querer. . . estar soltero in poison. If you lack self awareness you won’t
lence
even notice how toxic you really are.
TASK 4: Sexism
Identification in Sexist
Memes
Not Sexist
TASK 5: Source
Intention in Direct
Memes
Judgemental
Continued on next page
Table 1 – Continued from previous page
Task Label Example 1 (ES) Example 2 (EN)
TASK 6: Sexism
Ideological and
Categorization
Inequality
in Memes
Stereotyping
and Dominance
Objectification
Sexual Violence
Misogyny and
Non-Sexual Vio-
lence
3. Overview of the proposal
Previous research has developed methods to model annotators in subjective tasks, allowing for the
prediction of personalized labels for each annotator. For instance, Akhtar et al. [7] conducted an
exhaustive search to classify annotators into two groups based on their annotation patterns. Their study
demonstrated that an ensemble model, composed of two distinct classifiers representing the perspectives
of each group, outperformed the traditional single-task model that only considers aggregated labels.
Additionally, traditional classification methods typically aggregate labels through majority voting or
averaging before training. However, this approach has been found to potentially “silence the voices” of
socio-demographic minority groups [7]. One of the objectives of this study is to leverage the individual
opinions of annotators, or group them based on specific demographic characteristics, to ensure that
their “voices” are effectively integrated into the sexism detection models.
Building on previous concepts, our approach to addressing the EXIST task encompasses multiple
strategies across various runs and tasks:
Figure 1: Overview of the proposal for Sexism Detection in EXIST 2024.
• Run 1 for Tasks 1, 2, and 3: The model predicts labels by employing an ensemble method that combines
outputs based on different age groups and gender.
• Run 2 for Tasks 1, 2, and 3: The model predicts labels by employing an ensemble method that integrates
outputs from various profiles of the six annotators.
• Run 3 for Tasks 1, 2, and 3: The model predicts labels using a majority vote approach, where the final
prediction is based on the consensus among all annotators.
• Runs 1 and 2 for Tasks 4 and 5: For these tasks, our approach involves using embeddings for both the
text and the image of each meme. These embeddings represent deep features of the meme. Additionally,
annotator attributes are incorporated to develop a model capable of predicting labels for each annotator.
The final label is determined by a voting mechanism among the predictions of the annotators.
• Runs 1 and 2 for Task 6: A specialized model is trained for each label using only sexism data, with
the data balanced for each class. Embeddings for the meme text and image are utilized. The final output
combines the model’s prediction for non-sexist cases (from Task 4) with the outputs of the specialized
models for each sexism category to produce a single prediction.
• Run 3 for Tasks 4, 5, and 6: The system predicts labels by concatenating the profile annotator’s embedding
with an image embedding in the same space. A multimodal embedding model assesses the relationship
between annotators and items, and a voting mechanism is then applied to determine the final score.
Our general approach is presented in Figure 1 and integrates text and visual processing using
transformer models to extract features and perform classifications. Texts (tweets) are preprocessed
and fed into a transformer model to generate text embeddings, while images (memes) are processed
through a vision transformer model to produce visual embeddings. Annotator features are extracted
from the text embeddings, and a classifier is trained using these features along with the text and visual
embeddings. An ensemble technique is applied to combine the outputs of the models, enhancing the
accuracy of the classifier. The performance is then evaluated across several specific tasks to ensure
comprehensive assessment and optimization of the results.
In the domain of text analysis in Spanish, our dataset was meticulously constructed, comprising 2526
samples for training purposes, with an additional 639 samples reserved for validation, and a final set of
490 samples designated for comprehensive testing. Conversely, for the English language domain, our
dataset consisted of 1832 meticulously curated samples for training. An additional 574 samples were
allocated for validation. Furthermore, our local test set, comprising 978 meticulously selected samples,
served as a crucial benchmark for evaluating the generalization capabilities of our models in real-world
applications. The metrics used for each task are as follows: For Task 1 and Task 4, we employed the
ICM-Hard Norm F1-score for the positive class (sexism). For Task 2, Task 3, Task 5, and Task 6, we used
the ICM-Hard Norm F1-score macro, which is the average of the F1-scores for all classes.
Figure 2: Leveraging Annotator Consensus and Profiles for Sexism Detection in Tweets.
4. Sexism Detection in Tweets
Firstly, for the detection of sexism in tweets, we focus on integrating annotator information, particularly
considering their profiles such as gender and age, as summarized in Figure 2.
• Text Preprocessing: Mentions within the tweets were substituted with ’@USER,’ while any
URLs were replaced with ’HTTPURL.’
• Transformer Model: We decided to use pre-trained models specifically for tweets:
“cardiffnlp/twitter-roberta-base-sentiment” for English and “pysentimiento/robertuito-base-
uncased” for Spanish, both from Hugging Face, since these models were trained with data
in the respective languages.
• Ensemble: We employed two different ensembles. The first ensemble used a majority vote from
the outputs of six different models, one for each annotator. The second ensemble used a majority
vote from the outputs of five different models, focusing on gender and age (females, males, 18-22,
23-45, and 46+).
As mentioned in the previous Section, our runs for the first three text-focused tasks were:
1. Run 1: An ensemble was created from the outputs of five different models, focusing on gender
and age. A majority vote was taken from the outputs of these five models, with the label being
assigned if three or more groups agreed.
2. Run 2: An ensemble was created from the outputs of six different models, one for each annotator.
A majority vote was taken from the outputs of these six models, with the label being assigned if
four or more annotators agreed.
3. Run 3: A majority vote was taken initially from the six annotators’ inputs, serving as the baseline.
Similarly, the label was assigned if four or more annotators agreed.
To ensure a decision was always made in each ensemble without ties, we used probabilistic voting
rather than hard voting from each model. This means that even if three models classify a tweet as
sexist and three do not, the probabilities are compared, and the decision is made based on the highest
probability, ensuring a definitive decision for all predictions in the ensembles.
For Task 2, which requires determining the intention of the tweets (single label), the label was assigned
based on the highest probability prediction among the types of intentions if the tweet was sexist. To
achieve this, a binary model was trained for each label. This approach ensures that the classification is
both precise and comprehensive, taking into account the nuanced nature of the intentions expressed in
the tweets.
For Task 3, which involves identifying the types of sexism in a tweet (multi-label), the ensemble takes
into account all types of sexism indicated by the annotators. For example, if one annotator labels a tweet
as objectification and another labels it as misogynistic and sexual violence, all three types of sexism are
included in our ensemble prediction Furthermore, we employ multiple binary classification models for
each label, allowing us to address each facet of identified sexism with specificity and precision.
To analyze in depth the impact of considering individualized annotators’ opinions (A1-A6), grouped
opinions before the classification models (All), opinions by demographic group (Females, Males, 18-22,
23-45, 46+), or assembled at the end (Ensemble Annotators, Ensemble Groups), Figure 3 presents the
results based on the performance of the sexism identification, intention, and categorization models,
respectively.
The selection of group ensemble and annotator ensemble approaches as Run 1 and Run 2, respectively,
is grounded in their ability to integrate a wide range of perspectives and individual judgments. The
group ensemble, by combining different demographics, offers an enriched and balanced overview, which
is crucial for tackling the complexity of the tasks at hand. On the other hand, the annotator ensemble
capitalizes on the diversity of individual judgments, ensuring a robust and competitive performance.
Finally, the direct majority vote of annotators is established as the baseline (Run 3) due to its simplicity
and effectiveness, providing a clear reference for evaluating ensemble methods. These choices are
backed by the best results obtained in each task, where the group ensemble consistently outperforms
others in terms of performance and ability to capture the inherent complexity in the datasets.
5. Sexism Detection in Memes
We chose a different path for tasks 4, 5, and 6 as shown in Figure 4. Although we leveraged the
annotator data from the dataset, the text preprocessing steps depicted in Figure 1 were not applied.
Instead, embeddings were extracted directly from the raw data using different approaches. We utilized
CLIP embeddings for the memes and text along with annotator features. It was decided to address the
tasks from the textual domain due to the high variability in the representation and graphic styles of the
Memes (see examples in Table 1).
In Runs 1 and 2, annotator features were represented using one-hot encoding for gender, age range,
ethnicity, study level, and country. The reader can explore this approach in the subsection 5.1. In Run
3, a descriptive text was created for the annotator features, from which embeddings were extracted.
For Runs 2 and 3, the classifier used was a feed-forward neural network (FNN) with two hidden layers,
containing 4096 and 512 neurons, respectively. Following these layers, a dropout layer with a dropout
rate of 0.1 was applied. In Run 3, we further leverage the annotation and meme relationship and propose
a Factorization Machine model, a Collaborative filtering technique, to predict the annotation based on
annotator and meme CLIP embeddings. We explain more about this approach in the subsection 5.2.
5.1. Feed-forward neural network with CLIP embeddings
For Task 4, the output layer in the FNN consisted of a single neuron to produce the probability of the
sexism class. We evaluated various approaches for the model: using only text embeddings, using only
image embeddings, using both text and image embeddings, utilizing a general model (without annotator
characteristics), and some combinations of the outputs of some of these models. Table 2 outlines the
features of each evaluated model.
It is essential to define two concepts: early fusion and late fusion. In early fusion, the model
simultaneously receives both text and image embeddings, meaning the model’s input includes annotator
features, text embeddings, and image embeddings (as seen in the “Text+Image“ and “Text+Image General“
models). In late fusion, the outputs of two models are combined. For example, in the “Text|Image“ model,
the outputs of the “Text“ model (trained only with text embeddings) and the “Image“ model (trained
only with image embeddings) are combined by averaging their outputs. Similarly, the “Text|Image &
Text|Image General“ model combines the outputs of the “Text|Image“ and “Text+Image General“ models
by averaging their outputs.
Figure 5 displays the F1 score for the positive case (sexism) across 10 runs for each model in Task
4. The results indicate that the “Text|Image“ model and the “Text|Image & Text|Image General“ model
(a) Task 1: Sexism Identification in Tweets.
(b) Task 2: Source Intention in Tweets.
(c) Task 3: Sexism Categorization in Tweets.
Figure 3: Classification results for Sexism Detection in Tweets
Figure 4: Leveraging Annotator Consensus and Profiles for Sexism Detection in Memes.
Table 2
Features of different models for the task 4.
Model Name Annotator Features Text Embeddings Image Embbedings Early Fusion Late Fusion
Text Yes Yes No N/A N/A
Image Yes No Yes N/A N/A
Text+Image Yes Yes Yes Yes No
Text General No Yes No N/A N/A
Image General No No Yes N/A N/A
Text+Image General No Yes Yes Yes No
Text|Image Yes Yes Yes No Yes
Text|Image & Text|Image General Yes&No Yes Yes No Yes
achieve higher mean F1 scores and low variations of the performance. These models correspond to Run
1 and Run 2, respectively.
Figure 5: Classification results of different approaches for task 4.
For task 5, the output layer in the FNN consisted of a 3 neurons to yield the probability of every label.
Similar to task 4, we evaluated some approaches for the model: using only text embeddings, using only
image embeddings, using both text and image embeddings, and a combination of the outputs of “Text”
and “Image” models by averaging their outputs. Figure 6 displays the F1 score macro across 10 runs for
each model in Task 5.
The results in Figure 6 indicate the “Text+Image” model achieves higher mean F1 scores and low
variations of the performance. These models correspond to Run 1. For Run 2, the “Text|Image” model
was selected. Although it did not achieve the highest F1 score, it demonstrated a strong MSE score
comparable to the “Text+Image” model.
For Task 6, the output layer in the FNN consisted of one neuron to yield the probability for each
Figure 6: Classification results of different approaches for task 5.
label of sexism categorization. We created 5 models, each trained exclusively on data from sexism
memes and with a random subset of training data for negative cases equal to the amount of training
data for positive cases. Consequently, each model was trained on a balanced dataset. The probability
output from the model for Task 4 was used to determine the probability of the not sexism label and
then combined with the outputs from each of these 5 models to produce a final prediction.
There are two exceptional cases to consider: i) If the probability of not sexism is higher than 0.5, as
well as one of the 5 categories of sexism, the final prediction is always not sexism. ii) If the probability
of not sexism is lower than 0.5, as well as one of the 5 categories of sexism, the meme is classified as
sexist, and the category of sexism with the highest probability is selected. Similar to Tasks 4 and 5, we
evaluated various approaches for the model. Figure 7 presents the macro F1 scores for each model.
Figure 7: Classification results of different approaches for task 6.
We observed similar performance among the “Text”, “Text+Image”, and “Text|Image” models. Based
on these results, we selected the “Text|Image” model for Run 1 and the “Text+Image” model for Run 2.
The “Text” model was not chosen, as we believe that the combination of text and image embeddings
yields better results.
5.2. Multimodal Collaborative Filtering employing CLIP embeddings and
Factorization Machines
Loni2018FactorizationMF In this approach, we model similar to how to assign a score in a recommenda-
tion system or to predict links between nodes in a bipartite graph, leveraging the fact that we have
the annotator and the item features. Given known subject-item preferences, predict new subject-item
preferences. Formally, let 𝑈 a set of all subjects 𝑈 and 𝑉 a set of all items, our core task is to find a
real-valued scalar function 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) where 𝑢 ∈ 𝑈 and 𝑣 ∈ 𝑉 . To provide a hard label or multi-label,
𝑘 subjects vote with their encoded scores. Hence, we’ve reduced our problem into a score prediction
problem. For each user 𝑢 ∈ 𝑈 , let 𝑢 ∈ RD for its 𝐷-dimensional embedding. For each item 𝑣 ∈ 𝑉 ,
let 𝑣 ∈ R𝐷 be its 𝐷-dimensional embedding. So, 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) ≡ 𝑓 : R𝐷 × R𝐷 → R. In this approach,
memes and annotators are transformed into the same embedding space using CLIP. Specifically, user
demographics such as age, gender, and interests are encoded with a phrase such as “A female aged
18-22, of Hispanic or Latino ethnicity, with a high school degree or equivalent, and located in Mexico”
into one CLIP embedding. In contrast, the meme, which may include both image and text components,
is encoded into another CLIP embedding. These embeddings capture the nuanced features of both the
user and the meme content. We then concatenate these two embeddings into a single embedding that
represents the combined features of the user and the meme.
For instance, the table 3 illustrates a complete utility matrix for Task 4 with known score entries
𝑓 (𝑢, 𝑣) where 0 represents the label “NO” and 1 represents the label “YES”. Our encoding method
ignores "UNKNOWN" labels, but other encodings are possible. In this case, the voting policy is the
selection of the class annotated by more than 3 subjects.
Table 3
An example of utility matrix for the task 4
𝑉1 𝑉2 𝑉3
𝑈1 1 0 1
𝑈2 0 1 1
𝑈3 1 0 0
𝑈4 1 1 1
𝑈5 0 0 0
𝑈6 1 0 0
𝑉 𝑜𝑡𝑖𝑛𝑔 1 0 Undefined
𝐿𝑎𝑏𝑒𝑙 YES NO
For Task 5, the 𝑠𝑐𝑜𝑟𝑒 function is encoded similarly to task 4, with the addition of a voting policy and
a method to define similarity to hard labels. The voting policy is the arithmetic mean of votes 𝑠𝑐𝑜𝑟𝑒
which entails us into the encoding to predict the hard label as follows
𝑠𝑐𝑜𝑟𝑒 ∈ [0, 0.67] =⇒ No
𝑠𝑐𝑜𝑟𝑒 ∈ (0.67, 1.34] =⇒ Direct
𝑠𝑐𝑜𝑟𝑒 ∈ (1.34, 2] =⇒ Judgemental
We apply softmax over the votes to find the probabilities, thus solving the soft-soft task.
For Task 6, the different combinations are encoded into a compact bit set as follows: each 𝑙𝑎𝑏𝑒𝑙𝑖 is
a bit 2𝑖 where 𝑖 ≥ 0. The union gives us the bit set across the different combinations. We provide an
example below:
𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000001 =⇒ -
𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000010 =⇒ IDEOLOGICAL-INEQUALITY
𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000100 =⇒ MISOGYNY-NON-SEXUAL-VIOLENCE
𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏00010|0𝑏00001 = 0𝑏00011 =⇒ -, IDEOLOGICAL-INEQUALITY
Similarly to Task 5, we count the number of common bits and apply softmax to find the probability
distribution.
We’ve defined how to decode 𝑠𝑐𝑜𝑟𝑒 to solve the tasks, but how can we learn 𝑠𝑐𝑜𝑟𝑒 from annotator and
memes CLIP embeddings? Different embedding-based models include memory-based CF, model-based
CF, Neighborhood methods, Neural Graph Collaborative Filtering, Factorization Machines [8], and
GCN-based CF. Among those, the Factorization Machine models stands out for being efficient and
accurate, enabling it to effectively predict the score [9] from concatenated embedding. Figure 8 shows
how well this approach performs on our validation dataset after 10 runs.
6. Outcomes of the Evaluation Phase
Table 4 presents the combined results for both English and Spanish submissions in the sexism detection
challenge across six different tasks. Each task involves several runs evaluated using two metrics:
Hard-Hard and Soft-Soft. Below, we describe the results, focusing on the best runs for each task.
Figure 8: F1 score of hard-hard task 4, 5, 6 employing Collaborative Filtering.
For Task 1 (Tweets), the best performance was achieved by running MMICI_3, which ranked 17th in
the Hard-Hard metric with an ICM-Hard Norm of 0.7676, and an F1 score of 0.7637. In the Soft-Soft
metric, this run ranked 21st with an ICM-Soft Norm of 0.5736, indicating it was the most effective in
both metrics for this task.
For Task 4 (Memes), run MMICI_2 excelled, ranking 8th in the Hard-Hard metric with an ICM-Hard
Norm of 0.5515, and an F1 score of 0.7261. For Task 5, the top run was MMICI_1, which ranked 7th in
the Hard-Hard metric with an ICM-Hard Norm of 0.3934, and an F1 score of 0.4179. In the Soft-Soft
metric, this run performed even better, ranking 2nd with an ICM-Soft Norm of 0.3654, making it the
most effective in both categories. Lastly, for Task 6, the best run was MMICI_1, which ranked 3rd in the
Hard-Hard metric with an ICM-Hard Norm of 0.2954, and an F1 score of 0.4342.
Table 4
Results of Submission on Leaderboard for both Spanish and English (ALL)
Hard-Hard Soft-Soft
Task Run Ranking ICM-Hard ICM-Hard Norm F1 Ranking ICM-Soft ICM-Soft Norm
Task1 MMICI_1 31 0.4705 0.7365 0.7455 29 -0.3394 0.4456
Task1 MMICI_2 28 0.4780 0.7402 0.7460 30 -0.3622 0.4419
Task1 MMICI_3 17 0.5324 0.7676 0.7637 21 0.4589 0.5736
Task2 MMICI_1 27 -0.0987 0.4679 0.4548 24 -4.5753 0.1314
Task2 MMICI_2 32 -0.2406 0.4218 0.4383 25 -4.6285 0.1271
Task2 MMICI_3 28 -0.1076 0.4650 0.4525 20 -3.6350 0.2071
Task3 MMICI_1 27 -1.4509 0.1631 0.4026 22 -7.9356 0.0809
Task3 MMICI_2 28 -1.5003 0.1516 0.4017 23 -7.9380 0.0808
Task3 MMICI_3 23 -0.8105 0.3118 0.4805 20 -7.6413 0.0965
Task4 MMICI_1 12 0.0751 0.5382 0.7202 17 -0.6189 0.4005
Task4 MMICI_2 8 0.1014 0.5515 0.7261 16 -0.6183 0.4006
Task4 MMICI_3 24 -0.0361 0.4816 0.6781 19 -0.6410 0.3970
Task5 MMICI_1 7 -0.3066 0.3934 0.4179 2 -1.2660 0.3654
Task5 MMICI_2 10 -0.3868 0.3655 0.3770 3 -1.3738 0.3539
Task5 MMICI_3 8 -0.3297 0.3854 0.3814 13 -3.4751 0.1304
Task6 MMICI_1 3 -0.9863 0.2954 0.4342 19 -16.1248 0.0000
Task6 MMICI_2 7 -1.3446 0.2210 0.4453 20 -19.3246 0.0000
Task6 MMICI_3 24 -3.8341 0.0000 0.2347 21 -45.0237 0.0000
The results for the Spanish submissions are showcased in Table 5. Hereafter, we delve into these
outcomes, centering our attention on the most successful executions for each task. For Task 1, the best
performance was achieved by running MMICI_3, which ranked 10th in the Hard-Hard metric with an
ICM-Hard Norm of 0.7802, and an F1 score of 0.7892. In Task 2, the best run was MMICI_1, ranking
15th in the Hard-Hard metric with an ICM-Hard Norm of 0.5522, and an F1 score of 0.5133. For Task 3,
the top run was MMICI_1, ranking 13th in the Hard-Hard metric with an ICM-Hard Norm of 0.4586,
and an F1 score of 0.5486. In Task 4, run MMICI_2 excelled, ranking 14th in the Hard-Hard metric with
an ICM-Hard Norm of 0.4900, and an F1 score of 0.6997. For Task 5, the top run was MMICI_1, which
ranked 7th in the Hard-Hard metric with an ICM-Hard Norm of 0.3945, and an F1 score of 0.4198. In the
Soft-Soft metric, this run performed even better, ranking 1st with an ICM-Soft Norm of 0.3461, making
it the best-performing in both categories. Lastly, in Task 6, the best run was MMICI_1, which ranked
4th in the Hard-Hard metric with ICM-Hard Norm of 0.2473, and an F1 score of 0.3868.
Table 5
Results of Submission on Leaderboard for Spanish
Hard-Hard Soft-Soft
Task Run Ranking ICM-Hard ICM-Hard Norm F1 Ranking ICM-Soft ICM-Soft Norm
Task1 MMICI_1 16 0.5323 0.7662 0.7817 24 0.0894 0.5143
Task1 MMICI_2 22 0.5007 0.7504 0.7705 25 0.0170 0.5027
Task1 MMICI_3 10 0.5603 0.7802 0.7892 15 0.6706 0.6076
Task2 MMICI_1 15 0.1670 0.5522 0.5133 23 -4.1728 0.1658
Task2 MMICI_2 26 0.0064 0.5020 0.4933 24 -4.2127 0.1626
Task2 MMICI_3 29 -0.1146 0.4642 0.4779 20 -3.4962 0.2200
Task3 MMICI_1 13 -0.1853 0.4586 0.5486 24 -7.8261 0.0927
Task3 MMICI_2 14 -0.2269 0.4493 0.5446 25 -7.8356 0.0922
Task3 MMICI_3 22 -0.5870 0.3689 0.5165 22 -7.4291 0.1134
Task4 MMICI_1 17 -0.0591 0.4699 0.6906 14 -0.6655 0.3939
Task4 MMICI_2 14 -0.0196 0.4900 0.6997 15 -0.6689 0.3933
Task4 MMICI_3 26 -0.1848 0.4059 0.6470 18 -0.8361 0.3667
Task5 MMICI_1 7 -0.3028 0.3945 0.4198 1 -1.4813 0.3461
Task5 MMICI_2 9 -0.4077 0.3580 0.3728 2 -1.5486 0.3392
Task5 MMICI_3 10 -0.4875 0.3302 0.3545 13 -4.0400 0.0804
Task6 MMICI_1 4 -1.2346 0.2473 0.3868 18 -14.9495 0.0000
Task6 MMICI_2 13 -1.6925 0.1536 0.4141 20 -18.0902 0.0000
Task6 MMICI_3 24 -3.8686 0.0000 0.2225 21 -42.6540 0.0000
The outcomes for the English submissions are outlined in Table 6. We elaborate on these results,
specifically highlighting the top performances for each task. In Task 4, run MMICI_2 excelled, ranking
3rd in the Hard-Hard metric with an ICM-Hard Norm of 0.6129, and an F1 score of 0.7559. For Task
5, the top run was MMICI_3, which ranked 1st in the Hard-Hard metric with an ICM-Hard Norm of
0.4413, and an F1 score of 0.4094. Lastly, in Task 6, the best run was MMICI_1, which ranked 2nd in the
Hard-Hard metric with an ICM-Hard Norm of 0.3419, and an F1 score of 0.4726.
The strong results achieved with memes can be attributed to the use of CLIP (Contrastive Lan-
guage–Image Pre-training) embeddings. CLIP effectively learns visual concepts from natural language
descriptions, aligning images and text within a shared embedding space. This alignment is achieved
by training on a vast dataset of images paired with their corresponding textual descriptions, enabling
the model to understand and relate visual and textual information seamlessly. Using CLIP, Vision
Transformers can be employed for image encoding and Text Transformers for text encoding, resulting
in a unified model that excels in multi-modal tasks. The Vision Transformer processes the image data,
while the Text Transformer processes the text data. Both sets of embeddings are then projected into a
common space where their similarities can be measured and aligned, allowing the model to leverage
the strengths of both visual and textual information effectively. This approach enabled the extraction of
sexist expressions from memes in the dataset across both languages. By transferring the representation
to the textual domain, it became possible to adopt state-of-the-art techniques for the classification tasks.
In summary, the combined analysis of English and Spanish submissions in the sexism detection
challenge illuminates diverse approaches and performances across tasks. Each language cohort show-
cased distinct strengths, with notable runs such as MMICI_1 and MMICI_3 consistently demonstrating
effectiveness across multiple tasks. These results underscore the complexity of sexism detection and
highlight the importance of multilingual evaluation frameworks. Further exploration and refinement of
these methodologies promise continued advancements in combating bias and fostering inclusivity in
online content.
Table 6
Results of Submission on Leaderboard for English
Hard-Hard Soft-Soft
Task Run Ranking ICM-Hard ICM-Hard Norm F1 Ranking ICM-Soft ICM-Soft Norm
Task1 MMICI_1 40 0.3840 0.6960 0.6971 32 -0.8805 0.3586
Task1 MMICI_2 33 0.4402 0.7246 0.7141 31 -0.8349 0.3659
Task1 MMICI_3 25 0.4912 0.7507 0.7315 21 0.1413 0.5227
Task2 MMICI_1 33 -0.4572 0.3418 0.3680 23 -5.0641 0.0861
Task2 MMICI_2 36 -0.5728 0.3018 0.3570 24 -5.1264 0.0810
Task2 MMICI_3 30 -0.1384 0.4521 0.4087 19 -3.8024 0.1892
Task3 MMICI_1 32 -2.8962 0.0000 0.2357 22 -7.9094 0.0666
Task3 MMICI_2 34 -2.9573 0.0000 0.2373 21 -7.9059 0.0668
Task3 MMICI_3 26 -1.1024 0.2298 0.4287 19 -7.7476 0.0755
Task4 MMICI_1 5 0.2094 0.6063 0.7538 20 -0.5779 0.4062
Task4 MMICI_2 3 0.2224 0.6129 0.7559 19 -0.5735 0.4069
Task4 MMICI_3 18 0.1131 0.5574 0.7122 17 -0.4621 0.4250
Task5 MMICI_1 6 -0.3112 0.3920 0.4156 2 -1.1089 0.3790
Task5 MMICI_2 8 -0.3657 0.3731 0.3815 3 -1.2447 0.3642
Task5 MMICI_3 1 -0.1691 0.4413 0.4094 13 -2.9704 0.1760
Task6 MMICI_1 2 -0.7441 0.3419 0.4726 15 -18.3643 0.0000
Task6 MMICI_2 7 -1.0095 0.2855 0.4752 16 -21.6764 0.0000
Task6 MMICI_3 20 -3.8687 0.0000 0.2447 17 -49.2040 0.0000
7. Conclusion
This paper has detailed MMICI’s participation in the EXIST shared task at CLEF 2024, focusing on
the detection and categorization of sexism in social media content. By leveraging various innovative
methodologies, including ensemble approaches that incorporate diverse annotator profiles and multi-
modal embeddings, our models have demonstrated substantial efficacy in identifying and understanding
sexism in both tweets and memes.
The results of our evaluation phase reveal that our ensemble methods, particularly those combining
annotator profiles with text and image embeddings, achieve robust performance across multiple tasks.
Specifically, our runs have shown competitive results in detecting sexism, discerning the intent behind
sexist content, and categorizing different types of sexism. For instance, the ensemble approaches
used in Runs 1 and 2 consistently outperformed traditional majority voting methods, highlighting the
value of integrating diverse perspectives in addressing complex subjective tasks like sexism detection.
Our approach emphasizes the importance of considering individual annotator characteristics, such
as gender and age, to ensure that our models capture a wide range of viewpoints and avoid silencing
minority voices. In most tasks, our baseline strategy performed the best. However, for tasks 2 and 3
in Spanish, our ensembles surpassed the baseline by capturing a broader range of perspectives. This
nuanced understanding of sexism, facilitated by advanced machine learning techniques and diverse
data representation, is crucial for effectively combating sexist behaviors and discourses online.
In related work, there is significant potential in exploring additional data collected on annotators in
the EXIST 2024 dataset, including their ethnicities, study levels, and countries of origin, to enhance the
cross-lingual and cross-cultural analysis capabilities of sexism detection systems. Developing models
that effectively handle multiple languages and cultural contexts, possibly through cross-lingual transfer
learning and the creation of culturally nuanced models, would improve global applicability. Addi-
tionally, further exploration of Transformer-based models and the creation of ensembles can leverage
their strengths to improve detection accuracy. Expanding the dataset to include more diverse and
underrepresented demographic groups would also contribute to building more robust and generalizable
models. This could involve collecting additional annotated data from various social media platforms
and cultural contexts. Moreover, improving multimodal techniques by leveraging advanced neural
network architectures and incorporating additional features can further enhance model performance in
detecting sexism.
Overall, our participation in the EXIST task underscores the potential of advanced ensemble methods
and multimodal analysis in improving the detection and categorization of sexism in social media. These
methods not only enhance the accuracy of automatic tools but also contribute to a deeper understanding
of how sexism manifests in various forms, thereby supporting broader efforts to promote gender equity
and reduce discrimination in digital spaces.
Acknowledgments
This work has been partially supported by CONAHCYT (The National Council of Humanities, Sciences,
and Technologies of Mexico), which promotes scientific and technological development in the country.
Additionally, we acknowledge the support provided through the following scholarships: Martha
Paola Jimenez-Martinez (scholarship number 828539) and Joan Manuel Raygoza-Romero (scholarship
number 806073).
References
[1] Cambridge, Sexism, 2024. Https://dictionary.cambridge.org/dictionary/english/sexism.
[2] Real Academia Española, Sexismo, 2024. Https://dle.rae.es/sexismo.
[3] Comisión Nacional para Prevenir y Erradicar la Violencia Contra las Mujeres , ¿qué es el lenguaje
sexista y por qué es importante visibilizarlo?, 2016. Https://www.gob.mx/conavim/articulos/que-es-
el-lenguaje-sexista-y-por-que-es-importante-visibilizarlo?idiom=es.
[4] P. Glick, S. T. Fiske, Ambivalent sexism, in: Advances in experimental social psychology, volume 33,
Elsevier, 2001, pp. 115–188.
[5] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifica-
tion and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
Association (CLEF 2024), 2024.
[6] J. A. García-Díaz, M. Cánovas-García, R. Colomo-Palacios, R. Valencia-García, Detecting misogyny in
spanish tweets. an approach based on linguistics features and word embeddings, Future Generation
Computer Systems 114 (2021) 506–518.
[7] S. Akhtar, V. Basile, V. Patti, Whose opinions matter? perspective-aware models to identify opinions
of hate speech victims in abusive language detection, arXiv preprint arXiv:2106.15896 (2021).
[8] B. Loni, M. Larson, A. Hanjalic, Factorization machines for data with implicit feedback, 2018. URL:
https://api.semanticscholar.org/CorpusID:56517380.
[9] S. Rendle, Factorization machines, in: 2010 IEEE International Conference on Data Mining, 2010,
pp. 995–1000. doi:10.1109/ICDM.2010.127.