Enhancing Sexism Detection in Tweets with
                         Annotator-Integrated Ensemble Methods and Multimodal
                         Embeddings for Memes
                         Notebook for the EXIST Lab at CLEF 2024

                         Martha Paola Jimenez-Martinez1,* , Joan Manuel Raygoza-Romero1 ,
                         Carlos Eduardo Sánchez-Torres3 , Irvin Hussein Lopez-Nava1,3 and Manuel Montes-y-Gómez2
                         1
                           Centro de Investigación Científica y de Educación Superior de Ensenada, Mexico
                         2
                           Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
                         3
                           Universidad Autónoma de Baja California, Ensenada, Baja California, Mexico


                                      Abstract
                                      This paper details MMICI’s participation in the EXIST challenge at CLEF 2024, focusing on the identification
                                      and categorization of sexism in social media and memes. For tweets, we employed pre-trained transformer
                                      models and ensemble voting approaches. For memes, we utilized CLIP embeddings using a Vision Transformer
                                      (ViT) model and two types of classifiers: feed-forward neural networks and factorization machines. The tasks
                                      encompassed detecting sexism in tweets and memes, as well as categorizing their type and the author’s intention.
                                      Our methodology for tweets integrates annotator profiles, such as gender and age, to enhance the accuracy of
                                      sexism identification, source intention, and sexism categorization. For memes, we utilized all annotator features
                                      (gender, age, ethnicity, study level, and country) for the same tasks. The results demonstrate the effectiveness
                                      of our models across various tasks, emphasizing the integration of diverse perspectives. Notably, our best
                                      performances include a 10th place ranking in Task 1, a 15th place ranking in Task 2, and a 13th place ranking in
                                      Task 3 for Spanish tweets. For memes, we achieved a 3rd place ranking in Task 4 for English memes, two 1st
                                      place rankings in Task 5 for both English and Spanish memes, and a 2nd place ranking in Task 6 for English
                                      memes. These results underscore the importance of incorporating the demographic factors of annotators and
                                      taking advantage of multimodal embeddings for robust performance in sexism detection.

                                      Keywords
                                      Sexism detection, Sexism identification, Sexism classification, Social media, Transformer models


                         1. Introduction
                        According to the Cambridge Dictionary, sexism is defined as “(actions based on) the belief that the
                        members of one sex are less intelligent, able, skilful, etc. than the members of the other sex, especially
                        that women are less able than men [1]". In contrast, the Royal Spanish Academy defines sexism as
                        “discrimination against individuals based on their sex" (in spanish: discriminación de las personas por
                        razón de sexo) [2]. Both interpretations, based on the meaning and expression in both languages,
                        agree that sexism not only reflects but also communicates and perpetuates the stereotypes and roles
                        historically assigned to women and men in society. This perpetuation of stereotypes is a significant
                        factor in the struggle for gender equity [3].
                           Research on gender ideologies employs the Ambivalent Sexism Inventory and the Ambivalence
                        toward Men Inventory. The Ambivalent Sexism Inventory measures hostile sexism, which reflects
                        antagonistic attitudes towards women, and benevolent sexism, which consists of subjectively favorable
                        but patronizing beliefs about women. The Ambivalence toward Men Inventory assesses hostility

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ jimenezmp@cicese.edu.mx (M. P. Jimenez-Martinez); jraygoza@cicese.edu.mx (J. M. Raygoza-Romero);
                          a361075@uabc.edu.mx (C. E. Sánchez-Torres); hussein@cicese.edu.mx (I. H. Lopez-Nava); mmontesg@inaoep.mx
                          (M. Montes-y-Gómez)
                           0009-0005-8701-9875 (M. P. Jimenez-Martinez); 0000-0003-3085-5678 (J. M. Raygoza-Romero); 0000-0001-5799-4067
                          (C. E. Sánchez-Torres); 0000-0003-3979-9465 (I. H. Lopez-Nava); 0000-0002-7601-501X (M. Montes-y-Gómez)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
toward men, rooted in the resentment of men’s perceived greater power, and benevolence toward men,
which involves favorable views of men as protectors and providers. Ambivalent sexism theory posits
that hostile sexism and benevolent sexism arise due to social and biological factors common across
cultures, such as patriarchy, gender differentiation, and heterosexuality. Systemically, hostile sexism and
benevolent sexism function as complementary ideologies that justify and perpetuate gender inequality,
showing a strong correlation across cultures. This underscores the necessity of addressing both hostile
and benevolent forms of sexism in the pursuit of gender equality [4].
   This paper details MMICI’s participation in the "sEXism Identification in Social neTworks" (EXIST)
shared task at CLEF 2024. EXIST aims to broadly capture instances of sexism, ranging from overt
misogyny to subtler expressions of implicit sexist behavior, a task it has been undertaking since 2021.
The goal of utilizing automatic tools is not only to detect and alert against sexist behaviors and discourses
but also to estimate the prevalence of sexist and abusive situations on social media platforms, identify
the most common forms of sexism, and understand how sexism manifests in these media [3].
   Over the years, EXIST has evolved significantly. In 2021 and 2022, it provided a dataset with definitive
(hard) labels for each tweet. However, starting from 2023 and continuing into 2024, the task expanded to
generate six different labels per tweet, each derived from six distinct annotator profiles. These profiles
include three women and three men from distinct age groups: 18-22, 23-45, and 46+. Furthermore, the
most recent edition incorporates the demographic parameters of the annotators, such as gender, age,
level of education, ethnicity, and country of residence.


2. Dataset EXIST 2024
In its fourth edition [5], the task has incorporated new challenges involving images, specifically memes.
The six tasks are as follows:
    • Task 1: Sexism Identification in Tweets involves identifying whether a tweet is sexist or not.
    • Task 2: Source Intention in Tweets follows, where once a tweet is classified as sexist, it involves
      categorizing the intention of the author—whether the intention is direct, reported, or judgmental.
    • Task 3: Sexism Categorization in Tweets focuses on classifying sexist tweets into specific categories
      such as ideological and inequality, stereotyping and dominance, objectification, sexual violence, misogyny,
      and non-sexual violence.
    • Task 4: Sexism Identification in Memes is similar to Task 1 but applied to memes, determining whether
      a meme is sexist.
    • Task 5: Source Intention in Memes mirrors Task 2 but for memes, categorizing them based on the
      author’s intention, either direct or judgmental.
    • Task 6: Sexism Categorization in Memes parallels Task 3, classifying sexist memes into the same
      categories as tweets.

  These tasks aim to enhance the understanding and detection of sexism across various forms of social
media content in both English and Spanish, ultimately supporting efforts to combat sexism online.
Given that information is provided from expressions in different languages, it cannot be assumed that
models for detecting sexism in one language can be applied directly to another. This is due to the syntax
and semantic differences in the manifestations of sexism across various countries and contexts [6]. To
better understand the differences between the expressions in both languages, Table 1 provides some
examples of the labels given for the different Dataset tasks where all annotators reached a consensus
on that label.

                              Table 1: Examples of Tweets and Memes from the dataset EXIST 2024
 Task                Label             Example 1 (Spanish)                          Example 2 (English)
                                                                                    People really try to convince women with little
 TASK 1: Sexism
                                                                                    to no ass that they should go out and buy a
 Identification in   Sexist            Mujer al volante, tenga cuidado!
                                                                                    body. Like bih, I don’t need a fat ass to get a
 Tweets
                                                                                    man. Never have.
                                                                                                            Continued on next page
                                          Table 1 – Continued from previous page
Task                Label             Example 1 (ES)                                     Example 2 (EN)
                                                                                         @messyworldorder it’s honestly so embarrass-
                                      Alguien me explica que zorra hace la gente
                    Not Sexist                                                           ing to watch and they’ll be like “not all white
                                      en el cajero que se demora tanto.
                                                                                         women are like that”
                                      Una mujer necesita amor, llenar la nevera, si
TASK 2: Source                                                                           Women shouldn’t code. . . perhaps be in-
                                      un hombre puede darle esto a cambio de sus
Intention   in      Direct                                                               fluencer/creator instead. . . it’s their natural
                                      servicios (tareas domésticas, cocinar, etc.), no
Tweets                                                                                   strength.
                                      veo qué más necesita.
                                      Me duermo en el metro, abro los ojos sin-
                                                                                         Today, one of my year 1 class pupils could not
                    Reported          tiendo algo raro: la mano del hombre sentado
                                                                                         believe he’d lost a race against a girl.
                                      a mi lado en mi pierna #SquealOnYourPig.
                                      Como de costumbre, la mujer fue la que dejó        21st century and we are still earning 25% less
                    Judgemental
                                      su trabajo por el bienestar de la familia. . . "   than men #Idonotrenounce.
                                      Mi hermana y mi madre se burlan de mí por
TASK 3: Sexism                                                                           I think the whole equality thing is getting out
                    Ideological and   defender todo el tiempo los derechos de to-
Categorization                                                                           of hand. We are different, thats how were
                    Inequality        dos y me acaban de decir feminazi, la comple-
in Tweets                                                                                made!
                                      taron.
                                      @Paula2R @faber_acuria A las mujeres hay           Most women no longer have the desire or the
                    Stereotyping
                                      que amarlas. . . solo eso. . . Nunca las enten-    knowledge to develop a high quality character,
                    and Dominance
                                      derás.                                             even if they wanted to.
                                                                                         Don’t get married than blame all woman for
                                      “Pareces una puta con ese pantalón” - Mi her-
                                                                                         your poor investment. You should of got a
                    Objectification   mano de 13 cuando me vio con un pantalón
                                                                                         hooker but instead you choose to go get a wed-
                                      de cuero.
                                                                                         ding ring.
                                      #MeToo Estas 4 no han conseguido su objetivo.
                    Sexual Violence                                                      Fuck that cunt, I would with my fist.
                                      El juez estima que se abrieron de patas
                                                                                         Some woman are so toxic they don’t even
                    Misogyny and
                                      Las mujeres de hoy en dia te enseñar a             know they are draining everyone around them
                    Non-Sexual Vio-
                                      querer. . . estar soltero                          in poison. If you lack self awareness you won’t
                    lence
                                                                                         even notice how toxic you really are.


TASK 4: Sexism
Identification in   Sexist
Memes


                    Not Sexist


TASK 5: Source
Intention   in      Direct
Memes


                    Judgemental


                                                                                                                Continued on next page
                                        Table 1 – Continued from previous page
 Task             Label             Example 1 (ES)                               Example 2 (EN)


 TASK 6: Sexism
                  Ideological and
 Categorization
                  Inequality
 in Memes


                  Stereotyping
                  and Dominance


                  Objectification


                  Sexual Violence


                  Misogyny and
                  Non-Sexual Vio-
                  lence


3. Overview of the proposal
Previous research has developed methods to model annotators in subjective tasks, allowing for the
prediction of personalized labels for each annotator. For instance, Akhtar et al. [7] conducted an
exhaustive search to classify annotators into two groups based on their annotation patterns. Their study
demonstrated that an ensemble model, composed of two distinct classifiers representing the perspectives
of each group, outperformed the traditional single-task model that only considers aggregated labels.
   Additionally, traditional classification methods typically aggregate labels through majority voting or
averaging before training. However, this approach has been found to potentially “silence the voices” of
socio-demographic minority groups [7]. One of the objectives of this study is to leverage the individual
opinions of annotators, or group them based on specific demographic characteristics, to ensure that
their “voices” are effectively integrated into the sexism detection models.
   Building on previous concepts, our approach to addressing the EXIST task encompasses multiple
strategies across various runs and tasks:
Figure 1: Overview of the proposal for Sexism Detection in EXIST 2024.


    • Run 1 for Tasks 1, 2, and 3: The model predicts labels by employing an ensemble method that combines
      outputs based on different age groups and gender.
    • Run 2 for Tasks 1, 2, and 3: The model predicts labels by employing an ensemble method that integrates
      outputs from various profiles of the six annotators.
    • Run 3 for Tasks 1, 2, and 3: The model predicts labels using a majority vote approach, where the final
      prediction is based on the consensus among all annotators.
    • Runs 1 and 2 for Tasks 4 and 5: For these tasks, our approach involves using embeddings for both the
      text and the image of each meme. These embeddings represent deep features of the meme. Additionally,
      annotator attributes are incorporated to develop a model capable of predicting labels for each annotator.
      The final label is determined by a voting mechanism among the predictions of the annotators.
    • Runs 1 and 2 for Task 6: A specialized model is trained for each label using only sexism data, with
      the data balanced for each class. Embeddings for the meme text and image are utilized. The final output
      combines the model’s prediction for non-sexist cases (from Task 4) with the outputs of the specialized
      models for each sexism category to produce a single prediction.
    • Run 3 for Tasks 4, 5, and 6: The system predicts labels by concatenating the profile annotator’s embedding
      with an image embedding in the same space. A multimodal embedding model assesses the relationship
      between annotators and items, and a voting mechanism is then applied to determine the final score.

   Our general approach is presented in Figure 1 and integrates text and visual processing using
transformer models to extract features and perform classifications. Texts (tweets) are preprocessed
and fed into a transformer model to generate text embeddings, while images (memes) are processed
through a vision transformer model to produce visual embeddings. Annotator features are extracted
from the text embeddings, and a classifier is trained using these features along with the text and visual
embeddings. An ensemble technique is applied to combine the outputs of the models, enhancing the
accuracy of the classifier. The performance is then evaluated across several specific tasks to ensure
comprehensive assessment and optimization of the results.
   In the domain of text analysis in Spanish, our dataset was meticulously constructed, comprising 2526
samples for training purposes, with an additional 639 samples reserved for validation, and a final set of
490 samples designated for comprehensive testing. Conversely, for the English language domain, our
dataset consisted of 1832 meticulously curated samples for training. An additional 574 samples were
allocated for validation. Furthermore, our local test set, comprising 978 meticulously selected samples,
served as a crucial benchmark for evaluating the generalization capabilities of our models in real-world
applications. The metrics used for each task are as follows: For Task 1 and Task 4, we employed the
ICM-Hard Norm F1-score for the positive class (sexism). For Task 2, Task 3, Task 5, and Task 6, we used
the ICM-Hard Norm F1-score macro, which is the average of the F1-scores for all classes.
Figure 2: Leveraging Annotator Consensus and Profiles for Sexism Detection in Tweets.


4. Sexism Detection in Tweets
Firstly, for the detection of sexism in tweets, we focus on integrating annotator information, particularly
considering their profiles such as gender and age, as summarized in Figure 2.

    • Text Preprocessing: Mentions within the tweets were substituted with ’@USER,’ while any
      URLs were replaced with ’HTTPURL.’
    • Transformer Model: We decided to use pre-trained models specifically for tweets:
      “cardiffnlp/twitter-roberta-base-sentiment” for English and “pysentimiento/robertuito-base-
      uncased” for Spanish, both from Hugging Face, since these models were trained with data
      in the respective languages.
    • Ensemble: We employed two different ensembles. The first ensemble used a majority vote from
      the outputs of six different models, one for each annotator. The second ensemble used a majority
      vote from the outputs of five different models, focusing on gender and age (females, males, 18-22,
      23-45, and 46+).

  As mentioned in the previous Section, our runs for the first three text-focused tasks were:

   1. Run 1: An ensemble was created from the outputs of five different models, focusing on gender
      and age. A majority vote was taken from the outputs of these five models, with the label being
      assigned if three or more groups agreed.
   2. Run 2: An ensemble was created from the outputs of six different models, one for each annotator.
      A majority vote was taken from the outputs of these six models, with the label being assigned if
      four or more annotators agreed.
   3. Run 3: A majority vote was taken initially from the six annotators’ inputs, serving as the baseline.
      Similarly, the label was assigned if four or more annotators agreed.

   To ensure a decision was always made in each ensemble without ties, we used probabilistic voting
rather than hard voting from each model. This means that even if three models classify a tweet as
sexist and three do not, the probabilities are compared, and the decision is made based on the highest
probability, ensuring a definitive decision for all predictions in the ensembles.
   For Task 2, which requires determining the intention of the tweets (single label), the label was assigned
based on the highest probability prediction among the types of intentions if the tweet was sexist. To
achieve this, a binary model was trained for each label. This approach ensures that the classification is
both precise and comprehensive, taking into account the nuanced nature of the intentions expressed in
the tweets.
   For Task 3, which involves identifying the types of sexism in a tweet (multi-label), the ensemble takes
into account all types of sexism indicated by the annotators. For example, if one annotator labels a tweet
as objectification and another labels it as misogynistic and sexual violence, all three types of sexism are
included in our ensemble prediction Furthermore, we employ multiple binary classification models for
each label, allowing us to address each facet of identified sexism with specificity and precision.
   To analyze in depth the impact of considering individualized annotators’ opinions (A1-A6), grouped
opinions before the classification models (All), opinions by demographic group (Females, Males, 18-22,
23-45, 46+), or assembled at the end (Ensemble Annotators, Ensemble Groups), Figure 3 presents the
results based on the performance of the sexism identification, intention, and categorization models,
respectively.
   The selection of group ensemble and annotator ensemble approaches as Run 1 and Run 2, respectively,
is grounded in their ability to integrate a wide range of perspectives and individual judgments. The
group ensemble, by combining different demographics, offers an enriched and balanced overview, which
is crucial for tackling the complexity of the tasks at hand. On the other hand, the annotator ensemble
capitalizes on the diversity of individual judgments, ensuring a robust and competitive performance.
Finally, the direct majority vote of annotators is established as the baseline (Run 3) due to its simplicity
and effectiveness, providing a clear reference for evaluating ensemble methods. These choices are
backed by the best results obtained in each task, where the group ensemble consistently outperforms
others in terms of performance and ability to capture the inherent complexity in the datasets.


5. Sexism Detection in Memes
We chose a different path for tasks 4, 5, and 6 as shown in Figure 4. Although we leveraged the
annotator data from the dataset, the text preprocessing steps depicted in Figure 1 were not applied.
Instead, embeddings were extracted directly from the raw data using different approaches. We utilized
CLIP embeddings for the memes and text along with annotator features. It was decided to address the
tasks from the textual domain due to the high variability in the representation and graphic styles of the
Memes (see examples in Table 1).
   In Runs 1 and 2, annotator features were represented using one-hot encoding for gender, age range,
ethnicity, study level, and country. The reader can explore this approach in the subsection 5.1. In Run
3, a descriptive text was created for the annotator features, from which embeddings were extracted.
For Runs 2 and 3, the classifier used was a feed-forward neural network (FNN) with two hidden layers,
containing 4096 and 512 neurons, respectively. Following these layers, a dropout layer with a dropout
rate of 0.1 was applied. In Run 3, we further leverage the annotation and meme relationship and propose
a Factorization Machine model, a Collaborative filtering technique, to predict the annotation based on
annotator and meme CLIP embeddings. We explain more about this approach in the subsection 5.2.

5.1. Feed-forward neural network with CLIP embeddings
For Task 4, the output layer in the FNN consisted of a single neuron to produce the probability of the
sexism class. We evaluated various approaches for the model: using only text embeddings, using only
image embeddings, using both text and image embeddings, utilizing a general model (without annotator
characteristics), and some combinations of the outputs of some of these models. Table 2 outlines the
features of each evaluated model.
   It is essential to define two concepts: early fusion and late fusion. In early fusion, the model
simultaneously receives both text and image embeddings, meaning the model’s input includes annotator
features, text embeddings, and image embeddings (as seen in the “Text+Image“ and “Text+Image General“
models). In late fusion, the outputs of two models are combined. For example, in the “Text|Image“ model,
the outputs of the “Text“ model (trained only with text embeddings) and the “Image“ model (trained
only with image embeddings) are combined by averaging their outputs. Similarly, the “Text|Image &
Text|Image General“ model combines the outputs of the “Text|Image“ and “Text+Image General“ models
by averaging their outputs.
   Figure 5 displays the F1 score for the positive case (sexism) across 10 runs for each model in Task
4. The results indicate that the “Text|Image“ model and the “Text|Image & Text|Image General“ model
                                    (a) Task 1: Sexism Identification in Tweets.


                                      (b) Task 2: Source Intention in Tweets.


                                   (c) Task 3: Sexism Categorization in Tweets.

Figure 3: Classification results for Sexism Detection in Tweets
Figure 4: Leveraging Annotator Consensus and Profiles for Sexism Detection in Memes.


Table 2
Features of different models for the task 4.
           Model Name              Annotator Features   Text Embeddings   Image Embbedings   Early Fusion   Late Fusion
               Text                       Yes                 Yes                No              N/A           N/A
              Image                       Yes                  No                Yes             N/A           N/A
            Text+Image                    Yes                 Yes                Yes             Yes            No
           Text General                   No                  Yes                No              N/A           N/A
          Image General                   No                   No                Yes             N/A           N/A
        Text+Image General                No                  Yes                Yes             Yes            No
            Text|Image                    Yes                 Yes                Yes              No            Yes
 Text|Image & Text|Image General       Yes&No                 Yes                Yes              No            Yes


achieve higher mean F1 scores and low variations of the performance. These models correspond to Run
1 and Run 2, respectively.


Figure 5: Classification results of different approaches for task 4.


  For task 5, the output layer in the FNN consisted of a 3 neurons to yield the probability of every label.
Similar to task 4, we evaluated some approaches for the model: using only text embeddings, using only
image embeddings, using both text and image embeddings, and a combination of the outputs of “Text”
and “Image” models by averaging their outputs. Figure 6 displays the F1 score macro across 10 runs for
each model in Task 5.
  The results in Figure 6 indicate the “Text+Image” model achieves higher mean F1 scores and low
variations of the performance. These models correspond to Run 1. For Run 2, the “Text|Image” model
was selected. Although it did not achieve the highest F1 score, it demonstrated a strong MSE score
comparable to the “Text+Image” model.
  For Task 6, the output layer in the FNN consisted of one neuron to yield the probability for each
Figure 6: Classification results of different approaches for task 5.


label of sexism categorization. We created 5 models, each trained exclusively on data from sexism
memes and with a random subset of training data for negative cases equal to the amount of training
data for positive cases. Consequently, each model was trained on a balanced dataset. The probability
output from the model for Task 4 was used to determine the probability of the not sexism label and
then combined with the outputs from each of these 5 models to produce a final prediction.
   There are two exceptional cases to consider: i) If the probability of not sexism is higher than 0.5, as
well as one of the 5 categories of sexism, the final prediction is always not sexism. ii) If the probability
of not sexism is lower than 0.5, as well as one of the 5 categories of sexism, the meme is classified as
sexist, and the category of sexism with the highest probability is selected. Similar to Tasks 4 and 5, we
evaluated various approaches for the model. Figure 7 presents the macro F1 scores for each model.


Figure 7: Classification results of different approaches for task 6.


  We observed similar performance among the “Text”, “Text+Image”, and “Text|Image” models. Based
on these results, we selected the “Text|Image” model for Run 1 and the “Text+Image” model for Run 2.
The “Text” model was not chosen, as we believe that the combination of text and image embeddings
yields better results.

5.2. Multimodal Collaborative Filtering employing CLIP embeddings and
     Factorization Machines
Loni2018FactorizationMF In this approach, we model similar to how to assign a score in a recommenda-
tion system or to predict links between nodes in a bipartite graph, leveraging the fact that we have
the annotator and the item features. Given known subject-item preferences, predict new subject-item
preferences. Formally, let 𝑈 a set of all subjects 𝑈 and 𝑉 a set of all items, our core task is to find a
real-valued scalar function 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) where 𝑢 ∈ 𝑈 and 𝑣 ∈ 𝑉 . To provide a hard label or multi-label,
𝑘 subjects vote with their encoded scores. Hence, we’ve reduced our problem into a score prediction
problem. For each user 𝑢 ∈ 𝑈 , let 𝑢 ∈ RD for its 𝐷-dimensional embedding. For each item 𝑣 ∈ 𝑉 ,
let 𝑣 ∈ R𝐷 be its 𝐷-dimensional embedding. So, 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) ≡ 𝑓 : R𝐷 × R𝐷 → R. In this approach,
memes and annotators are transformed into the same embedding space using CLIP. Specifically, user
demographics such as age, gender, and interests are encoded with a phrase such as “A female aged
18-22, of Hispanic or Latino ethnicity, with a high school degree or equivalent, and located in Mexico”
into one CLIP embedding. In contrast, the meme, which may include both image and text components,
is encoded into another CLIP embedding. These embeddings capture the nuanced features of both the
user and the meme content. We then concatenate these two embeddings into a single embedding that
represents the combined features of the user and the meme.
   For instance, the table 3 illustrates a complete utility matrix for Task 4 with known score entries
𝑓 (𝑢, 𝑣) where 0 represents the label “NO” and 1 represents the label “YES”. Our encoding method
ignores "UNKNOWN" labels, but other encodings are possible. In this case, the voting policy is the
selection of the class annotated by more than 3 subjects.

Table 3
An example of utility matrix for the task 4
                                                 𝑉1    𝑉2      𝑉3
                                         𝑈1       1     0      1
                                         𝑈2       0     1      1
                                         𝑈3       1     0      0
                                         𝑈4       1     1      1
                                         𝑈5       0     0      0
                                         𝑈6       1     0      0
                                       𝑉 𝑜𝑡𝑖𝑛𝑔    1     0   Undefined
                                        𝐿𝑎𝑏𝑒𝑙    YES   NO

  For Task 5, the 𝑠𝑐𝑜𝑟𝑒 function is encoded similarly to task 4, with the addition of a voting policy and
a method to define similarity to hard labels. The voting policy is the arithmetic mean of votes 𝑠𝑐𝑜𝑟𝑒
which entails us into the encoding to predict the hard label as follows
  𝑠𝑐𝑜𝑟𝑒 ∈ [0, 0.67] =⇒ No
  𝑠𝑐𝑜𝑟𝑒 ∈ (0.67, 1.34] =⇒ Direct
  𝑠𝑐𝑜𝑟𝑒 ∈ (1.34, 2] =⇒ Judgemental
  We apply softmax over the votes to find the probabilities, thus solving the soft-soft task.

   For Task 6, the different combinations are encoded into a compact bit set as follows: each 𝑙𝑎𝑏𝑒𝑙𝑖 is
a bit 2𝑖 where 𝑖 ≥ 0. The union gives us the bit set across the different combinations. We provide an
example below:
   𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000001 =⇒ -
   𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000010 =⇒ IDEOLOGICAL-INEQUALITY
   𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏000100 =⇒ MISOGYNY-NON-SEXUAL-VIOLENCE
   𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 0𝑏00010|0𝑏00001 = 0𝑏00011 =⇒ -, IDEOLOGICAL-INEQUALITY
   Similarly to Task 5, we count the number of common bits and apply softmax to find the probability
distribution.
   We’ve defined how to decode 𝑠𝑐𝑜𝑟𝑒 to solve the tasks, but how can we learn 𝑠𝑐𝑜𝑟𝑒 from annotator and
memes CLIP embeddings? Different embedding-based models include memory-based CF, model-based
CF, Neighborhood methods, Neural Graph Collaborative Filtering, Factorization Machines [8], and
GCN-based CF. Among those, the Factorization Machine models stands out for being efficient and
accurate, enabling it to effectively predict the score [9] from concatenated embedding. Figure 8 shows
how well this approach performs on our validation dataset after 10 runs.


6. Outcomes of the Evaluation Phase
Table 4 presents the combined results for both English and Spanish submissions in the sexism detection
challenge across six different tasks. Each task involves several runs evaluated using two metrics:
Hard-Hard and Soft-Soft. Below, we describe the results, focusing on the best runs for each task.
Figure 8: F1 score of hard-hard task 4, 5, 6 employing Collaborative Filtering.


  For Task 1 (Tweets), the best performance was achieved by running MMICI_3, which ranked 17th in
the Hard-Hard metric with an ICM-Hard Norm of 0.7676, and an F1 score of 0.7637. In the Soft-Soft
metric, this run ranked 21st with an ICM-Soft Norm of 0.5736, indicating it was the most effective in
both metrics for this task.
  For Task 4 (Memes), run MMICI_2 excelled, ranking 8th in the Hard-Hard metric with an ICM-Hard
Norm of 0.5515, and an F1 score of 0.7261. For Task 5, the top run was MMICI_1, which ranked 7th in
the Hard-Hard metric with an ICM-Hard Norm of 0.3934, and an F1 score of 0.4179. In the Soft-Soft
metric, this run performed even better, ranking 2nd with an ICM-Soft Norm of 0.3654, making it the
most effective in both categories. Lastly, for Task 6, the best run was MMICI_1, which ranked 3rd in the
Hard-Hard metric with an ICM-Hard Norm of 0.2954, and an F1 score of 0.4342.

Table 4
Results of Submission on Leaderboard for both Spanish and English (ALL)
                                       Hard-Hard                                     Soft-Soft
 Task     Run      Ranking ICM-Hard ICM-Hard Norm                F1     Ranking ICM-Soft ICM-Soft Norm
Task1   MMICI_1        31        0.4705          0.7365        0.7455      29      -0.3394       0.4456
Task1   MMICI_2        28        0.4780          0.7402        0.7460      30      -0.3622       0.4419
Task1   MMICI_3        17        0.5324          0.7676        0.7637      21      0.4589        0.5736
Task2   MMICI_1        27       -0.0987          0.4679        0.4548      24      -4.5753       0.1314
Task2   MMICI_2        32       -0.2406          0.4218        0.4383      25      -4.6285       0.1271
Task2   MMICI_3        28       -0.1076          0.4650        0.4525      20      -3.6350       0.2071
Task3   MMICI_1        27       -1.4509          0.1631        0.4026      22      -7.9356       0.0809
Task3   MMICI_2        28       -1.5003          0.1516        0.4017      23      -7.9380       0.0808
Task3   MMICI_3        23       -0.8105          0.3118        0.4805      20      -7.6413       0.0965
Task4   MMICI_1        12        0.0751          0.5382        0.7202      17      -0.6189       0.4005
Task4   MMICI_2        8         0.1014          0.5515        0.7261      16      -0.6183       0.4006
Task4   MMICI_3        24       -0.0361          0.4816        0.6781      19      -0.6410       0.3970
Task5   MMICI_1         7       -0.3066          0.3934        0.4179      2      -1.2660        0.3654
Task5   MMICI_2        10       -0.3868          0.3655        0.3770       3      -1.3738       0.3539
Task5   MMICI_3         8       -0.3297          0.3854        0.3814      13      -3.4751       0.1304
Task6   MMICI_1        3        -0.9863          0.2954        0.4342      19     -16.1248       0.0000
Task6   MMICI_2         7       -1.3446          0.2210        0.4453      20     -19.3246       0.0000
Task6   MMICI_3        24       -3.8341          0.0000        0.2347      21     -45.0237       0.0000


  The results for the Spanish submissions are showcased in Table 5. Hereafter, we delve into these
outcomes, centering our attention on the most successful executions for each task. For Task 1, the best
performance was achieved by running MMICI_3, which ranked 10th in the Hard-Hard metric with an
ICM-Hard Norm of 0.7802, and an F1 score of 0.7892. In Task 2, the best run was MMICI_1, ranking
15th in the Hard-Hard metric with an ICM-Hard Norm of 0.5522, and an F1 score of 0.5133. For Task 3,
the top run was MMICI_1, ranking 13th in the Hard-Hard metric with an ICM-Hard Norm of 0.4586,
and an F1 score of 0.5486. In Task 4, run MMICI_2 excelled, ranking 14th in the Hard-Hard metric with
an ICM-Hard Norm of 0.4900, and an F1 score of 0.6997. For Task 5, the top run was MMICI_1, which
ranked 7th in the Hard-Hard metric with an ICM-Hard Norm of 0.3945, and an F1 score of 0.4198. In the
Soft-Soft metric, this run performed even better, ranking 1st with an ICM-Soft Norm of 0.3461, making
it the best-performing in both categories. Lastly, in Task 6, the best run was MMICI_1, which ranked
4th in the Hard-Hard metric with ICM-Hard Norm of 0.2473, and an F1 score of 0.3868.

Table 5
Results of Submission on Leaderboard for Spanish
                                     Hard-Hard                                     Soft-Soft
 Task     Run      Ranking ICM-Hard ICM-Hard Norm             F1     Ranking ICM-Soft ICM-Soft Norm
 Task1   MMICI_1      16        0.5323         0.7662       0.7817      24       0.0894        0.5143
 Task1   MMICI_2      22        0.5007         0.7504       0.7705      25       0.0170        0.5027
 Task1   MMICI_3      10        0.5603         0.7802       0.7892      15       0.6706        0.6076
 Task2   MMICI_1      15        0.1670         0.5522       0.5133      23       -4.1728       0.1658
 Task2   MMICI_2      26        0.0064         0.5020       0.4933      24       -4.2127       0.1626
 Task2   MMICI_3      29       -0.1146         0.4642       0.4779      20       -3.4962       0.2200
 Task3   MMICI_1      13       -0.1853         0.4586       0.5486      24       -7.8261       0.0927
 Task3   MMICI_2      14       -0.2269         0.4493       0.5446      25       -7.8356       0.0922
 Task3   MMICI_3      22       -0.5870         0.3689       0.5165      22       -7.4291       0.1134
 Task4   MMICI_1      17       -0.0591         0.4699       0.6906      14       -0.6655       0.3939
 Task4   MMICI_2      14       -0.0196         0.4900       0.6997      15       -0.6689       0.3933
 Task4   MMICI_3      26       -0.1848         0.4059       0.6470      18       -0.8361       0.3667
 Task5   MMICI_1       7       -0.3028         0.3945       0.4198      1       -1.4813        0.3461
 Task5   MMICI_2       9       -0.4077         0.3580       0.3728       2       -1.5486       0.3392
 Task5   MMICI_3      10       -0.4875         0.3302       0.3545      13       -4.0400       0.0804
 Task6   MMICI_1      4        -1.2346         0.2473       0.3868      18      -14.9495       0.0000
 Task6   MMICI_2      13       -1.6925         0.1536       0.4141      20      -18.0902       0.0000
 Task6   MMICI_3      24       -3.8686         0.0000       0.2225      21      -42.6540       0.0000


   The outcomes for the English submissions are outlined in Table 6. We elaborate on these results,
specifically highlighting the top performances for each task. In Task 4, run MMICI_2 excelled, ranking
3rd in the Hard-Hard metric with an ICM-Hard Norm of 0.6129, and an F1 score of 0.7559. For Task
5, the top run was MMICI_3, which ranked 1st in the Hard-Hard metric with an ICM-Hard Norm of
0.4413, and an F1 score of 0.4094. Lastly, in Task 6, the best run was MMICI_1, which ranked 2nd in the
Hard-Hard metric with an ICM-Hard Norm of 0.3419, and an F1 score of 0.4726.
   The strong results achieved with memes can be attributed to the use of CLIP (Contrastive Lan-
guage–Image Pre-training) embeddings. CLIP effectively learns visual concepts from natural language
descriptions, aligning images and text within a shared embedding space. This alignment is achieved
by training on a vast dataset of images paired with their corresponding textual descriptions, enabling
the model to understand and relate visual and textual information seamlessly. Using CLIP, Vision
Transformers can be employed for image encoding and Text Transformers for text encoding, resulting
in a unified model that excels in multi-modal tasks. The Vision Transformer processes the image data,
while the Text Transformer processes the text data. Both sets of embeddings are then projected into a
common space where their similarities can be measured and aligned, allowing the model to leverage
the strengths of both visual and textual information effectively. This approach enabled the extraction of
sexist expressions from memes in the dataset across both languages. By transferring the representation
to the textual domain, it became possible to adopt state-of-the-art techniques for the classification tasks.
   In summary, the combined analysis of English and Spanish submissions in the sexism detection
challenge illuminates diverse approaches and performances across tasks. Each language cohort show-
cased distinct strengths, with notable runs such as MMICI_1 and MMICI_3 consistently demonstrating
effectiveness across multiple tasks. These results underscore the complexity of sexism detection and
highlight the importance of multilingual evaluation frameworks. Further exploration and refinement of
these methodologies promise continued advancements in combating bias and fostering inclusivity in
online content.
Table 6
Results of Submission on Leaderboard for English
                                     Hard-Hard                                   Soft-Soft
 Task     Run     Ranking ICM-Hard ICM-Hard Norm             F1     Ranking ICM-Soft ICM-Soft Norm
Task1   MMICI_1       40       0.3840         0.6960       0.6971     32       -0.8805        0.3586
Task1   MMICI_2       33       0.4402         0.7246       0.7141     31       -0.8349        0.3659
Task1   MMICI_3       25       0.4912         0.7507       0.7315     21       0.1413         0.5227
Task2   MMICI_1       33      -0.4572         0.3418       0.3680     23       -5.0641        0.0861
Task2   MMICI_2       36      -0.5728         0.3018       0.3570     24       -5.1264        0.0810
Task2   MMICI_3       30      -0.1384         0.4521       0.4087     19       -3.8024        0.1892
Task3   MMICI_1       32      -2.8962         0.0000       0.2357     22       -7.9094        0.0666
Task3   MMICI_2       34      -2.9573         0.0000       0.2373     21       -7.9059        0.0668
Task3   MMICI_3       26      -1.1024         0.2298       0.4287     19       -7.7476        0.0755
Task4   MMICI_1        5       0.2094         0.6063       0.7538     20       -0.5779        0.4062
Task4   MMICI_2       3        0.2224         0.6129       0.7559     19       -0.5735        0.4069
Task4   MMICI_3       18       0.1131         0.5574       0.7122     17       -0.4621        0.4250
Task5   MMICI_1        6      -0.3112         0.3920       0.4156      2       -1.1089        0.3790
Task5   MMICI_2        8      -0.3657         0.3731       0.3815      3       -1.2447        0.3642
Task5   MMICI_3       1       -0.1691         0.4413       0.4094     13       -2.9704        0.1760
Task6   MMICI_1       2       -0.7441         0.3419       0.4726     15      -18.3643        0.0000
Task6   MMICI_2        7      -1.0095         0.2855       0.4752     16      -21.6764        0.0000
Task6   MMICI_3       20      -3.8687         0.0000       0.2447     17      -49.2040        0.0000


7. Conclusion
This paper has detailed MMICI’s participation in the EXIST shared task at CLEF 2024, focusing on
the detection and categorization of sexism in social media content. By leveraging various innovative
methodologies, including ensemble approaches that incorporate diverse annotator profiles and multi-
modal embeddings, our models have demonstrated substantial efficacy in identifying and understanding
sexism in both tweets and memes.
   The results of our evaluation phase reveal that our ensemble methods, particularly those combining
annotator profiles with text and image embeddings, achieve robust performance across multiple tasks.
Specifically, our runs have shown competitive results in detecting sexism, discerning the intent behind
sexist content, and categorizing different types of sexism. For instance, the ensemble approaches
used in Runs 1 and 2 consistently outperformed traditional majority voting methods, highlighting the
value of integrating diverse perspectives in addressing complex subjective tasks like sexism detection.
Our approach emphasizes the importance of considering individual annotator characteristics, such
as gender and age, to ensure that our models capture a wide range of viewpoints and avoid silencing
minority voices. In most tasks, our baseline strategy performed the best. However, for tasks 2 and 3
in Spanish, our ensembles surpassed the baseline by capturing a broader range of perspectives. This
nuanced understanding of sexism, facilitated by advanced machine learning techniques and diverse
data representation, is crucial for effectively combating sexist behaviors and discourses online.
   In related work, there is significant potential in exploring additional data collected on annotators in
the EXIST 2024 dataset, including their ethnicities, study levels, and countries of origin, to enhance the
cross-lingual and cross-cultural analysis capabilities of sexism detection systems. Developing models
that effectively handle multiple languages and cultural contexts, possibly through cross-lingual transfer
learning and the creation of culturally nuanced models, would improve global applicability. Addi-
tionally, further exploration of Transformer-based models and the creation of ensembles can leverage
their strengths to improve detection accuracy. Expanding the dataset to include more diverse and
underrepresented demographic groups would also contribute to building more robust and generalizable
models. This could involve collecting additional annotated data from various social media platforms
and cultural contexts. Moreover, improving multimodal techniques by leveraging advanced neural
network architectures and incorporating additional features can further enhance model performance in
detecting sexism.
   Overall, our participation in the EXIST task underscores the potential of advanced ensemble methods
and multimodal analysis in improving the detection and categorization of sexism in social media. These
methods not only enhance the accuracy of automatic tools but also contribute to a deeper understanding
of how sexism manifests in various forms, thereby supporting broader efforts to promote gender equity
and reduce discrimination in digital spaces.


Acknowledgments
This work has been partially supported by CONAHCYT (The National Council of Humanities, Sciences,
and Technologies of Mexico), which promotes scientific and technological development in the country.
  Additionally, we acknowledge the support provided through the following scholarships: Martha
Paola Jimenez-Martinez (scholarship number 828539) and Joan Manuel Raygoza-Romero (scholarship
number 806073).


References
[1] Cambridge, Sexism, 2024. Https://dictionary.cambridge.org/dictionary/english/sexism.
[2] Real Academia Española, Sexismo, 2024. Https://dle.rae.es/sexismo.
[3] Comisión Nacional para Prevenir y Erradicar la Violencia Contra las Mujeres , ¿qué es el lenguaje
    sexista y por qué es importante visibilizarlo?, 2016. Https://www.gob.mx/conavim/articulos/que-es-
    el-lenguaje-sexista-y-por-que-es-importante-visibilizarlo?idiom=es.
[4] P. Glick, S. T. Fiske, Ambivalent sexism, in: Advances in experimental social psychology, volume 33,
    Elsevier, 2001, pp. 115–188.
[5] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
    R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifica-
    tion and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilinguality,
    Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
    Association (CLEF 2024), 2024.
[6] J. A. García-Díaz, M. Cánovas-García, R. Colomo-Palacios, R. Valencia-García, Detecting misogyny in
    spanish tweets. an approach based on linguistics features and word embeddings, Future Generation
    Computer Systems 114 (2021) 506–518.
[7] S. Akhtar, V. Basile, V. Patti, Whose opinions matter? perspective-aware models to identify opinions
    of hate speech victims in abusive language detection, arXiv preprint arXiv:2106.15896 (2021).
[8] B. Loni, M. Larson, A. Hanjalic, Factorization machines for data with implicit feedback, 2018. URL:
    https://api.semanticscholar.org/CorpusID:56517380.
[9] S. Rendle, Factorization machines, in: 2010 IEEE International Conference on Data Mining, 2010,
    pp. 995–1000. doi:10.1109/ICDM.2010.127.