The Wisdom of Weighing: Stacking Ensembles for a More
                         Balanced Sexism Detector
                          Abhay Shanbhag1,† , Suramya Jadhav1,† , Atharva Date1,† , Sumedh Joshi1,† and
                         Sheetal Sonawane1,†
                         1
                             SCTR’s Pune Institute of Computer Technology


                                        Abstract
                                        The rise of sexism online has become increasingly prevalent as more and more people are found on social media
                                        with little regard for what they say behind a screen of anonymity. Sexism in the form of derogatory, biased,
                                        violent, and presumptuous remarks, apart from perpetuating gender inequality, also creates a hostile environment
                                        for women that calls for immediate attention. EXIST 2024, the fourth edition of the sEXism Identification in Social
                                        Networks task at CLEF 2024, aims to not only detect sexism but also capture its types, from explicit misogyny to
                                        other subtle expressions that involve implicit sexist behaviors. We provided solutions for three tasks, the first of
                                        which was the identification of sexism in both English and Spanish texts, whereas the second and third identified
                                        the more subtle categories and aspects of sexism. In this study, we introduce a robust classification system
                                        utilizing a stacking classifier composed of four LLMs, whose output probabilities feed into a LightGBM model
                                        to produce a consolidated prediction. Additionally, five supplementary models contribute to the final decision
                                        by providing weighted predictions based on their respective accuracies. This ensemble approach, leveraging
                                        both stacking and weighted averaging, ensures enhanced accuracy and reliability in classifying text as sexist or
                                        non-sexist.

                                        Keywords
                                        Sexism Identification, Source Intention, Sexism Categorization, BERT, RoBERTa, Ensemble Approach


                         1. Introduction
                         Sexism on social media is a widespread problem that reinforces prejudices and discrimination based
                         on gender. Social media sites such as Facebook, Instagram, Twitter,etc. function as forums for the
                         discussion and expression of sexist beliefs. This problem encompasses a variety of actions, ranging
                         from blatantly sexist statements to subtly discriminatory small talk that reinforces stereotypical gender
                         norms. Such content spreads easily thanks to social media’s anonymity and wide audience, which
                         frequently makes it challenging to monitor successfully. Women are disproportionately the victims of
                         cyberbullying, where they face a variety of sexist acts such as body shaming, threats of violence, and
                         disparaging remarks . Technologies like better content filtering algorithms and social interventions like
                         victim support groups and public awareness campaigns are used in the fight against online misogyny.
                         In order to create safer and more inclusive digital spaces, it is imperative to recognize and address
                         sexism on social media.
                            In this Shared task, EXIST 2024 organised by Plaza et al. [1] and Plaza et al. [2], our research work
                         aims to classify whether a given tweet contains sexist expressions or behaviours (i.e., it is sexist
                         itself, describes a sexist situation or criticizes a sexist behaviour), and classify it according to two
                         categories: YES and NO. The second subtask is a multi-class classification. For the tweets that have
                         been predicted as sexist,the second task aims to classify each tweet according to the intention of the
                         person who wrote it. One of the three following categories must be assigned to each sexist tweet:
                         DIRECT, REPORTED, JUDGEMENTAL. The third subtask is a multi-label classification. For the tweets
                         that have been predicted as sexist, the third task aims to categorize them according to the type of
                         sexism. This is a multi-label task, so that more than one of the following labels may be assigned

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         †
                           These authors contributed equally.
                          $ abhayshanbhag0110@gmail.com (. A. Shanbhag); 2018suramyajadhav@gmail.com (S. Jadhav); atharvad931@gmail.com
                          (A. Date); sumedhjoshi463@gmail.com (S. Joshi); sssonawane@pict.edu (S. Sonawane)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to each tweet: IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINANCE, OBJECTIFICATION,
SEXUAL-VIOLENCE, MISOGYNY-NON-SEXUAL-VIOLENCE.


2. Background and Dataset
We participated in tasks 1, 2, and 3. Total of 3,660 of tweets are in Spanish, and the rest of them in English.
The annotated information helps in understanding how various demographic factors of annotators
might influence the interpretation and classification of sexist content on social media, thus contributing
to the development of more nuanced and effective detection algorithms.

"100001": {
"id_EXIST": "100001",
"lang": "es",
"tweet": "@TheChiflis Ignora al otro, es un capullo.El problema con este
  youtuber denuncia el acoso... cuando no afecta a la gente de izquierdas.
  Por ejemplo,en su video sobre el gamergate presenta como \"normal\" el
  acoso que reciben Fisher, Anita o Zöey cuando hubo hasta amenazas
  de bomba.",
"number_annotators": 6,
"annotators": ["Annotator_1", "Annotator_2", "Annotator_3",
  "Annotator_4", "Annotator_5", "Annotator_6"],
"gender_annotators": ["F", "F", "F", "M", "M", "M"],
"age_annotators": ["18-22", "23-45", "46+", "46+", "23-45", "18-22"],
"ethnicities_annotators": ["White or Caucasian", "Hispano or Latino",
  "White or Caucasian", "White or Caucasian", "White or Caucasian", "Hispano
  or Latino"],
"study_levels_annotators": ["Bachelor’s degree", "Bachelor’s degree",
  "High school degree or equivalent", "Master’s degree",
"Master’s degree", "High school degree or equivalent"],
"countries_annotators": ["Italy", "Mexico", "United States", "Spain",
  "Spain", "Chile"],
"labels_task1": ["YES", "YES", "NO", "YES", "YES", "YES"],
"labels_task2": ["REPORTED", "JUDGEMENTAL", "-", "REPORTED",
  "JUDGEMENTAL", "REPORTED"],
"labels_task3": [
    ["OBJECTIFICATION"],
    ["OBJECTIFICATION", "SEXUAL-VIOLENCE"],
    ["-"],
    ["STEREOTYPING-DOMINANCE"],
    ["SEXUAL-VIOLENCE"],
    ["IDEOLOGICAL-INEQUALITY", "MISOGYNY-NON-SEXUAL-VIOLENCE"]
]
}


  Above is a glimpse of dataset provided by the organisers.


3. Related Work
Angel et al. [3] presents an innovative approach to effectively handling the Exist2023 dataset, a collection
comprising both English and Spanish, with a focus on the informal language typical of social media
platforms like Twitter, including emojis and hashtags. Their methodology introduces contrastive
learning into the traditional fine-tuning language model pipeline, marking a departure from conventional
approaches. Unlike standard fine-tuning methods aimed at learning an embedding space where similar
samples cluster together, their contrastive learning technique adopts a regression setting. Here, labels
are determined based on the fraction of annotators who agree with a specific classification, thereby
accommodating the subjectivity and diversity of opinions inherent in the dataset.
   Moreover, they leverage the diverse annotations from different annotators to enhance the prediction
process, particularly in tasks such as sexism identification, which they frame as a regression problem
predicting the fraction of annotators labeling a tweet as containing sexism. They then apply a threshold
rule to convert these predictions into binary labels (i.e., "YES" or "NO"). Their study includes the
submission of three variations: FT (Fine-Tuning), which entails standard fine-tuning of the language
model; FreezeCL, where contrastive learning precedes fine-tuning, followed by freezing the model and
training only the classifier head; and UnfreezeCL, similar to FreezeCL but allowing updates to all model
parameters during fine-tuning. For contrastive learning, they conduct training for 10 epochs with a
learning rate of 5e-5 and a batch size of 32, followed by fine-tuning for up to 20 epochs, matching the
30-epoch setting of FT, with early stopping, a learning rate of 1e-5, and a batch size of 128. They employ
the AdamW optimizer and the transformers library for training on an NVIDIA V100 GPU with 32GB
memory. The model with the lowest root mean square error (rmse) score on the validation set is saved
for further evaluation.
   In EXIST 2023, several teams participated in the competition, each employing various approaches
to tackle the tasks. Some teams utilized multilingual models such as XLM-RoBERTa and BERT for
classification tasks, demonstrating the versatility of these models across different languages. They
incorporated both hard and soft labels, leveraging data augmentation techniques and fine-tuning on
task-specific datasets to improve performance. In particular, Ersoy et al. [4] employed a cascade model
using the output from Task 1 for Task 3, showcasing an effective transfer learning strategy. Vetagiri
et al. [5] focused on training models exclusively on the provided dataset, generating both hard and
soft labels for Task 1. Erbani et al. [6] took a fine-tuning approach on separate BERT models for each
task, incorporating manual features and concatenating representations for improved classification.
Additionally, Cordon et al. [7] experimented with ensemble models, including bert-large-uncased and
distilbert, to enhance performance in Task 1. Chaudhary and Kumar [8] utilized a Bi-LSTM architecture
for hard predictions in Task 1, while de Paula and da Silva [9] employed a variety of transformer models
for both English and Spanish tasks. Hatekar et al. [10] optimized various models from HuggingFace and
employed multilingual models with data augmentation techniques. Finally, Buzzell et al. [11] explored
various approaches including SVM with TF-IDF and CNN models for Task 1, showcasing diversity in
methodology among the participating teams.
   Rodríguez-Sánchez et al. [12] explored a variety of machine learning methods for the task of sexism
detection. They compared the performance of logistic regression, SVM, random forest, bi-LSTMs, and
mBERT on sexism detection in Spanish tweets. They found the neural models to be slightly better
than the non-neural machine learning algorithms at detecting sexism in the dataset, although random
forest achieved the highest precision. The bi-LSTM models were on par with mBERT in terms of
F1, accuracy, precision, and recall. Rizvi and Jamatia [13] participated in the 2022 EXIST shared task
Rodríguez-Sánchez et al. [14]. They experimented with logistic regression, Naive Bayes, and SVM
systems and found that the logistic regression model worked best for both Spanish and English on
both tasks. They used TF-IDF unigram and bigram representations as features for all three models.
While their submission ultimately ranked 17th out of 19 submissions in the competition, with an official
F1-score of 70.65% overall, their approach showed promise among the few submissions that did not
implement pretrained transformer-based models.
   Moldovan et al. [15] addressed the issue of sexism in Romanian. They used logistic regression, SVM,
random forests, Ro-BERT, and mBERT to classify Romanian tweets as sexist or nonsexist. They used
BOW-based representations, TF-IDF word representations, and sentence representations generated by
mBERT and Ro-BERT as features for the non-neural models. The best performance was achieved with a
fine-tuned Ro-BERT model; however, the best recall for non-sexist tweets was achieved by the random
forest classifier using TF-IDF-based word representations, by a significant margin. Related to the topic
of sexism detection is abusive language detection. Steimel et al. [16] investigated abusive language
detection in English and German tweets using topic modeling and a number of neural and non-neural
classifiers. They found that SGBoost performed best on the English data, while SVMs performed best
on the German data. They also found that different sampling methods to address class imbalances led
to drastically different outcomes regarding the two data sets. Their work provides evidence that the
best classifier and techniques for one language cannot be assumed to perform well for other languages,
even if the data sets share similarities. Thus, it is important to experiment with a variety of methods
when handling multilingual data.


4. System Description
4.1. Text Preprocessing
A range of datasets, such as the Kirk et al. [17] EDOS dataset,Guest et al. [18] Misogyny, Rodríguez-
Sánchez et al. [12] MeTwo and training datasets supplied by the organizers were used to fine-tune our
models. While some of the models in our work are designed for Spanish text, some are limited to English.
We separately translated all of the data into English and Spanish because our approach makes use of
language-specific models. Siino et al. [19] mention the importance of preprocessing and showcase that
Using a proper preprocessing strategy, simple models can outperform transformers in text classification.
tasks. The preprocessing methods employed for our research work were lowercasing, eliminating
mentions (like @username), eliminating hashtags, eliminating links, eliminating numerals, eliminating
punctuation, and eliminating non-ASCII symbols. Hickman et al. [20] focus on providing empirically
grounded recommendations for text preprocessing decision-making in text mining, considering the type
of text mining, the research question, and dataset characteristics to enhance the validity and transparency
of insights derived from natural language text data. Further preprocessing processes like stemming,
Lemmatizing and stopword removal are redundant for the applicable contextualized transformer models.
and doing so might negatively impact their performance. Following this preprocessing, two datasets
were produced: one with all English text and the other with all Spanish text, both labeled as sexist and
Non-Sexist in order to allow for fine tuning.

4.2. Task 1 : Identification of Sexism
4.2.1. Approach-1 : Finetuning English Model
Identifying subtle, context-specific instances of discriminatory language is the initial stage of sex-
ism.detection. Sanh et al. [21] proposed DistilBERT, which is an excellent choice for sexism detection.task
since it exhibits a high degree of proficiency in recognizing subtle verbal patterns. DistilBERT is a
condensed version of BERT that retains 97% of BERT’s language comprehension while being 40%smaller
and 60% faster. Through knowledge distillation, a smaller model (DistilBERT) learns to replicate the
behavior of a larger model (BERT) by focusing on its essential components. DistilBERT’s compact and
efficient architecture is especially helpful for tasks like sexism detection, where it’s necessary to parse
language subtly in order to identify discriminatory and biased remarks. DistilBERT’s transformer design
effectively captures context and dependencies in text, enabling it to distinguish between benign and
harmful words. To create an English dataset for our study, we followed the instructions in Section4.1 to
preprocess and integrate existing datasets. We then used this English dataset to fine-tune the DistilBERT
model utilizing the hyperparameters shown in Table 1. As a result, the model performed better at
interpreting the particular subtleties and variances associated with sexism than generic models.

4.2.2. Approach-2 : Finetuning Spanish Model
Sanh et al. [21] indicates RoBERTa is a powerful option for sexism detection because of its exceptional
comprehension and analysis of delicate and complicated information. The transformer design is
Table 1                                                    Table 2
Parameters for Finetuning DistilBERT                       Parameters for Finetuning RoBERTuito
    Parameter                        Value                      Parameter                        Value
    learning_rate                     2e-5                      learning_rate                     1e-5
    train_batch_size                   16                       train_batch_size                   32
    eval_batch_size                    16                       eval_batch_size                    32
    seed                               42                       seed                              123
    optimizer (Adam with betas)    0.9, 0.999                   optimizer (Adam with betas)    0.9, 0.999
    epsilon                          1e-08                      epsilon                          1e-08
    weight decay                      0.01                      weight decay                     0.001
    num_train_epochs                    5                       num_train_epochs                   10


enhanced by a BERT version known as RoBERTa, which uses a strong encoder mechanism to effectively
extract contextual information from text. This model’s extensive pretraining on a large corpus of text
makes it highly adept at identifying and evaluating minute biases and differentiating linguistic patterns.
RoBERTa uses dynamic masking and larger mini-batches during training to improve the sensitivity and
accuracy of its sexist content detection and classification. Because RoBERTa can represent complicated
syntactic and semantic relationships and handle a wide range of linguistic terminology, it is a useful
tool for detecting sexism. We opted to use the model developed by Liu et al. [22] as our version of
RoBERTuito, which is based on RoBERTa. The model is accessible through HuggingFace and is fine-
tuned on the EXIST2021 dataset. In our work we optimized the RoBERTuito model in our study using
the hyperparameters indicated in Table 2 on all of the Spanish data that we obtained after processing
and translating the external datasets specified in the 4.1 section into Spanish. As a result, the model
outperformed generic models in its ability to comprehend the dataset’s particular complexities and
variations. This helped us navigate possible sexism that was presented in Spanish, which enhanced our
approach and increased the detection accuracy.

4.2.3. Proposed Approach : Ensembling Multiple Models
In our proposed methodology, we have fine-tuned 3 different models: twitter-roberta-base-hate5 ,
distilbert-base-uncased 6 , twitter-sexismo-finetuned-robertuito-exist20217 and have also directly used 2
pretrained models: xlm-roberta-base-misogyny-sexism-indomain-mix-bal8 , twitter-sexismo-finetuned-
exist2021-metwo9 from hugging face for sexism identification by ensembling the classifiers. The idea
of creating a voting ensemble from neural classifiers has been explored by Siino et al. [23]. Let’s see
each one of them below. We fine-tuned the pretrained Roberta-twitter-hate model by Barbieri et al.
[24] that had been trained for hate speech detection on our dataset using the parameters given in
Table 3. Since this model was initially trained for hate speech detection, it had a better tendency to
correctly classify sexist tweets that were aggressive in nature. A fine-tuned version of this model is
thranduil2/results 10 . Next, we fine-tuned the distilbert-base-uncased 6 model on the existing dataset.
Distilbert was used because of its high reliability and its lightweight architecture for text classification,
and our new model now is thranduil2/sexismDistilbert 11 . Finally, for handling Spanish instances, we
used the somosnlp-hackathon-2022/twitter-sexismo-finetuned-robertuito-exist2021, which was already
finetuned on sexism-based tasks. Finally, we got NewSpanishFinetunedtrainepoch10 12 . Croce et al.
[25] employed a text vectorization layer to create Bag-of-Words sequences, which were then utilized
by three distinct text classifiers (Decision Tree, Convolutional Neural Network, and Naive Bayes),
culminating in the use of an SVM as the final classifier. Kang et al. [26] proposed an ensemble of
text-based hidden Markov models using boosting and clusters of words produced by latent semantic
analysis. We created an ensemble of four models that had the highest complementary error correction
and trained a stacking classifier on the output scores of these models. For the stacking classifier, we
used LightGBM, which is a lightweight and powerful gradient boosting model using the best parameters
Figure 1: Architecture Diagram. Here models Distilbert finetuned and Roberta finetuned are finetuned models
sexismDistilbert11 and results10 respectively. Annahaz8 and SomnosNLP 9 are pretrained models from Hugging
Face. The ’Spanish’ model is also finedtuned model NewSpanishFinetunedtrainepoch10 12 . For hyperparameter
tuning of our LGBM, we used Optuna4 framework for determining the correct set of hyperparameters. Numbers
on arrow represent the weights for each of the votes given.


obtained by hyperparameter tuning as mentioned in Table 4.

                                                                 Table 4
                                                                 Best Hyperparameters
Table 3
Training Parameters                                               Parameter                       Value
                                                                  𝜆𝐿1                   6.191822954187258 × 10−8
     Parameter                          Value
                                                                  𝜆𝐿2                           0.3564434
     Learning Rate                    2 × 10−5
                                                                  Number of Leaves                 253
     Number of Training Epochs           10
                                                                  Feature Fraction              0.9053014
     Weight Decay                        0.1
                                                                  Bagging Fraction              0.7948063
     Per Device Train Batch Size         16
                                                                  Bagging Frequency                 1
                                                                  Min Child Samples                 5

   To illustrate the functionality of our proposed architecture, we present the following operational
workflow: Initially, a text sample is input into the system, where it undergoes analysis by a stacking
classifier comprising four distinct models. Each model independently classifies the text as either sexist
(1) or non-sexist (0). The output probabilities from these four models serve as input features for the
LightGBM (LGBM) classification model, which subsequently synthesizes these inputs to produce a
singular prediction of either 1 (sexist) or 0 (non-sexist). In addition to the stacking classifier, the proposed
architecture integrates five supplementary models. These models are also tasked with classifying the
same text, each outputting a prediction of either 1 (sexist) or 0 (non-sexist). To ensure a balanced and
accurate final prediction, each of the six models—including the LGBM stacking classifier—is assigned a
weight based on its accuracy performance. The predictions (0 or 1) from each model are then multiplied
4
  Optuna
5
  cardiffnlp/twitter-roberta-base-hate
6
  distilbert/distilbert-base-uncased
7
  somosnlp-hackathon-2022/twitter-sexismo-finetuned-robertuito-exist2021
8
  annahaz/xlm-roberta-base-misogyny-sexism-indomain-mix-bal
9
  somosnlp-hackathon-2022/twitter-sexismo-finetuned-exist2021-metwo
by their respective weights. The resulting weighted outputs are subsequently averaged. The decision
rule applied to the averaged output is straightforward: if the mean value exceeds 0.5, the text is classified
as sexist (1); if the mean value is 0.5 or less, the text is classified as non-sexist (0). This methodology
ensures that the final classification leverages the strengths of multiple models, enhancing the overall
accuracy and reliability of the system’s predictions. Finally, we ensembled the Stacking Classifier itself
with our best-performing fine-tuned Distilbert model as well as the fine-tuned Roberta-Twitter-Hate
model, along with the fine-tuned Spanish RoBERETuito model, and passed the six models to a voting
classifier. We performed a weighted average of the predictions of the six models to get our final
prediction.

4.3. Task 2 : Source Intention of Sexism
BERT (Bidirectional Encoder Representations from Transformers), given by Devlin et al. [27], is a state-of-
the-art natural language processing (NLP) model known for its exceptional performance across various
language understanding tasks. Its key strength lies in its ability to capture bidirectional contextual
information, enabling a nuanced understanding of word meaning by considering both preceding and
succeeding contexts. Pre-trained on extensive text corpora using self-supervised learning tasks, such as
masked language modeling and next sentence prediction, BERT learns rich semantic representations of
language. This pre-training allows for fine-tuning on specific tasks, making BERT highly adaptable and
effective for multi-class classification tasks. Its capability to comprehend complex relationships within
text, coupled with its contextual understanding, makes it an ideal choice for tasks requiring nuanced
classification of textual data. The BERT model was fine-tuned using the existing dataset for task 2, and
the parameters used are given in Table 5.

Table 5                                                    Table 6
Task 2 Parameters                                          Task 3 Parameters
    Parameter                         Value                    Parameter                         Value
    learning_rate                     5e-05                    learning_rate                     5e-05
    train_batch_size                    16                     train_batch_size                    16
    eval_batch_size                     32                     eval_batch_size                     16
    seed                                42                     seed                                42
    optimizer (Adam with betas)     0.9, 0.999                 optimizer (Adam with betas)     0.9, 0.999
    epsilon                           1e-08                    epsilon                           1e-08
    lr_scheduler_type                 linear                   lr_scheduler_type                 linear
    num_epochs                           5                     num_epochs                           7


4.4. Task 3 : Categorisation of Sexism in Tweets
For task3 the BERT model given by Devlin et al. [27] was finetuned using BERT Tokenizer on EXIST
dataset for task 3 data with parameters as mentioned in Table 6. All the dataset was pre-processed and
converted to English first using Google Cloud API.


5. Metrics Used
ICM is a similarity function that generalizes point-wise mutual information (PMI) and can be used to
evaluate system outputs in classification problems by computing their similarity to the ground truth
categories. The general definition of ICM is:

                           ICM(𝐴, 𝐵) = 𝛼1 IC(𝐴) + 𝛼2 IC(𝐵) − 𝛽 IC(𝐴 ∪ 𝐵)                                    (1)
10
   thranduil2/results
11
   thranduil2/sexismDistilbert
12
   Suramya/NewSpanishFinetunedtrainepoch10
   Where IC(A) is the information content of the item represented by the set of features A, etc. ICM
maps into PMI when all parameters take a value of 1. In Amigó and Delgado, the general ICM definition
is applied to cases where categories have a hierarchical structure and items may belong to more than
one category. The resulting evaluation metric is proven to be analytically superior to the alternatives in
the state of the art. The definition of ICM in this context is:

                      ICM(𝑠(𝑑), 𝑔(𝑑)) = 2𝐼(𝑠(𝑑)) + 2𝐼(𝑔(𝑑)) − 3𝐼(𝑠(𝑑) ∪ 𝑔(𝑑))                          (2)

  Where I() stands for Information Content, s(d) is the set of categories assigned to document d by
system s, and g(d) the set of categories assigned to document d in the gold standard.

                           𝐼({⟨𝑐, 𝑣⟩}) = − log2 (𝑃 ({𝑑 ∈ 𝐷 : 𝑔𝑐 (𝑑) ≥ 𝑣}))                             (3)


6. Results
In our study, we employed a series of approaches aimed at enhancing the accuracy of our model.
Different Approaches and their Rankings with corresponding score can be seen in Table 7. In this table
1a represents model explained in Section 4.2.1, 1b is the model presented in Section 4.2.2 and 1c is
proposed model which uses ensembling concept as explained in Section 4.2.3 for Task 1. 2a and 3a are
the models made by finetuning BERT whose details are mentioned in section 4.3 and 4.4 respectively.
EN and ES denote the English and Spanish languages respectively, indicating our models performances
on these specific texts present in the test dataset.

Table 7
Results Table of our team maven
                     Teamwise Rank      Approach     ICM-Hard      ICM-Hard Norm       F1 SCORE
       EN    task1          12          1a              0.3395           0.6732           0.6747
                                        1b              0.1926           0.5983           0.6512
                                        1c              0.5107           0.7606           0.7359
             task2          15          2a             -0.1333           0.4539           0.4123
             task3          8           3a             -0.3093           0.4242           0.4245
       ES    task1          14          1a              0.4129           0.7065           0.7394
                                        1b              0.4129           0.7065           0.7394
                                        1c              0.4857           0.7429           0.7784
             task2          14          2a             -0.0033            0.499           0.4859
             task3          6           3a             -0.2401           0.4464           0.4662


7. Limitations and Future work
7.1. Limitations
Pre-trained transformer-based models often struggle with sarcasm detection, leading to misclassification
of such instances. Specialized handling of sarcastic comments is necessary to improve model accuracy.
In tasks 2 and 3, where data availability for discerning source intention is limited, the model tends to
exhibit lower accuracy rates. Increasing the volume of data for fine-tuning enables the model to grasp
finer nuances and glean hidden insights more effectively. Additionally, the model’s performance is
heavily contingent on the quality of the data it is trained on. Contextual ambiguity and the presence
of code-mixed sentences, where multiple languages are used within the same sentence, diminish the
model’s efficiency. Therefore, ensuring clear context and providing diverse, high-quality data are crucial
steps in enhancing the model’s overall performance.
7.2. Future Work
In future work, employing Large Language Models (LLMs) for detection purposes could significantly
enhance the performance of sarcasm detection systems. Furthermore, exploring varying ensembling
methods such as Bagging Classifiers employed by Chen et al. [28]. These ensembling techniques can
also be extended to downstream multi-class and multi-label classification tasks as demonstrated by
Miri et al. [29]. Moreover, exploring advanced aggregation methods, such as incorporating probabilistic
means from multiple models and employing dedicated algorithms for assigning weights, holds promise
for improving overall accuracy. It’s imperative to select models that complement each other’s False
Positive (FP) and False Negative (FN) rates, ensuring comprehensive coverage of cases without bias
towards any particular outcome. Additionally, integrating data augmentation techniques, such as
introducing noise to text or scraping data from diverse social media platforms and video transcripts,
can substantially augment the dataset, thereby enriching the model’s training corpus and potentially
enhancing its robustness.


8. Conclusion
In this research work, we aimed at addressing the problem of sexism detection, categorization and
finding intention in tweets with both English and Spanish content. As social media grows more and
more ingrained in our daily lives, sexism continues to be a significant social issue that is drawing
increasing attention. Today, it is imperative to address and minimize misogyny on these platforms.
Given this, our study aimed to develop efficient multilingual sexism identification systems by fine-
tuning models and utilizing ensembles of several models. Quality external datasets are beneficial as
they improve performance of models. This outcome highlights how crucial it is to look at possible
integration of ensemble-based methods into traditional pipelines in order to further produce accurate
outputs. Oversampling techniques allowed for the prevention of model bias for each unique class group.
Simple Aggregation of ensemble based and transformer model gave better results as compared to their
individual performance.


References
 [1] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of
     the CLEF Association (CLEF 2024), 2024.
 [2] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
     R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism Identifi-
     cation and Characterization in Social Networks and Memes (Extended Overview), in: G. Faggioli,
     N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 – Conference
     and Labs of the Evaluation Forum, 2024.
 [3] J. Angel, S. T. Aroyehun, A. F. Gelbukh, Multilingual sexism identification using contrastive
     learning, in: Conference and Labs of the Evaluation Forum, 2023. URL: https://api.semanticscholar.
     org/CorpusID:264441422.
 [4] B. I. Ersoy, G. Radler, S. Carpentieri, Classifiers at exist 2023: Detecting sexism in spanish
     and english tweets with xlm-t, in: Conference and Labs of the Evaluation Forum, 2023. URL:
     https://api.semanticscholar.org/CorpusID:264441369.
 [5] A. Vetagiri, P. K. Adhikary, D. P. Pakray, A. Das, Leveraging gpt-2 for automated classification of
     online sexist content, 2023.
 [6] J. Erbani, E. Egyed-Zsigmond, D. Nurbakova, P.-E. Portier, When multiple perspectives and an
     optimization process lead to better performance, an automatic sexism identification on social media
     with pretrained transformers in a soft label context, in: Conference and Labs of the Evaluation
     Forum, 2023. URL: https://api.semanticscholar.org/CorpusID:264441461.
 [7] P. Cordon, J. Mata, V. Pachón, J. L. Domínguez, I2c-uhu at clef-2023 exist task: Leveraging
     ensembling language models to detect multilingual sexism in social media, in: Conference and
     Labs of the Evaluation Forum, 2023. URL: https://api.semanticscholar.org/CorpusID:264441322.
 [8] A. Chaudhary, R. Kumar, Sexism identification in social networks, in: Conference and Labs of the
     Evaluation Forum, 2023. URL: https://api.semanticscholar.org/CorpusID:264441505.
 [9] A. F. M. de Paula, R. F. da Silva, Detection and classification of sexism on social media using
     multiple languages, transformers, and ensemble models, in: IberLEF@SEPLN, 2022. URL: https:
     //api.semanticscholar.org/CorpusID:252015736.
[10] Y. A. Hatekar, M. S. Abdo, S. Khanna, S. Kübler, Iuexist: Multilingual pre-trained language models
     for sexism detection on twitter in exist2023, in: Conference and Labs of the Evaluation Forum,
     2023. URL: https://api.semanticscholar.org/CorpusID:264441455.
[11] M. Buzzell, J. Dickinson, N. Singh, S. Kübler, Iu-nlp-jedi: Investigating sexism detection in
     english and spanish, in: Conference and Labs of the Evaluation Forum, 2023. URL: https://api.
     semanticscholar.org/CorpusID:264441789.
[12] F. Rodríguez-Sánchez, J. Carrillo-de Albornoz, L. Plaza, Automatic classification of sexism in social
     networks: An empirical study on twitter data, IEEE Access 8 (2020) 219563–219576. doi:10.1109/
     ACCESS.2020.3042604.
[13] A. Rizvi, A. Jamatia, Nit-agartala-nlp-team at exist 2022: Sexism identification in social networks,
     in: IberLEF@SEPLN, 2022. URL: https://api.semanticscholar.org/CorpusID:252015919.
[14] F. Rodríguez-Sánchez, J. C. de Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet, T. Donoso,
     Overview of exist 2022: sexism identification in social networks, Proces. del Leng. Natural 69
     (2022) 229–240. URL: https://api.semanticscholar.org/CorpusID:239690039.
[15] A. Moldovan, K. Csuros, A.-m. Bucur, L. Bercuci, Users hate blondes: Detecting sexism in user
     comments on online romanian news, 2022, pp. 230–230. doi:10.18653/v1/2022.woah-1.21.
[16] K. Steimel, D. Dakota, Y. E. Chen, S. Kübler, Investigating multilingual abusive language detection:
     A cautionary tale, in: Recent Advances in Natural Language Processing, 2019. URL: https://api.
     semanticscholar.org/CorpusID:210063047.
[17] H. R. Kirk, W. Yin, B. Vidgen, P. Röttger, Semeval-2023 task 10: Explainable detection of online
     sexism, ArXiv abs/2303.04222 (2023). URL: https://api.semanticscholar.org/CorpusID:257405434.
[18] E. Guest, B. Vidgen, A. Mittos, N. Sastry, G. Tyson, H. Margetts, An expert annotated dataset for the
     detection of online misogyny, 2021, pp. 1336–1350. doi:10.18653/v1/2021.eacl-main.114.
[19] M. Siino, I. Tinnirello, M. La Cascia, Is text preprocessing still worth the time? a comparative survey
     on the influence of popular preprocessing methods on transformers and traditional classifiers,
     Information Systems 121 (2023) 102342. doi:10.1016/j.is.2023.102342.
[20] L. Hickman, S. Thapa, L. Tay, M. Cao, P. Srinivasan, Text preprocessing for text mining in
     organizational research: Review and recommendations, Organizational Research Methods 25
     (2020) 114 – 146. URL: https://api.semanticscholar.org/CorpusID:229501282.
[21] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
     cheaper and lighter, ArXiv abs/1910.01108 (2019). URL: https://api.semanticscholar.org/CorpusID:
     203626972.
[22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, ArXiv abs/1907.11692 (2019). URL:
     https://api.semanticscholar.org/CorpusID:198953378.
[23] M. Siino, I. Tinnirello, M. L. Cascia, T100: A modern classic ensemble to profile irony and stereotype
     spreaders, in: Conference and Labs of the Evaluation Forum, 2022. URL: https://api.semanticscholar.
     org/CorpusID:262324480.
[24] F. Barbieri, J. Camacho-Collados, L. Neves, L. Espinosa-Anke, Tweeteval: Unified benchmark
     and comparative evaluation for tweet classification, ArXiv abs/2010.12421 (2020). URL: https:
     //api.semanticscholar.org/CorpusID:225062026.
[25] D. Croce, D. Garlisi, M. Siino, An svm ensemble approach to detect irony and stereotype spreaders
     on twitter, in: Conference and Labs of the Evaluation Forum, 2022. URL: https://api.semanticscholar.
     org/CorpusID:251471680.
[26] M. Kang, J. Ahn, K. Lee, Opinion mining using ensemble text hidden markov models for text clas-
     sification, Expert Systems with Applications 94 (2018) 218–227. URL: https://www.sciencedirect.
     com/science/article/pii/S0957417417304979. doi:https://doi.org/10.1016/j.eswa.2017.
     07.019.
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, in: North American Chapter of the Association for Computational
     Linguistics, 2019. URL: https://api.semanticscholar.org/CorpusID:52967399.
[28] H. Chen, Z. Zhang, S. Huang, J. Hu, W. Ni, J. Liu, Textcnn-based ensemble learning model
     for japanese text multi-classification, Comput. Electr. Eng. 109 (2023) 108751. URL: https://api.
     semanticscholar.org/CorpusID:258900728.
[29] M. Miri, M. B. Dowlatshahi, A. Hashemi, M. K. Rafsanjani, B. B. Gupta, W. S. Alhalabi, En-
     semble feature selection for multi-label text classification: An intelligent order statistics ap-
     proach, International Journal of Intelligent Systems 37 (2022) 11319 – 11341. URL: https:
     //api.semanticscholar.org/CorpusID:252056607.