<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sexism Identification in Social Networks with Generation-based Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Le Minh Quan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dang Van Thin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology-VNUHCM</institution>
          ,
          <addr-line>Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>This paper demonstrate our participation in the EXIST (sEXism Identification in Social neTworks) at CLEF 2024. We participate in 3 task: sexism identification (Task 1), sexism intention (Task 2), and sexism categorization (Task 3). We proposed an ensemble architecture using Large Language Models (LLMs) to tackle these tasks. Our approach aimed to emulate the human annotation process and gain more insights into fine-tuning LLMs for classification tasks. Our best performance model achieved 2 nd on task 1, 1st on task 2 and 1st on task 3 with hard label result. Achieving 0.7826, 0.5677 and 0.6004 on F1 score respectively for each task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Llama 2</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>LoRA</kwd>
        <kwd>Fine-tuning LLM</kwd>
        <kwd>Prompting Engineering</kwd>
        <kwd>Social media</kwd>
        <kwd>CLEF 2024</kwd>
        <kwd>EXIST 2024</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sexism is prejudice, stereotyping or discrimination base on one’s sex or gender, typically against women
and girls. This inequality and discrimination against women remains a pervasive issue in modern
society and manifesting in online spaces. Sexism afects women in many facets of their lives, including
domestic and parenting roles, career opportunities, body image, and life expectations. Moreover, online
sexism can influence social media users, particularly teenagers, to develop sexist attitudes or view them
as normal and acceptable. The growth of social media platforms like Twitter and Facebook has led to a
significant rise in online sexism. This makes the need for automatic tools to identify online sexism even
more critical.</p>
      <p>
        Identifying sexism on social media is a challenging problem. Sexist messages can be overtly ofensive
and hateful, but they can also be subtle, disguised as humor or friendly posts. To address this problem,
EXIST (sEXism Identification in Social neTworks) was established. EXIST 2024 at CLEF 2024 is the
fourth edition of the EXIST challenge that focus at combating sexism on social media. EXIST aims
to capture sexism in a broad sense, from explicit misogyny to other subtle expressions that involve
implicit sexist behaviours [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. EXIST 2024 provides tasks to detect sexism behaviours and discourses,
identify the intention of the author behind a sexist social media post and categorize the forms of sexism.
These tasks are divided into tweet classification and meme classification, focusing on textual messages
and image content (particularly memes) on Twitter. Identifying sexism content can also be conflicting
due to diferences in perspectives among annotators. To account for this problem, EXIST 2024 adopt
the Learning With Disagreement (LeWiDi) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] paradigm for both the development of the dataset and
the evaluation of the systems. In the LeWiDi paradigm, models are trained with conflicting or diverse
annotations instead of relying on a single "correct" label per sample. This results in two types of output:
hard outputs and soft outputs.
      </p>
      <p>In this paper, we employ Large Language Models to address the first three tasks of EXIST 2024. Below
is a overview of the proposed tasks:
• Task 1 is a binary classification task. The objective of this task is to decide whether or not a
given tweet contains sexist expressions or behaviours.
• Task 2 is a hierarchical multi-class classification task. For the tweets that have been predicted as
sexist, the objective of this task is to classify each tweet according to the intention of the person
who wrote it.
• Task 3 is a hierarchical multi-label classification task. For the tweets that have been predicted as
sexist, the objective of this task is to categorize them according to the types of sexism.</p>
      <p>This paper presents our approach for the first three tasks of the EXIST 2024 shared task. In section 4,
we introduce our proposed architecture and the models employed for experiments. Section 5 details
the experimental setup, including the evaluation metrics used and the system setting. In section 6, we
present and discuss the experiment result from both the development and evaluation phases. Finally,
our conclusion based on the result will be in section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The scientific community has established numerous academic events and shared tasks about Sexism
and Hate Speech overall. HatEval 2019 focused on detection of hate speech against immigrants and
women in Twitter [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Homo-Mex 2024 focus on Hate Speech Detection Towards the Mexican Spanish
Speaking LGBT+ Population [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. EDOS 2023 address the the limitations of Binary detection in sexism
detection, and emphasizing the need to provide clear explanations for why something is sexist [6].
DETESTS-Dis 2024 at IberLEF 2024 follow Learning with Disagreement paradigm to detect and classify
racial stereotypes in Spanish social media [7]. These eforts show that sexism and hate speech detection
still remain an active and challenging area of research.
      </p>
      <p>Previous EXIST shared task (EXIST 2023) have attracted many research teams to participate.
Participants have achieved good results and provided insightful research. Below are some works that achieved
top results in the EXIST 2023 shared task.</p>
      <p>• Kelkar et al. [8] evaluated multiple diferent approach, ranging from classic classification
methods like Multinomial Naive Bayes and Linear Support Vector Classifiers to deep learning
methods like Multi-Layer Perceptrons, XGBoost, and LSTMs with attention. They also explored
multiple embedding methods. Their research concluded that training models on both languages
yielded better results compared to training them on separate languages.
• Erbani et al. [9] used three separate BERT models to fine-tune task 1, 2, and 3. After fine-tuning,
the models were "frozen" and then connected together. Fully connected layers are added at the
beginning of the models. This allows the models to share their "diferent views" of the data from
their individual training and thus achieve better performance than using single models.
• Paula et al. [10] employed a combination of two BERT models: Multilingual BERT (mBERT)
and Cross-lingual Language Model RoBERTa (XLM-RoBERTa). Each model generated its own
prediction. The final prediction label was then chosen based on the prediction probability. Their
experiments demonstrated that this ensemble approach significantly improved model performance.
This research highlights the potential of utilizing ensemble architectures with even larger base
models for tackling text classification problems.
• Tian et al. [11] introduced a novel cascade architecture using two large language models (LLMs)
based on the GPT architecture: GPT-NeoX and BERTIN-GPT-J-6B. The first model is fine-tuned on
the competition data, this model handles simpler classifications. The second model is sequentially
ifne-tuned on various “Hate speech” data and then fine-tuned on the competition data, this larger
model tackles more complex samples. A confidence checker system identifies data points that
are challenging for the smaller model and send them to the larger model. This architecture
helps the model save computational costs, increase classification speed while still having higher
performance than conventional ensemble models.
• Vallecillo-Rodríguez et al. [12] explored two methods to improve model. The first method is
tested on transformer models including mDeBERTa and XML-RoBERTa, with data augmentation
by repeating the tweet six times corresponding to six annotator that labeling each tweet sample.
The second method studies how to integrate annotator information into the training process.
The research team applied the multi-modal architecture called "Transformer With Tabular". Text
feature (tweets) are combined with the encoded annotator’s metadata to create a special tensor.
This tensor will go through a classification model to output the final prediction. Their findings
revealed that while annotator information did not significantly impact performance on Task
1 (likely due to its binary nature), it showed promise for Tasks 2 and 3, which involved more
complex classifications.</p>
      <p>The approaches shown above span from traditional methods such as support vector machine (SVM),
multi-layer perceptron (MLP) to Deep Learning architecture like Convolutional Neural Networks (CNNs)
and Long Short-Term Memory (LSTM) networks. Transformer-base architecture like BERT and XLM
also widely used. Notably, all these proposed methods rely on text classification. Meanwhile, generative
models are growing significantly in both computer vision (e.g., Difusion Models) and natural language
processing (e.g., Large Language Models). In this paper, we will not only employ Medium-sized Language
model like XLM-RoBERTa for classification tasks, but also leverage the power of Large Language Model
like mT5 and Llama 2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset overview</title>
      <p>
        The dataset is provided by the EXIST shared task, containing Spanish and English tweets from Twitter.
Dataset structure following Learning with disagreement paradigm (LeWiDi) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], each Tweet sample is
annotated by six annotators with diverse socio-demographic characteristics. Consequently, instead of a
single gold label, the labels consist of six labels annotated by each annotator. The information of each
annotator is provided, including gender, age, ethnicity, study level and country of origin. Table 1 shows
the statistics for the tweet data, combining both English and Spanish text.
      </p>
      <p>Table 2 shows the statistics for the labels of each task, using the oficial gold labels provided by the
shared task. Task 1 categorizes tweets as sexist or not sexist (yes, no). Task 2 classify the intention of
the author who wrote the sexist tweet (direct, reported, judgemental). Task 3 categorize the types of
a sexist tweet, which can have one or more label (ideological inequality, sexual violent, objectification,
stereotyping dominance, misogyny non-sexual violent). Task 2 and Task 3 follow a two-level hierarchical
structure, as illustrated in Figure 1:
Sexist</p>
      <p>Not sexist
Judgemental</p>
      <p>Reported</p>
      <p>Direct</p>
      <p>Sexualviolence</p>
      <p>Ideologicalinequality</p>
      <p>Stereotypingdominance</p>
      <p>Objectification</p>
      <p>Misogyny-nonsexual-violence</p>
    </sec>
    <sec id="sec-4">
      <title>4. Approach</title>
      <sec id="sec-4-1">
        <title>4.1. Architecture</title>
        <p>Base architecture: Our architecture, illustrated in Figure 2, consists of six independent models, referred
to as Component Models. These models share the same model type and training setting. This ensemble
approach emulates the human annotation process, where each model represents a diferent annotator
with potentially varying biases. Our proposed framework comprises five main components:
1. Preparing dataset: Each data sample includes metadata of six diferent annotator. We split the
dataset by annotator, resulting in six separate dataset. Each dataset contains the same tweets but
with metadata specific to the corresponding annotator.
2. Data processing: The sub-samples undergo pre-processing. For LLM models, prompt engineering
is applied.
3. Fine-tuning the Component Models: Each Component Model undergoes a separate
finetuning process using its own data picked from one of the six separate datasets. the component
models have to have the same type, whether it is XLM-RoBERTa or mT5 or Llama 2, and have
the same system setting.
4. Post processing: The prediction outputs from the six Component Models are collected and
post-processed to the required submission format
5. Final output: The Component Predictions are converted into two type of outputs, soft label and
hard label:
• Soft labels: To generate soft labels, we calculate probability distribution for each possible
classes. In tasks 1 and 2, the sum of these probabilities must be 1.0. For task 3, the sum of
probability does not necessarily need to be 1.0 because of the multi-label structure.
• Hard labels: To generate the hard labels from the predicted outputs, we follow the
probabilistic threshold used in the oficial gold label. For tasks 1 (mono-label), the class annotated
by more than 3 annotators is selected. For tasks 2 (mono-label), the class annotated by more
than 2 annotators is selected. Finally, for tasks 3 (multi-label), the class annotated by more
than 1 annotator are selected. In cases where no class meets the threshold, a random class
is chosen.</p>
        <p>Integrating Hierarchical architecture: Task 2 and Task 3 follow hierarchical structure. To address
these tasks, we modified our architecture such that Component Models for Tasks 2 and 3 only make
predictions for sub-samples classified as sexist. We rely on the Predictions make by Component Models
of Task 1 to get the classified labels (YES, NO). Our base architecture incorporates a hierarchical structure,
as shown in Figure 3</p>
        <p>OUTPUT</p>
        <p>POST
PROCESSING</p>
        <p>Component Predictions
Component Models: Model</p>
        <p>Model</p>
        <p>Model</p>
        <p>Model</p>
        <p>Model</p>
        <p>Model
Sample A:</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Overview</title>
        <p>In this paper, we leverage two types of language models: large language models (LLMs) with
Multilingual T5 (mT5) and Llama 2 and a smaller transformer-base language model XLM-RoBERTa.
XLM-RoBERTa [13]: XLM-RoBERTa (XLM-R) is a Multilingual language model base on RoBERTa. It
leverages the Transformer architecture and is trained using a multilingual masked language modeling
objective. XLM-R has been shown to outperform multilingual BERT (mBERT) on various cross-lingual
benchmarks. Our approach using XLM-RoBERTa is described below:
• Pre-processing: For XLM-RoBERTa, we apply multiple pre-processing methods to the tweet
dataset as follows.</p>
        <p>1. Link conversion: We use regex to match URLs from tweet and convert them to "URL"
token.
2. Mention conversion: We convert user mentions (@username) in tweets to "USER" tokens.
3. Hashtag conversion: We convert hashtags into separate words (For example,
"#DeepLearning" will be converted to "Deep Learning").
4. label conversion: For tasks 1 and 2, labels are converted to numerical representations. For
multi-label task 3, labels are converted to binary representations.
• Integrating metadata: To integrate annotator metadata to the input, we add a [SEP] token to
separate each metadata and the tweet. For example:
– Original data: Tweet: "Lo sentimos, el meme aún está en construcción."; Annotator’s
information: gender: female, age: 23-45, ethicities: White or Caucasian, education level:
Bachelor’s degree, country: Spain.
– Processed data: "Lo sentimos, el meme aún está en
construcción.[SEP]female[SEP]2345[SEP]White or Caucasian[SEP]Bachelor’s degree[SEP]Spain"
• Fine-tuning: We fine-tune the pre-trained XLM-RoBERTa using the Trainer API from
Hugging</p>
        <p>Face’s library. We fine-tune each task separately with diferent parameter setting.
• Post Processing: Post-processing involves converting the output label to the submission format.</p>
        <p>This step transforms the numerical or binary label output by models to original natural language
format.</p>
        <p>Multilingual T5 [14]: Multilingual T5 (mT5) is a LLMs and a multilingual variant of T5. It was
pre-trained on a new Common Crawl dataset containing 101 languages. T5 uses a basic encoder-decoder
Transformer architecture [15] and pre-trained on a masked language modeling “span-corruption”
objective, where the entire spans of tokens are selected for corruption at once. T5 excel at multiple task
such as question answering, summarization, translation. Our approach using mT5 is described below:
• Pre-processing: Our experiments show that for mT5 in this work, performing better with tweet
in their original format. This mean keeping elements like user mentions, links, and emojis rather
than converting them to specific tokens or removing them. We only convert hashtags to separate
words.
• Prompting: We format the input by adding a task prefix before the tweet and the section indicator
before the annotator’s metadata. For the task prefix, we use short phrases such as "Classify"
or "Multi-label Classify". To indicate a section containing the annotator’s information, we use
phrases like "Context" or "Information". Here is the example for prompts used for each task:
1. Prompt: Classify: Lo sentimos, el meme aún está en construcción Information: male, 46+,</p>
        <p>White or Caucasian, Master’s degree Response: YES
2. Prompt: Multiclass Classify: Lo sentimos, el meme aún está en construcción Context: male,
46+, White or Caucasian, Master’s degree Response: JUDGEMENTAL
• Fine-tuning: We leverage the Hugging Face library’s Trainer API to fine-tune the pre-trained
mT5 model for each task separately. Each task utilizes diferent parameter settings to optimize
performance.</p>
        <p>[INST]
Imagine you are a person with the following characteristics: female, 23-45 years
old, White or Caucasian ethnicity, Bachelor’s degree, and residing in Spain. Now
classify the sentiment of a tweet: "Lo sentimos, el meme aún está en construcción"
## If the intention of the tweet is to write a message that is sexist by itself, classify as
Prompt: "#D# IIRf EthCeTi"ntention of the tweet is to report or describe a sexist situation or event sufered
by a woman or women in first or third person, classify as "REPORTED".
## If the intention of the tweet is to be judgemental and the tweet describes sexist situations
or behaviors with the aim of condemning them, classify as "JUDGEMENTAL". Answer only
DIRECT, REPORTED, or JUDGEMENTAL.</p>
        <p>Answer: [/INST]</p>
        <p>Response: REPORTED
LLaMA 2 [16]: Llama 2 is a large language model developed by Meta AI and is the successor to their
original Llama model. Llama 2 uses the standard transformer architecture and applies pre-normalization
using RMSNorm. The model employs the SwiGLU activation function, grouped-query attention (GQA),
and rotary positional embeddings. Both the context length and the pre-training corpus size have been
increased compared to the previous model. Unlike other models in this paper, Llama 2 was primarily
trained on English data. Llama 2 surpasses its predecessor in various tasks, including Reasoning, Coding,
and Knowledge. Our approach using Llama 2 is described below:
• Pre-processing: Like mT5, Llama 2 performs better with tweet in their original format. Therefore,
we only convert hashtags to separate words.
• Prompting: Our observation shows that providing more information, such as the label’s
description to the prompt, leads to better model performance. Additionally, having distinct separators
between each part of the prompt helps to clarify the instructions and makes them easier for the
model to understand. A special token [INST] is utilized to separate the input prompt and answer
segments. Prompts used for each task are shown in Table 3, 4, 5.
• Fine-tuning: We fine-tune the pre-trained Llama 2 with the LoRA method from HuggingFace’s
library. Each task are fine-tuned separately, with diferent parameter settings.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metric</title>
        <p>• Hard Label Evaluation:
• Soft Label Evaluation:</p>
        <p>[INST]
Imagine you are a person with the following characteristics: female, 23-45 years
old, White or Caucasian ethnicity, Bachelor’s degree, and residing in Spain. Now
classify the sentiment of a tweet: "Lo sentimos, el meme aún está en construcción"
## If the tweet rejects equality between men and women, classify as
"IDEOLOGICALINEQUALITY".
## If the tweet implies that men are superior to women, classify as
"STEREOTYPING</p>
        <p>DOMINANCE"
Prompt: ## If the tweet objectifies women, classify it as "OBJECTIFICATION"
## If the tweet contains sexual suggestions, or harassment of a sexual nature, classify
as "SEXUAL-VIOLENCE".
## If the tweet expresses hatred and violence towards women without being sexual in
nature, classify as "MISOGYNY-NON-SEXUAL-VIOLENCE".
## Answer only these category IDEOLOGICAL-INEQUALITY,
STEREOTYPINGDOMINANCE, OBJECTIFICATION, SEXUAL-VIOLENCE,
MISOGYNY-NONSEXUAL-VIOLENCE. You can choose more than one category if applicable.</p>
        <p>Answer: [/INST]
Response: IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINANCE
• Low-rank Adaptation: To fine-tune Llama 2, we used Parameter-Eficient Fine-Tuning (PEFT)
method called LoRA. LoRA (Low-Rank Adaptation for Large Language Models) is a popular
technique to fine-tune Large pre-trained models such as LLMs and difusion models. LoRA allows
us to train some dense layers in a neural network indirectly by optimizing rank decomposition
matrices of the dense layers change during adaptation instead, while keeping the pre-trained
weights frozen [17]. In short, LoRa reduce the number of trainable parameters, making the
training process faster and less computational cost, while maintaining strong performance on
downstream tasks.</p>
        <p>We evaluate our model’s performance using a combination of metrics for all tasks and all types of
evaluation (soft label and hard label):
– ICM (Information Contrast Measure): ICM is a similarity function that generalizes
Pointwise Mutual Information (PMI), and can be used to evaluate system outputs in classification
problems by computing their similarity to the ground truth categories [18].
– F1-score: F1-score measure of the harmonic mean of precision and recall.
– ICM Soft: A modified ICM metric that accepts both soft system outputs and soft ground
truth assignments for soft label evaluation.
– Cross Entropy: Cross Entropy measures the diference between the true label distribution
and the predicted probability distribution. Cross Entropy only applicable for Tasks 1 and 2.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. System Setting</title>
        <p>We use the PyTorch framework and HuggingFace’s Transformers library [19] for our system. We
conducted fine-tuning on various Language models and their variants for each task:
– Model Variants: We use Llama 2 7B Chat model. This is the fastest and lightest model in
the Llama 2 family and has been fine-tuned specifically for dialogue.
– Training setting: We used a learning rate of 2e-4 and a batch size of 4. We used the
Parameter-Eficient Fine-Tuning (PEFT) method LoRA and the AdamW optimizer [ 20]. We
only fine-tune task 1 for 1 epoch. Task 2 and task 3 were fine-tuned for 5 epochs.
– LoRA setting: For Causal Language Modeling, we configured LoRA with an attention
dimension "r" of 8, an alpha parameter "LoRA_alpha" of 16, and a dropout probability
"lora_dropout" of 0.05. For target modules, all trainable modules of Llama 2 were included:
gate_proj, up_proj, down_proj, q_proj, k_proj, v_proj, and o_proj.With LoRA, the required
training time and resources are significantly reduced. The number of training parameters is
reduced from approximately 7 billion to approximately 20 million.</p>
        <p>– Processing unit: A100 80G GPU
• XLM RoBERTa
– Model Variants: We conduct experiments on two XLM-RoBERTa model sizes: Base and</p>
        <p>Large
– Training setting: We used a learning rate of 2e-5 and a batch size of 8. For the optimizer,
we used AdamW. For regularization, we apply Weight Decay (L2 regularization) of 0.01. We
set Warm-up steps to 100. This allows the model to slowly adjust to the data and mitigate
the risk of divergence in the initial stages. We fine-tuned the model for 10 epochs for Task
1, and 20 epochs for Task 2 and Task 3.</p>
        <p>– Processing unit: A100 80G GPU for XLM-R Large, P100 16G GPU for XLM-R Base
• Multilingual T5
– Model Variants: We conduct experiment on three mT5 model sizes: Small, Base and Large
– Training setting: We used a learning rate of 3e-4, a batch size of 16, and the AdamW
optimizer for fine-tuning. We fine-tuned all three Tasks for 15 epochs.
– Processing unit: A100 80G GPU for mT5 Large, P100 16G GPU for mT5 Base and mT5</p>
        <p>Small.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Result and Discussion</title>
      <sec id="sec-6-1">
        <title>6.1. Development Phase</title>
        <p>This section presents the model’s performance on the development dataset. We compare the performance
of diferent models and variants. Table 6 details our results on hard label, while Table 7 details our results
on soft label. To ensure a fair comparison of Task 2 and Task 3 results, we used the true labels from the
dataset for the first hierarchical level instead of predictions from Task 1. Therefore, these results are
only used to compare model performance and should not be considered the actual performance on data.</p>
        <p>On hard label evaluation, Llama 2 achieves the best results for both Task 1 (ICM Norm: 0.8119,
F1-score: 0.8746) and Task 2 (ICM Norm: 0.7996, F1-score: 0.7471). This suggests its strong text
understanding capabilities efectively translate to classification tasks. For Task 3, Llama 2 performed on
par with XLM-R large, with scores difering by only about 1% on both ICM Norm and F1-score. mT5
models consistently underperformed across all three tasks. Notably, even mT5 large’s results were
significantly lower than other models. This suggests that mT5 might not be well-suited for these tasks,
possibly due to the complexity of the input sequences, which combine tweet and annotator information.</p>
        <p>Llama 2 achieved the best results on Task 1 for soft labels. However, XLM-R large outperformed all
models on Tasks 2 and 3. Llama-2’s high performance on hard labels but not on soft labels suggests
it might not be learning annotator’s information or, more likely, is ignoring it due to the prompt
construction. XLM-R, on the other hand, is not reliant on prompt engineering and likely processes the
entire input, leading to more diverse soft outputs. Notably, all three models achieved low ICM Soft
Norm scores ranging from 40% to 66%. This indicates that capturing the complexity of the LeWiDi
paradigm distribution might be a challenging problem.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Evaluation Phase</title>
        <p>Task 1: As shown in Table 8, in the Hard-hard evaluation method, Llama 2 achieved the best result
(2th place) with an ICM Norm of 0.7994 and an F1-score of 0.7826 for the positive class. XLM-R large
came in 7th place with an ICM Norm of 0.7898, and mT5 large achieved the lowest with an ICM Norm
of 0.6952. In the Soft-soft evaluation method. XLM-R large achieved the best result, ranking 4th place
with an ICM Soft Norm of 0.6490.
Task 2: Table 9 showcases our model performance on two evaluation methods. In the Hard-hard
evaluation, our method using Llama 2 achieves 1st rank with ICM Norm at 0.6320 and F1-score at 0.5677.
XLM-R large at 6th rank with ICM Norm at 0.5926, and mT5 large falls short with ICM Norm at 0.4555.
While Llama 2 excelled in the Hard-hard evaluation, its performance faltered in the Soft-soft evaluation.
XLM-R large emerged as the leader, ranking 7th with an ICM Soft Norm at 0.3513.
Task 3: Our results are presented in Table 10. In the Hard-hard evaluation, XLM-R large comes in
at a close second to 1st place of Llama 2, with ICM Norms of 0.5822 and 0.5862, respectively. XLM-R
maintains its lead in the Soft-soft evaluation, ranking 7th with an ICM Soft Norm of 0.3143.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper describes our approach utilizing the pre-trained Large Language Models (LLMs) for the
classification tasks in the EXIST 2024 Shared task at CLEF 2024. We employed ensemble architectures
to generate both soft and hard predictions for all three tasks. Our experiments highlight the strong
text understanding capabilities of the Large Language Model Llama 2, allowing it to tackle various
classification tasks with high accuracy on hard-hard evaluation method. However, the complexities of
the LeWiDi paradigm distribution, which involves understanding diverse cultural and social perspectives
presented a challenge. In future work, we will delve deeper into prompting techniques, such as
Chainof-Thought Prompting to encourage LLMs to consider the diversity of human perspectives.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This research was supported by The VNUHCM-University of Information Technology’s Scientific
Research Support Fund.
[6] H. Kirk, W. Yin, B. Vidgen, P. Röttger, Semeval-2023 task 10: Explainable detection of online sexism,
in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), 2023,
pp. 2193–2210.
[7] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
Processing Challenges for Spanish and other Iberian Languages, Procesamiento del Lenguaje Natural 73
(2024).
[8] S. Ravi, S. Kelkar, A. K. Madasamy, Lstm-attention architecture for online bilingual sexism detection
(2023).
[9] J. Erbani, E. Egyed-Zsigmond, D. Nurbakova, P.-E. Portier, When multiple perspectives and an
optimization process lead to better performance, an automatic sexism identification on social
media with pretrained transformers in a soft label context, Working Notes of CLEF (2023).
[10] A. F. M. de Paula, G. Rizzi, E. Fersini, D. Spina, Ai-upv at exist 2023–sexism characterization using
large language models under the learning with disagreements regime (2023).
[11] L. Tian, N. Huang, X. Zhang, Eficient multilingual sexism detection via large language models
cascades, Working Notes of CLEF (2023).
[12] M. E. Vallecillo-Rodríguez, F. del Arco, L. A. Ureña-López, M. T. Martín-Valdivia, A. Montejo-Ráez,
Integrating annotator information in transformer fine-tuning for sexism detection, Working Notes
of CLEF (2023).
[13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association
for Computational Linguistics, 2020.
[14] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Rafel, mt5: A
massively multilingual pre-trained text-to-text transformer, in: Proceedings of the 2021 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2021, pp. 483–498.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[16] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv e-prints
(2023) arXiv–2307.
[17] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
[18] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: Proceedings
of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2022, pp. 5809–5819.
[19] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funtowicz, et al., Huggingface’s transformers: State-of-the-art natural language processing, arXiv
preprint arXiv:1910.03771 (2019).
[20] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
(2017).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , Overview of EXIST 2024 -
          <article-title>Learning with Disagreement for Sexism Identification and Characterization in Social Networks and Memes (Extended Overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Leonardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Abercrombie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almanea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          , Semeval-2023 task 11:
          <article-title>Learning with disagreements (lewidi)</article-title>
          ,
          <source>in: The 61st Annual Meeting Of The Association For Computational Linguistics</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M. R.</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sanguinetti, Semeval2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter</article-title>
          ,
          <source>in: Proceedings of the 13th international workshop on semantic evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sierra</surname>
          </string-name>
          , S.-T. Andersen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ojeda-Trueba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alcántara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Calvo</surname>
          </string-name>
          ,
          <article-title>Overview of homo-mex at iberlef 2024: Homo-mex: Hate speech detection in online messages directed towards the mexican spanish speaking lgbtq+ population</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>73</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>