<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-031-71736-9_2</article-id>
      <title-group>
        <article-title>Tackling Sexism in Multimodal Social Media: Exploring Hybrid Generative-Transformer Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Moiz Ali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lakshmi Yendapalli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bishoy Tawfik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matt Winzenried</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>139</volume>
      <issue>11</issue>
      <fpage>8748</fpage>
      <lpage>8763</lpage>
      <abstract>
        <p>Sexist content on social media platforms like TikTok poses a serious challenge to online safety and equitable discourse. This paper presents a multimodal framework for automated sexism detection in short-form videos, incorporating audio, visual, and textual signals. We explore the use of transformer-based models including RoBERTa for text, VideoMAE for video, and CNN-MFCC pipelines for audio. Furthermore, we introduce a generative AI-enhanced pipeline using Gemini to produce video summaries and analyses, which are then combined with traditional modalities. Experimental results demonstrate that combining generative outputs with RoBERTa significantly improves classification performance over unimodal baselines. Our findings support the efectiveness of hybrid generative-transformer models in moderating nuanced harmful content in multimodal social media.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Learning</kwd>
        <kwd>Sexism Detection</kwd>
        <kwd>Classification</kwd>
        <kwd>Multimodal</kwd>
        <kwd>Gemini</kwd>
        <kwd>RoBERTa-Large</kwd>
        <kwd>Hybrid Generative Transformer</kwd>
        <kwd>Prompt engineering</kwd>
        <kwd>VideoMAE</kwd>
        <kwd>r3d-18</kwd>
        <kwd>CNN</kwd>
        <kwd>Late-Fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the past decade, social media use has grown from 970 million in 2010 to 5.24 billion users in early
2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While social media has the power to connect people across the world, they also fuel the spread
of harmful content. Sexist content in particular perpetuates stereotypes, normalizes discrimination and
violence against women and non-binary groups. In recent years, negative content on social media has
proven harmful not just to the consumers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but also to content moderators [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These issues highlight
the urgent need for automated content moderation - especially video platforms such as TikTok where
multimodal content makes manual review impractical.
      </p>
      <p>In this project, we investigate the application of multi-modal deep learning to detect sexism in TikTok
videos. This problem is both socially significant and technically complex. Automated identification of
harmful content, such as sexist behavior, can play a key role in supporting content moderation eforts
at scale. While sexism is one example of harmful online behavious, our project can be extended to other
online harmful behaviours that also need moderation.</p>
      <p>
        From a technical standpoint, the task presents a rich set of challenges: it requires the efective fusion
of text, audio, and video modalities, each of which carries diferent types of signals and noise. Text
may be ambiguous or use slang and emojis; audio can be noisy or low quality; and visual content often
contains subtle cues such as gestures and expressions. The temporal aspect of videos adds further
complexity. Earlier approaches primarily used CNNs and LSTM but recent research has shown promising
performance by transformer based models such as VideoBERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and video masked autoencoders
such as ViT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], VideoMAE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and UniViLM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For text content, state-of-the-art word embeddings
include Word2Vec and RoBERTa [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Research is also being done on early fusion vs late fusion for
combining information from diferent modalities.
      </p>
      <p>For our paper, we have explored multiple approaches including analyzing text, audio and video
modalities individually, combining them using late fusion methodologies, as well as using generative
artificial intelligence tools such as Gemini to generate descriptions of videos which were then analyzed
using text models. We have explored VideoMAE for video analysis, RoBERTa-Large and DeBERTa for
text analysis, and CNN+MFCC for audio feature analysis. The results of the diferent experiments are
further described in detail in the rest of the paper.</p>
      <p>Our work was conducted as part of the EXIST 2025 challenge at CLEF 2025, which focuses on detecting
sexism in social media content. This paper presents our methodology and findings for Task 3.1 (Sexism
Identification in Videos) for the English language dataset from our team DS@GT. Additional details
about the EXIST challenge framework, evaluation setup, and data labeling are provided in Section 3.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Hate and Sexism Detection on Social Media</title>
        <p>
          Research in hate speech and sexism detection has been shaped significantly by benchmark tasks and
shared evaluation datasets. Several SemEval tasks have played a central role in advancing the field.
SemEval-2019 Task 6, known as OfensEval, focused on ofensive language detection in English tweets
using the OLID dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This task was later extended to multiple languages and fine-grained
categories in SemEval-2020 Task 12 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. A more recent contribution, SemEval-2023 Task 10 (EDOS),
introduced a large dataset for explainable sexism detection, with fine-grained annotations for both
source intention and types of sexist behavior [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Other important shared tasks include HateEval,
which addressed hate against immigrants and women [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], and the TRAC workshop tasks, which
focused on aggression and trolling detection across multiple languages [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. These challenges have
helped standardize evaluation for abusive content detection, making them cornerstones of the research
landscape.
        </p>
        <p>
          Key datasets have also emerged alongside these tasks. The Waseem and Hovy dataset labeled sexism
and racism in tweets and was among the first large-scale resources in this area [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The Stormfront
dataset provided annotated posts from white supremacist forums, ofering insight into more extreme
hate speech [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The Gab Hate Corpus compiled hate speech from the Gab platform, known for its
lack of moderation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          Methodologically, early approaches to hate and sexism detection relied on bag-of-words and n-gram
models. These were eventually outperformed by transformer-based architectures like BERT, RoBERTa,
and multilingual models such as XLM-RoBERTa [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Top systems in shared tasks typically fine-tuned
these models and incorporated linguistic priors or handcrafted features to improve performance. Recent
work has also focused on emotional and contextual cues. For example, emotion-aware embeddings
extracted from audio have been used to improve hate speech detection on speech-based platforms
[18]. In multilingual settings, prompting-based models like T5 and LLaMA have demonstrated strong
performance under zero- and few-shot scenarios [19]. These models reduce the need for extensive
retraining and perform well across English and Spanish datasets.
        </p>
        <p>Studies have also highlighted fairness concerns in model embeddings. It has been shown that
commonly used embeddings like ELMo can carry gender biases, reinforcing the need for debiasing
strategies during training [20]. Overall, the evolution of hate speech and sexism detection reflects a
growing emphasis not only on model accuracy, but also on robustness, explainability, and fairness—all
of which are highly relevant to our own work in multimodal video-based sexism detection.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multimodal Approaches for Video Platforms</title>
        <p>Text-only approaches struggle with the complexity of video and meme content. With the rise of
short-form video platforms like TikTok, there is increasing interest in multimodal approaches that
integrate audio, visual, and textual signals. Arcos and Rosso (2024) were among the first to introduce
a multimodal sexism detection pipeline specifically for TikTok videos [ 21]. Their model extracted
and fused linguistic, acoustic, visual, and emotional features to detect sexist content, and significantly
outperformed unimodal baselines. Complementing this, De Grazia et al. (2025) introduced the MuSeD
dataset—a manually annotated, Spanish-language corpus comprising approximately 11 hours of video
content from TikTok and BitChute [22]. Their multimodal annotation strategy revealed that visual cues
often play a decisive role in recognizing implicit or indirect sexism that might not be evident in text
alone.</p>
        <p>Beyond TikTok, studies have leveraged cross-domain multimodal data. For instance, Maity et al. (2021)
proposed architectures that jointly learn from speech transcripts and acoustic features, highlighting the
underexplored value of tone and prosody in moderation tasks [23]. Wang et al. (2025) advanced this by
using multimodal meme datasets to transfer learned representations into video-based hate detection
tasks via domain adaptation [24]. These eforts confirm that multimodal fusion—especially involving
pre-trained backbones like CLIP and ViT—can dramatically improve performance in domains where
textual information alone may be insuficient.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Generative AI for Content Moderations</title>
        <p>Generative models have become a cornerstone in enhancing content moderation pipelines, especially
where labeled data is scarce. Wullach et al. (2020) employed GAN-based text generation to create
synthetic hate speech examples, which helped improve recall in low-resource classification settings [ ? ].
More recently, Pendzel et al. (2023) evaluated the efectiveness of large language models such as GPT-3.5
in augmenting training corpora and producing adversarial examples to probe model robustness [? ].
These approaches not only increased overall detection accuracy but also mitigated dataset imbalance
and annotation bottlenecks.</p>
        <p>Generative techniques have also been used within the classification pipeline itself. The RoJiNG-CL
system (2024) for the EXIST shared task employed GPT-4 to produce image captions for memes, which
were then fed into a multimodal classifier integrating CLIP and ViT embeddings [ 25]. Their system
ranked among the top performers, particularly on examples requiring nuanced semantic understanding.
Additionally, the HARE framework (2023) introduced reasoning chains generated by large models
to provide interpretability in hate speech decisions, showing how generative AI can support both
performance and transparency [26]. While promising, these methods also pose risks—especially around
bias amplification and misuse for generating toxic content, as seen in controversial examples like
GPT-4Chan.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Background and Data</title>
      <p>This work is conducted as part of the CLEF 2025 EXIST challenge (sEXism Identification in Social
neTworks), which aims to benchmark automated systems for detecting and characterizing sexist content
across diferent media platforms. The challenge provides multilingual, multimodal datasets and is
divided into three tasks based on content modality:
• Task 1 – Tweets: Focused on detecting and analyzing sexism in Twitter posts.
• Task 2 – Memes: Centered on image-based content (memes), combining textual and visual
elements.
• Task 3 – TikTok Videos: Focused on short-form video content from TikTok, involving
multimodal data (text, audio, visual).</p>
      <p>Each task consists of the following three subtasks:
• Sexism Identification (Subtask 1.1, 2.1, 3.1): A binary classification task to determine whether
a given datapoint (text, meme, or video) contains or refers to sexist content.
• Source Intention (Subtask 1.2, 2.2, 3.2): Classifies the intention of the author in sexist content,
such as direct expression, judgmental commentary, or reporting (reporting is only used for subtask
1.2).
• Sexism Categorization (Subtask 1.3, 2.3, 3.3): Categorizes sexist content into types, including:
(i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual
violence, and (v) misogyny and non-sexual violence.</p>
      <p>Our work focuses exclusively on Task 3, Subtask 3.1 – Sexism Identification in TikTok Videos, which
is a binary classification problem. The objective is to determine whether a TikTok video contains or
describes sexist expressions or behaviors — either directly, through depiction, or via commentary. We
restrict our analysis to the English-language portion of the dataset, excluding all Spanish-language
content. Each sample is human-annotated using the Learning with Disagreement (LeWiDi) paradigm,
capturing multiple annotator opinions. While LeWiDi supports both hard and soft labels, our study
utilizes only the hard labels derived via majority vote. Our work focuses on the hard-hard evaluation.
for which the oficial evaluation metric is the ICM score.</p>
      <p>The EXIST TikTok dataset for 2025 comprises over 3,000 videos in English and Spanish collected via
hashtag-based scraping. After removing corrupted files (using FFmpeg to detect playback or encoding
errors) and non-English samples, we obtained a final dataset of 973 videos, of which 446 are labeled
sexist and 527 non-sexist. Labels are determined using majority vote from multiple annotators. The
dataset was split into 60% for training, 20% for validation, and 20% for testing to ensure robust model
evaluation. Each video includes associated metadata such as automatic transcriptions, audio features,
and visual frames, enabling multimodal analysis. Figure 1 shows the distribution of transcript lengths.
Most videos fall below 500 tokens and under 1 minute of length, indicating that the Tiktok videos were
mostly short videos - this informed our maximum input length as well as number of frames to extract
during preprocessing.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <sec id="sec-4-1">
        <title>4.0. Overview of Experiments</title>
        <p>In this study, we explored multiple approaches to detect sexism in TikTok videos, evaluating both
unimodal and multimodal strategies. We experimented with:
• RoBERTa-large (Text-only): a strong baseline transformer model trained on video transcripts.
• Gemini + RoBERTa Hybrid: a novel pipeline that generates video descriptions and analyses
via Gemini, then classifies these textual features using RoBERTa-large.
• Video-based Models: including VideoMAE and r3d-18, to capture visual signals.
• Audio-based Models: leveraging CNNs with MFCC features extracted via OpenSMILE.
• Late Fusion Models: combining outputs from text, video, and audio modalities to enhance
prediction.</p>
        <p>While unimodal models and the late fusion model provided valuable baselines, our best results were
achieved with the hybrid Gemini + RoBERTa approach, which integrated generative video summaries
and analyses to capture nuanced contextual signals. This pipeline ultimately formed our top-ranked
submission for the EXIST 2025 challenge. All of our experiments are further detailed in the following
sections. All experiments were conducted on the PACE HPC clusters at Georgia Tech, utilizing a range
of NVIDIA GPUs including A100, H100, RTX 6000, V100, and L40S models.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1. RoBERTa</title>
        <p>In this approach, we extracted the provided text from the dataset to partially retrain a model. After
some research, we decided to experiment with three models: DeBERTa, RoBERTa, and RoBERTa-large.
DeBERTa required more computational resources to train, as it is a larger model than RoBERTa-large,
but it produced similar results in our case. Additionally, RoBERTa-large outperformed RoBERTa, so we
chose to use RoBERTa-large from Hugging Face for text processing to predict whether the text was
sexist or not. After conducting several experiments, we achieved a text classification accuracy ranging
from 74% to 77% on our testing dataset. We selected RoBERTa-large because its 24 transformer layers
allowed us to capture deeper patterns in the text. It was also pretrained on 160GB of data with dynamic
masking, compared to only 16GB for BERT, making it a strong fit for our task.</p>
        <p>Given the model’s substantial GPU memory requirements, we anticipated that training with large or
even medium batch sizes could pose challenges. To address this, we experimented with diferent batch
sizes to strike a balance between model performance and generalization, while staying within memory
limitations.</p>
        <p>Another issue we encountered was related to overriding the pretrained weights. Without freezing
any layers of the pretrained model, performance was poor—most of the weights were overwritten
during training, and with a relatively small dataset, the model achieved only around 40% accuracy. To
address this, we conducted several experiments to determine the optimal number of layers to freeze
versus fine-tune. Ultimately, freezing the first 20 out of 24 layers significantly improved performance,
raising the accuracy average to 75% on our testing dataset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. Gemini + RoBERTa</title>
        <p>Our approach was inspired by the prompt design from the previous winning competition solution i.e.
RoJiNG-CL system (2024) [25]. The outline of this approach is explained in Figure 2: We started with a
similar prompt baseline but extended the prompt by adding instructions to provide an “analysis” of
the video with the goal of enriching the signal provided to the model. This prompt was then provided
to the train validation split of the dataset.</p>
        <p>During evaluation, we observed that the model outputs were heavily skewed towards false positives
(FP). To address this, we implemented an iterative loop in which the output of the initial prompt was
recursively fed back into the prompt itself, instructing the model to refine and generate a more balanced
version of the prompt. Interestingly, the new prompt version began skewing predictions toward false
negatives (FN).</p>
        <p>To mitigate the biases introduced by each prompt variant, we experimented with combining outputs
from both prompts. Specifically, we tested various combinations of model-generated descriptions and
analyses produced under prompts skewed toward false positives (FP) (See Appendix Section A.1) and
false negatives (FN)(See Appendix Section A.2). To evaluate model performance and identify the optimal
combination of generated textual features, we split the dataset into 80/20 train test split and conducted
5-fold cross-validation with combinations of transcribed text, descriptionFP (a description generated
from an FP-skewed prompt), analysisFP (analysis generated from an FP-skewed prompt), descriptionFP
(a description generated from an FN-skewed prompt) and analysisFN (analysis generated from an
FN-skewed prompt). After training, we evaluated each model on the same held-out test set, collecting
predictions to compute F1 scores. The combination of descriptionFP, analysisFP, and analysisFN yielded
the best overall results, with a mean test F1 score of approximately 82% across folds.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3. Other Models</title>
        <p>We also evaluated HuggingFace’s VideoMAEForVideoClassification model, r3d-18, OpenSMILE, and a
Late Fusion model.</p>
        <p>
          Video MAEForVideoClassification is HuggingFace’s pre-trained video masked auto-endcoer
implementation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. By unfreezing and fine tuning the last 6 of 12 layers, we allowed the model to train on a
new domain successfully in spite of the limited data available.
        </p>
        <p>R3D-18 is a pre-trained. 18 layer 3D Residual Network (ResNet) model [27]. We used it in conjunction
with a Long-Short Term Memory (LSTM) model to use sequences of video frames in an efort to capture
changes and transitions at 16 even spaced time frames across the video.</p>
        <p>OpenSMILE stands for open-source Speech and Music Interpretation by Large-space Extraction [28].
We used this to extract MFCC (Mel-frquency cepstral coeficient) features at the frame level, allowing
us to gain information on the emotional representation of speech, and passed it through a CNN.</p>
        <p>Finally, we also evaluated a Late Fusion model using the models listed above and the RoBERTa-Large
model as inputs. This was done to see if the Late Fusion model could use these multiple sources in order
to make better overall predictions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimentation</title>
      <sec id="sec-5-1">
        <title>5.1. RoBERTa</title>
        <p>The primary goal of this experiment was to identify a set of weights that leveraged the benefits of
large-scale pretraining while fine-tuning just enough to improve performance on our specific task. To
ifne-tune the RoBERTa-large model on our dataset, we explored various strategies, including freezing
layers and using LoRA during training. Our best results came from freezing the first 20 layers and
allowing the final 4 transformer layers to learn. This approach struck a balance between preserving
the general-purpose language understanding from pretraining and adapting the model to our
domainspecific task, efectively achieving our objective.</p>
        <p>We also experimented extensively with optimization hyperparameters. A grid search revealed that
a value of 0.00002 provided the most stable and consistent training. We used the AdamW optimizer
and found that omitting weight decay (setting it to 0.0) worked best in our case, possibly because the
small dataset already posed a strong regularization constraint. GPU memory limitations constrained us
to smaller batch sizes; among the values we tested, a batch size of 16 struck the best balance between
eficiency and training stability. Finally, we ran the model for up to 10 epochs and observed signs of
overfitting beyond epoch 6, at which point validation performance began to deteriorate. Thus, we
selected the model checkpoint from epoch 6 for evaluation and inference. The final values of the
hyperparameters are mentioned in Table 1.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Gemini + RoBERTa</title>
        <p>One of the central goals of our experiments was to determine whether a well engineered prompt could
directly produce accurate classification labels specifically for the hard–hard subset without relying
on the RoBERTa model. In this setup, the language model was prompted to output the classification
label itself. However, this approach yielded performance that was comparable to our RoBERTa-based
pipeline, with no observed improvement in overall F1 score.</p>
        <p>Initial experimentation also revealed that our base prompt tended to bias predictions toward false
positives (FP). To address this, we manually tested several prompt variations designed to encourage
a more balanced classification. These included explicitly assigning the model a “ male” or “female”
perspective, instructing it to adopt a “more lenient” or “less critical stance when labeling”,
and directing it to “assume a non sexist interpretation in cases of uncertainty”.
Despite these adjustments, we observed minimal impact on model behavior.</p>
        <p>As a more systematic alternative, we implemented a self refining prompt loop. In this process, the
output of each prompt iteration was used to iteratively improve the next prompt, with the goal of
optimizing classification balance. This loop was run for 5 iterations using a subset of 10 samples (balanced
random sample of 10 rows equally split between the two target classes). The final prompt generated
from this process was significantly more comprehensive and, when evaluated on the train-validation
set, demonstrated a skewed tendency towards false negatives (FN).</p>
        <p>Since we had two opposite ends of the results, we thus tried to come up with a prompt which would
use the description and analysis from each of the two prompts and in hopes of synthesizing a more
neutral perspective. However, this approach also resulted in skewed predictions suggesting that finding
a prompt to balance the two would be tricky.</p>
        <p>We thus incorporated the outputs from both the FP and FN skewed prompts including
transcribedtext, descriptionFP, analysisFP, descriptionFP and analysisFN into the RoBERTa-Large architecture
using the same model parameters. These combinations were tested to identify the most efective input
representation for final classification.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Other Models</title>
        <p>Each of the other models we evaluated were tuned on their own. Various hyperparameters were tuned
for the other models we evaluated. The Late Fusion model then used input from the other models:
VideoMAE, r3d-18, CNN, and RoBERTa-Large. Table 1 lists some of these hyperparameters and settings.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. RoBERTa</title>
        <p>We conducted a series of experiments to tune various hyperparameters, as the default values did not
produce promising results. The hyperparameters we optimized included the learning rate, batch size,
number of training epochs, and the number of layers to freeze. Through careful tuning, we achieved
an accuracy between 74% and 77% on our testing dataset. This confirmed that we had successfully
developed a strong text-based model that performs optimally given the training data, making it a
suitable candidate for use in our subsequent approach.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Gemini + RoBERTa</title>
        <p>Figure 4 (a) shows the results from the initial prompt (shown in Appendix under
“Prompt for Sexism Detection Model - FP skewed”), which exhibited a
high false positive (FP) tendency, with a False Positive Rate (FPR) of 43.95% and a
False Negative Rate (FNR) of 4.41%. We thus systematically modified the prompt using
“Prompt to refine the sexism prompt leading to FN prompt” as mentioned in
Appendix. The result from this new prompt is a lot more exhaustive as shown in the Appendex
under “Prompt for Sexism Detection Model - FN skewed”. We can see that using the more
exhaustive prompt, the resulting performance shifted significantly, yielding an FPR of 14.39% and a
much higher FNR of 62.79%, as illustrated in Figure 4 (b). This contrast motivated us to use both sets of
outputs as input features to the RoBERTa-Large model to capture complementary error patterns.</p>
        <p>The results of the RoBERTa-Large classifier using diferent feature combinations are summarized in
Table 2 for test accuracy and average epoch time. The highest test accuracy of 0.8308 was achieved
when combining descriptionFP, analysisFP, and analysisFN, with a moderate average epoch time of
3.75 seconds. Removing the descriptionFP component resulted in a slight performance drop to 0.8205,
while reducing inputs further led to continued accuracy decline. The model using only the text input
had the lowest accuracy of 0.7436 and one of the highest epoch durations (5.92 seconds). Interestingly,
combinations excluding text consistently performed better both in terms of accuracy and computational
eficiency. An important observation is that input combinations excluding text consistently led to
significantly faster training times (about half per epoch) while also achieving higher or comparable
accuracy. This suggests that the additional processing cost of textual features may not be justified in
this context.</p>
        <p>Table 2 presents the oficial leaderboard rankings for Task 3, Subtask 3.1 from EXIST 2025. Three
submissions were made under the DS@GT team name, achieving positions of 1st, 2nd, and 30th place.
The best-performing system, DS@GT EXIST_3, which combined Gemini with RoBERTa-Large using
the inputs descriptionFP, analysisFP, and analysisFN, achieved the highest scores across all metrics,
including an F1 YES of 0.7969. Closely following was DS@GT EXIST_2, which substituted analysisFN
with descriptionFN, reaching an F1 YES of 0.7714. In contrast, DS@GT EXIST_1, which relied solely on
RoBERTa-Large with the raw text input, ranked 30th, with significantly lower scores, including an F1
YES of only 0.6541.</p>
        <p>(a) Confusion matrix using FN-skewed prompt
(b) Confusion matrix using FP-skewed prompt</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Other Models</title>
        <p>We found the VideoMAE, r3d-18, and CNN models did not perform as well as RoBERTa or RoBERTa +
Gemini (see Table 3 for a summary of performance metrics). The Late Fusion model’s accuracy score is
only as good as RoBERTa Large, F1 score is only slightly better. This leads to the conclusion that other
models are not meaningfully contributing to the success of the Late Fusion model.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>We initially explored a multi-modal late fusion approach and observed that the models performance
was similar to that of the text only RoBERTa-Large model and hence disproportionately relied on the
raw text modality as seen in Table 3. This outcome suggests that the text input contained the most
predictive information among all modalities. This led us to focus more on enhancing the text modality
itself and looked to extract meaningful textual features using Gemini.</p>
      <p>We therefore moved towards an approach to extract more meaningul textual features using Gemini.
The results using the Gemini and RoBERTa-Large approach indicate that the DS@GT EXIST_3 model
outperformed other configurations, suggesting that its combination of descriptionFP, analysisFP, and
analysisFN provides a more comprehensive and balanced representation of the video content. This
combination appears to efectively address biases present in prompts that individually skew towards
false positives or false negatives, thereby enabling the model to better distinguish nuanced cases of
sexism. The integration of both description and analysis inputs likely captures diferent facets of the
video’s framing and intent, which enriches the contextual understanding beyond what is possible with
single source inputs.</p>
      <p>Moreover, the complementary nature of Gemini’s prompt generation and RoBERTa’s classification
capabilities may have contributed to the model’s success. Gemini’s diverse prompt outputs ofer varied
perspectives, while RoBERTa’s deep language understanding synthesizes this information efectively.
This synergy likely enhances the robustness of the classification, particularly in identifying subtle
endorsements or critiques of sexist content.</p>
      <p>It is important to recognize that the model was trained on annotations from only 5 individuals which
may not fully represent the diverse perspectives within society. This limited training data can introduce
biases, making the model’s predictions reflect a narrow viewpoint rather than a broader consensus on
sexism. Therefore, maintaining a human-in-the-loop approach is crucial to ensure careful review and
context-aware decisions. Additionally, improving the annotation process by involving a larger, more
diverse group of annotators would help create more representative training data, leading to fairer and
more accurate models.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>This work contributes a robust multimodal framework for detecting sexism in TikTok videos, combining
traditional deep learning with generative AI. By integrating textual, visual, and acoustic modalities
alongside generative descriptions and analyses, we capture complex signals often missed in unimodal
pipelines. Our experiments show that combining prompt-engineered generative features with RoBERTa
achieves superior performance, particularly on hard evaluation subsets. Notably, incorporating both
false-positive- and false-negative-skewed prompts leads to a richer feature space and better
generalization.</p>
      <p>Despite promising results, challenges remain. The annotation set was limited to a small group, which
may introduce social or cultural biases into the model’s predictions. This highlights the importance
of involving diverse annotators and maintaining a human-in-the-loop moderation process. Future
work should investigate real-time inference, continual learning from user feedback, and extending the
framework to detect other forms of online harm such as racism or homophobia. Overall, our approach
demonstrates that hybrid generative-transformer models are a viable path forward for nuanced, scalable
content moderation in video-centric social media.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We thank the DS@GT CLEF team for providing valuable inputs and support throughout the project. This
research was supported in part through research cyberinfrastructure resources and services provided by
the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used
• Gemini: Generating descriptions of video datasets which were then further used for the
classification tasks.</p>
      <p>• OpenAI-GPT-4o: Grammar and spelling check.</p>
      <p>After using generative AI tools, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[18] A. Rana, A. Jha, Emotion-based hate speech detection in videos, in: Conference on Empirical</p>
      <p>Methods in Natural Language Processing Workshops (EMNLP), 2022.
[19] J. A. García-Díaz, R. Pan, R. Valencia-García, Leveraging zero and few-shot learning for enhanced
model generality in hate speech detection in spanish and english, Mathematics 11 (2023) 5004.</p>
      <p>URL: https://doi.org/10.3390/math11245004. doi:10.3390/math11245004.
[20] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, K.-W. Chang, Gender bias in contextualized word
embeddings, in: Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, volume 1, Association
for Computational Linguistics, 2019, pp. 629–634. URL: https://aclanthology.org/N19-1063/.
[21] I. Arcos, P. Rosso, Sexism identification on tiktok: A multimodal ai approach with text, audio, and
video, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, volume XXXX
of Lecture Notes in Computer Science, Springer, 2024, pp. XX–XX. URL: https://link.springer.com/
chapter/10.1007/978-3-031-71736-9_2. doi:10.1007/978-3-031-71736-9_2.
[22] L. D. Grazia, P. Pastells, M. V. Chas, D. Elliott, D. S. Villegas, M. Farrús, M. Taulé, Mused: A
multimodal spanish dataset for sexism detection in social media videos, arXiv preprint arXiv:2504.11169
(2025). URL: https://arxiv.org/abs/2504.11169.
[23] K. e. a. Maity, Multimodal video-based hate speech detection: The role of transcripts and audio,
in: In Workshop on Social Media Safety, 2021.
[24] A. Wang, Cross-modal transfer learning from meme to video for hate detection, Journal of</p>
      <p>Multimedia Intelligence (2025).
[25] J. Ma, R. Li, Rojing-cl at exist 2024: Leveraging large language models for multimodal sexism
detection in memes, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction,
volume 3740 of CEUR Workshop Proceedings, 2024, pp. 1080–1090. URL: https://ceur-ws.org/
Vol-3740/paper-100.pdf.
[26] A. HARE, Hare: Harnessing ai reasoning explanations in hate speech detection, in: NeurIPS 2023</p>
      <p>Workshop on Trustworthy AI, 2023.
[27] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A Closer Look at Spatiotemporal
Convolutions for Action Recognition, in: 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, IEEE, Salt Lake City, UT, 2018, pp. 6450–6459. URL: https://ieeexplore.ieee.
org/document/8578773/. doi:10.1109/CVPR.2018.00675.
[28] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open-source audio
feature extractor, in: Proceedings of the 18th ACM international conference on Multimedia,
MM ’10, Association for Computing Machinery, New York, NY, USA, 2010, pp. 1459–1462. URL:
https://doi.org/10.1145/1873951.1874246. doi:10.1145/1873951.1874246.
[29] P. K. A. Vasu, F. Faghri, C.-L. Li, C. Koc, N. True, A. Antony, G. Santhanam, J. Gabriel, P. Grasch,
O. Tuzel, H. Pouransari, FastVLM: Eficient Vision Encoding for Vision Language Models, 2024.</p>
      <p>URL: http://arxiv.org/abs/2412.13303. doi:10.48550/arXiv.2412.13303, arXiv:2412.13303 [cs].
[30] Y. Ouali, A. Bulat, A. Xenos, A. Zaganidis, I. M. Metaxas, G. Tzimiropoulos, B. Martínez,
Discriminative fine-tuning of lvlms, ArXiv abs/2412.04378 (2024). URL: https://api.semanticscholar.org/
CorpusID:274514426.
[31] A. Miyaguchi, A. Cheung, M. Gustineli, A. Kim, Transfer Learning with Pseudo Multi-Label
Birdcall Classification for DS@GT BirdCLEF 2024, 2024. URL: http://arxiv.org/abs/2407.06291.
doi:10.48550/arXiv.2407.06291, arXiv:2407.06291 [cs].
[32] V. Sharma, M. Gupta, A. Kumar, D. Mishra, Video Processing Using Deep Learning Techniques: A
Systematic Literature Review, IEEE Access 9 (2021) 139489–139507. URL: https://ieeexplore.ieee.
org/abstract/document/9563948. doi:10.1109/ACCESS.2021.3118541, conference Name: IEEE
Access.
[33] Guide to Vision-Language Models (VLMs), 2024. URL: https://encord.com/blog/
vision-language-models-guide/.
[34] J. Ma, R. Li, Notebook for the EXIST Lab at CLEF 2024 (2024).
[35] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
tion, CoRR abs/2006.03654 (2020). URL: http://dblp.uni-trier.de/db/journals/corr/corr2006.html#
abs-2006-03654.</p>
    </sec>
    <sec id="sec-11">
      <title>A. Prompt</title>
      <sec id="sec-11-1">
        <title>A.1. False Positive Skewed Prompt</title>
        <p>Prompt for Sexism Detection Model - FP skewed
"You are an expert in identifying the presence of sexism in video content."
"Your task is to assess whether the content of the video criticizes through marginalization,"
"prejudice against women based on gender stereotypes, beliefs in male superiority, or misogynistic views."
"Briefly describe and analyze the content of the video then label it as ’YES’ if the video embodies or criticizes sexism."
"Please respond strictly in the following JSON format:\n\n"
’{\n’
’ "description": "[Your one-sentence description of the video]",\n’
’ "label": "YES" or "NO"\n’
’ "analysis": [One or two sentences explaining why the video is or isnt sexist]"\n’
’}\n\n’
"Only return valid JSON. Do not include any explanations or extra text."</p>
      </sec>
      <sec id="sec-11-2">
        <title>A.2. False Negative Skewed Prompt</title>
        <p>Prompt for Sexism Detection Model - FN skewed
"""
You are an expert in identifying sexism in video content. Your primary task is to distinguish between videos that *are themselves sexist*
(i.e., they actively promote, endorse, or normalize sexism through their own message, framing, or overall presentation) and videos
that merely *depict, discuss, or incidentally include* sexist content (e.g., for critique, satire, factual reporting, artistic
expression, or as part of a narrative where the video *itself* does not endorse the sexism). Focus on what the *video creator* is
trying to convey as their own viewpoint or message, not just what is shown or said within the video.
**The core question is: Does the video *itself* actively promote, endorse, or normalize sexism, intending to persuade the viewer of
sexist ideas?**
Sexism is defined as the *video itself* communicating one or more of the following, in a way that the video *actively promotes or clearly
endorses as acceptable, normal, or true*, thereby contributing to inequality or harm:
*
*
*
*
**Marginalization:** The video *itself* presents a gender (typically women) as inherently less important, capable, or deserving, or
unfairly limits their opportunities or voice *due to their gender*, in a manner the video appears to support or normalize.
**Harmful Stereotypes:** The video *itself* actively promotes demeaning, limiting, or broadly negative generalizations about a gender
*as if they are inherent truths*. For this to be sexist, the video’s discernible primary intent in promoting such stereotypes must
be to reinforce inequality, belittle the gender, or restrict their opportunities. Observations or humor based on gender are *not*
sexist under this rule *unless the video’s overall framing and primary intent clearly weaponize these stereotypes to achieve such
harmful outcomes as its central message.* (e.g., simply listing stereotypical "wants" of women is not sexist unless the video uses
this to argue for their inferiority or to justify discriminatory treatment).
**Male Superiority:** The video *itself* asserts or implies men’s inherent dominance, greater value, or rightful control over women,
framing this as natural, desirable, or justified.
**Misogyny:** The video *itself*, through its own primary message or creative choices, expresses or encourages hatred, contempt, or
ingrained prejudice against women. The mere inclusion of misogynistic language (e.g., in song lyrics, character dialogue) does *not
* automatically make the video misogynistic *unless the video’s own framing and primary intent clearly center on endorsing,
celebrating, or amplifying that misogynistic sentiment as its own message.*
**Crucial Decision Point (Labeling):**
*
*
**Label "YES"**: If the video’s *own discernible primary message, tone, narrative voice, or overall presentation* clearly and
actively promotes, endorses, normalizes, or celebrates any of the sexist elements defined above. The video itself is the source of
sexist advocacy or validation.
**Label "NO"**:
* If the video depicts sexist acts, language, or ideas *primarily to critique, condemn, satirize, or factually report on them*,
where the video’s own stance is clearly against the depicted sexism or unendorsing of it.
* If sexist ideas/actions are expressed by characters or are part of a narrative, but the *video itself does not demonstrably
endorse or promote these as valid, acceptable, or desirable as part of its own primary message*. The video might be exploring
complex themes or showing flawed characters without its own voice condoning the sexism.
* If the video depicts common or lighthearted gender stereotypes without the video’s *own discernible primary intent* being to use
these stereotypes to demean, restrict, or advocate for unequal treatment of a gender. The video isn’t *weaponizing* the
stereotype to push a harmful sexist agenda.
* If the video incorporates material containing sexist language or ideas (e.g., song lyrics, dialogue from a film) but the *video’s
own primary focus, message, and creative intent* are not to endorse or amplify the sexism within that material. The presence
of such material is incidental to, or serves a different purpose within, the video’s overall non-sexist message (e.g., used for
its beat, a non-sexist thematic element, or artistic quotation).
* If sexist elements are merely incidental background elements not central to any message actively endorsed by the video, and are
not the focus of the video’s *own* active promotion or endorsement.</p>
        <p>Respond strictly in the following JSON format:
{
"description": "[One sentence describing the video’s relevant content AND, crucially, the video’s *own apparent stance or framing* of
that content, focusing on whether the video *itself* promotes, endorses, or normalizes sexism.]",
"label": "YES" or "NO"
"analysis": "[One sentence describing the reason why it was labelled as YES or NO, referencing the specific definitions if applicable]"
}
Only return valid JSON. Do not include any explanations or extra text.</p>
        <p>"""</p>
        <p>Prompt To Refine The Sexism Prompt Leading To FN Prompt
example_text = "\n\n".join([
f"Video transcript: {e[’text’]}\nExpected label: {e[’expected’]}\nPredicted: {e[’predicted’]}\nGemini’s description: {e[’description
’]}\nGemini’s analysis: {e[’analysis’]}\nGemini’s label probability: {e[’probability’]}"
for e in errors[:10]
])
refinement_prompt = f"""
You are helping refine a prompt that instructs an AI to classify whether a video is sexist or not.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Social</given-names>
            <surname>Network</surname>
          </string-name>
          <string-name>
            <given-names>Usage</given-names>
            &amp; Growth
            <surname>Statistics</surname>
          </string-name>
          (
          <year>2025</year>
          ),
          <year>2023</year>
          . URL: https://backlinko.com/ social-media-users.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Woodward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>McGettrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. G.</given-names>
            <surname>Dick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Teeters</surname>
          </string-name>
          ,
          <article-title>Time Spent on Social Media and Associations with Mental Health in Young Adults: Examining TikTok</article-title>
          , Twitter, Instagram, Facebook, Youtube, Snapchat, and Reddit,
          <source>Journal of Technology in Behavioral Science</source>
          (
          <year>2025</year>
          ). URL: https://doi.org/10.1007/s41347-024-00474-y. doi:
          <volume>10</volume>
          .1007/s41347-024-00474-y.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Newton</surname>
          </string-name>
          ,
          <article-title>Facebook will pay $52 million in settlement with moderators who developed PTSD on the job</article-title>
          ,
          <year>2020</year>
          . URL: https://www.theverge.com/
          <year>2020</year>
          /5/12/21255870/ facebook-content
          <article-title>-moderator-settlement-scola-ptsd-mental-health.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vondrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          , C. Schmid,
          <article-title>VideoBERT: A Joint Model for Video and Language Representation Learning</article-title>
          , in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul, Korea (South),
          <year>2019</year>
          , pp.
          <fpage>7463</fpage>
          -
          <lpage>7472</lpage>
          . URL: https://ieeexplore.ieee.org/ document/9009570/. doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2019</year>
          .
          <volume>00756</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lučić</surname>
          </string-name>
          , C. Schmid,
          <article-title>ViViT: A Video Vision Transformer</article-title>
          , in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
          <year>2021</year>
          , pp.
          <fpage>6816</fpage>
          -
          <lpage>6826</lpage>
          . URL: https://ieeexplore.ieee.org/document/9710415. doi:
          <volume>10</volume>
          .1109/ICCV48922.
          <year>2021</year>
          .
          <volume>00676</volume>
          , iSSN:
          <fpage>2380</fpage>
          -
          <lpage>7504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Wang,</surname>
          </string-name>
          <article-title>VideoMAE: Masked Autoencoders are Data-Eficient Learners for Self-Supervised Video Pre-Training</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>10078</fpage>
          -
          <lpage>10093</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/ 2022/file/416f9cb3276121c42eebb86352a4354a-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bharti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zhou, UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding</article-title>
          and Generation,
          <year>2020</year>
          . URL: https://ui. adsabs.harvard.edu/abs/2020arXiv200206353L. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2002</year>
          .
          <volume>06353</volume>
          , aDS Bibcode: 2020arXiv200206353L.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Iraqi</surname>
          </string-name>
          ,
          <article-title>Lora for sequence classification with roberta, llama</article-title>
          , and mistral,
          <year>2024</year>
          . URL: https: //huggingface.co/blog/Lora-for
          <article-title>-sequence-classification-with-</article-title>
          <string-name>
            <surname>Roberta-</surname>
          </string-name>
          Llama-Mistral, accessed:
          <fpage>2025</fpage>
          -06-26.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Predicting the type and target of ofensive posts in social media</article-title>
          ,
          <source>in: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          , et al.,
          <source>Semeval-2020</source>
          task 12:
          <article-title>Multilingual ofensive language identification in social media</article-title>
          (ofenseval
          <year>2020</year>
          ),
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1425</fpage>
          -
          <lpage>1447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <article-title>Explainable detection of online sexism (edos) task at semeval-2023</article-title>
          , in: Proceedings of SemEval-2023,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          , et al.,
          <article-title>Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter (hateval)</article-title>
          ,
          <source>in: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , et al.,
          <article-title>Benchmarking aggression identification in social media</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Waseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Hateful symbols or hateful people? predictive features for hate speech detection on twitter</article-title>
          ,
          <source>in: Proceedings of the NAACL Student Research Workshop</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>O. de Gibert</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García-Pablos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cuadros</surname>
          </string-name>
          ,
          <article-title>Hate speech dataset from a white supremacy forum</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Abusive Language Online</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Qian</surname>
          </string-name>
          , et al.,
          <article-title>A benchmark dataset for learning to intervene in online hate speech</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , et al.,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viégas</surname>
          </string-name>
          , M. Wattenberg,
          <source>AttentionViz: A Global View of Transformer Attention, IEEE Transactions on Visualization and Computer Graphics</source>
          <volume>30</volume>
          (
          <year>2024</year>
          )
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
          . URL: https://ieeexplore.ieee.org/abstract/document/10297591. doi:
          <volume>10</volume>
          .1109/TVCG.
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wullach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Adler</surname>
          </string-name>
          , E. Minkov,
          <article-title>Towards hate speech detection at large via deep generative modeling</article-title>
          ,
          <source>IEEE Internet Computing</source>
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>48</fpage>
          -
          <lpage>57</lpage>
          . doi:
          <volume>10</volume>
          .1109/MIC.
          <year>2020</year>
          .
          <volume>3033161</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pendzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wullach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <article-title>Generative ai for hate speech detection: Evaluation and findings,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>