<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ECA-SIMM-UVa at EXIST 2025: A Segmentation Oriented Approach to Sexism Detection in Tik Tok Videos Based on a "One Is Enough" Paradigm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Fernández García</string-name>
          <email>david.fernandez@uva.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrique Amigó Cabrera</string-name>
          <email>enrique@lsi.uned.esnl</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentín Cardeñoso Payo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universidad Nacional a Distancia (UNED)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ECA-SIMM Research group, University of Valladolid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper details the ECA-SIMM-UVa team's participation in Task 7: Sexism Identification in TikToks as part of the EXIST 2025 challenge. The focus is on automatic detection of potentially harmful sexist behaviours on social platforms. We adopted a segmentation oriented approach, splitting TikTok videos into textual, audio, and video channels, on the hypothesis that sexism can manifest in spoken words, embedded text, speaker tone, or visual content (text, pictures or other images). We trained individual deep learning classifiers for each channel and explored various prediction fusion mechanisms like One Is Enough (OIE), Majority Voting, and Probabilistic OIQ for hard evaluation, as well as Logistic Regression and Weighted Sum for soft evaluation, to combine predictions. As a significant finding, models using the textual channel show superior performance, specially when using the original text provided with each sample in the dataset. They consistently outperformed audio and video channels, indicating textual information as the most informative source for sexism detection in this context. Although fusion mechanisms achieved good estimation performance, it was frequently associated, almost exclusively, to the presence of decisions made on the original-text specific model being fused with the others, efectively disregarding contributions from the audio and video channels due to high thresholds. Our systems ranked 1st, 3rd, and 7th out of 41 submissions in the hard evaluation category, and 15th, 17th, and 18th out of 35 submissions in the soft evaluation category, considering instances of any language in both cases. Our results emphasizes the challenges that multimodal sexism detection still faces and the need to further improve pre-trained audio and video models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Segmentation</kwd>
        <kwd>Fusion Mechanism</kwd>
        <kwd>TikToks</kwd>
        <kwd>Multimodal</kwd>
        <kwd>Sexism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Nowadays we can find many diferent social platforms like Twitter, Instagram or Tik Tok, where people
can share huge variety of multimedia and hypermedia publications brought to users in a multimodal
fashion. This is a great opportunity to encourage positive social interaction and discussions but it also
opens way for potentially dangerous and harmful behaviours, like sexism or misogyny, by becoming
huge loudspeakers for many kinds of discriminatory content. Due to that, one of the most important
problems nowadays is how to deploy realistic regulations and mechanisms to detect, control and
mitigate these types of behaviour. The vast amount of content upload to these platforms makes it
impossible to address this controls under a manual fashion. Thus, the development of automatic tools
that can help to address control and mitigation of harmful information and behaviours become an open
challenge for the following years.</p>
      <p>The EXIST challenge (sEXism Identification in Social neTworks) is a group of tasks that try to promote
research related to designing, implementing and evaluating automatic sexism detection systems on
social networks content. This year’s challenge include three diferent global tasks:
• Global Task 1: This is a binary classification task , where systems have to decide whether one
post is sexist or not.
• Global Task 2: This is a multi-class classification task , where systems have to identify the
author’s intention of the posts classified as sexist in Task 1. There are three diferent source
intention classes: direct, reported or judgemental.
• Global Task 3: This is multi-label classification task , where systems have to categorize sexist
posts based on a set of defined types: ideological-inequality, stereotyping-dominance, objectification ,
sexual-violence, misogyny-non-sexual-violence.</p>
      <p>
        Each of these global tasks can be faced from three diferent perspectives, depending on the kind of
input media being considered for the source: textual posts, meme posts and video posts. For a more
complete description of the challenge and overview documents, see [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
      </p>
      <p>This paper describes the participation of ECA-SIMM-UVa research team in this challenge. We focus
on Task 7: Sexism Identification in TikToks, trying to deal with one of the most important social media
platform of our days. We built systems for both soft and hard evaluation modalities. Our approach
is based on the initial segmentation of videos into three diferent source channels: text, audio and
image sequences (images). The textual channel includes both the audio transcription of the words in
the video and any kind of text material that can be recognized as embedded in the video itself. Audio
channel includes the sound track extracted from the video. Images channel includes the sequence of
image frames in the video without audio. For each channel, a deep learning based classifier is trained,
through a fine-tuning of a specialized pre-trained model on each type of data. Then, we explore diferent
classification fusion mechanisms, in order to merge the three individual predictions for text, audio and
images into a final decision. Diferent fusion mechanisms were applied in this work, depending of the
evaluation type. For hard evaluation we tried: OIQ, One Is Enough (OIE) and Majority Voting. For soft
evaluation we try: Logistic regression and Weighted sum of channels.</p>
      <p>Therefore, the study aims not only to obtain good competition results on the classification task, but
also addresses a comparative study of classification models for diferent information channels, analysing
the relative importance of each channel for the sexism identification task. More specifically, we seek to
answer the following research questions:
• How do pre-trained text, audio and image models compare in terms of performance to classify
contents as sexist or not?
• What is the relative impact of the three information channels (text, audio and image sequence)
on classification performance, both in an isolated and combined fashion?</p>
      <p>The paper is organised as follows. Section 2 investigates related studies for sexism identification on
video. Section 3 describes the ECA-SIMM-UVa approach for Task 7 of EXIST 2025. Section 4 presents
results and rankings. Finally, in Section 5, we include discussion, conclusions, and suggestions for
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The importance of automated detection of sexism on digital platforms has recently increased, as a
consequence of the endless amount of multimedia content delivered every hour through social networks
and other distribution channels. Researchers have made serious eforts to develop these ML and AI
based automated systems, promoting and participating in initiatives like EXIST [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ] or SemEval
2023 challenges [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and through the publication of relevant studies [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. Research has mainly focused
on textual data, which is easily obtained from social networks like X (Twitter) or Gab, and also from
other audio and video sources by means of automatic speech recognizers of ever increasing quality.
      </p>
      <p>
        In the realm of sexism identification based on text, as addressed in EXIST 2024 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] Tasks 1, 2, and 3, and
the textual component of Tasks 4, 5, and 6, the state of the art is primarily characterized by the dominant
use of encoding-based transformer models fine-tuned on the EXIST dataset, frequently enhanced with
additional components. Meticulous data preprocessing was a key factor in improving performance,
involving removal of irrelevant elements and the application of data augmentation techniques like AEDA
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and automatic English-Spanish translation. Ensembles of encoding-based transformer models such
as BERT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], RoBERTa [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and DeBERTa [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (including multilingual versions or those pre-trained on
domains like tweets or hate speech) proved highly efective, particularly in the soft evaluation setting,
which was linked to their training with soft labels. Ensemble strategies varied, from assigning higher
weight to the best-performing model for significant performance diferences, to using a proportion
of votes when diferences were smaller. A significant distinguishing factor, especially successful in
the hard evaluation setting, was the incorporation of Large Language Models (LLMs) like Llama [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
Mistral [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and GPT, primarily used for zero-shot or few-shot learning due to computational costs,
relying heavily on prompt engineering. While encoding-based transformers generally excelled in soft
evaluation, LLMs showed superior performance in hard evaluation. For multimodal tasks (4, 5, 6),
top performances were unexpectedly achieved by models focusing solely on text, with systems using
encoding-based transformers for text analysis often outranking truly multimodal approaches.
      </p>
      <p>
        Video automated detection of sexism has attracted much less interest for researchers, probably due to
the dificulties associated with its collection, processing, information extraction and model training for
them. Thus, few works can be found that specifically deal with sexism. A novel corpus of 11 hours of
video extracted from Tik Tok and BitChute, was presented in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], a videos’ dataset which is annotated
at three diferent levels: text, audio and image sequences. In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], a segmentation and multimodal
approach is explored to face the problem. They use a wide variety of models such as RoBERTa [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
(textual model), Wav2Vec [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (audio model) and ViT [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] (video model).
      </p>
      <p>
        As the field of interest broadens, a higher number of works using video sources is found, as in,
for example, hate speech detection [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22, 23, 24</xref>
        ]. A common practice is to obtain or extract text
transcriptions from videos, and train classifiers just with that textual data. Other approaches prefer the
multimodal way, combining text, audio and video features [25, 26]. Regarding these approximations,
we can find the use of Multimodal deep learning systems [ 27, 28, 29] or models ensemble approaches
[30, 31], which mainly use majority vote to make their decisions.
      </p>
      <p>The shortage of works centered around video detection of sexism is a clear symptom that the research
must pay more attention to this field, specially because of its increasing importance. Development of
new corpora, improvement of multimodal models and an increase of consciousness on the importance
of this kind of research, become crucial factors for boosting this field.</p>
    </sec>
    <sec id="sec-3">
      <title>3. ECA-SIMM-UVa Approach</title>
      <p>
        3.1. Data
Table 1 shows the composition and distribution of the EXIST 2025 Tik Tok Dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] both in English
and Spanish. This dataset was specifically developed for the challenge, extending sexism detection
tasks to TikTok videos. TikTok’s recommendation algorithm could reinforce sexism and normalize
misogynistic attitudes, significantly impacting adolescents’ self-esteem and gender perceptions. Apify’s
TikTok Hashtag Scraper was used for data collection, and, as a crucial feature of the dataset, annotation
was performed by trained annotators from Servipoli, organized into mixed-gender pairs to avoid biases.
A Learning With Disagreement paradigm was adopted, incorporating diverse human perspectives and
disagreements to foster human-centric AI, thereby reducing bias and promoting inclusive
decisionmaking. The dataset supports three main tasks: Sexism Identification, Source Intention Detection, and
Sexism Categorization.
      </p>
      <sec id="sec-3-1">
        <title>3.2. Segmentation</title>
        <p>We adopted a segmentation-based approach to video detection of sexism (see Figure 1). This decision
was based on the hypothesis that a Tik Tok video can be perceived as sexist because of four distinct
reasons:
1. The semantic content of the spoken words is sexist. This involves using audio processing and
speech transcription as a source.
2. The embedded text within the video conveys sexist content. This includes any on-screen textual
elements, such as real posters or comments added on the video.
3. The tone or speech intention of the speaker carries a sexist attitude. This involves audio signal
processing and analysis.
4. The visual content of the video is sexist. This involves visual scene analysis.</p>
        <p>Based on those four paths to sexism detection in videos, we conduct a segmentation of Tik Tok videos
to split them into 3 input channels: text, audio and video.</p>
        <p>In our experiments, we considered three diferent sources for the text channel. First one was the text
input that was given with the dataset. This text includes a combination of textual transcription and
embedded text. Then, we obtain two additional textual sources: we use Whisper-X [32] to get a detailed
time-aligned text transcription of videos. This allowed us to identify what was said and when was it
said; in a second place, a textual channel was obtained using DeepSeek-VL [33] to extract text messages
embedded as images in the video frames. After experimenting with various prompts, we opted for a
zero-shot prompting strategy with this tool (see Listing 1). As the output of this processing, we get three
text tiers in the text channel: original, which mixes transcription and embedded text, transcription
and embedded text.</p>
        <p>Listing 1: Prompt for zero-shot prompting video text extraction withDeepSeek-VL [33]
" r o l e " : " U s e r " ,
" c o n t e n t " : " &lt; i m a g e _ p l a c e h o l d e r &gt; E x t r a c t ONLY t h e main v i s i b l e
t e x t i n t h i s i m a g e and g i v e i t b a c k . I g n o r e TikTok i n t e r f a c e
e l e m e n t s s u c h a s : h a s h t a g s , u s e r IDs , c o u n t e r s , b u t t o n s ,
m e n t i o n s , t a g s , o r any t e x t o v e r l a i d by t h e p l a t f o r m . F o c u s
e x c l u s i v e l y on t e x t t h a t i s p a r t o f t h e o r i g i n a l v i d e o
c o n t e n t . The t e x t may be i n E n g l i s h o r S p a n i s h . P r e s e r v e t h e
o r i g i n a l f o r m a t ( u p p e r c a s e / l o w e r c a s e ) and o r g a n i z e t h e t e x t
i n n a t u r a l r e a d i n g o r d e r . Do n o t add i n t e r p r e t a t i o n s , c o n t e x t
, o r e x p l a n a t i o n s . R e t u r n ONLY t h e e x t r a c t e d t e x t . I n o r d e r
t o g i v e b a c k t h e t e x t e x t r a c t e d u s e t h i s form : The t e x t
e x t r a c t e d i n t h e i m a g e i s : &lt; h e r e p u t t h e t e x t you h a v e
e x t r a c t e d &gt; . " ,
" i m a g e s " : [ i m a g e _ p a t h ]
" r o l e " : " A s s i s t a n t " ,
" c o n t e n t " : " "</p>
        <p>To extract audio and video channels, we used fmpeg library, which is a framework commonly used
for audio, video and multimedia file and stream processing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. General Approach</title>
        <p>We trained three classifiers, one for each input channel (audio, text, video). To ensure comparability
across models, we applied a common architecture and training strategy for each of them (see Figure 2).
All classifiers consist of a channel-specific encoder followed by a Multi-Layer Perceptron (MLP). Given
that our task is binary classification (sexist vs. non-sexist), we used the Binary Cross-Entropy (BCE) loss
function during training.</p>
        <p>We conducted hyper-parameter tuning using the optuna library, with an 80/20 train-validation split.
The original approach of this work consisted of the hypothesis shown in Figure 1, which assumes that
a video will be classified as sexist just if one of the models classifies the video as such, ignoring the
decisions of the rest of models. This implies that the precision of each classifier is critical, since a single
false positive leads to a global misclassification. In this scenario, we have to use a more specific metric,
which gives greater weight to the precision, in order to decide which hyper-parameter set it is the
optimal one. We chose F-Beta, with  = 0.5, as primary optimization metric, in order to give twice as
much importance to Precision as compared to Recall. For each fixed set of hyper-parameters, we also
performed threshold calibration through exhaustive threshold search, in order to maximize our primary
metric. For completion, we also monitored alternative metrics: ICM [34], ICM_norm and F1 Score.</p>
        <p>Once the hyper-parameter search was complete, we need to estimate the real error of our system.
To get a global estimation over the entire dataset, we apply 5-Fold-Cross-Validation, which allows us
to obtain an evaluation for each specific sample. As in the tuning phase, threshold adjustment was
performed after each fold.</p>
        <p>Finally, after selecting the optimal hyper-parameters, we retrained each model on the full dataset in
order to coin the final classifiers we used for downstream analysis.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Models</title>
        <p>Procedure described in Figure 2 was followed for every channel-specific system we trained. However,
due to the unique characteristics of each channel, certain diferences emerged in the training processes
for the three models. We trained three distinct textual models, each of them with one of the three
textual representations we mentioned in Section 3.2. All three were trained using an identical set
of hyper-parameters to ensure comparability. As a pre-trained model for the textual channel, we
used XLM-RoBERTa-Large [35]; Wav2Vec-Large-XLRS-53 [36] was used for audio channel and
ViViT-b-16x2-Kinetics400 [37] for video channel. Specific hyper-parameter values for each model
can be seen in Table 2.</p>
        <p>In our experiments, we consider three configurations based on the source of textual input. Original
configuration uses the model trained with textual input provided directly by the dataset. Own
configuration includes two separate textual models, one trained on automatically extracted transcriptions
and the other on automatically extracted embedded text. All configuration includes all three textual
models.</p>
        <p>All experiments were carried out using a NVIDIA A-40 GPU with 48GB RAM. Due to high memory
requirements, mainly when we process video and audio, we had to apply gradient accumulation technique
in order to simulate larger batch sizes during training. It should be noted that the available hardware
turns out to be a limitation when training audio and video models, since these require a large amount
of resources.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Fusion Mechanisms</title>
        <p>As pointed out in Section 3.3, our first attempt was to follow a One Is Enough (OIE) approach, so all
the training process was focus on that. However, we also explored alternative fusion mechanisms to
combine model outputs. Diferent fusion alternatives were chosen, depending on the type of evaluation,
because, while in hard evaluation we have to get a final label, in soft evaluation we should obtain a
correct likelihood distribution of labels. We implemented three fusion mechanisms for hard evaluation:
1. One is Enough (OIE): A video is classified as sexist if any individual model detects it as such.</p>
        <p>
          This mechanism follows a similar idea to what can happen when screening critical patients at
hospital. If the objective is to determine whether the patient is sick or not, it is enough for one of
the specialists to afirm it, without the need of further opinions of other doctors.
2. Majority Voting: A simple majority rule is applied, where at least half plus one, out of total
number models, must classify the video as sexist for the final label to be positive.
3. Probabilistic OIQ: This method considers all possible combinations of binary outputs from the
models. For each pattern (e.g., [
          <xref ref-type="bibr" rid="ref1 ref1">1, 0, 1</xref>
          ]), we estimate the empirical probability that the video is
sexist, based on the classification distribution of each sequence. Notice that before applying this
method we have to adjust an individual threshold for each model, which should maximize that
model’s performance. At inference time, the model’s predicted output pattern is matched against
these empirical probabilities. A final label is assigned based on whether this probability exceeds a
tuned decision threshold.
        </p>
        <p>Regarding soft fusion mechanisms, we explored two fusion strategies:
1. Logistic Regression Fusion: A meta-model is trained to map the predicted probabilities from
each channel to a final soft label. The model is optimized to approximate the ground truth
probability distribution.
2. Weighted Sum of Predictions: Fixed weights are assigned to each channel’s output. These
weights are optimized using the SLSQP algorithm [38], minimizing the cross-entropy loss between
the weighted prediction and the soft ground truth.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Table 3 shows the best achieved estimation performance for each individual channel using 5-fold
cross-validation. For all individual channels, performance is better than for baselines. The results for
audio and video channels show only small diferences between them. The textual channel, however,
clearly outperforms the others, as the three best-performing models are all based on text. Transcription
and embedded text channels show also a very similar performance, while the original text channel stands
out as the best-performing channel overall.</p>
      <p>These findings reveal that textual channel is the best selection for prediction, either due to its own
information content or because baseline and pre-trained models for this channel are better. Nevertheless,
all channels contribute meaningful information, as each surpasses the baseline performance. Baselines
refer to the naive all-negative classification (majority class) and all-positive classification (minority class).</p>
      <p>Tables 4 and 5 show performance values of runs sent for hard evaluation and soft evaluation,
respectively. OIQ, considering all possible channels, achieves the best estimation performance for
hard evaluation. Even so, the diferences with the other two fusion mechanisms (voting and OIE) are
very small, which denotes a great similarity between algorithms. Something similar happens with
soft evaluation, where Logistic Regression fusion with all possible channel gets the best estimation
performance, but again, the results of the other methods are really close to it.</p>
      <p>Focusing on hard evaluation –as it was the main emphasis of this work–, we observed that results
for the diferent fusion mechanisms were very close to original-text-specific model results. In short, we
found that all the decisions made by the fusion mechanism were mainly based on original-text-specific
information, frequently ignoring the predictions from the audio and video channels. A clear example of
that behaviour can be seen in Figure 3, where the OIE fusion mechanism relies almost exclusively on
the original-text-specific model for decision-making. This happens because during threshold adjustment,
audio and video thresholds are set so high that few, if any, examples exceed them, efectively excluding
these channels from the final decision.</p>
      <sec id="sec-4-1">
        <title>4.1. Rankings</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>
        This work presented the ECA-SIMM-UVa team’s participation in Task 7 of the EXIST 2025 challenge,
which focuses on sexism identification in TikTok videos. The overarching goal of the EXIST challenge
is to develop automatic detection systems for harmful behaviours like sexism on social platforms, a
crucial task given the vast amount of content uploaded. Our approach involved a segmentation-based
strategy, splitting TikTok videos into textual, audio, and video channels, based on the hypothesis that
sexism can manifest through spoken words, embedded text, speaker’s tone, and/or visual content. A
significant finding from our experiments is the clear performance superiority of the textual channel as
compared to audio and video channels. The original text channel, which combines textual transcription
and embedded text provided with the dataset, outperformed all other individual channels, including
automatically extracted transcriptions and embedded text. This highlights that while all channels
contribute meaningful information and individually surpass baseline performance, the textual content
appears to be the most informative modality for sexism detection in this context, possibly due to better
available models or the inherent nature of the textual information. Same conclusion was obtained in
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], which shows that there has not been great progress in terms of the development of multimodal
approaches in the last year. This directly answers one of our research questions, confirming a notable
performance gap between pre-trained text, audio, and video models, with text being significantly
stronger.
      </p>
      <p>Regarding the fusion mechanisms, for hard evaluation, the Probabilistic OIQ method, considering
all possible channels, yielded the best estimation performance, though only marginally better than
Majority Voting and One Is Enough (OIE). Similarly, for soft evaluation, Logistic Regression fusion
with all channels showed the best performance. However, final results did not follow same order as
our estimations, which clearly indicates the inclusion of some type of bias in the estimation during the
training phase. We also found that fusion mechanisms performance were really close to
original-textspecific model results. This circumstance shows us that fusion mechanisms frequently relied almost
exclusively on the original-text-specific model’s decisions, efectively ignoring predictions from the
audio and video channels. This behaviour implicitly addresses our second research question, indicating
that the textual channel was deemed disproportionately important during the decision-making process
of the fusion models, rather than all three channels being considered equally important.</p>
      <p>Despite these observations regarding channel contributions within the fusion mechanisms, our team
achieved remarkable results in the oficial EXIST 2025 challenge. Our competition ranking results
demonstrate the efectiveness of our segmentation-based approach and the fine-tuning of specialized
deep learning models.</p>
      <p>The findings underscore the challenges in multimodal sexism detection, particularly the comparatively
underdeveloped research in video automated detection of sexism due to data collection dificulties and
high resource requirements. While our textual models leveraged robust pre-trained architectures like
XLM-RoBERTa-Large, the performance and integration issues with audio (Wav2Vec-Large-XLRS-53)
and video (ViViT-b-16x2-Kinetics400) models suggest areas for future improvement. Future work should
focus on improving audio and video models under a mutual reinforcement learning strategy. These
models are crucial for advancing this field, especially as platforms like TikTok continue to be significant
vectors for potentially harmful content.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was carried out in the Project PID2021-126315OB-I00 that was supported by MCIN / AEI
/ 10.13039/501100011033 / FEDER, EU. Also, this work is partially funded by the Spanish Ministry
of Science, Innovation and Universities (project FairTransNLP PID2021-124361OB-C32) funded by
MCIN/AEI/10.13039/501100011033.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used NotebookLM in order to: Drafting content and
Abstract drafting. Further, the author(s) used GPT-4 in order to: Improve writing style and Grammar
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
[23] H. Wang, T. R. Yang, U. Naseem, R. K.-W. Lee, Multihateclip: A multilingual benchmark dataset
for hateful video detection on youtube and bilibili, in: Proceedings of the 32nd ACM International
Conference on Multimedia, 2024, pp. 7493–7502.
[24] F. T. Boishakhi, P. C. Shill, M. G. R. Alam, Multi-modal hate speech detection using machine
learning, in: 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 4496–4499.
doi:10.1109/BigData52589.2021.9671955.
[25] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine, The hateful
memes challenge: Detecting hate speech in multimodal memes, Advances in neural information
processing systems 33 (2020) 2611–2624.
[26] R. Velioglu, J. Rose, Detecting hate speech in memes using multimodal deep learning approaches:</p>
      <p>Prize-winning solution to hateful memes challenge, arXiv preprint arXiv:2012.12975 (2020).
[27] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks, Advances in neural information processing systems 32 (2019).
[28] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant baseline
for vision and language, arXiv preprint arXiv:1908.03557 (2019).
[29] J. Lu, C. Clark, R. Zellers, R. Mottaghi, A. Kembhavi, Unified-io: A unified model for vision,
language, and multi-modal tasks, arXiv preprint arXiv:2206.08916 (2022).
[30] Y. Li, J. Duan, Z. Qu, Tri-robust learning: Robust multi-neural networks against extremely noisy
labels, Available at SSRN 4911734 (????).
[31] A. Shrotriya, A. K. Sharma, A. K. Bairwa, R. Manoj, Hybrid ensemble learning with cnn and rnn
for multimodal cotton plant disease detection, IEEE Access (2024).
[32] M. Bain, J. Huh, T. Han, A. Zisserman, Whisperx: Time-accurate speech transcription of long-form
audio, arXiv preprint arXiv:2303.00747 (2023).
[33] H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, et al., Deepseek-vl:
towards real-world vision-language understanding, arXiv preprint arXiv:2403.05525 (2024).
[34] E. Amigó, F. Giner, J. Gonzalo, F. Verdejo, On the foundations of similarity in information access,</p>
      <p>Information Retrieval Journal 23 (2020) 216–254.
[35] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv
preprint arXiv:1911.02116 (2019).
[36] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, M. Auli, Unsupervised cross-lingual
representation learning for speech recognition, arXiv preprint arXiv:2006.13979 (2020).
[37] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer,
in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846.
[38] D. Kraft, A Software Package for Sequential Quadratic Programming, Technical Report DFVLR-FB
88-28, Institut für Dynamik der Flugsysteme, Deutsche Forschungs- und Versuchsanstalt für
Luftund Raumfahrt (DLR), Cologne, Germany, 1988.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , I. Arcos,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <year>Exist 2025</year>
          :
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>442</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos (extended overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of exist 2021:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendieta-Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marco-Remón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Makeienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of exist 2022:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2023-learning with disagreement for sexism identification and characterization</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <article-title>Overview of exist 2024-learning with disagreement for sexism identification and characterization in tweets and memes</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , P. Röttger, SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval2023)</source>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2193</fpage>
          -
          <lpage>2210</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .305/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <article-title>Automatic classification of sexism in social networks: An empirical study on twitter data</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>219563</fpage>
          -
          <lpage>219576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chiril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Moriceau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mari</surname>
          </string-name>
          , G. Origgi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coulomb-Gully</surname>
          </string-name>
          ,
          <article-title>An annotated corpus for sexism detection in french tweets</article-title>
          ,
          <source>in: 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ),
          <source>ELRA: European Language Resources Association</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prati</surname>
          </string-name>
          ,
          <string-name>
            <surname>AEDA:</surname>
          </string-name>
          <article-title>An easier data augmentation technique for text classification</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2748</fpage>
          -
          <lpage>2754</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>234</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>234</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen, Deberta:
          <article-title>Decoding-enhanced bert with disentangled attention</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2006</year>
          .03654. arXiv:
          <year>2006</year>
          .03654.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.06825. arXiv:
          <volume>2310</volume>
          .
          <fpage>06825</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Grazia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pastells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Chas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Farrús</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <article-title>Mused: A multimodal spanish dataset for sexism detection in social media videos</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv. org/abs/2504.11169. arXiv:
          <volume>2504</volume>
          .
          <fpage>11169</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Arcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Sexism identification on tiktok: a multimodal ai approach with text, audio, and video</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C. S. Wu</surname>
          </string-name>
          , U. Bhandary,
          <article-title>Detection of hate speech in videos using machine learning</article-title>
          ,
          <source>in: 2020 international conference on computational science and computational intelligence (CSCI)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>585</fpage>
          -
          <lpage>590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>M. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Raj</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mathew</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Hatemm: A multi-modal dataset for hate video classification</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>17</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>1014</fpage>
          -
          <lpage>1023</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>