<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. Pergola)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Leveraging Model Confidence and Diversity: A Multi-Stage Framework for Sexism Detection⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anwar Alajmi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Pergola</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, College of Business Studies, Public Authority of Applied Education and Training (PAAET)</institution>
          ,
          <country country="KW">Kuwait</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Warwick</institution>
          ,
          <addr-line>Coventry CV4 7AL</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Sexism on social media often manifests in implicit, sarcastic, or context-dependent forms, making it dificult to detect reliably with existing systems. What is perceived as sexist language can at times vary across individuals, cultures, and platforms, highlighting its inherently multifaceted nature. Inspired by this diversity of human interpretation, we propose a hybrid detection framework that integrates the outputs of multiple neural language models, each encoding diferent perspectives on the task. Our system combines fine-tuned monolingual transformers (e.g., BERTweet for English, RoBERTuito for Spanish) with instruction-tuned large language models (LLMs) such as Claude 3 Sonnet and LLaMA3-70B-Instruct. These models are combined within a confidence-based multi-stage pipeline: high-confidence predictions from task-specialized models are preserved, while uncertain instances are routed to general-purpose LLMs for zero-shot classification. This dynamic strategy combines high-confidence predictions from specialized models with broader judgments from instruction-tuned LLMs, enabling the system to better manage expressions of sexist language across diferent linguistic contexts. Evaluated on the EXIST 2025 shared task, our approach ranked 2nd on the Spanish test set, 4th overall, and 9th on English, demonstrating the efectiveness of confidence-guided ensemble learning in multilingual sexism detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism Detection</kwd>
        <kwd>Tweets Classification</kwd>
        <kwd>Multilingual Models</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Ensemble</kwd>
        <kwd>EXIST 2025</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sexism can be described as gender-based stereotyping or discrimination. Sexist language can appear in
both explicit and implicit forms, ranging from overtly hateful comments to subtle sexist jokes. With the
widespread use of social media, such expressions increasingly occur online, posing serious psychological
and societal risks. As a result, automatic detection of sexist language has become a vital task in Natural
Language Processing (NLP), aimed at fostering safer and more inclusive digital spaces. Despite notable
progress in sexism detection using deep learning methods, several core challenges remain. Sexist content
can be highly implicit, relying on contextual cues, sarcasm, or coded language, which makes it dificult
for models to detect reliably. Moreover, due to diferences in cultural, social, and individual norms, the
interpretation of what constitutes sexist content can vary widely. These challenges underscore the need
for detection systems that are not only accurate but also robust to linguistic variation and ambiguity.</p>
      <p>
        To address these issues, this study proposes a hybrid sexism detection framework that integrates
the outputs of diverse NLP models. Diferent neural language models, whether simply pre-trained,
instruction-tuned, or trained with reinforcement learning from human feedback (RLHF), may already
encode diferent assumptions and interpretations about what constitutes sexist language. These
differences can arise from the datasets and training objectives used during their development, which
implicitly shape the models’ sensitivity to linguistic nuances and social cues. By aggregating the outputs
of such diverse systems, our approach seeks to synthesize these pre-encoded interpretive frames to
enhance detection robustness. Therefore, rather than relying on a single classifier, we introduce a
new approach that leverages a confidence-based ensemble of fine-tuned monolingual transformers
(such as BERTweet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and RoBERTuito [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and large language models (LLMs) including
Llama-3-70BInstruct [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Claude 3 Sonnet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We posit that each model is able to ofer a distinct perspective
on the task, contributing to a more comprehensive understanding of the data, and that combining
high-confidence predictions from these models, the system improves reliability and generalization. Our
main contributions are as follows:
• We introduce a robust framework for sexism detection that combines fine-tuned language-specific
models and general-purpose large language models through a confidence-based ensemble method.
• We conduct a thorough experimental evaluation, including ablation studies and multiple ensemble
combinations, to quantify the individual and combined impact of each model component.
      </p>
      <p>
        This work was conducted as part of the EXIST (sEXism Identification in Social Networks) 2025 shared
task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which evaluates systems for detecting sexist content in multilingual social media data under
realistic conditions. Our system demonstrates competitive performance on the EXIST 2025 shared task,
ranking 2nd on the Spanish test set, 4th overall, and 9th on the English portion, highlighting the value of
synthesizing multiple model perspectives for detecting multi faceted classification tasks such as online
sexism.
      </p>
      <p>The rest of the paper is structured as follows: Section 2 reviews and summarizes related works in the
ifeld of sexism detection. Sections 3 explores the proposed methodology, while Section 4 discusses the
experimental results. Finally, the conclusion and future work are presented in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Hateful and toxic speech detection has been an ongoing challenge in NLP. This area has received
considerable attention from machine learning (ML) researchers, with early work employing traditional
supervised ML techniques [
        <xref ref-type="bibr" rid="ref10 ref11 ref6 ref7 ref8 ref9">6, 7, 8, 9, 10, 11, 12, 13</xref>
        ]. More recent approaches have leveraged Transformer
models [14], such as BERT, which has significantly improved toxicity classification performance by
enabling deeper contextual understanding as demonstrated by the works in [15, 16, 17, 18, 19, 20, 21].
More recently, several methodologies focused on employing LLMs, and in particular prompt-based
approaches, for various hate speech detection tasks, including sexism classification [ 22, 23]. For example,
the study by [24] demonstrates that zero-shot sexism detection can be enhanced through expert-guided
prompting, even when experts are not familiar with LLMs. Their findings emphasize the advantage of
combining human knowledge with model capabilities to detect subtle sexist language.
      </p>
      <p>An ongoing challenge in the detection of sexism and other forms of hateful speech is the presence of
annotation disagreements, which often originate from social and cultural biases [25]. The EXIST shared
task aims to tackle this issue by promoting systems capable of accurately identifying sexist content
regardless of inconsistent labeling. Submissions to EXIST 2024 included a range of approaches, such as
the utilization of BERT-based models (DistilBERT [26], DeBERTa [27],and RoBERTa [28]), LLMs [29],
ensemble methods [30] and knowledge-based techniques [31]. One of the best performing systems was
proposed by [32], where the authors used data augmentation to train DeBERTa with hard parameter
sharing [33]. Their approach also incorporated annotators information and Round to Closed Value [34].</p>
      <p>This work leverages a variety of classification systems to emulate annotators and models their
disagreements. Moreover, it proposes a combination of monolingual and multilingual NLP models
within a confidence-based ensemble technique to produce robust and accurate final predictions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        In this section, we describe the classification framework in detail. The system follows a multi-stage
ensemble pipeline that combines predictions from fine-tuned discriminative transformers and
instructionfollowing large language models (LLMs). Each input tweet is first processed by one or more fine-tuned
transformer models. These include language-specific models, such as BERTweet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for English,
RoBERTuito [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) for Spanish, as well as a multilingual model, LLaMA-3.2-1B [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], fine-tuned jointly on both
languages. If the model’s confidence in its prediction exceeds a predefined threshold, the prediction is
accepted directly and included in the final output. If confidence falls below this threshold, the input is
rerouted to a secondary stage involving zero-shot classification by general-purpose LLMs (Claude 3
Sonnet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Llama-3-70B-Instruct [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). This dynamic routing mechanism allows the system to retain
high-confidence transformer outputs while deferring uncertain cases to more general, instruction-tuned
models. As shown in Figure 1, the final predictions consist of both retained transformer outputs and
LLM-based decisions, combined to enhance robustness and coverage across a wide range of linguistic
inputs.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Confidence-based Routing and Decision Integration</title>
        <p>The central mechanism driving our ensemble strategy is the use of model confidence to control routing
and the final predictions. Rather than aggregating predictions uniformly, we adopt a selective approach
that promotes reliable outputs from specilised models, while using LLMs as fallback experts in cases of
uncertainty.</p>
        <p>For each fine-tuned model, we compute confidence scores based on the logit margin: the diference
between the top two predicted class logits. A threshold  is determined individually for each model
using its training set distribution of logit margins. In our experiments, the threshold typically falls
between the 25th and 35th percentiles, providing a conservative cutof for what constitutes a confident
prediction. Predictions with logit margins above this threshold are retained as-is. Instances that fall
below the threshold are rerouted to the LLM layer. These tweets are classified in a zero-shot manner
using both Claude 3 Sonnet and Llama-3-70B-Instruct, prompted with the following instruction: “You
are an accurate sexism classification system. Classify the given tweets as
sexist (YES) or not sexist (NO).”. The final decision for each instance is determined as
follows: if both LLMs return the same label, that label is used. If they disagree, we select the output
of Claude 3 Sonnet, which demonstrated higher reliability in our preliminary experiments on the
development set.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results &amp; Analysis</title>
      <p>This section discusses the dataset, the experimental set up, the evaluation metrics, and the results of the
proposed methods.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The EXIST 2025 Tweets dataset was used to classify sexist content. It contains 10,034 samples labeled
’YES’ or ’NO’ in both English and Spanish, distributed across training, development, and test sets, as
shown in Table 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baselines</title>
        <p>We compare our proposed approach against several strong baselines commonly used in multilingual
and monolingual text classification tasks:
• XLM-RoBERTa-large: A multilingual transformer model introduced by [35], that is pre-trained
on 100 languages and acts as a strong fine-tuned baseline for multilingual text classification tasks.
• LLaMA3.2-1B: A 1B-parameter model that is fine-tuned for sexism classification of tweets.
• Merged Monolingual Transformers (MMT): This setup combines BERTweet-large solely
ifne-tuned on English data, and RoBERTuito-base exclusively trained on Spanish samples for the
detection of sexist tweets.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Setting</title>
        <p>
          The baseline models are fine-tuned on the entire training set of 6,920 samples. The
hyperparameter settings were 0.1 for the warmup_ratio , 4 for both per_device_train_batch_size and
per_device_eval_batch_size, 3 for num_train_epochs, and 2e-5 for learning rate. The
pretrained versions of the baseline models were obtained from Huggingface [36]. For LLM-related
experiments, LangChain [37] and Ollama [38] were used to download and run the models locally.
Hyperparameter optimization was conducted during the training phase using Optuna [39], a framework for
eficiently searching optimal training configurations for each model. We additionally employed a data
augmentation strategy, where Claude 3.7 Sonnet [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was used to generate additional balanced synthetic
examples in English and Spanish. Using the data augmentation prompt: "Generate a paraphrased
(augmented) version of the tweet that preserves its original meaning, tone,
and any sexist or non-sexist elements", additional 1,000 English and 1,000 Spanish tweets
(for a total of 2,000 augmented examples) were appended to the original training dataset.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation Metrics</title>
        <p>The following are the oficial metrics used to evaluate the system under the EXIST 2025 hard–hard
evaluation scheme:
1. ICM: The Inter-Consistency Measure (ICM) [40] as shown in Equation 1 measures the alignment
between two sets of predictions,  and .</p>
        <p>
          ICM(, ) =  1 IC() +  2 IC() −  IC( ∪ )
2. ICM-Norm: A normalized ICM score that is rescaled to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range.
3. F1: The harmonic mean of precision and recall as shown in Equation 2.
        </p>
        <p>1 = 2 *   * 
  + 
(1)
(2)</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Results</title>
        <sec id="sec-4-5-1">
          <title>4.5.1. Development Set</title>
        </sec>
        <sec id="sec-4-5-2">
          <title>4.5.2. Test Set (Leaderboard)</title>
          <p>The three submitted runs for the EXIST 2025 competition Subtask 1.1 (hard setting) were as follows:
• warwick_1: Represents the full confidence-based ensembling approach combining high-confidence
predictions from Claude 3 Sonnet and LLaMA-3-70B-Instruct with the predictions of MMT and
LLaMA-3-1B (Claude 3 Sonnet and LLaMA-3-70B-Instruct (s) + MMT and LLaMA-3-1B (w)).
• warwick_2: Combines the high-confidence predictions of MMT with fallback predictions from</p>
          <p>Claude 3 Sonnet (MMT (s) + Claude 3 Sonnet (w)).
• warwick_3: Uses confident predictions from LLaMA-3-1B and replaces low-confidence outputs
with those from Claude 3 Sonnet (LLaMA-3-1B (s) + Claude 3 Sonnet (w)).</p>
          <p>Across all instances (English and Spanish), warwick_1 ranked 4th, achieving an ICM score of 0.6249, a
normalized ICM of 0.8141, and an F1 score of 0.7991. warwick_2 ranked 7th (ICM: 0.5834, ICM Norm:
0.7932, F1: 0.7892), while warwick_3 ranked 12th out of 160 submissions, scoring 0.5793 on ICM, 0.7912
on ICM Norm, and 0.7888 on F1.</p>
          <p>For Spanish tweets, warwick_1 ranked 2nd, with an ICM of 0.6441. warwick_2 followed in 5th, scoring
0.6312 on ICM. warwick_3 ranked 6th, with ICM: 0.6126.</p>
          <p>Baseline Models
XLM-RoBERTa-large
Llama3.2-1B
MMT
LLMs
LLaMA-3-70B-Instruct
Claude 3 Sonnet
Confidence-based Ensemble Combinations
MMT (s) + LLaMA-3-1B (w)
MMT (s) + LLaMA-3-70B-Instruct (w)
MMT (s) + Claude 3 Sonnet (w)
LLaMA-3-1B (s) + LLaMA-3-70B-Instruct (w)
LLaMA-3-1B (s) + Claude 3 Sonnet (w)
Claude 3 Sonnet + LLaMA-3-70B-Instruct (s) + MMT (w)
Claude 3 Sonnet + LLaMA-3-70B-Instruct (s) + LLaMA-3-1B (w)
Claude 3 Sonnet + LLaMA-3-70B-Instruct (s) + MMT + LLaMA-3-1B (w)
ICM Norm
ICM Norm</p>
          <p>In contrast, performance on English samples was lower, with warwick_1 ranking 9th, warwick_2
ranking 55th, and warwick_3 ranking 43rd. Nevertheless, the top performing submissions on the English
test set achieved lower scores compared to their Spanish counterparts. This performance gap may be
influenced by several factors. One possibility is the proportion and frequency of forms of sarcasm,
humor, or subtle implication, which can be harder to detect. Another potential factor is variation in
annotation consistency across languages, though further analysis would be needed to confirm this.</p>
          <p>The leaderboard results demonstrate the efectiveness of confidence-based ensembling,
particularly when integrating predictions from both LLMs and fine-tuned models. Among the submissions,
warwick_1, which employs the full confidence-based ensemble, consistently outperforms the other
approaches. These findings also highlight the advantages of fine-tuning monolingual models separately
for each language. Specifically, the ensemble approach using BERTweet and RoBERTuito in the MMT
setup (warwick_2) outperforms the LLaMA3.2-1B model fine-tuned on both languages ( warwick_3) on
the Spanish test set (ranked 5th) and across all instances (ranked 7th).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work introduced a multi-stage, confidence-based ensemble framework for sexism detection in
multilingual social media data. The proposed pipeline dynamically integrates the outputs of fine-tuned
monolingual and multilingual transformer models with zero-shot predictions from instruction-tuned
LLMs, using model confidence to guide the decision process. This mechanism enabled the system
to balance precision and generalization by leveraging each model type where they are most reliable.
Through extensive evaluation on the EXIST 2025 shared task, our system demonstrated competitive
performance, ranking 2nd on the Spanish test set and 4th overall. The experimental results confirm that
the confidence-based routing not only enhances robustness to ambiguous or borderline cases, but also
improves performance over individual models and naïve ensembles. A promising direction for future
research is exploring additional prompting techniques, for example, based on multi-expert prompting
[41] to model annotators as experts and improve the reliability of the predictions.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Association for Computational Linguistics, Online, 2021, pp. 2870–2883.
[12] A. S. Saksesi, M. Nasrun, C. Setianingsih, Analysis text of hate speech detection using recurrent
neural network, in: 2018 international conference on control, electronics, renewable energy and
communications (ICCEREC), IEEE, 2018, pp. 242–248.
[13] X. Tan, Y. Zhou, G. Pergola, Y. He, Cascading large language models for salient event graph
generation, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Proceedings of the 2025 Conference of
the Nations of the Americas Chapter of the Association for Computational Linguistics: Human
Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics,
Albuquerque, New Mexico, 2025, pp. 2223–2245.
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[15] M. Mozafari, R. Farahbakhsh, N. Crespi, A bert-based transfer learning approach for hate speech
detection in online social media, in: International conference on complex networks and their
applications, Springer, 2019, pp. 928–940.
[16] R. T. Mutanga, N. Naicker, O. O. Olugbara, Hate speech detection in twitter using transformer
methods, International Journal of Advanced Computer Science and Applications 11 (2020).
[17] Z. Sun, G. Pergola, B. Wallace, Y. He, Leveraging ChatGPT in pharmacovigilance event extraction:
An empirical study, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the
European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers),
Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 344–357.
[18] F. M. Plaza-del Arco, M. D. Molina-González, L. A. Urena-López, M. T. Martín-Valdivia, Comparing
pre-trained language models for spanish hate speech detection, Expert Systems with Applications
166 (2021) 114120.
[19] C. Lyu, , G. Pergola, SciGisPy: a novel metric for biomedical text simplification via gist inference
score, in: M. Shardlow, H. Saggion, F. Alva-Manchego, M. Zampieri, K. North, S. Štajner, R. Stodden
(Eds.), Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability
(TSAR 2024), Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 95–106.</p>
      <p>URL: https://aclanthology.org/2024.tsar-1.10/.
[20] Anjum, R. Katarya, Hate speech, toxicity detection in online social media: a recent survey of state
of the art and opportunities, International Journal of Information Security 23 (2024) 577–608.
[21] S. Khan, G. Pergola, A. Jhumka, Multilingual sexism identification via fusion of large language
models, in: Conference and Labs of the Evaluation Forum (CLEF 2024), 2024.
[22] S. Khan, A. Jhumka, G. Pergola, Explaining matters: Leveraging definitions and semantic
expansion for sexism detection, in: Proceedings of the 63rd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
2025.
[23] X. Tan, C. Lyu, H. M. Umer, S. Khan, M. Parvatham, L. Arthurs, S. Cullen, S. Wilson, A. Jhumka,
G. Pergola, SafeSpeech: A comprehensive and interactive tool for analysing sexist and abusive
language in conversations, in: N. Dziri, S. X. Ren, S. Diao (Eds.), Proceedings of the 2025 Conference
of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human
Language Technologies (System Demonstrations), Association for Computational Linguistics,
Albuquerque, New Mexico, 2025, pp. 361–382.
[24] M. Reuver, I. Sen, M. Melis, G. Lapesa, Tell me what you know about sexism: Expert-llm interaction
strategies and co-created definitions for zero-shot sexism detection, in: Findings of the Association
for Computational Linguistics: NAACL 2025, 2025, pp. 8438–8467.
[25] A. M. Davani, V. Prabhakaran, M. Diaz, Dealing with disagreements: Looking beyond the majority
vote in subjective annotations, in: ACL, 2022.
[26] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[27] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention,
arXiv preprint arXiv:2006.03654 (2020).
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[29] J. Tavarez-Rodríguez, F. Sánchez-Vega, A. Rosales-Pérez, A. P. López-Monroy, Better together: Llm
and neural classification transformers to detect sexism, Working Notes of CLEF (2024).
[30] S. Khan, G. Pergola, A. Jhumka, Multilingual sexism identification via fusion of large language
models, Working Notes of CLEF (2024).
[31] L. Plaza, J. Carrillo-de Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of exist 2024—learning with disagreement for sexism identification
and characterization in tweets and memes, in: International Conference of the Cross-Language
Evaluation Forum for European Languages, Springer, 2024, pp. 93–117.
[32] Y.-Z. Fang, L.-H. Lee, J.-D. Huang, Nycu-nlp at exist 2024–leveraging transformers with diverse
annotations for sexism identification in social networks, Working Notes of CLEF (2024).
[33] S. Ruder, An overview of multi-task learning in deep neural networks, arXiv preprint
arXiv:1706.05098 (2017).
[34] A. F. M. de Paula, G. Rizzi, E. Fersini, D. Spina, Ai-upv at exist 2023–sexism characterization
using large language models under the learning with disagreements regime, arXiv preprint
arXiv:2307.03385 (2023).
[35] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR
abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[36] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funtowicz, J. Davison, S. Shleifer, P. von Platen, Hugging face: The ai community building the future.,
https://huggingface.co, 2020. Accessed: 2025-06-02.
[37] H. Chase, Langchain: Building applications with llms through composability, https://www.</p>
      <p>langchain.com/, 2022. Accessed: 2025-06-02.
[38] O. Team, Ollama: Run and deploy large language models locally, https://ollama.com/, 2023.
Accessed: 2025-06-02.
[39] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter
optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery &amp; Data Mining, KDD ’19, Association for Computing Machinery, New
York, NY, USA, 2019, p. 2623–2631. URL: https://doi.org/10.1145/3292500.3330701. doi:10.1145/
3292500.3330701.
[40] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: Proceedings
of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2022, pp. 5809–5819.
[41] D. X. Long, D. N. Yen, A. T. Luu, K. Kawaguchi, M.-Y. Kan, N. F. Chen, Multi-expert prompting
improves reliability, safety, and usefulness of large language models, arXiv preprint arXiv:2411.00492
(2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          , A.-T. Nguyen,
          <article-title>Bertweet: A pre-trained language model for english tweets</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Furman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Alemany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Luque</surname>
          </string-name>
          ,
          <article-title>Robertuito: a pre-trained language model for social media text in spanish</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>7235</fpage>
          -
          <lpage>7243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <article-title>Claude 3.7 sonnet: a hybrid reasoning model with adjustable “thinking” mode</article-title>
          , https: //www.anthropic.com/news/claude-3-7-sonnet,
          <year>2025</year>
          . (accessed
          <year>June 2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. de Albornoz</surname>
            , I. Arcos,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Sutejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Lestari</surname>
          </string-name>
          ,
          <article-title>Indonesia hate speech detection using deep learning</article-title>
          ,
          <source>in: 2018 International Conference on Asian Language Processing (IALP)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gröndahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pajola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Juuti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Conti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Asokan</surname>
          </string-name>
          ,
          <article-title>All you need is" love" evading hate speech detection</article-title>
          ,
          <source>in: Proceedings of the 11th ACM workshop on artificial intelligence and security</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-R. Yao</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>Hate speech detection: Challenges and solutions</article-title>
          ,
          <source>PloS one 14</source>
          (
          <year>2019</year>
          )
          <article-title>e0221152</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pergola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Tdam: A Topic-Dependent Attention Model for Sentiment Analysis</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>56</volume>
          (
          <year>2019</year>
          )
          <fpage>102084</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Albadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kurdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <article-title>Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere</article-title>
          ,
          <source>in: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pergola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>A disentangled adversarial neural topic model for separating opinions from plots in user reviews</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the 2021 Conference of</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>