<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Sexism Detection through Domain Adaptation and Label-Augmented Translation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sahrish Khan</string-name>
          <email>sahrish.khan@warwick.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arshad Jhumka</string-name>
          <email>h.a.jhumka@leeds.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Pergola</string-name>
          <email>gabriele.pergola.1@warwick.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Warwick</institution>
          ,
          <addr-line>Coventry CV4 7AL</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing, University of Leeds</institution>
          ,
          <addr-line>Leeds LS2 9JT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Detecting sexism in online discourse presents persistent challenges for multilingual NLP systems, especially when dealing with culturally specific expressions and subtle contextual cues. In our submission to the EXIST 2025 Task 1.1, we propose a pipeline that combines label-aware translation, domain-adaptive pre-training, and ensemble learning to address these challenges. A central component of our system is a prompt-based Spanish-to-English translation step, designed to preserve the tone and task-relevant semantics of the original message, selectively incorporating label cues during training. This enables the use of high-performance monolingual models while maintaining semantic fidelity across languages. We further adapt DeBERTa-v3-Large and RoBERTa-Large using 2 million unlabeled posts from the EDOS dataset and fine-tune them individually and in a fused configuration (DTFN). Final predictions are generated via majority voting, with a tie-handling rule that improves robustness. In the experimental assessment, our system achieved 4th place in English and ranked in the top 10 across all evaluation tracks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism Detection</kwd>
        <kwd>Multilingual NLP</kwd>
        <kwd>Domain-adaptive pre-training</kwd>
        <kwd>Prompt-based translation</kwd>
        <kwd>Ensemble learning</kwd>
        <kwd>EXIST Task 1</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sexism or discrimination based on gender continues to manifest in diverse ways across online discourse.
In social media, sexist remarks can spread rapidly and take many forms, from casual stereotyping to overt
hostility [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Detecting such content reliably across languages and platforms remains a key challenge
in natural language processing (NLP), particularly when subtle contextual cues or culturally-specific
expressions are involved. Multilingual sexism detection presents an added layer of complexity:
languagespecific expressions of bias can difer in tone, terminology, and structure. While large multilingual
models ofer a direct solution, they often struggle to balance cross-lingual generalization with semantic
precision. Our system, submitted to EXIST 2025 Task 1.1, takes a diferent approach. We combine
domain-adaptive pre-training, label-aware translation, and ensemble learning to construct a pipeline
that preserves the semantics of bias-related content while maximizing classification robustness.
      </p>
      <p>A central component of our pipeline is a prompt-based Spanish-to-English translation step. The
Large Language Model’s prompt for translation is carefully designed to preserve the tone and intent
of the original message and, when training labels are available, selectively incorporates task-relevant
cues aligned with the sexism taxonomy. This translation strategy therefore acts as augmentation
strategy and allows the models to operate fully within an English-language modeling space, taking
advantage of the wider development of monolingual neural models, while mitigating semantic drift. We
complement this approach with model-level adaptation and aggregation. Specifically, we domain-adapt</p>
      <p>
        DeBERTa-v3-Large and RoBERTa-Large using 2 million unlabeled tweets drawn from the Explainable
Detection of Online Sexism (EDOS) dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and fine-tune these models—individually and in a fused
configuration (DTFN) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Final predictions are produced via ensemble learning, using majority voting
and a tie-handling rule to increase consistency across prediction variants. Together, the label-informed
translation design and ensemble-based decision mechanism enable strong cross-lingual performance
while maintaining conceptual and architectural clarity. Our system ranked 4th in English and achieved
top-10 scores across all tracks in the EXIST 2025 shared task.
      </p>
      <p>The contributions of this work can be summarized as follows:
• We introduce a label-aware, prompt-based translation strategy that preserves semantics relevant
to sexism detection across languages.
• We combine domain-adaptive pre-training with a majority-vote ensembling for robust multilingual
classification.
• We provide detailed empirical results and analysis on the EXIST 2025 benchmark, achieving high
rankings in all evaluation tracks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Detecting sexism in online content has become a growing focus in the NLP community, driven by the
widespread use of social media and the urgent need to mitigate harmful language. As interest in the
task has grown, a series of shared tasks have emerged to benchmark progress and foster collaboration
across the research community [
        <xref ref-type="bibr" rid="ref10 ref11 ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9, 10, 11</xref>
        ]. A key initiative in this space is the EXIST shared
task, launched in 2021 as part of IberLEF. Running annually through 2024, EXIST has established itself
as a central benchmark for multilingual sexism detection, focusing on English and Spanish social media
content. The datasets span diverse platforms such as Gab and X (formerly Twitter), reflecting real-world
linguistic variation and user behavior [12, 13, 14].
      </p>
      <p>
        Building on this foundation, other shared tasks have expanded the scope of the problem. SemEval-2022
introduced Task 5, Multimedia Automatic Misogyny Identification (MAMI) , which explored multimodal
misogyny detection in memes, combining image and text analysis [15]. This was followed by
SemEval2023’s Task 10, Explainable Detection of Online Sexism (EDOS), which emphasized explainability by
introducing fine-grained sexist categories, ranging from broad groupings to eleven specific subtypes.
EDOS also emphasized source diversity, drawing data from Reddit and Gab to challenge generalization
across platforms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Although these tasks contributed valuable insights into modality and taxonomy expansion, the EXIST
2024 shared task returned its focus to multilingual detection, this time with greater methodological
variety and evaluation depth. In this edition, top-performing systems explored a diverse range of
strategies to address the challenges of multilingual sexism detection [16]. Several teams focused on
leveraging ensemble learning, combining outputs from transformer-based models such as DeBERTa,
RoBERTa, Mistral, and XLM-R, often enhanced through domain-adaptive pre-training [
        <xref ref-type="bibr" rid="ref3">3, 17, 16</xref>
        ].
      </p>
      <p>
        In parallel, a growing number of studies explored translation-based approaches to better manage
multilingual input. By translating all content into a single language, researchers could apply monolingual
models across languages, thus simplifying the learning process [18, 19, 20]. Recent ensemble methods
include adaptive model selection, where systems dynamically choose which models to include based
on the input and confidence weighted fusion, which assigns weights to each model’s output based on
its predicted certainty. This kind of aggregation brings complementary perspectives together, helping
capture the varied and nuanced expressions of online sexism. Together, these strategies reflect a wider
trend in the field: towards flexible layered systems that integrate translation, ensemble learning, and
model tuning to better address the complexity of online sexism detection [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">5, 17, 21, 22, 3, 4, 23, 24, 25</xref>
        ].
      </p>
      <p>Our system adopts a focused and reproducible strategy for multilingual sexism detection. We unify
Spanish and English inputs through prompt-guided translation using GPT-4o, carefully preserving tone
and domain-specific language. This allowed us to operate entirely within an English-language setup,
leveraging powerful monolingual models.</p>
      <p>Exist2025 Task 1
Labeled Tweets</p>
      <p>English
Spanish
Dual-Transformer Fusion</p>
      <p>Network</p>
      <p>EDOS
Unlabeled Dataset
2M Documents</p>
      <p>Pre-training</p>
      <p>DeBERTa-v3-Large
RoBERTa-Large</p>
      <p>Ensemble Learning</p>
      <p>Results</p>
    </sec>
    <sec id="sec-3">
      <title>3. Pipeline Overview</title>
      <p>
        To tackle multilingual sexism detection efectively, we designed a simple yet powerful pipeline built
around four key stages (Figure 1). First, we adapt transformer models to the particular style of online
messages through large-scale pre-training. Second, we unify the input space by translating all Spanish
tweets into English using GPT-4o and a task-aware prompt. Third, we fine-tune two high-performing
models—DeBERTa-v3-Large and our Dual-Transformer Fusion Network (DTFN) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], on the labeled
dataset, Finally, we apply ensemble learning using majority voting to combine the predictions from
models.
      </p>
      <p>These four stages are detailed in the following sections, covering pre-training, translation, model
design, and ensemble learning.</p>
      <sec id="sec-3-1">
        <title>3.1. Domain-Adaptive Pre-training</title>
        <p>
          As the initial step in our pipeline, we performed domain-adaptive pre-training on DeBERTa-v3-Large
and RoBERTa-Large using a large corpus of 2 million unlabeled posts from the EDOS dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which
includes user-generated content from platforms such as Gab and Reddit. These pre-trained models were
then used in the next stages of the pipeline: or as a standalone DeBERTa-v3-Large, or by combining
them, as in the case of DeBERTa-v3-Large and RoBERTa-Large combined into the Dual-Transformer
Fusion Network (DTFN), a dual-encoder architecture introduced and discussed in earlier work [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
3.2. Label-Augmented Prompt-Based ES→EN Translation
To keep the the multilingual processing pipeline simple, rather than introducing dedicated components,
we translated all Spanish tweets into English using GPT-4o [26] and a carefully designed prompt.
This allowed us to work entirely within an English-language setup, enabling the use of monolingual
transformer models, such as DeBERTa-v3-Large and DTFN.This strategy was guided by both practical
considerations, as English transformer models are typically more robust and better optimized due to
pre-training on larger, more diverse corpora.
        </p>
        <p>To ensure consistency, we applied the same translation approach and prompt uniformly across the
training, development, and test sets. For the training data, the label was selectively used to guide the
translation process, particularly to preserve task-relevant language and semantic cues associated with
sexism. However, for fair comparison and evaluation, at test time no label information was used during
translation.</p>
        <p>
          The prompt was specifically crafted to preserve the tone and intent of each tweet. If the content was
sexist, GPT-4o was guided to retain that tone and, when appropriate, use terminology aligned with the
sexism typology (e.g., misogyny, objectification, mansplaining), aligned with the EDOS [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] taxonomy
(see Figure 2).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Model Fine-tuning</title>
        <p>Following domain-adaptive pre-training and translation, we fine-tuned DeBERTa-v3-Large and our
Dual-Transformer Fusion Network (DTFN) on the labeled dataset provided in EXIST 2025. The training
objective was to minimize the binary cross-entropy loss, formulated as:</p>
        <p>1 ∑︁ [ log(^) + (1 − ) log(1 − ^)]
ℒ = −</p>
        <p>
          =1
where  is the number of training samples,  ∈ {0, 1} is the ground-truth label for the -th sample,
and ^ ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the predicted probability of the sample being classified as sexist.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Ensemble Learning with Majority Voting</title>
        <p>
          To further improve the consistency and variety of predictions, we employed a majority voting ensemble
over the best fine-tuned models. Specifically, we combined outputs from DeBERTa-v3-Large and DTFN.
As in our prior work [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we adopted the same tie-handling rule, defaulting to the “sexist” label in
ambiguous cases.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Assessment</title>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>The experimental assessment was conducted on the EXIST 2025 Tweets Dataset, designed for Task 1.1
on sexism detection. The dataset comprises over 10,000 tweets in English and Spanish, each annotated
for binary classification, labeling posts as either sexist ("YES") or not sexist ("NO").</p>
        <p>
          The dataset is split into three parts: training, development, and test sets. All models were trained
using gold-standard (hard) labels from both English and Spanish subsets in the training. The full
distribution of the dataset is summarized in Table 1.
In the following, we briefly describe the baselines evaluated:
• RoBERTa-Large [27]: An optimized variant of BERT that removes the Next Sentence Prediction
(NSP) objective and leverages larger training corpora and longer training durations.
• DeBERTa-V3-Large [28]: An advancement over BERT and RoBERTa that introduces a
disentangled attention mechanism and an enhanced masked language modeling objective, with strong
performance in low-resource or noisy data settings.
• DTFN (Dual-Transformer Fusion Network) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]: Our previously proposed architecture that
integrates representations from both RoBERTa-Large and DeBERTa-V3-Large using a
dualtransformer fusion strategy.
        </p>
        <p>We adopted the publicly available pre-trained versions of all models from the HuggingFace Transformers
library1, and further performed domain-adaptive pre-training on 2 million unlabeled samples from the
EDOS dataset. These adapted models were then fine-tuned on the EXIST 2025 Task 1.1 dataset under
both baseline and translation-based setups.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Parameter Settings</title>
        <p>For each model used in our experiments, we identified a set of optimal hyperparameters through
preliminary testing. The selected hyperparameters include the number of training epochs, learning rate
( ), batch size, and weight decay ( ). Table 2 presents the optimal hyperparameters for each model.
In Task 1.1 of EXIST 2025, system performance was evaluated using three oficial metrics: ICM-Hard,
ICM-Hard Norm, and F1-score. The ICM metric, introduced by Amigo and Delgado [29], is grounded in
information theory and quantifies the similarity between a system’s predictions and the gold-standard
labels. To account for class imbalance in the dataset, the organizers also reported a normalized variant,
ICM-Hard Norm, enabling fairer comparisons across systems. For all three metrics, higher scores
indicate better alignment with the reference annotations, with ICM-Hard Norm ofering particular
robustness in uneven label distributions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Results on the Development set</title>
        <p>We evaluated both individual model variants and ensemble configurations on the development set. As
shown in Table 3, our experiments involved DeBERTa-v3-Large and DTFN, each domain-adaptively
pre-trained on 2 million unlabeled tweets from the EDOS dataset (for 7 or 9 epochs) and subsequently
ifne-tuned for 30 epochs on the labeled EXIST 2025 training set. These evaluations were conducted
under two setups: the baseline setting, which used the original bilingual data (English and Spanish
tweets), and the translation-based setting, where all Spanish tweets were first translated into English
using our GPT-4o prompt.</p>
        <p>In the baseline setup (Index 1–4), models were fine-tuned on both language input. DTFN slightly
outperformed DeBERTa-v3-Large at 9 epochs (ICM-Hard: 0.6019 vs. 0.6012), while F1 scores remained
closely aligned. Ensembling these models resulted in a noticeable improvement (ICM-Hard: 0.6533, F1:
0.8800), illustrating the value of combining model representations even in a multilingual context. The
translation-based setup (Index 6–9) led to further performance gains. Notably, DTFN—pre-trained for
7 epochs and fine-tuned on translated Spanish tweets achieved strong standalone results (ICM-Hard:
0.6214, F1: 0.8736; see Index 8). The best overall scores were obtained by ensembling all translated
variants via majority voting (Index 10), reaching an ICM-Hard of 0.6864 and an F1 score of 0.8950.
These findings highlight the efectiveness of our lightweight multilingual strategy, achieving strong
performance without resorting to complex multilingual systems.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results of the Oficial Leaderboard</title>
        <p>Table 4 presents our oficial results on the EXIST 2025 Task 1.1 test set, as submitted under the team
name EquityExplorer-2.0. We submitted two system variants for each language-specific and multilingual
evaluation track.</p>
        <p>In Run 1, we used our translation-based setup: all Spanish tweets were first translated into English
using our GPT-4o prompt, and models were trained on this unified English dataset. In Run 2, we
submitted results from our baseline configuration, where models were trained on the original bilingual
data (English + Spanish).</p>
        <p>Across both runs, our system achieved strong, consistent performance. In the English-only track,
Run 1 ranked 4th out of 158 teams with an ICM-Hard score of 0.6018 and an F1 score of 0.7685. Even
in Run 2, without translation, our model remained in the top 10 (8th/158), demonstrating the strength
of our pre-training and ensemble components. In the Spanish track, performance was slightly lower
but still competitive. Run 1 achieved an ICM-Hard of 0.5490 and ranked 24th/153, while Run 2 slightly
improved in F1 score (0.7967) and ranked 21th.</p>
        <p>For the combined multilingual track, our translation-based pipeline (Run 1) outperformed the baseline,
placing 10th out of 160 teams with an ICM-Hard of 0.5806 and an F1 score of 0.7837. The baseline
configuration (Run 2), which trained directly on English and Spanish tweets without translation, also</p>
        <p>Evaluation Language Team (EquityExplorer-2.0) ICM-Hard ICM-Hard Norm
performed competitively—ranking 13th with an ICM-Hard of 0.5779 and F1 score of 0.7827. This small
performance gap further reinforces the efectiveness of our prompt-based ES →EN translation strategy
in multilingual settings.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we introduced a compact yet efective system for multilingual sexism detection in the
EXIST 2025 shared task 1.1. Our approach combines domain adaptive pre-training, prompt-based
GPT-4o translation, and lightweight ensemble learning, all designed to maximize performance while
keeping the pipeline simple and reproducible.</p>
      <p>By translating Spanish tweets into English using a carefully crafted prompt, we were able to work
within an English-only setup and apply monolingual transformers efectively. We fine-tuned
DeBERTav3-Large and our previously proposed DTFN, and combined their predictions through majority voting
with a simple tie-handling rule.</p>
      <p>This minimal design achieved competitive results across all language tracks, including 4th place in
English and top-10 rankings overall. Our findings reinforce that thoughtful reuse of existing components
combined with strategic translation and fine-tuning can match or even exceed the performance of more
complex architectures in real world tasks like online sexism detection.</p>
      <p>In future work, we aim to explore more adaptive translation strategies, and investigate the model’s
behavior on emerging forms of online bias.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>Sahrish Khan, Gabriele Pergola, and Arshad Jhumka, were partially supported by the Police Science,
Technology, Analysis, and Research (STAR) Fund 2022–23 and 2025-26, funded by the National Police
Chiefs’ Council (NPCC), in collaboration with the Forensic Capability Network (FCN). Gabriele Pergola
was partially supported by the ESRC-funded project Digitising Identity: Navigating the Digital
Immigration System and Migrant Experiences, as part of the Digital Good Network. This work was conducted
on the Sulis Tier-2 HPC platform hosted by the Scientific Computing Research Technology Platform
at the University of Warwick. Sulis is funded by EPSRC Grant EP/T022108/1 and the HPC Midlands+
consortium.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors used generative AI tools (GPT-4o) in two ways: (1) as part of the experimental pipeline,
to translate Spanish tweets into English while preserving task-relevant semantics, and (2) for minor
language editing, including grammar correction and paraphrasing, during manuscript preparation. All
research ideas, analyses, results, and conclusions are the authors’ own.
Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics,
Albuquerque, New Mexico, 2025, pp. 2223–2245. URL: https://aclanthology.org/2025.naacl-long.
112/.
[12] F. J. Rodríguez-Sanchez, J. C. de Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet, T. Donoso,
Overview of exist 2021: sexism identification in social networks, Proces. del Leng. Natural 67
(2021) 195–207.
[13] L. Plaza, J. C. de Albornoz, I. Arcos, P. Rosso, D. Spina, E. Amigó, J. Gonzalo, R. Morante, Overview
of exist 2025: Learning with disagreement for sexism identification and characterization in tweets,
memes, and tiktok videos (extended overview), in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.),
CLEF 2025 Working Notes, 2025, pp. –. To appear.
[14] L. Plaza, J. C. de Albornoz, I. Arcos, P. Rosso, D. Spina, E. Amigó, J. Gonzalo, R. Morante, Overview
of exist 2025: Learning with disagreement for sexism identification and characterization in tweets,
memes, and tiktok videos, in: J. C. de Albornoz, J. Gonzalo, L. Plaza, A. G. S. de Herrera, J. Mothe,
F. Piroi, P. Rosso, D. Spina, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF
Association (CLEF 2025), Lecture Notes in Computer Science, Springer, 2025, pp. –. To appear.
[15] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. Sorensen, SemEval-2022
task 5: Multimedia automatic misogyny identification, in: G. Emerson, N. Schluter, G. Stanovsky,
R. Kumar, A. Palmer, N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings of the 16th International
Workshop on Semantic Evaluation (SemEval-2022), Association for Computational Linguistics,
Seattle, United States, 2022, pp. 533–549.
[16] Y.-Z. Fang, L.-H. Lee, J.-D. Huang, Nycu-nlp at exist 2024: Leveraging transformers with diverse
annotations for sexism identification in social networks, in: CLEF (Working Notes), 2024, pp.
1003–1011. URL: https://ceur-ws.org/Vol-3740/paper-93.pdf.
[17] J. Tavarez-Rodríguez, F. Sánchez-Vega, A. Rosales-Pérez, A. P. López-Monroy, Better together:
Llm and neural classification transformers to detect sexism, in: Working Notes of CLEF 2024 –
Conference and Labs of the Evaluation Forum, 2024.
[18] F. Manzi, L. Weber-Genzel, B. Plank, Fine-grained sexism detection in Italian newspapers, in:
F. Dell’Orletta, A. Lenci, S. Montemagni, R. Sprugnoli (Eds.), Proceedings of the 10th Italian
Conference on Computational Linguistics (CLiC-it 2024), CEUR Workshop Proceedings, Pisa, Italy,
2024, pp. 556–583. URL: https://aclanthology.org/2024.clicit-1.66/.
[19] M. Chen, P.-A. Duquenne, P. Andrews, J. Kao, A. Mourachko, H. Schwenk, M. R. Costa-jussà,
BLASER: A text-free speech-to-speech translation evaluation metric, in: A. Rogers, J. Boyd-Graber,
N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada,
2023, pp. 9064–9079. URL: https://aclanthology.org/2023.acl-long.504/. doi:10.18653/v1/2023.
acl-long.504.
[20] B. Ding, C. Qin, R. Zhao, T. Luo, X. Li, G. Chen, W. Xia, J. Hu, A. T. Luu, S. Joty, Data augmentation
using large language models: Data perspectives, learning paradigms and challenges, 2024. URL:
https://arxiv.org/abs/2403.02990. arXiv:2403.02990.
[21] M. Zhou, PingAnLifeInsurance at SemEval-2023 task 10: Using multi-task learning to better
detect online sexism, in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino, H. Tayyar Madabushi,
R. Kumar, E. Sartori (Eds.), Proceedings of the 17th International Workshop on Semantic Evaluation
(SemEval-2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 2188–2192.</p>
      <p>URL: https://aclanthology.org/2023.semeval-1.304. doi:10.18653/v1/2023.semeval-1.304.
[22] F. Aldabbas, S. Ashraf, R. Sifa, L. Flek, MultiProp framework: Ensemble models for enhanced
crosslingual propaganda detection in social media and news using data augmentation, text segmentation,
and meta-learning, in: M. El-Haj (Ed.), Proceedings of the 1st Workshop on NLP for Languages
Using Arabic Script, Association for Computational Linguistics, Abu Dhabi, UAE, 2025, pp. 7–22.</p>
      <p>URL: https://aclanthology.org/2025.abjadnlp-1.2/.
[23] H. Yan, L. Gui, G. Pergola, Y. He, Position bias mitigation: A knowledge-aware graph model
for emotion cause extraction, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for
Computational Linguistics, Online, 2021, pp. 3364–3375.
[24] G. Pergola, L. Gui, Y. He, A disentangled adversarial neural topic model for separating opinions from
plots in user reviews, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy,
S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Association for Computational Linguistics, Online, 2021, pp. 2870–2883.
[25] J. Lu, J. Li, B. Wallace, Y. He, G. Pergola, NapSS: Paragraph-level medical text simplification
via narrative prompting and sentence-matching summarization, in: A. Vlachos, I. Augenstein
(Eds.), Findings of the Association for Computational Linguistics: EACL 2023, Association for
Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 1079–1091.
[26] OpenAI, Gpt-4o model card, https://openai.com/index/gpt-4o-system-card/, 2024. Accessed: today.
[27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692.
[28] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021.</p>
      <p>arXiv:2006.03654.
[29] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: S. Muresan,
P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
Dublin, Ireland, 2022, pp. 5809–5819.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A literature survey on multimodal and multilingual sexism detection</article-title>
          ,
          <source>IEEE Transactions on Computational Social Systems</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCSS.
          <year>2025</year>
          .
          <volume>3561921</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , P. Röttger, SemEval-2023 task 10:
          <article-title>Explainable detection of online sexism</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            , G. Da San Martino, H. Tayyar Madabushi,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , E. Sartori (Eds.),
          <source>Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>2193</fpage>
          -
          <lpage>2210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pergola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jhumka</surname>
          </string-name>
          ,
          <article-title>Multilingual sexism identification via fusion of large language models</article-title>
          , in: E. names (Ed.),
          <source>Joint Proceedings of the ACM IUI 2024 Workshops</source>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-99.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Umer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Parvatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Arthurs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cullen</surname>
          </string-name>
          , S. Wilson,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jhumka</surname>
          </string-name>
          , G. Pergola,
          <article-title>SafeSpeech: A comprehensive and interactive tool for analysing sexist and abusive language in conversations</article-title>
          , in: N.
          <string-name>
            <surname>Dziri</surname>
            ,
            <given-names>S. X.</given-names>
          </string-name>
          <string-name>
            <surname>Ren</surname>
          </string-name>
          , S. Diao (Eds.),
          <source>Proceedings of the</source>
          <year>2025</year>
          <article-title>Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jhumka</surname>
          </string-name>
          , G. Pergola,
          <article-title>Explaining matters: Leveraging definitions and semantic expansion for sexism detection</article-title>
          ,
          <source>in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pergola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Topic-driven and knowledge-aware transformer for dialogue emotion detection</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>1571</fpage>
          -
          <lpage>1582</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pergola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Event-centric question answering via contrastive learning and invertible event transformation</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>2377</fpage>
          -
          <lpage>2389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pergola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Leveraging ChatGPT in pharmacovigilance event extraction: An empirical study</article-title>
          , in: Y. Graham, M. Purver (Eds.),
          <source>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics, St. Julian's, Malta</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>344</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lyu</surname>
          </string-name>
          , , G. Pergola,
          <article-title>SciGisPy: a novel metric for biomedical text simplification via gist inference score</article-title>
          , in: M.
          <string-name>
            <surname>Shardlow</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Alva-Manchego</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zampieri</surname>
          </string-name>
          , K. North, S. Štajner, R. Stodden (Eds.),
          <source>Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Pergola,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Set-aligning framework for auto-regressive event temporal graph generation</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>3872</fpage>
          -
          <lpage>3892</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Pergola,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Cascading large language models for salient event graph generation</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Proceedings of the</source>
          <year>2025</year>
          <article-title>Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>