<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. Arampatzis* );</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DUTH at EXIST 2025: Multilingual Sexism Detection with Soft Labels and Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgios Arampatzis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasileios Perifanis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Symeonidis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avi Arampatzis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Democritus University of Thrace, Department of Electrical and Computer Engineering</institution>
          ,
          <addr-line>Xanthi</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Democritus University of Thrace, Department of Production and Management Engineering</institution>
          ,
          <addr-line>Xanthi</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This paper presents the DUTH system for the EXIST 2025 shared task on multilingual sexism detection. The task comprises three subtasks applied to a multilingual tweet corpus annotated with both hard and soft labels: (i) binary classification of sexist vs. non-sexist content, (ii) single-label classification of the type of sexism, and (iii) multi-label classification of the intended sexism category. The proposed system employs a transformer-based multilingual architecture, fine-tuned using techniques such as oversampling, class weighting, and soft-label learning to address class imbalance and annotator disagreement. Our system demonstrates robust performance in binary sexism detection, particularly on Spanish data, achieving competitive results under both hard and soft evaluation metrics. However, performance on the more nuanced subtasks-classifying the type and intent of sexist speech-remains limited, underscoring the dificulty of modeling implicit and context-sensitive expressions of sexism. We analyze these challenges and propose future directions, including discourse-aware modeling, hierarchical label representations, and multimodal learning.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism Detection</kwd>
        <kwd>Transformer Models</kwd>
        <kwd>Soft Labels</kwd>
        <kwd>Multi-label Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sexism remains prevalent in online discourse, often disguised through implicit or veiled expressions,
which complicates automated detection eforts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Social media platforms frequently exacerbate this
issue by amplifying such content [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Consequently, efective computational approaches are essential to
meet the growing need for identifying gender-based discrimination.
      </p>
      <p>
        Detecting sexism automatically is inherently challenging due to linguistic ambiguity, annotator
subjectivity, and cultural variation in its expression [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The EXIST 2025 shared task tackles these
challenges by providing a multilingual benchmark dataset consisting of tweets, memes, and TikToks
annotated along multiple sexism-related dimensions [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>In this paper, we describe our participation in Task 1, which focuses on tweets and includes three
subtasks. We adopt a multi-model architecture based on transformer models fine-tuned for multilingual
and multi-label classification.</p>
      <p>Previous work in ofensive language and toxic comment detection has evolved from rule-based systems
to deep learning architectures [7]. Transformer models such as BERT and its variants have shown
strong performance across NLP classification tasks, including sentiment analysis, stance detection, and
toxicity recognition [8].</p>
      <p>
        Multilingual transformers like mBERT and XLM-R are particularly efective in cross-lingual scenarios
with limited annotated data [9]. The EXIST series highlights the complexities of modeling sexism,
especially in the presence of annotator disagreement [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Recent approaches to address this include soft
labeling, uncertainty modeling, and disagreement-aware learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Our prior work on multilingual afective analysis [ 10] informed our modeling strategy in this
task, underlining the efectiveness of ensemble and hybrid models in tackling nuanced cross-lingual
phenomena such as sexism. Building on these insights, we structure our study as follows: Section 2
describes the dataset, including annotation methodology and statistical distributions, followed by our
implementation environment and modeling approach. Section 3 presents the experimental results,
evaluation metrics, and a detailed subtask analysis. Finally, Section 4 summarizes our findings and
outlines future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>The EXIST 2025 dataset is a multilingual corpus of tweets annotated for three subtasks: binary sexism
detection, intention classification, and multi-label sexism type categorization. Annotations were
collected from multiple annotators per instance, with soft labels derived from aggregated votes. Tweets
are annotated in both English and Spanish, providing a realistic and culturally diverse corpus.</p>
        <p>Table 3 summarizes the label counts for Task 1.3, which involves multi-label classification of sexism
intentions. Each tweet can be annotated with one or more of the following categories: Intentional,
Unintentional, Ideological, and Non-Sexist. In the training set, Intentional sexism is the most common
category (10,587 tweets), closely followed by Ideological (8,778) and Unintentional (8,391). The Non-Sexist
class corresponds to tweets without sexist content (22,767 instances) and can co-occur with the others
due to the soft-label nature of the task. This distribution suggests that intentionality and ideology are
prominent aspects of sexist expression in the dataset.</p>
        <p>The test set consists of 2,076 tweets annotated with soft labels for all three subtasks. For Task 1.1,
each instance includes a probability distribution over the binary classes (Sexist vs. Non-Sexist). For
Tasks 1.2 and 1.3, soft multi-label annotations are provided for each of the respective categories. These
probabilistic labels capture annotator disagreement and are intended to support evaluation methods
beyond traditional classification metrics.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Implementation and Environment</title>
        <p>All experiments were conducted in Python 3.10, using the HuggingFace Transformers library and
the PyTorch framework.</p>
        <p>The core software stack included transformers (v4.38.1) for model loading and fine-tuning,
datasets (v2.18.0) for data handling, scikit-learn (v1.4.2) for evaluation metrics, and pandas
(v2.2.1) and numpy (v1.26.4) for data manipulation. We used torch (v2.2.0) as the main backend for deep
learning operations. Auxiliary libraries such as accelerate, evaluate, tqdm, json, and argparse
supported training and evaluation.</p>
        <p>The implementation supports soft-label training, class reweighting, and multi-label classification
where applicable. Annotator labels were manually preprocessed into hard or probabilistic targets
according to the task requirements.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Methodology</title>
        <sec id="sec-2-3-1">
          <title>Task 1.1 – Binary Sexism Detection</title>
          <p>For Task 1.1, we formulated the problem as a binary classification task, aiming to distinguish between
sexist and non-sexist tweets. We filtered the training instances to retain only those with annotations
from at least three annotators, and assigned hard labels based on majority vote. To address class
imbalance, we applied oversampling to the minority class (sexist instances). We employed the
multilingual xlm-roberta-large transformer model and fine-tuned it using a custom training routine with
class-weighted cross-entropy loss to mitigate bias toward the majority class. Hyperparameters were
tuned using stratified training-validation splits and early stopping based on F1-score.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Task 1.2 – Sexism Type Classification</title>
          <p>Task 1.2 involves single-label classification of sexist tweets into three categories: Reported Speech,
Judgmental, and Direct. Instances were labeled according to the most frequently selected category
among annotators. Due to skewed class distributions, we balanced the training data via oversampling
to ensure equal representation across categories. The cardiffnlp/twitter-xlm-roberta-base
model was fine-tuned using a custom training pipeline that dynamically computed class weights based
on the frequency of each label in the training set. Optimization was guided by macro-averaged F1-score,
and early stopping was applied to prevent overfitting.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>Task 1.3 – Multi-label Sexism Intention Classification</title>
          <p>For Task 1.3, we treated the classification of sexism intentions as a multi-label problem, where tweets
could be associated with one or more categories from a predefined set. We performed label normalization
to unify semantically overlapping tags and filtered out inconsistently annotated or ambiguous instances.
To alleviate class imbalance, we applied targeted data augmentation using paraphrased versions of
underrepresented instances. We fine-tuned a multilingual xlm-roberta-base model with sigmoid
activation on the output layer and binary cross-entropy loss. The model was trained using stratified
sampling and evaluated with micro-averaged F1-score.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Training Details</title>
        <p>All systems were developed using the Hugging Face Transformers library with a PyTorch backend.
Stratified training-validation splits (typically 80/20 or 90/10) were used to preserve label distributions.
Early stopping was employed to prevent overfitting, with patience values ranging from 2 to 3 epochs.
Below, we outline the specific hyperparameter settings and preprocessing strategies adopted per subtask.
Task 1.1 – Binary Sexism Detection. We employed the xlm-roberta-large transformer model,
ifne-tuned using the AdamW optimizer. Training data was filtered to retain examples with at least
three annotators and binarized via majority vote. Minority class oversampling and a class-weighted
cross-entropy loss were used to address label imbalance. Learning rate: 1e–5 Batch size: 4 Epochs:
up to 10 (early stopping patience: 2) Max sequence length: 128 Class weights: [1.0, 1.3]
Task 1.2 – Sexism Type Classification. The cardiffnlp/twitter-xlm-roberta-base model
was fine-tuned using class-balanced oversampling and a dynamically computed class-weighted
crossentropy loss. Learning rate: 1e–5 Batch size: 4 Epochs: 6 (early stopping patience: 2) Max sequence
length: 128 Loss weighting: inverse label frequency (normalized)</p>
      </sec>
      <sec id="sec-2-5">
        <title>Task 1.3 – Multi-label Intention Classification. We used xlm-roberta-base in a multi-label</title>
        <p>setup with sigmoid activation and binary cross-entropy loss. Label normalization and label-aware
paraphrasing were applied to address semantic overlap and class imbalance. Learning rate: 2e–5 Batch
size: 8 Epochs: 3 Max sequence length: 128 Augmentation strategy: paraphrasing underrepresented
categories to at least 300 examples per class</p>
        <p>All models were evaluated using macro- or micro-averaged F1-score depending on the task.
Mixedprecision (FP16) training was enabled when supported by the hardware.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Evaluation Metrics</title>
        <p>The evaluation of submitted systems in EXIST 2025 relies on a diverse set of metrics tailored to the
nature of each subtask.</p>
        <p>Information Contrast Measure (ICM): ICM is a hierarchical-aware metric that compares
predicted and gold labels by incorporating the semantic distances between hierarchical classes [11]. It is
particularly suitable for hard-label classification tasks involving taxonomies.</p>
        <p>
          ICM-soft extends ICM to the soft-label setting by evaluating predicted probability distributions
against annotator consensus distributions. It rewards models that capture annotator uncertainty and
disagreement, aligning with recent trends in disagreement-aware learning and probabilistic labeling
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>F1-score is used in subtasks with binary or imbalanced classification. It is defined as the harmonic
mean of precision and recall, and may be reported per class or for the positive class (YES) depending on
the evaluation protocol [12].</p>
        <p>Cross-Entropy measures the divergence between predicted and reference probability distributions,
ofering insight into the probabilistic calibration of classifiers. It is particularly relevant for soft-label
and uncertainty-based modeling [13].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Results</title>
        <p>To assess system performance, the EXIST 2025 organizers adopted metrics that reflect both accuracy
and agreement with annotator uncertainty.</p>
        <p>ICM-Soft (Inter-Class Matching – Soft) is a divergence-based metric that compares predicted
distributions with soft gold labels representing annotator consensus. It rewards systems that approximate
the degree of disagreement among annotators rather than enforcing a single hard label [14, 15].</p>
        <p>Normalized ICM-Soft rescales ICM-Soft relative to a random baseline, producing values between 0
and 1 for easier interpretation. A higher score indicates stronger alignment with annotators.</p>
        <p>Cross Entropy measures the average divergence between predicted and true soft label distributions.
Lower values signify better probabilistic calibration and alignment with annotator judgments.</p>
        <p>These metrics, drawn from recent research in learning with disagreement [14, 15], are particularly
suited to tasks involving subjective or multi-annotator data such as sexism detection.
Our system exhibited limited efectiveness on Task 1.2, achieving a Macro F1 score of only 0.1967,
which reflects poor balance across the three target classes. This result suggests that the model struggled
particularly with identifying minority categories, especially those involving subtle or non-explicit
expressions of sexist intent. Additionally, the high Cross Entropy value (7.3212) indicates substantial
uncertainty and miscalibration in the model’s probabilistic outputs.</p>
        <p>These outcomes are not unexpected given the inherent complexity of Task 1.2. Unlike binary
classification, this task requires the ability to distinguish between nuanced forms of sexism—such as
ideological versus unintentional intent—and to interpret implicit language cues embedded in varying
cultural and social contexts. Such subtleties often challenge general-purpose text encoders, which may
lack the inductive bias needed to generalize over pragmatic and contextual features.
In Task 1.3, our system obtained a Macro F1 score of 0.3897, indicating moderate performance in
distinguishing between the multiple categories associated with sexist intent. While the model captures
certain patterns in the data, it struggles to generalize across all intent types in a balanced manner.</p>
        <p>This relatively low score is not unexpected, given the inherent dificulty of Task 1.3, which involves
not only identifying the presence of sexist content but also inferring the underlying intention—a
subjective and highly context-sensitive construct. Distinguishing between intentional, unintentional,
ideological, and non-sexist statements requires sensitivity to pragmatic cues, cultural nuances, and
discourse-level features that go beyond surface-level lexical signals.</p>
        <p>Our models showed robustness in binary classification but underperformed in the nuanced distinctions
required by Tasks 1.2 and 1.3. The use of standard transformers, without explicit modeling of label
hierarchy or annotator disagreement, likely contributed to the poor handling of ambiguous or
multiintent tweets. Performance was particularly limited in cases with rare label co-occurrence or high
inter-annotator variance.</p>
        <p>We hypothesize that incorporating hierarchical label modeling, disagreement-aware loss functions,
and graph-based representation learning could substantially improve performance. Error analysis also
highlighted the need for pragmatic and discourse-level features, which were lacking in our current
token-level input representations.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Results Analysis</title>
        <p>The experimental results across the three subtasks reveal several insights regarding the capabilities and
limitations of our system.</p>
        <p>In Task 1.1 (Binary Sexism Detection), the system demonstrated robust performance, particularly
on the Spanish dataset. The highest normalized ICM-Hard and F1-YES scores across all language subsets
suggest stronger alignment with the linguistic characteristics of Spanish sexist content. We hypothesize
that greater consistency of lexical cues in Spanish tweets, coupled with the model’s cross-lingual
generalization capabilities, contributed to this outcome. In contrast, the relatively lower performance
on English may reflect increased linguistic ambiguity or higher annotation noise.</p>
        <p>In Task 1.2 (Sexism Type Classification) , the system performed considerably worse. The
macroaveraged F1 score of 0.1967 indicates substantial class imbalance and dificulty in distinguishing
between closely related categories such as Judgemental and Reported Speech. The high cross-entropy
further suggests that the model was poorly calibrated, frequently producing overconfident but incorrect
predictions. The lack of explicit contextual signals in short tweet texts likely impeded its ability to
disambiguate intent-related expressions.</p>
        <sec id="sec-3-3-1">
          <title>In Task 1.3 (Sexism Intention Classification) , the system achieved moderate performance (Macro</title>
          <p>F1 = 0.3897) but struggled with overlapping labels and fine-grained distinctions. The task’s multi-label
nature, which requires handling interdependent and co-occurring classes, posed significant challenges.
The absence of structured label modeling may have further constrained performance. Moreover, the
extremely low or negative ICM-Soft scores highlight a misalignment with annotator disagreement,
underlining the complexity of learning from soft-label distributions in subjective contexts.</p>
          <p>Overall, while our system was efective at identifying explicit forms of sexism, it underperformed
when required to infer nuanced, context-dependent phenomena such as intention or ideological framing.
These findings are consistent with prior observations that transformer-based models, though strong in
binary classification, benefit from extensions such as discourse-aware architectures, hierarchical label
modeling, and pragmatic signal integration when applied to subjective or multi-dimensional annotation
schemes.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>
        In this paper, we presented our approach for the EXIST 2025 shared task on multilingual sexism
detection, addressing three subtasks involving binary, single-label, and multi-label classification. Our
architecture leveraged transformer-based multilingual models trained with both hard and soft labels to
accommodate the subjectivity and annotator disagreement inherent in the dataset [
        <xref ref-type="bibr" rid="ref3">3, 16</xref>
        ].
      </p>
      <p>Our system achieved robust performance in Task 1.1, particularly on Spanish instances, demonstrating
strong alignment with annotator labels in both hard and soft evaluation metrics. However, results on
Tasks 1.2 and 1.3 revealed significant challenges in modeling subtle forms of intent and distinguishing
ifne-grained classes under conditions of low inter-annotator agreement. These tasks require more than
lexical matching—understanding pragmatic cues, intent, and socio-linguistic context is essential [17].</p>
      <p>For future work, we aim to enhance the modeling of subjective and ambiguous instances by integrating
hierarchical and graph-based label representations [18]. We also plan to incorporate discourse-aware
and pragmatics-driven features, possibly through large language models with conversational grounding
or attention to speaker roles and framing. Agreement-aware loss functions and uncertainty modeling
will be further explored to better align model behavior with the soft-label structure of the dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgements</title>
      <p>We thank the organizers of the EXIST 2025 shared task for providing multilingual data and a reliable,
transparent evaluation framework that enables fair, meaningful, and reproducible comparisons. Their
contribution has been instrumental in advancing research on computational modeling of sexism,
subjectivity, and annotator agreement.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[7] J. Pavlopoulos, P. Malakasiotis, I. Androutsopoulos, Toxicity detection: Does context really matter?,
in: Proceedings of the 59th ACL (Volume 1: Long Papers), 2021, pp. 3341–3353. doi:10.18653/
v1/2021.acl-long.264.
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: NAACL-HLT 2019, 2019, pp. 4171–4186. doi:10.18653/v1/
N19-1423.
[9] A. Conneau, et al., Unsupervised cross-lingual representation learning at scale (xlm-r), in: ACL
2020, 2020, pp. 8447–8461. doi:10.18653/v1/2020.acl-main.747.
[10] G. Arampatzis, V. Perifanis, S. Symeonidis, A. Arampatzis, DUTH at SemEval-2023 Task 9: An
Ensemble Approach for Twitter Intimacy Analysis, in: Proceedings of the 17th International
Workshop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics,
2023, pp. 1225–1230. URL: https://aclanthology.org/2023.semeval-1.170. doi:10.18653/v1/2023.
semeval-1.170.
[11] G. Angulo, et al., Hierarchical evaluation of classifiers with the information contrast model, in:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational
Linguistics, 2021.
[12] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks,</p>
      <p>Information Processing &amp; Management 45 (2009) 427–437.
[13] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in:</p>
      <p>Proceedings of the 34th International Conference on Machine Learning, 2017.
[14] A. Uma, J. Baan, B. Plank, Learning from disagreement: A survey, in: Proceedings of the 59th</p>
      <p>Annual Meeting of the Association for Computational Linguistics, 2021.
[15] J. Baan, A. Uma, B. Plank, Learning from label disagreement in natural language processing,
in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2022.
[16] M. Forbes, A. Zhang, Y. Choi, Limitations of modeling annotator disagreement: A case study in
hate speech detection, in: Findings of EMNLP, 2021.
[17] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, Y. Choi, Social bias frames: Reasoning about
social and power implications of language, in: Proceedings of ACL, 2020.
[18] Y. Choi, J. Lee, J. Choi, S.-g. Lee, Deepertag: Disentangling hierarchical classification with
discriminative learning of label structures, in: Proceedings of AAAI, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          , et al.,
          <article-title>Memes as carriers of sexist ideologies</article-title>
          ,
          <source>in: Digital Platforms and Feminist Politics</source>
          , Palgrave Macmillan,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , et al.,
          <article-title>Predicting the type and target of ofensive posts in social media</article-title>
          ,
          <source>in: Proceedings of NAACL</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Davani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <article-title>Dealing with disagreements: Looking beyond the majority vote in subjective annotations</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>92</fpage>
          -
          <lpage>110</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00454</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Pamungkas</surname>
          </string-name>
          , et al.,
          <article-title>Exist 2023 at clef: Incorporating author's intention and learning with disagreements for sexism detection</article-title>
          ,
          <source>in: CLEF Working Notes</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Arcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Arcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morante</surname>
          </string-name>
          , Overview of exist 2025:
          <article-title>Learning with disagreement for sexism identification and characterization in tweets, memes, and tiktok videos (extended overview)</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>