<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>FairTransNLP: Fairness and Transparency for Equitable NLP Applications in Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariona Taulé</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Plaza</string-name>
          <email>plaza@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Carrillo-de-Albornoz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona</institution>
          ,
          <addr-line>Gran Via 585, 08029, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Nacional de Educación a Distancia, Juan del Rosal</institution>
          ,
          <addr-line>16, 28040 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Politècnica de València</institution>
          ,
          <addr-line>Camino de Vera s/n, 46022 Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ValgrAI Valencian Graduate School and Research Network of Artificial Intelligence</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Artificial Intelligence (AI) applications often perpetuate and accentuate unfair biases that can originate from multiple sources, such as data sampling, labelling and training data. Biased outputs can negatively affect certain social groups of users and even lead to discrimination. The ability of AI systems to provide transparent and understandable explanations for their decisions is crucial both for developers, to better understand the systems' behaviour, and for users, to gain trust in AI systems. In this coordinated project (UPV, UB, UNED), we have addressed problems such as the detection and classification of racial stereotypes and sexism in social networks, considering the multiple perspectives of annotators in data that have "conflicting" labels. This was made possible by employing the Learning with Disagreements paradigm with the aim to foster the development of more equitable AI models, i.e. fair and inclusive towards multiple viewpoints, rather than only representing the majority view.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fairness</kwd>
        <kwd>transparency</kwd>
        <kwd>equitable NLP</kwd>
        <kwd>learning with disagreements</kwd>
        <kwd>racial stereotypes</kwd>
        <kwd>sexism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Motivation and Related Work</title>
      <p>Biased systems are those that systematically and unfairly discriminate against individuals or social
groups. If a biased system becomes widely adopted, the social biases it perpetuates may have
serious consequences. Another major issue is biases in the design and annotation of datasets. For
example, in [1], the authors point out how annotated data may carry racial biases, and how models
can learn such biases. Topic bias is another factor to consider when developing datasets. Recent
studies showed how the volatile nature of topics, especially on social media, can hinder the
predictive capability of models trained on data collected with keyword sets [2] or in restricted time
spans. In [3] the authors analyse the topic bias in datasets used in shared tasks on the detection of
toxic/abusive language, hate speech and offensive language, misogyny and sexism.</p>
      <p>In the framework of the FairTransNLP project, we have analysed bias in several domains: (i)
media bias, (ii) hate speech and (iii) multimodal sexism. The problem of media bias was reviewed in
[5]. In that study, the authors concluded that the current methods for automatic media bias
detection are still in their infancy and there is still a lot of potential for improvement in terms of
accuracy and robustness. In another work [6], the same authors employed LIME and SHAP
explainability techniques to determine to what extent lexical-based AI models could identify bias.
In [7], bias was studied in pre-trained models for hate speech detection, showing that they are
biased towards hateful keywords: fine-tuning these models with hateful texts that do not contain
the hateful keywords making possible to reduce the bias. Finally, in [8], a bias estimation technique
was proposed to identify specific elements that compose a meme that could lead to unfair models,
together with a bias mitigation strategy based on Bayesian Optimization.</p>
      <p>In the next sections, after describing Learning with Disagreements, we will comment on two
problems we addressed employing this new paradigm: the detection and classification of racial
stereotypes and sexism in social networks. The aim was to foster the development of more fair and
inclusive AI models in the framework of two shared tasks we organized at IberLEF (racial
stereotypes) and CLEF (sexism) forums.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Learning with Disagreements</title>
      <p>The classic approach for dealing with disagreement among annotators assumes the existence of a
single, objective label, known as the gold label, which can be extracted through a majority voting
scheme. Although this methodology is simple and effective, it is also true that it ignores the
opinion of the minority over the majority, hence neglecting other viewpoints. In fact, disagreement
can be considered as a signal, not as noise [9], as it provides useful information for learning [10].</p>
      <p>For instance, consider the tweet “Women should stop trying to understand football and focus on
what they're good at, like fashion.” This example illustrates the use of soft labels, in which
annotator disagreement is preserved rather than collapsed into a single category. Out of 6
annotators, 4 labeled the tweet as sexist and 2 as non-sexist, resulting in a soft label distribution of
sexist: 0.67, non-sexist: 0.33.</p>
      <p>Learning With Disagreements (LeWiDi) usually applies a soft loss approach. That means that,
although it also aggregates the annotations of each annotator, it aggregates them into a probability
distribution. The main goal behind this approach is to optimize a model distribution to resemble
the original one produced by disagreement among annotators [11]. In contrast, the perspectivist
approach disregards aggregation and proposes working directly with individual annotations [12].
Modelling strategies for LeWiDi include using soft labels that reflect the distribution of annotator
responses, modeling individual annotators to capture biases and reliability, and applying multi-task
or probabilistic approaches to jointly infer true labels and annotator behavior.</p>
      <p>One of the key conclusions drawn from the LeWiDi-inspired tasks EXIST and DETEST, is the
clear performance improvement achieved when training systems with soft labels instead of hard
labels. This finding, which aligns with previous literature, highlights how converting multiple
annotator judgments into a single ground truth label leads to a loss of valuable information—both
in terms of discarded instances (to resolve ties or disagreements) and in the reduction of nuanced
perspectives within each instance. Moreover, commonly used disagreement resolution methods
(e.g., majority vote, adding an extra annotator) are often applied without considering the specifics
of the task, use case, or annotator profiles, which unintentionally introduce biases into the final
dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>3. DETESTS: DETEction and classification of racial STereotypes in</title>
    </sec>
    <sec id="sec-4">
      <title>Spanish</title>
      <p>The DETESTS-Dis corpus was created with the aim of analyzing and detecting stereotypes related
to immigration on social media. We adopted the LeWiDi paradigm given the inherent subjectivity
of this task. The identification of stereotypes on social media is crucial because their presence
reinforces toxic and hate speech against vulnerable social groups such as immigrants. Furthermore,
these types of messages are rapidly disseminated on social media such as Twitter (currently X).
Stereotype detection is a complex task both because of its subjective nature (the same message can
be interpreted differently depending on the culture, beliefs, age or gender of the reader), and
because of the way in which stereotypes can be expressed, i.e. explicitly or implicitly. The different
human perspectives and stereotypes implicit in the messages (i.e., when a certain inference process
is required to understand them) are possibly the main difficulties for detection systems. Based on
these assumptions, we created the DETESTS-Dis dataset, which consists of two corpora from
different social media sources containing two types of texts:</p>
      <p>The StereoCom corpus consists of 6,762 sentences extracted from 3,054 comments posted in
response to articles manually selected from 12 Spanish online newspapers, including El País, La
Vanguardia and ABC and discussion forums like Menéame and ForoCoches, related to
immigration. This selection was carried out considering news articles published between August
2017 and November 2021 containing controversial content, potential toxicity and a minimum of 50
published comments per article, following the methodology applied to the NewsCom-TOX corpus
[13]. We used a keyword-based approach to search for articles mainly related to xenophobia likely
to include ethnic stereotypes related to immigrants.</p>
      <p>
        The SteroHoax-ES corpus consists of 5,349 Twitter messages retrieved in 2021 from 449
conversational heads (i.e., the tweet starting the conversation) responding to 72 racist hoaxes
related to immigration in Spain, manually extracted from the fact-checking websites Maldita.es and
Newtral, which verify or refute claims made on social networks. The tweets were retrieved using
keywords and the contents of the hoaxes with Twitter API v2 for Academia. We also collected the
conversational threads. This corpus corresponds to the Spanish subset of the StereoHoax
multilingual dataset [14]. Both corpora were annotated with the following categories:
 Stereotype: a binary category indicating the presence or absence of stereotypes.
 Stereotype classification: a multilabel category in which immigrants are presented as: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
‘victims of xenophobia’, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) ‘suffering victims’, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) ‘economic resources’, (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) a problem of
‘migration control’, (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) people with ‘cultural and religious differences’, (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) people who receive the
‘benefits’ of our social policy, (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) a problem for ‘public health’, (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) a threat to ‘security’, (
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
‘dehumanization’ and (
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) ‘other’ types of stereotypes. This classification is based on the
proposal of [15].
 Implicitness: a binary category indicating whether the stereotype is expressed explicitly or
implicitly.
 Type of Implicitness: a multilabel category including different linguistic strategies used to
convey implicit stereotypes: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) ‘world knowledge’; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) ‘figures of speech’ (such as metaphor,
rhetorical questions, euphemisms and reported speech); (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) ‘humor/jokes’; (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) ‘irony/sarcasm’;
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) ‘extrapolation’; (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) ‘imperative/exhortative calls’ for action related to immigrants; (
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
‘entailment/ evaluation’; and the (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) ‘others’ label for types of implicitness not considered in the
previous categories. We also included the (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) ‘context’ label to indicate that it is necessary to
consider previous messages (sentences or tweets) to understand the implicit stereotype.
      </p>
      <p>
        All these corpora were annotated by three linguists of different ages and genders (one expert
linguist and two trained students). The DETESTS-Dis dataset was released both in its aggregated
form (applying the majority vote, hard labels) and its disaggregated form (soft labels), in the
DETESTS-Dis task [16], which took place as part of IberLEF 2024. The conversational threads of
comments and tweets were also provided to the participants in the task, but not the information
included in the stereotype classification and type of implicitness labels. The DETESTS-Dis task
(DETEction and classification of racial Stereotypes in Spanish - Learning with Disagreement) was
designed hierarchically by chaining two binary-classification subtasks:
1. The stereotype detection subtask aimed to determine whether a tweet or sentence contains
any stereotype, considering the full distribution of labels provided by the annotators: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
Stereotype: […] Illegal immigrants have more rights than Spaniards and we are FED UP! (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Not
Stereotype: In fact, all Muslim-majority countries have Sharia as a source of law to a greater or
lesser degree, they are all theocratic.
2. The implicitness identification subtask introduces a hierarchical binary classification
problem to identify whether the stereotypes in the text are explicit or implicit: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Implicit:
Quality immigrants. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Explicit: What is dangerous is the brainless immigrants, who
misinterpret Islam and who kill innocent people.
      </p>
      <p>The first subtask was evaluated using the binary F1 metric for the models that output hard
labels, and the cross-entropy metric was applied between the system soft label values and the soft
labels generated from the average votes of the annotators. The ICM metric [17], an information
theoretic-based metric that considers both the hierarchical structure and the class specificity, was
used to evaluate the second subtask. The ICM metric was the official metric for the ranking of both
hard labels (ICM) and soft labels (ICM-Soft). 15 teams signed up to participate, six of whom sent
runs, while three sent a working paper. The models used by participants were BETO, RoBERTa,
XML-RoBerta, Twitter-RoBERTa and Twitter-XML-RoBERTa while one of them used
GPT-3.5Turbo. Various teams employed data augmentation techniques, mainly back-translation. Only a
few teams used contextual information. The best system in subtask1 with hard labels achieved a
0.724 F1, while soft labels achieved a Cross Entropy score of 0.841. No team beat the BETO baseline
with hard labels and only one team achieved a better result with soft labels (ICM-soft normalized of
0.403).</p>
    </sec>
    <sec id="sec-5">
      <title>4. EXIST: sEXism Identification in Social neTworks</title>
      <p>EXIST is a series of scientific events and shared tasks aimed at capturing sexism in a broad sense,
from explicit misogyny to subtle, implicit sexist behaviours. The last two editions were held as labs
at CLEF 2023 and CLEF 2024, while the first two took place atIberLEF.</p>
      <p>The EXIST shared task focuses on identifying and classifying sexism in social networks. While
the 2023 edition focused on tweets, the 2024 edition expanded the task scope by incorporating
memes, recognizing their increasing use in spreading harmful messages disguised as humour. Both
editions, however, included the three main following tasks:
 Sexism Detection: Binary classification to determine whether a content is sexist or not.
 Source Intention Classification: Ternary classification distinguishing between:
Direct messages: Women shouldn’t code... perhaps be an influencer instead...it’s their natural
strength.</p>
      <p>Reported messages: Today, one of my year 1 class pupils could not believe he’d lost a race
against a girl.</p>
      <p>Judgmental messages: As usual, the woman was the one quitting her job for the family’s
welfare…
</p>
      <p>Sexism Categorization: Multi-label classification assigning sexist content to one or more
among five categories:</p>
      <p>Ideological and inequality: #Feminism is a war on men, but it’s also a war on women.
Role stereotyping and dominance: I feel like everytime I flirt with a girl they start to
imagine all the ways they can utilize me.</p>
      <p>Objectification: No offense but I’ve never seen an attractive african american hooker. Not a
single one.</p>
      <p>Sexual violence: Fuck that cunt, I would with my fist.</p>
      <p>Misogyny and non-sexual violence: Domestic abuse is never okay.... Unless your wife is a
bitch.</p>
      <p>Both editions adopted the LeWiDi paradigm and conducted both hard and soft evaluations
[18]. The hard evaluation assigned discrete labels to instances and used ICM [17] and F1 as metrics,
while the soft evaluation assessed the model’s ability to capture label disagreement by comparing
probability distributions using ICM-Soft and Cross Entropy.</p>
      <p>The 2023 edition of EXIST focused on textual data from microblogs. The dataset consisted of
over 5,000 tweets in English and Spanish, collected using more than 200 potentially sexist phrases
sourced from academic studies, tweets from journalists and activists reporting sexist incidents,
expressions from Everyday Sexism1, and feminist dictionaries. Rather than relying on majority
voting, each tweet was labeled by multiple annotators from diverse socio-demographic
backgrounds, ensuring a broad representation at the gender, age, country and education levels.
Both in EXIST 2023 and EXIST 2024, all social media data was anonymized to prevent the
identification of individuals, and data collection adheres strictly to the terms of service of the
respective platforms. During annotation, participants were explicitly warned that the content may
include sensitive material and are free to withdraw at any time. Furthermore, we relied on
platforms such as Prolific, which require annotators to comply with ethical guidelines.</p>
      <p>A total of 28 teams from 29 countries submitted 232 runs. For a comprehensive description,
please refer to the overview of the task [19]. Approximately 90% of the systems relied on LLMs.
Only a few teams explored traditional machine learning methods. Several participants applied data
augmentation techniques, such as tweet translation, external datasets, and instance duplication.
Twitter-specific models and transfer learning from related domains, such as hate speech, toxicity,
and sentiment analysis, were used. The best-performing system submitting soft labels achieved an
ICM-soft normalized score of 0.6421, 0.8072, and 0.7879, respectively, for the three tasks. For
systems submitting hard labels, the best F1 scores were 0.8109, 0.5715, and 0.6296, respectively.</p>
      <p>The 2024 edition of EXIST [19] expanded the task to multimodal content by introducing
memes. The dataset includes both the EXIST 2023 tweet collection and a newly curated set of
memes. To retrieve relevant memes, 250 sexist-related terms were used as search queries on Google
Images, obtaining the top 100 images per term. A manual cleaning process was applied to remove
irrelevant content such as ads, duplicates, textless images, and text-only images. The final dataset
contains 2,000 memes per language for training and 500 memes per language for testing. The
following figures show examples of sexist memes.</p>
      <p>A total of 57 teams submitted 412 runs, marking a significant increase in participation. For a
comprehensive description, please refer to the overview of the task [20]. Most teams relied on
monolingual and multilingual LLMs (BERT, DistilBERT, MarIA, MDEBERTA, RoBERTa, DeBERTa,
LLaMA, and GPT-4). For meme analysis, vision models such as CLIP, BEIT, and VIT were
employed. Some teams integrated linguistic features, while others used data augmentation and
prompt engineering. A small number of participants explored deep learning architectures or
traditional ML methods. As in EXIST 2023, Twitter-specific models were used.</p>
      <p>For systems submitting soft labels, the best ICM-soft normalized scores were 0.6755 (tweets)
0.4530 (memes), 0.4795 (tweets) - 0.3676 (memes), and 0.4379 (tweets) - 0.2462 (memes) for the three
tasks, respectively. For systems submitting hard labels, the best F1 scores were 0.7944 (tweets)
0.7642 (memes), 0.5677 (tweets) - 0.3873 (memes), and 0.6004 (tweets) - 0.4319 (memes). As can be
seen, identifying and characterizing sexism in memes seems more difficult than in text.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions and Future Work</title>
      <p>This work shows that incorporating annotation disagreement improves NLP models on subjective
tasks like detecting sexism and stereotypes. Results from DETESTS and EXIST confirm that soft
labeling using annotator distributions yields better performance and richer representations than
majority voting, which can mask interpretive differences and reinforce bias.
1 https://everydaysexism.com/</p>
      <p>As future work, in EXIST 2025 [20], we will analyze sexist content in TikTok videos, addressing
the influence of short-form platforms on stereotype diffusion. This will introduce new challenges in
annotation, multimodal modeling, and fairness-aware evaluation.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work has been funded by MCIN/AEI/10.13039/501100011033 and ERDF/EU
(PID2021124361OB-C31-C32-C33).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>
        During the preparation of this work, the authors used ChatGPT in order to check grammar and
spelling. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
M. Taulé, F. D'Errico, Stereohoax: A Multilingual Corpus of Racial Hoaxes and Social Media
Reactions Annotated for Stereotypes, Lang. Resour. and Eval. 59 (2025) 2031–2069.
[15] J. Sánchez-Junquera, B. Chulvi, P. Rosso, S.P. Ponzetto, How Do You Speak about Immigrants?
Taxonomy and StereoImmigrants Dataset for Identifying Stereotypes about Immigrants,
Applied Sciences 11(
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) (2021) 3610.
[16] W.S. Schmeisser-Nieto, P. Pastells, S. Frenda, A. Ariza-Casabona, M. Farrús, P. Rosso, M. Taulé,
Overview of DETESTS-Dis at IberLEF 2024: DETEction and classification of racial STereotypes
in Spanish - Learning with Disagreement, Procesamiento del Lenguaje Natural 73 (2024) 323–
333.
[17] E. Amigo, A. Delgado, Evaluating Extreme Hierarchical Multi-label Classification, in:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol.
1: Long Papers), 2022, pp. 5809–5819.
[18] L. Plaza, J. Carrillo-de-Albornoz, R. Morante, E. Amigó, J. Gonzalo, D. Spina, P. Rosso,
Overview of EXIST 2023 – Learning with Disagreement for Sexism Identification and
Characterization (Extended Overview), Working Notes of the Conference and Labs of the
Evaluation Forum (2023).
[19] L. Plaza, J. Carrillo-de-Albornoz, V. Ruiz, A. Maeso, B. Chulvi, P. Rosso, E. Amigó, J. Gonzalo,
R. Morante, D. Spina, Overview of EXIST 2024 – Learning with Disagreement for Sexism
Identification and Characterization in Social Networks and Memes (Extended Overview),
Working Notes of the Conference and Labs of the Evaluation Forum (2024).
[20] L. Plaza, J. Carrillo-de-Albornoz, I. Arcos, P. Rosso, D. Spina, E. Amigó, J. Gonzalo, R. Morante,
EXIST 2025: Learning with Disagreement for Sexism Identification and Characterization in
Tweets, Memes, and TikTok Videos, in: Proceedings of the European Conference on
Information Retrival, ECIR, 2025, pp. 442-449.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Card</surname>
          </string-name>
          , S. Gabriel,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>The Risk of Racial Bias in Hate Speech Detection</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the ACL</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1668</fpage>
          -
          <lpage>1678</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          , T. Kleinbauer,
          <article-title>Detection of Abusive Language: The Problem of Biased Datasets, in: Proceedings of the 2019 Conf. of the North American chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          (Vol.
          <volume>1</volume>
          ),
          <year>2019</year>
          , pp.
          <fpage>602</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Florio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <source>Time of Your Hate: The Challenge of Time in Hate Speech Detection on Social Media, Applied Sciences</source>
          <volume>10</volume>
          (
          <issue>12</issue>
          ) (
          <year>2020</year>
          )
          <fpage>4180</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Poletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>Resources and Benchmark Corpora for Hate Speech Detection: A Systematic Review</article-title>
          ,
          <source>Lang. Resour. Eval</source>
          .
          <volume>55</volume>
          (
          <issue>2</issue>
          ) (
          <year>2021</year>
          )
          <fpage>477</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F-J.</given-names>
            <surname>Rodrigo-Ginés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Systematic</surname>
          </string-name>
          <article-title>Review on Media Bias Detection: What is Media Bias, How it is Expressed, and How to Detect it</article-title>
          ,
          <source>Expert Syst. Appl</source>
          .
          <volume>237</volume>
          (
          <year>2024</year>
          )
          <fpage>121641</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F-J.</given-names>
            <surname>Rodrigo-Ginés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <article-title>Identifying Media Bias beyond Words: Using Automatic Identification of Persuasive Techniques for Media Bias Detection</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          )
          <fpage>179</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>De La Peña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Systematic Keyword and Bias Analyses in Hate Speech Detection, Inf</article-title>
          . Process. Manag.
          <volume>60</volume>
          (
          <issue>5</issue>
          ) (
          <year>2023</year>
          )
          <fpage>103433</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gasparini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saibene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , E. Fersini, Recognizing Misogynous Memes:
          <article-title>Biased Models and Tricky Archetypes, Inf</article-title>
          . Process. Manag.
          <volume>60</volume>
          (
          <issue>5</issue>
          ) (
          <year>2023</year>
          )
          <fpage>103474</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Aroyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Welty</surname>
          </string-name>
          ,
          <article-title>Truth is a Lie: Crowd Truth and the Seven Myths of Human Annotation</article-title>
          ,
          <source>AI</source>
          Magazine
          <volume>36</volume>
          (
          <issue>1</issue>
          ) (
          <year>2015</year>
          )
          <fpage>15</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          ,
          <article-title>Learning from Disagreement: A Survey</article-title>
          .
          <source>J. Artif. Intell. Res</source>
          .
          <volume>72</volume>
          (
          <year>2021</year>
          )
          <fpage>1385</fpage>
          -
          <lpage>1470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          ,
          <article-title>A Case for Soft Loss Functions</article-title>
          ,
          <source>in: Proceedings of the AAAI Conf. on Human Computation and Crowdsourcing</source>
          (Vol.
          <volume>8</volume>
          ),
          <year>2020</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cabitza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Campagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <article-title>Toward a Perspectivist Turn in Ground Truthing for Predictive Computing</article-title>
          ,
          <source>in: Proceedings of the 37th AAAI Conf. on Artificial Intelligence</source>
          , AAAI'
          <fpage>23</fpage>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nofre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bargiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bonet</surname>
          </string-name>
          ,
          <string-name>
            <surname>NewsCom-TOX</surname>
          </string-name>
          :
          <article-title>A Corpus of Comments on News Articles Annotated for Toxicity in Spanish</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>58</volume>
          (
          <year>2024</year>
          )
          <fpage>1115</fpage>
          -
          <lpage>1155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.S.</given-names>
            <surname>Schmeisser-Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bourgeade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ariza-Casabona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Laurent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.G.</given-names>
            <surname>Cicirelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Corbelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Moriceau</surname>
          </string-name>
          , M. Paciello,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>