<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DH-FBK at HODI: Multi-Task Learning with Classifier Ensemble Agreement, Oversampling and Synthetic Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elisa Leonardelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camilla Casula</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe the systems submitted by the DH-FBK team to the HODI shared task, dealing with Homotransphobia detection in Italian tweets (Subtask A) and prediction of the textual spans carrying the homotransphobic content (Explainability Subtask B). We adopt a multi-task approach, developing a model able to solve both tasks at once and learn from diferent types of information. In our architecture, we fine-tuned an Italian BERT-model for detecting homotransphobic content as a classification task and, simultaneously, for locating the homotransphobic spans as a sequence labeling task. We also took into account the subjective nature of the task by artificially estimating the level of agreement among the annotators using a 5-classifier ensemble and incorporating this information in the multi-task setup. Moreover, we experimented by extending the initial training data with oversampling (Run 1) and via generation of synthetic data (Run2). Our runs achieve competitive results in both tasks. Finally, we conducted a series of additional experiments and a qualitative error analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-task learning</kwd>
        <kwd>data augmentation</kwd>
        <kwd>agreement</kwd>
        <kwd>subjective tasks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Warning: This paper contains examples of potentially ofensive content. 1</title>
      <sec id="sec-1-1">
        <title>1. Introduction</title>
        <sec id="sec-1-1-1">
          <title>6,000 tweets annotated for homophobic and transphobic</title>
          <p>content (Subtask A) and highlighting the span range
exIn recent years, social media use has increased globally, pressing it within the sentence (Subtask B), encouraging
with platforms enabling users to post, share and comment the developing of models able to detect homotransphobic
about any topic at any time. With the increase of online content in an accurate and explainable way.
communication, proliferation of online hateful comments In this paper, we present the submitted systems by the
has become a major problem. Natural language process- DH-FBK team for the two HODI subtasks. Based on the
ing (NLP) research is essential for the mitigation of online hypothesis that the two layers of annotations provided
hate speech, as it can help in understanding the phe- are highly correlated and thus knowledge sharing will
nomenon and assisting in automating the process at a help with the completion of each task, we implemented
large scale. a multi-task architecture, similarily to the ones proposed</p>
          <p>The NLP community has been tackling this problem in Ramponi and Leonardelli [8] and Leonardelli and
Cathrough the creation of datasets and models, especially sula [9]. This setup allows leveraging training signals of
focusing on some of the most vulnerable communities, related tasks at the same time by exploiting a shared
repsuch as migrants [2] or women [3]. The application of resentation in the model. Specifically, we simultaneously
automatic methods for detecting hate speech targeting train a model on the two HODI subtasks, addressing
SubLGBTQIA+ people specifically is a recent development, task A as a classification task, and the extraction of the
having been addressed for the first time in English and spans containing homotransphobic language (Subtask B)
Tamil [4] and more recentely in Locatelli et al. [5]. as a Sequential Labeling (SL) problem, locating the spans</p>
          <p>
            The evaluation task for Homotransphobia Detection by BIO tags [10]. Importantly, this multi-task approach
in Italian (HODI) [6] proposed at E
            <xref ref-type="bibr" rid="ref3">valita 2023</xref>
            [7], aims allows to develop a unique model for addressing both
to explore homotransphobia on Twitter in Italian, taking tasks, and we are one of the two teams who participated
a deeper look into an issue that has not been adequately in both tasks. Moreover, given the subjectivity of the
addressed in either the global or Italian NLP communi- task, we add an auxiliary task to the multi-task
configties. To this end, the task organizers released a dataset of uration to incorporate information related to annotator
agreement. Previous studies have shown that training
on data with low agreement between annotators can
lead to a decrease in model performance [11]. However,
more recent research has shown that this depends on the
source of the disagreement and that the level of
agreement should still be taken into account when training
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>EVALITA 2023: 8th Evaluation Campaign of Natural Language</title>
      <p>Processing and Speech Tools for Italian, Sep 7 – 8, Parma, IT
$ eleonardelli@fbk.eu (E. Leonardelli); ccasula@fbk.eu (C. Casula)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)</p>
      <p>1Profanities have been obfuscated with PrOf (https://github.
com/dnozza/profanity-obfuscation) [1]
[12]. Since disaggregated annotations are not accessible The metric used for evaluation of Subtask A is
macroto participants in the task, we estimate agreement levels F1, while character-based F1 is used for evaluating
Subthrough the use of an ensemble of 5 classifiers, to imitate task B, similarly to Pavlopoulos et al. [15].
annotator judgments, similarly to the work conducted
in Leonardelli and Casula [9]. Additionally, following
the organizers’ suggestion to increase the train size of 4. Methods
the data, we experimented with diferent methods for
augmenting the training size, i.e. oversampling [13] and 4.1. Multi-task setup
data generation [14]. To exploit the strong correlations between the
annota</p>
      <p>Our best performing run (Run 1) achieved competitive tions of Subtasks A and B, we used a multi-task learning
results, ranking 4th for Subtask A and 2nd for Subtask B. setup [16], showed in Figure 1. Our model is trained</p>
      <p>Finally, we discuss the impact of the diferent elements simultaneously on tasks relative to both levels of
annowe combined in our models by conducting a series of ad- tation and, by utilizing a shared representation, all the
ditional experiments in Section 5.2, showing the benefits available information is available to the model.
Moreof augmenting training data, especially using oversam- over, the tasks under scrutiny are highly subjective. For
pling, and showing the relative beneficial impact of the instance, we observed some inconsistencies across
senauxiliary task on agreement, which is efective only in tences (for example articles being included/excluded in
combination with oversampling and not with the syn- spans). To leverage the uncertainty around words that
thetic data. We then also conduct a qualitative analysis are potentially ambiguous, and given that no
informato discover the most dificult cases. tion about agreement among annotators was released,
we ‘artificially’ created an agreement label by using the
2. The HODI dataset agreement of an ensemble of 5 classifiers. This procedure
is described in more detail in Appendix A. In summary,
we use three tasks for our multi-task model: two main
tasks corresponding to the two annotation levels (and
subtasks) of HODI, and an additional auxiliary task
relative to synthetic agreement. The three tasks can be
summarized as:</p>
      <sec id="sec-2-1">
        <title>The HODI dataset is composed of 6,000 Italian tweets.</title>
        <p>The tweets have been collected from the 1st of May 2022
to the 31st of August 2022 using a set of 21 keywords
associated with language that might potentially target
minority groups victims of homotransphobia. Entries are
annotated following a two-layer scheme:
1. Homotransphobia detection: a tweet contains
homotransphobic language or not (binary).
2. Rationales detection (explainability): when a
tweet is considered homotransphobic, the span
of text that contains the homotranshphobic part
is highlighted (list of character positions).</p>
        <sec id="sec-2-1-1">
          <title>3. Task description</title>
          <p>The organizers provided participants with the HODI
dataset, described in Section 2. 5,000 annotated tweets
were released during the first phase of the competition,
out of which 2,008 labeled as homotransphobic. In a
second phase, the remaining 1,000 tweets were released
unlabeled as test data. The task is divided into two
subtasks, reflecting the layers of annotations of the dataset:
• Subtask A - Homotransphobia detection: binary
classification, the goal is to predict whether a
tweet is homotransphobic or not.
• Subtask B - Explainability: participants are re- 4.2. Synthetic Data
quired to predict the spans of homotransphobic
tweet that were responsible for the
homotransphobic label of the tweet.
• Homotransphobia (Subtask A): binary
classifica</p>
          <p>tion of homotransphobia.
• Explainability (Subtask B): annotations for this
task are released at character level. We convert
each sentence from character to word-level
annotation, and associate each word to a label for
whether it belongs to the homotransphobic span
(Explainibility). Moreover, since often spans are
comprised of entire phrases, annotations followed
sequence labelling, using a BIO tagging scheme
[17] in which each word can be at the beginning,
inside or outside of a span. After converting the
data into this format, Subtask B can be carried
out as a sequence labelling prediction task.
• Agreement on Subtask B: it is addressed as
sequence labelling at word-level. It can assume
values between [0-5], reflecting how many of the
5-classifier ensemble, described in Appendix A,
predict a specific word in agreement with the gold
label of Subtask B.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The use of synthetic data has been proposed as a method to increase the amount of available training data for hate speech and ofensive language detection tasks, especially</title>
        <p>when relying on machine-generated data [18, 19]. Al- 4.3. Experimental Setup
though data augmentation using generative models has
been found to not always be reliable in improving mod- The models developed for the two runs submitted by our
els [14], we aim at exploring whether it can help the team, are both based on a pre-trained Italian BERT 2. For
performance of our models for the HODI task. ifne-tuning the models in the multi-task setup described</p>
        <p>A widely used method of generating synthetic data in Figure 1, we employed the MaChAmp v0.2 toolkit [22],
consists in fine-tuning a generative model on annotated a tool that supports a variety of standard NLP tasks
out-ofdata and then using it for generating new sequences. the-box, also in a multi-task setup. We employed the
preThese generated sequences are then passed through a trained BERT as our shared encoder for all tasks, while a
classifier in order to confirm the label assignment made separate decoder is utilized by each task. We fine-tuned
by the generator, since generative models are not always the model (110M parameters) for 15 epochs on a single
reliable in their label assignment [20]. GPU 3, using default MaChAmp hyperparameters4. For</p>
        <p>While the majority of works that exploit model- the training process, we assign each class equal weight to
generated data for the detection of ofensive language guarantee minority classes are not underrepresented. We
have no particular focus on any target category or phe- introduced also loss weights for the multi-task learning
nomenon, our experiments are focused on specifically loss, calculated as  = ∑︀  , where  accounts for
detecting homotransphobia. Because of this, the gener- the loss for task  and   being the respective weighting
ated texts should be correct regarding both the label and parameter. We set   = 0.8 for the primary tasks, and
the focus on the phenomenon. In part due to this, and   = 0.5 for the auxiliary tasks.
in part to the limited availability of generative large
language models for Italian, we decide to generate new data 4.4. Submitted Systems Description
using an encoder-decoder model trained on Italian, IT5 For the competition, we submitted two diferent runs
[21], in its 738M-parameter (large) configuration. The with predictions by models created using the same setup
details of our data augmentation process can be found in described in Section 4.3 and Figure 1, but trained on
Appendix B. diferent sets of data. Starting from the suggestion from</p>
        <p>Given that the augmentation process provides us with the organizers to augment the size of the training set, we
synthetic examples annotated for Subtask A (Homotrans- experimented with oversampling and data generation in
phobia detection), but not for Subtask B (detection of the following way:
rationales), we additionally estimate Subtask B labels for
the generated instances, using the model of the first sub- • Run 1: the data made available from the
organizmitted run (generated data were used only in run2), while ers are oversampled by repeating them twice.
the agreement for the auxiliary task was estimated using
the ensemble classifier described in Appendix A.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2dbmdz/bert-base-italian-cased</title>
        <p>3NVIDIA Titan Xp
4Defaults MaChAmp hyperparameter settings used for all our
experiments; Optimizer: AdamW;  1,  2 = 0.9; Dropout = 0.2; Batch</p>
        <p>Size = 32; Learning rate = 0.0001
• Run 2: In addition to oversampling the HODI profanities, hurting its generalization capabilities, a
phedata, similarly to run1, we add 4,000 synthethi- nomenon that has been observed in data augmentation
cally generated examples (see Section 4.2). using generative models [14]. Moreover, to dissect the
impact of oversampling and the impact of the auxiliary
We split HODI data into 90% training set and 10% de- task, we run a series of additional experiments5. Results
velopment set. For Run2, the synthetic examples were are shown in Table 2. To evaluate the role of
oversamadded to the training set. pling we replicate the setup of the two submitted runs
but omitting the oversampling of the HODI data from
5. Evaluation the training (Exp 1 and Exp 2). By comparing results
(Run 1 vs Exp 1 and Run 2 vs Exp 2), we can observe how
5.1. Results oversampling the data is generally beneficial, especially
if no synthetic data are used. Moreover, in Exp 3 and Exp
Table 1 shows the oficial results of our submissions for 4 we replicate the submitted experiments but exclude the
Subtasks A and B. All runs for both tasks beat the orga- auxiliary task. By comparing Run1 and Exp 3, we can
nizers’ baseline. observe how in this case the task is indeed beneficial,</p>
        <p>For Subtask A we report macro-averaged F1 score and while it is not when comparing Run 2 and Exp 4, where
overall rank of our runs, as well as those of the teams synthetic examples are part of the training data. This
who performed better than us and the baseline. Our best suggests that the estimation of agreement for generated
performance (Run 1) obtained a macro F1 score of 0.795, data might not be informative.
ranking 4th out of 18 submitted runs (3rd out of 8 teams),
while run 2 ranked 7th out of 18 submissions (4th out of 8 5.3. Qualitative Error Analysis - Subtask A
teams).</p>
        <p>For Subtask B we report the overall ranking, given that
the leaderboard is short and only another team
participated in Subtask B. One run of the other team
participating in this task beat our result, while our best scoring
run (Run 1) ranked 2nd.
5.2. Additional experiments
Regarding the impact of generated data, when adding
the synthetic data in the training (Run 1) performance
decreases in both tasks, showing that the augmentation
with generated data does not improve the generalization
of models compared to oversampling. In fact, we
hypothesize that the addition of synthetic data might push
models to be over-reliant on specific identity terms or
To perform a qualitative analysis on the most problematic
tweets, we isolated the tweets that were incorrectly
classified by all models in Table 2. The most consistent false
negative regards the missed detection of tweets
containing a specific ofensive slang word ( f*mminiello). One
possible reason is that this word is not generally
common (as it belongs to a local language variety), and it
was not present in the training set. Observing the posts
incorrectly classified as homotransphobic by the models,
we identified (doubtful) sense of humour or metaphorical
expressions (andare a fare in culo, essere fr*cio col culo
degli altri ) as possible reasons. Another possible reason
could also be over-reliance on specific terms.</p>
        <sec id="sec-2-3-1">
          <title>6. Conclusions</title>
          <p>
            We described our participation to the HODI e
            <xref ref-type="bibr" rid="ref3">valuation
task at Evalita 2023</xref>
            . We used a multi-task learning
approach to share representations between the two tasks
involved and, additionally, considering the subjectivity
of the task, we incorporated inter-annotator agreement
information into our framework, estimating them with
a 5-classifier ensemble. We experimented augmentation
of the training data available by oversampling them and
via generated data. We were one of the few teams who
participated in both tasks, and our systems performed
competitively.
          </p>
          <p>Moreover, we conducted an analysis on the role and
impact of the various aspects we combined. Our results
show oversampling is generally beneficial, especially</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>5The organizers released the labels for the test set after the</title>
        <p>closing of the evaluation phase</p>
        <sec id="sec-2-4-1">
          <title>Acknowledgments</title>
          <p>when combined with the auxiliary task on agreement.</p>
          <p>The usage of generated data instead has limited
beneifts, compared to oversampling or additional auxiliary
tasks. Finally, performing a qualitative analysis on the
most frequent causes of error, we identified specific
homotransphobic slang terms that were problematic to be
identified by our models.</p>
          <p>The work of Elisa Leonardelli was partially funded by
the StandByMe European project
(REC-RDAP-GBV-AG2020) on “Stop online violence against women and girls
by changing attitudes and behaviour of young people
through human rights education” (GA 101005641). Her
research was also supported by the StandByMe 2.0 project
(CERV-2021-DAPHNE) on “Stop gender-based violence
by addressing masculinities and changing behaviour of
young people through human rights education” (GA
101049386).
[11] E. Leonardelli, S. Menini, A. Palmero Aprosio, https://aclanthology.org/2022.acl-long.234. doi:10.</p>
          <p>
            M. Guerini, S. Tonelli, Agreeing to disagree: 18653/v1/2022.acl-long.234.
Annotating ofensive language datasets with an- [20] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A.
Kannotators’ disagreement, in: Proceedings of the tor, G. Kour, S. Shlomov, N. Tepper, N. Zwerdling,
2021 Conference on Empirical Methods in Natu- Do Not Have Enough Data? Deep Learning to the
ral Language Processing, Association for Computa- Rescue!, in: Proceedings of the AAAI Conference
tional Linguistics, Online and Punta Cana, Domini- on Artificial Intelligence, volume 34, 2020, pp. 7383–
can Republic, 2021, pp. 10528–10539. URL: https: 7390. doi:10.1609/aaai.v34i05.6233.
//aclanthology.org/2021.emnlp-main.822. doi:10. [21] G. Sarti, M. Nissim, IT5: Large-scale text-to-text
18653/v1/2021.emnlp-main.822. pretraining for italian language understanding and
[12] M. Sandri, E. Leonardelli, S. Tonelli, E. Jezek, Why generation, ArXiv preprint 2203.03759 (2022). URL:
don’t you do it right? analysing annotators’ dis- https://arxiv.org/abs/2203.03759.
agreement in subjective tasks, in: Proceedings of [22] R. van der Goot, A. Üstün, A. Ramponi, I. Sharaf,
the 2023 Conference of the European Chapter of the B. Plank, Massive choice, ample tasks (MaChAmp):
Association for Com
            <xref ref-type="bibr" rid="ref10">putational Linguistics, 2023</xref>
            . A toolkit for multi-task learning in NLP, in:
Pro[13] E. Leonardelli, S. Menini, S. Tonelli, Dh-fbk@ ceedings of the 16th Conference of the European
haspeede2: Italian hate speech detection via self- Chapter of the Association for Computational
Lintraining and oversampling., in: Proceedings of the guistics: System Demonstrations, Association for
Seventh Evaluation Campaign of Natural Language Computational Linguistics, Online, 2021, pp. 176–
Processing and Speech Tools for Italian. Final Work- 197. URL: https://aclanthology.org/2021.eacl-demos.
shop (EVALITA 2020), volume 2765, 2020. 22. doi:10.18653/v1/2021.eacl-demos.22.
[14] C. Casula, S. Tonelli, Generation-based data
augmentation for ofensive language detection: Is it
worth it?, in: Proceedings of the 17th Conference Appendix
of the European Chapter of the Association for
Computational Linguistics, Association for Com- A. Ensemble agreement
putational Linguistics, Dubro
            <xref ref-type="bibr" rid="ref3">vnik, Croatia, 2023</xref>
            ,
pp. 3359–3377. URL: https://aclanthology.org/2023. For posts (of the training set) that were annotated as
eacl-main.244. homotransphobic, we aim at obtaining an
approxima[15] J. Pavlopoulos, J. Sorensen, L. Laugier, I. Androut- tion of the agreement level on each word of
            <xref ref-type="bibr" rid="ref16">the post,
sopoulos, Semeval-2021</xref>
            task 5: Toxic spans de- as being considered part of the span is correlated with
tection, in: Proceedings of the 15th international labeling the post as homotranspophobic. This
informaworkshop on semantic evaluation (SemEval-2021), tion is then exploited as additional information in our
2021, pp. 59–69. multi-task training setup, specifically as an extension to
[16] R. Caruana, Multitask learning, Machine learning the sequence labelling prediction of Subtask B.
          </p>
          <p>28 (1997) 41–75. We split the training data  provided by the HODI
[17] J. Laferty, A. McCallum, F. C. Pereira, Conditional organizers in 5 folds 1, 2, ..., 5, creating 5 separate
random fields: Probabilistic models for segmenting train/validation splits, being careful that each item of the
and labeling sequence data (2001). train appears in the validation set of one fold. We employ
[18] T. Wullach, A. Adler, E. Minkov, Fight Fire with Fire: an ensemble of classifiers, a method first suggested by
Fine-tuning Hate Detectors using Large Samples of Leonardelli et al. 2021 [11], where each classifier of the
Generated Hate Speech, in: Findings of the Associ- ensemble is trained using slightly diferent
configuraation for Computational Linguistics: EMNLP 2021, tions by varying the initial conditions such as the initial
Association for Computational Linguistics, Punta seed and the number of epochs, so that the 5 classifiers
Cana, Dominican Republic, 2021, pp. 4699–4705. produce similar but not identical predictions. The
clasURL: https://aclanthology.org/2021.findings-emnlp. sifiers are produced in the multi-task setup showed in
402. Figure 1, but without the Auxiliary Task on agreement.
[19] T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, In this manner, we have ensemble predictions for each of
D. Ray, E. Kamar, ToxiGen: A Large-Scale the entries of the training data. Based on the predictions
Machine-Generated Dataset for Adversarial and of the classifiers, we assign ensemble agreement labels
Implicit Hate Speech Detection, in: Proceedings to the validation set (at a word-level) of the current fold
of the 60th Annual Meeting of the Association based on how many classifiers agree with the actual gold
for Computational Linguistics (Volume 1: Long annotation. The ensemble agreement label is thus a
numPapers), Association for Computational Linguis- ber between 0 and 5. We consider this information as
tics, Dublin, Ireland, 2022, pp. 3309–3326. URL: proxy for item’s dificulty and annotators’ disagreement.
The pipeline we follow for augmenting the available data
for the task is as follows:
1. We fine-tune a classifier (in our case a BERT-base
model trained on Italian) 2 on the HODI training
data.
2. We fine-tune IT5-Large on the same training data,
formatting the task so that the input is ‘Scrivi un
tweet:’ or ´Scrivi un tweet omotransfobico:’ (‘Write
a tweet:’ or ´Write a homotransphobic tweet:’)
depending on the gold label of each example, and
the output is the actual post.
3. We use the fine-tuned IT5 model to generate new
data, using the same type of input we use in Step
2.
4. We filter the generated data using the fine-tuned
classifier from Step 1, keeping only the examples
for which the label assignment is the same for
the classifier and the generator [ 20, 14]. We
additionally remove duplicates and normalize URLs
as URL.
5. We rank generated examples based on the
confidence of the classification model we used for
filtering, retaining the top 2,000 examples for each
class. This number is chosen in order to ideally
double the size of the dataset, and we use
generated examples that are equally split among the
labels so as to artificially mitigate the class
imbalance.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>preprint arXiv:2109.00227</source>
          (
          <year>2021</year>
          ). [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Locatelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Damo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <article-title>A cross-lingual</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>siderations in NLP (C3NLP)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>24</lpage>
          . [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          , G. Damo, T. Caselli,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , HODI at EVALITA 2023:
          <article-title>Overview of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          . Final Workshop (EVALITA
          <year>2023</year>
          ), CEUR.org,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Parma</surname>
          </string-name>
          , Italy,
          <year>2023</year>
          . [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          , R. Sprug-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>noli</surname>
          </string-name>
          , G. Venturi,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Workshop</surname>
          </string-name>
          (EVALITA
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <year>2023</year>
          . [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , E. Leonardelli, Dh-fbk at semeval[1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>The state of profanity obfusca- 2022 task 4: leveraging annotators' disagreement</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>lications, in: Findings of the Association for Com- detection</article-title>
          , in
          <source>: Proceedings of the 16th International</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>putational Linguistics: ACL</source>
          <year>2023</year>
          , Association for Workshop on Semantic Evaluation (SemEval-2022),
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          ,
          <year>2023</year>
          .
          <year>2022</year>
          , pp.
          <fpage>324</fpage>
          -
          <lpage>334</lpage>
          . [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bourgeade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          , M. Lau- [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Leonardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Casula</surname>
          </string-name>
          , Dh-fbk at semeval-2023
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>rent</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Schmeisser-Nieto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Benamara</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Bosco, task 10: Multi-task learning with classifier ensem-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>dataset of racial stereotypes in social media conver- of the 17th</article-title>
          <source>International Workshop on Semantic</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Computational</surname>
            <given-names>Linguistics: EACL</given-names>
          </string-name>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>tics</fpage>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1894</fpage>
          -
          <lpage>1905</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          674-
          <fpage>684</fpage>
          . https://aclanthology.org/
          <year>2023</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>261</fpage>
          . [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Anzovino</surname>
          </string-name>
          , Overview of [10]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>the task on automatic misogyny identification at</article-title>
          R. Xu, Hitsz-hlt at semeval-
          <source>2021 task 5:</source>
          Ensem-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>ibereval</surname>
          </string-name>
          <year>2018</year>
          ., Ibereval@ sepln
          <volume>2150</volume>
          (
          <year>2018</year>
          ) 214
          <article-title>- ble sequence labeling and span boundary detection</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          228.
          <article-title>for toxic span detection</article-title>
          , in: Proceedings of the [4]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          , R. Pon- 15th international workshop on semantic evalua-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>nusamy</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          <string-name>
            <surname>Kumaresan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sampath</surname>
          </string-name>
          , D. Then- tion (
          <issue>SemEval-2021</issue>
          ),
          <year>2021</year>
          , pp.
          <fpage>521</fpage>
          -
          <lpage>526</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>