Warning: This paper contains examples of potentially ofensive content. 1

DH-FBK at HODI: Multi-Task Learning with Classifier Ensemble Agreement, Oversampling and Synthetic Data

Elisa Leonardelli

Camilla Casula

0 1 0 Fondazione Bruno Kessler , Trento , Italy 1 University of Trento , Trento , Italy

We describe the systems submitted by the DH-FBK team to the HODI shared task, dealing with Homotransphobia detection in Italian tweets (Subtask A) and prediction of the textual spans carrying the homotransphobic content (Explainability Subtask B). We adopt a multi-task approach, developing a model able to solve both tasks at once and learn from diferent types of information. In our architecture, we fine-tuned an Italian BERT-model for detecting homotransphobic content as a classification task and, simultaneously, for locating the homotransphobic spans as a sequence labeling task. We also took into account the subjective nature of the task by artificially estimating the level of agreement among the annotators using a 5-classifier ensemble and incorporating this information in the multi-task setup. Moreover, we experimented by extending the initial training data with oversampling (Run 1) and via generation of synthetic data (Run2). Our runs achieve competitive results in both tasks. Finally, we conducted a series of additional experiments and a qualitative error analysis.

eol>Multi-task learning data augmentation agreement subjective tasks

Warning: This paper contains examples of potentially ofensive content. 1 1. Introduction 6,000 tweets annotated for homophobic and transphobic

content (Subtask A) and highlighting the span range exIn recent years, social media use has increased globally, pressing it within the sentence (Subtask B), encouraging with platforms enabling users to post, share and comment the developing of models able to detect homotransphobic about any topic at any time. With the increase of online content in an accurate and explainable way. communication, proliferation of online hateful comments In this paper, we present the submitted systems by the has become a major problem. Natural language process- DH-FBK team for the two HODI subtasks. Based on the ing (NLP) research is essential for the mitigation of online hypothesis that the two layers of annotations provided hate speech, as it can help in understanding the phe- are highly correlated and thus knowledge sharing will nomenon and assisting in automating the process at a help with the completion of each task, we implemented large scale. a multi-task architecture, similarily to the ones proposed

The NLP community has been tackling this problem in Ramponi and Leonardelli [8] and Leonardelli and Cathrough the creation of datasets and models, especially sula [9]. This setup allows leveraging training signals of focusing on some of the most vulnerable communities, related tasks at the same time by exploiting a shared repsuch as migrants [2] or women [3]. The application of resentation in the model. Specifically, we simultaneously automatic methods for detecting hate speech targeting train a model on the two HODI subtasks, addressing SubLGBTQIA+ people specifically is a recent development, task A as a classification task, and the extraction of the having been addressed for the first time in English and spans containing homotransphobic language (Subtask B) Tamil [4] and more recentely in Locatelli et al. [5]. as a Sequential Labeling (SL) problem, locating the spans

The evaluation task for Homotransphobia Detection by BIO tags [10]. Importantly, this multi-task approach in Italian (HODI) [6] proposed at E valita 2023 [7], aims allows to develop a unique model for addressing both to explore homotransphobia on Twitter in Italian, taking tasks, and we are one of the two teams who participated a deeper look into an issue that has not been adequately in both tasks. Moreover, given the subjectivity of the addressed in either the global or Italian NLP communi- task, we add an auxiliary task to the multi-task configties. To this end, the task organizers released a dataset of uration to incorporate information related to annotator agreement. Previous studies have shown that training on data with low agreement between annotators can lead to a decrease in model performance [11]. However, more recent research has shown that this depends on the source of the disagreement and that the level of agreement should still be taken into account when training

EVALITA 2023: 8th Evaluation Campaign of Natural Language

Processing and Speech Tools for Italian, Sep 7 – 8, Parma, IT $ eleonardelli@fbk.eu (E. Leonardelli); ccasula@fbk.eu (C. Casula)

© 2023 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)

1Profanities have been obfuscated with PrOf (https://github. com/dnozza/profanity-obfuscation) [1] [12]. Since disaggregated annotations are not accessible The metric used for evaluation of Subtask A is macroto participants in the task, we estimate agreement levels F1, while character-based F1 is used for evaluating Subthrough the use of an ensemble of 5 classifiers, to imitate task B, similarly to Pavlopoulos et al. [15]. annotator judgments, similarly to the work conducted in Leonardelli and Casula [9]. Additionally, following the organizers’ suggestion to increase the train size of 4. Methods the data, we experimented with diferent methods for augmenting the training size, i.e. oversampling [13] and 4.1. Multi-task setup data generation [14]. To exploit the strong correlations between the annota

Our best performing run (Run 1) achieved competitive tions of Subtasks A and B, we used a multi-task learning results, ranking 4th for Subtask A and 2nd for Subtask B. setup [16], showed in Figure 1. Our model is trained

Finally, we discuss the impact of the diferent elements simultaneously on tasks relative to both levels of annowe combined in our models by conducting a series of ad- tation and, by utilizing a shared representation, all the ditional experiments in Section 5.2, showing the benefits available information is available to the model. Moreof augmenting training data, especially using oversam- over, the tasks under scrutiny are highly subjective. For pling, and showing the relative beneficial impact of the instance, we observed some inconsistencies across senauxiliary task on agreement, which is efective only in tences (for example articles being included/excluded in combination with oversampling and not with the syn- spans). To leverage the uncertainty around words that thetic data. We then also conduct a qualitative analysis are potentially ambiguous, and given that no informato discover the most dificult cases. tion about agreement among annotators was released, we ‘artificially’ created an agreement label by using the 2. The HODI dataset agreement of an ensemble of 5 classifiers. This procedure is described in more detail in Appendix A. In summary, we use three tasks for our multi-task model: two main tasks corresponding to the two annotation levels (and subtasks) of HODI, and an additional auxiliary task relative to synthetic agreement. The three tasks can be summarized as:

The HODI dataset is composed of 6,000 Italian tweets.

The tweets have been collected from the 1st of May 2022 to the 31st of August 2022 using a set of 21 keywords associated with language that might potentially target minority groups victims of homotransphobia. Entries are annotated following a two-layer scheme: 1. Homotransphobia detection: a tweet contains homotransphobic language or not (binary). 2. Rationales detection (explainability): when a tweet is considered homotransphobic, the span of text that contains the homotranshphobic part is highlighted (list of character positions).

3. Task description

The organizers provided participants with the HODI dataset, described in Section 2. 5,000 annotated tweets were released during the first phase of the competition, out of which 2,008 labeled as homotransphobic. In a second phase, the remaining 1,000 tweets were released unlabeled as test data. The task is divided into two subtasks, reflecting the layers of annotations of the dataset: • Subtask A - Homotransphobia detection: binary classification, the goal is to predict whether a tweet is homotransphobic or not. • Subtask B - Explainability: participants are re- 4.2. Synthetic Data quired to predict the spans of homotransphobic tweet that were responsible for the homotransphobic label of the tweet. • Homotransphobia (Subtask A): binary classifica

tion of homotransphobia. • Explainability (Subtask B): annotations for this task are released at character level. We convert each sentence from character to word-level annotation, and associate each word to a label for whether it belongs to the homotransphobic span (Explainibility). Moreover, since often spans are comprised of entire phrases, annotations followed sequence labelling, using a BIO tagging scheme [17] in which each word can be at the beginning, inside or outside of a span. After converting the data into this format, Subtask B can be carried out as a sequence labelling prediction task. • Agreement on Subtask B: it is addressed as sequence labelling at word-level. It can assume values between [0-5], reflecting how many of the 5-classifier ensemble, described in Appendix A, predict a specific word in agreement with the gold label of Subtask B.

The use of synthetic data has been proposed as a method to increase the amount of available training data for hate speech and ofensive language detection tasks, especially

when relying on machine-generated data [18, 19]. Al- 4.3. Experimental Setup though data augmentation using generative models has been found to not always be reliable in improving mod- The models developed for the two runs submitted by our els [14], we aim at exploring whether it can help the team, are both based on a pre-trained Italian BERT 2. For performance of our models for the HODI task. ifne-tuning the models in the multi-task setup described

A widely used method of generating synthetic data in Figure 1, we employed the MaChAmp v0.2 toolkit [22], consists in fine-tuning a generative model on annotated a tool that supports a variety of standard NLP tasks out-ofdata and then using it for generating new sequences. the-box, also in a multi-task setup. We employed the preThese generated sequences are then passed through a trained BERT as our shared encoder for all tasks, while a classifier in order to confirm the label assignment made separate decoder is utilized by each task. We fine-tuned by the generator, since generative models are not always the model (110M parameters) for 15 epochs on a single reliable in their label assignment [20]. GPU 3, using default MaChAmp hyperparameters4. For

While the majority of works that exploit model- the training process, we assign each class equal weight to generated data for the detection of ofensive language guarantee minority classes are not underrepresented. We have no particular focus on any target category or phe- introduced also loss weights for the multi-task learning nomenon, our experiments are focused on specifically loss, calculated as = ∑︀ , where accounts for detecting homotransphobia. Because of this, the gener- the loss for task and being the respective weighting ated texts should be correct regarding both the label and parameter. We set = 0.8 for the primary tasks, and the focus on the phenomenon. In part due to this, and = 0.5 for the auxiliary tasks. in part to the limited availability of generative large language models for Italian, we decide to generate new data 4.4. Submitted Systems Description using an encoder-decoder model trained on Italian, IT5 For the competition, we submitted two diferent runs [21], in its 738M-parameter (large) configuration. The with predictions by models created using the same setup details of our data augmentation process can be found in described in Section 4.3 and Figure 1, but trained on Appendix B. diferent sets of data. Starting from the suggestion from

Given that the augmentation process provides us with the organizers to augment the size of the training set, we synthetic examples annotated for Subtask A (Homotrans- experimented with oversampling and data generation in phobia detection), but not for Subtask B (detection of the following way: rationales), we additionally estimate Subtask B labels for the generated instances, using the model of the first sub- • Run 1: the data made available from the organizmitted run (generated data were used only in run2), while ers are oversampled by repeating them twice. the agreement for the auxiliary task was estimated using the ensemble classifier described in Appendix A.

2dbmdz/bert-base-italian-cased

3NVIDIA Titan Xp 4Defaults MaChAmp hyperparameter settings used for all our experiments; Optimizer: AdamW; 1, 2 = 0.9; Dropout = 0.2; Batch

Size = 32; Learning rate = 0.0001 • Run 2: In addition to oversampling the HODI profanities, hurting its generalization capabilities, a phedata, similarly to run1, we add 4,000 synthethi- nomenon that has been observed in data augmentation cally generated examples (see Section 4.2). using generative models [14]. Moreover, to dissect the impact of oversampling and the impact of the auxiliary We split HODI data into 90% training set and 10% de- task, we run a series of additional experiments5. Results velopment set. For Run2, the synthetic examples were are shown in Table 2. To evaluate the role of oversamadded to the training set. pling we replicate the setup of the two submitted runs but omitting the oversampling of the HODI data from 5. Evaluation the training (Exp 1 and Exp 2). By comparing results (Run 1 vs Exp 1 and Run 2 vs Exp 2), we can observe how 5.1. Results oversampling the data is generally beneficial, especially if no synthetic data are used. Moreover, in Exp 3 and Exp Table 1 shows the oficial results of our submissions for 4 we replicate the submitted experiments but exclude the Subtasks A and B. All runs for both tasks beat the orga- auxiliary task. By comparing Run1 and Exp 3, we can nizers’ baseline. observe how in this case the task is indeed beneficial,

For Subtask A we report macro-averaged F1 score and while it is not when comparing Run 2 and Exp 4, where overall rank of our runs, as well as those of the teams synthetic examples are part of the training data. This who performed better than us and the baseline. Our best suggests that the estimation of agreement for generated performance (Run 1) obtained a macro F1 score of 0.795, data might not be informative. ranking 4th out of 18 submitted runs (3rd out of 8 teams), while run 2 ranked 7th out of 18 submissions (4th out of 8 5.3. Qualitative Error Analysis - Subtask A teams).

For Subtask B we report the overall ranking, given that the leaderboard is short and only another team participated in Subtask B. One run of the other team participating in this task beat our result, while our best scoring run (Run 1) ranked 2nd. 5.2. Additional experiments Regarding the impact of generated data, when adding the synthetic data in the training (Run 1) performance decreases in both tasks, showing that the augmentation with generated data does not improve the generalization of models compared to oversampling. In fact, we hypothesize that the addition of synthetic data might push models to be over-reliant on specific identity terms or To perform a qualitative analysis on the most problematic tweets, we isolated the tweets that were incorrectly classified by all models in Table 2. The most consistent false negative regards the missed detection of tweets containing a specific ofensive slang word ( f*mminiello). One possible reason is that this word is not generally common (as it belongs to a local language variety), and it was not present in the training set. Observing the posts incorrectly classified as homotransphobic by the models, we identified (doubtful) sense of humour or metaphorical expressions (andare a fare in culo, essere fr*cio col culo degli altri ) as possible reasons. Another possible reason could also be over-reliance on specific terms.

6. Conclusions

We described our participation to the HODI e valuation task at Evalita 2023 . We used a multi-task learning approach to share representations between the two tasks involved and, additionally, considering the subjectivity of the task, we incorporated inter-annotator agreement information into our framework, estimating them with a 5-classifier ensemble. We experimented augmentation of the training data available by oversampling them and via generated data. We were one of the few teams who participated in both tasks, and our systems performed competitively.

Moreover, we conducted an analysis on the role and impact of the various aspects we combined. Our results show oversampling is generally beneficial, especially

5The organizers released the labels for the test set after the

closing of the evaluation phase

Acknowledgments

when combined with the auxiliary task on agreement.

The usage of generated data instead has limited beneifts, compared to oversampling or additional auxiliary tasks. Finally, performing a qualitative analysis on the most frequent causes of error, we identified specific homotransphobic slang terms that were problematic to be identified by our models.

The work of Elisa Leonardelli was partially funded by the StandByMe European project (REC-RDAP-GBV-AG2020) on “Stop online violence against women and girls by changing attitudes and behaviour of young people through human rights education” (GA 101005641). Her research was also supported by the StandByMe 2.0 project (CERV-2021-DAPHNE) on “Stop gender-based violence by addressing masculinities and changing behaviour of young people through human rights education” (GA 101049386). [11] E. Leonardelli, S. Menini, A. Palmero Aprosio, https://aclanthology.org/2022.acl-long.234. doi:10.

M. Guerini, S. Tonelli, Agreeing to disagree: 18653/v1/2022.acl-long.234. Annotating ofensive language datasets with an- [20] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. Kannotators’ disagreement, in: Proceedings of the tor, G. Kour, S. Shlomov, N. Tepper, N. Zwerdling, 2021 Conference on Empirical Methods in Natu- Do Not Have Enough Data? Deep Learning to the ral Language Processing, Association for Computa- Rescue!, in: Proceedings of the AAAI Conference tional Linguistics, Online and Punta Cana, Domini- on Artificial Intelligence, volume 34, 2020, pp. 7383– can Republic, 2021, pp. 10528–10539. URL: https: 7390. doi:10.1609/aaai.v34i05.6233. //aclanthology.org/2021.emnlp-main.822. doi:10. [21] G. Sarti, M. Nissim, IT5: Large-scale text-to-text 18653/v1/2021.emnlp-main.822. pretraining for italian language understanding and [12] M. Sandri, E. Leonardelli, S. Tonelli, E. Jezek, Why generation, ArXiv preprint 2203.03759 (2022). URL: don’t you do it right? analysing annotators’ dis- https://arxiv.org/abs/2203.03759. agreement in subjective tasks, in: Proceedings of [22] R. van der Goot, A. Üstün, A. Ramponi, I. Sharaf, the 2023 Conference of the European Chapter of the B. Plank, Massive choice, ample tasks (MaChAmp): Association for Com putational Linguistics, 2023 . A toolkit for multi-task learning in NLP, in: Pro[13] E. Leonardelli, S. Menini, S. Tonelli, Dh-fbk@ ceedings of the 16th Conference of the European haspeede2: Italian hate speech detection via self- Chapter of the Association for Computational Lintraining and oversampling., in: Proceedings of the guistics: System Demonstrations, Association for Seventh Evaluation Campaign of Natural Language Computational Linguistics, Online, 2021, pp. 176– Processing and Speech Tools for Italian. Final Work- 197. URL: https://aclanthology.org/2021.eacl-demos. shop (EVALITA 2020), volume 2765, 2020. 22. doi:10.18653/v1/2021.eacl-demos.22. [14] C. Casula, S. Tonelli, Generation-based data augmentation for ofensive language detection: Is it worth it?, in: Proceedings of the 17th Conference Appendix of the European Chapter of the Association for Computational Linguistics, Association for Com- A. Ensemble agreement putational Linguistics, Dubro vnik, Croatia, 2023 , pp. 3359–3377. URL: https://aclanthology.org/2023. For posts (of the training set) that were annotated as eacl-main.244. homotransphobic, we aim at obtaining an approxima[15] J. Pavlopoulos, J. Sorensen, L. Laugier, I. Androut- tion of the agreement level on each word of the post, sopoulos, Semeval-2021 task 5: Toxic spans de- as being considered part of the span is correlated with tection, in: Proceedings of the 15th international labeling the post as homotranspophobic. This informaworkshop on semantic evaluation (SemEval-2021), tion is then exploited as additional information in our 2021, pp. 59–69. multi-task training setup, specifically as an extension to [16] R. Caruana, Multitask learning, Machine learning the sequence labelling prediction of Subtask B.

28 (1997) 41–75. We split the training data provided by the HODI [17] J. Laferty, A. McCallum, F. C. Pereira, Conditional organizers in 5 folds 1, 2, ..., 5, creating 5 separate random fields: Probabilistic models for segmenting train/validation splits, being careful that each item of the and labeling sequence data (2001). train appears in the validation set of one fold. We employ [18] T. Wullach, A. Adler, E. Minkov, Fight Fire with Fire: an ensemble of classifiers, a method first suggested by Fine-tuning Hate Detectors using Large Samples of Leonardelli et al. 2021 [11], where each classifier of the Generated Hate Speech, in: Findings of the Associ- ensemble is trained using slightly diferent configuraation for Computational Linguistics: EMNLP 2021, tions by varying the initial conditions such as the initial Association for Computational Linguistics, Punta seed and the number of epochs, so that the 5 classifiers Cana, Dominican Republic, 2021, pp. 4699–4705. produce similar but not identical predictions. The clasURL: https://aclanthology.org/2021.findings-emnlp. sifiers are produced in the multi-task setup showed in 402. Figure 1, but without the Auxiliary Task on agreement. [19] T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, In this manner, we have ensemble predictions for each of D. Ray, E. Kamar, ToxiGen: A Large-Scale the entries of the training data. Based on the predictions Machine-Generated Dataset for Adversarial and of the classifiers, we assign ensemble agreement labels Implicit Hate Speech Detection, in: Proceedings to the validation set (at a word-level) of the current fold of the 60th Annual Meeting of the Association based on how many classifiers agree with the actual gold for Computational Linguistics (Volume 1: Long annotation. The ensemble agreement label is thus a numPapers), Association for Computational Linguis- ber between 0 and 5. We consider this information as tics, Dublin, Ireland, 2022, pp. 3309–3326. URL: proxy for item’s dificulty and annotators’ disagreement. The pipeline we follow for augmenting the available data for the task is as follows: 1. We fine-tune a classifier (in our case a BERT-base model trained on Italian) 2 on the HODI training data. 2. We fine-tune IT5-Large on the same training data, formatting the task so that the input is ‘Scrivi un tweet:’ or ´Scrivi un tweet omotransfobico:’ (‘Write a tweet:’ or ´Write a homotransphobic tweet:’) depending on the gold label of each example, and the output is the actual post. 3. We use the fine-tuned IT5 model to generate new data, using the same type of input we use in Step 2. 4. We filter the generated data using the fine-tuned classifier from Step 1, keeping only the examples for which the label assignment is the same for the classifier and the generator [ 20, 14]. We additionally remove duplicates and normalize URLs as URL. 5. We rank generated examples based on the confidence of the classification model we used for filtering, retaining the top 2,000 examples for each class. This number is chosen in order to ideally double the size of the dataset, and we use generated examples that are equally split among the labels so as to artificially mitigate the class imbalance.

preprint arXiv:2109.00227 ( 2021 ). [5]

Locatelli ,

Damo ,

Nozza , A cross-lingual

siderations in NLP (C3NLP) , 2023 , pp. 16 - 24 . [6]

Nozza ,

A. T.

Cignarella , G. Damo, T. Caselli,

Patti , HODI at EVALITA 2023: Overview of

Italian . Final Workshop (EVALITA 2023 ), CEUR.org,

Parma , Italy, 2023 . [7]

Lai ,

Menini ,

Polignano ,

Russo , R. Sprug-

noli , G. Venturi, Evalita 2023 : Overview of the 8th

Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy,

2023 . [8]

Ramponi , E. Leonardelli, Dh-fbk at semeval[1]

Nozza ,

Hovy , The state of profanity obfusca- 2022 task 4: leveraging annotators' disagreement

lications, in: Findings of the Association for Com- detection , in : Proceedings of the 16th International

putational Linguistics: ACL 2023 , Association for Workshop on Semantic Evaluation (SemEval-2022),

Computational

Linguistics , 2023 . 2022 , pp. 324 - 334 . [2]

Bourgeade ,

A. T.

Cignarella ,

Frenda , M. Lau- [9]

Leonardelli ,

Casula , Dh-fbk at semeval-2023

rent , W.

Schmeisser-Nieto , F.

Benamara , C.

Bosco, task 10: Multi-task learning with classifier ensem-

dataset of racial stereotypes in social media conver- of the 17th

International Workshop on Semantic

Computational

Linguistics: EACL

2023 , 2023 , pp. tics , Toronto, Canada, 2023 , pp. 1894 - 1905 . URL:

674- 684 . https://aclanthology.org/ 2023 .semeval- 1 . 261 . [3]

Fersini ,

Rosso ,

Anzovino , Overview of [10]

Zhu ,

Lin ,

Zhang ,

Sun ,

Li ,

Lin ,

Dang ,

the task on automatic misogyny identification at R. Xu, Hitsz-hlt at semeval- 2021 task 5: Ensem-

ibereval

2018 ., Ibereval@ sepln 2150 ( 2018 ) 214 - ble sequence labeling and span boundary detection

228. for toxic span detection , in: Proceedings of the [4]

B. R.

Chakravarthi ,

Priyadharshini , R. Pon- 15th international workshop on semantic evalua-

nusamy , P. K.

Kumaresan , K.

Sampath , D. Then- tion ( SemEval-2021 ), 2021 , pp. 521 - 526 .