1. Introduction

Data Augmentation through Back-Translation for Stereotypes and Irony Detection

Tom Bourgeade

Silvia Casola

Adel Mahmoud Wizani

Cristina Bosco

0 0 Dipartimento di Informatica, Università di Torino , Turin , Italy 1 LORIA, University of Lorraine , Nancy , France 2 MaiNLP & MCML, LMU Munich , Germany

Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of diferent data augmentation configurations to train a multilingual Transformer-based classifier for stereotype or irony detection on mono-lingual data. Warning: This paper may contain potentially ofensive example messages.

eol>Data Augmentation Back Translation Irony Detection Stereotypes Detection Low-Resource NLP

1. Introduction

CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author.

The work of T. Bourgeade and S. Casola was performed while at Dipartimento di Informatica, Università di Torino, Turin, Italy. $ tom.bourgeade@loria.fr (T. Bourgeade); s.casola@lmu.de (S. Casola); adel.mahmoudwizani@edu.unito.it (A. M. Wizani); cristina.bosco@unito.it (C. Bosco)

0000-0002-0247-3130 (T. Bourgeade); 0000-0002-0017-2975 (S. Casola); 0000-0002-8857-4484 (C. Bosco) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License source languages — leveraging two multilingual datasets Attribution 4.0 International (CC BY 4.0). with subsets for these languages — and various languages as pivots for the BT process (French/Italian, English, and Arabic). We compare BT with an alternative process for data augmentation, specific for multilingual datasets, which we refer to as “cross-translation”, where the data from one language subset is translated and then used as a data augmentation source for another language subset.

Our contributions are ( 1 ) an intrinsic qualitative human evaluation of translations and back-translations for stereotypes detection and irony detection datasets in various combinations of source and pivot languages, followed by ( 2 ) an extrinsic evaluation of machine learning model performance on these datasets, using these various data augmentation sources.

3. Datasets

• StereoHoax [10] is a contextualized multilingual dataset of tweets annotated primarily for the presence of anti-migrant stereotypes. It consists of replies to tweets containing racial hoaxes (RH), with each message having 2. Related Work a “conversation head” (the message containing the source RH) and a direct parent message (if applicable).

BT as a data augmentation method was originally pro- • MultiPICo [11] is a disaggregated multilingual dataset posed by Sennrich et al. [4], in the context of Neural of short social media conversations annotated for irony Machine Translation (NMT), to allow using monolingual detection through crowdsourcing. Each instance is a data to improve translation quality, particularly when (post, reply) pair, where the post is a starting message parallel (source and target) training data is scarce. in a thread, and the reply is either a direct reply or a

Since then, several works have explored BT, either as second-level reply. a baseline to evaluate other data augmentation methods against or as the primary augmentation method for lowresource tasks. For example, Kumar et al. [5] evaluated 4. Translation Model pre-trained conditional generative Transformer models as data augmentation sources and used BT as a baseline.

They found that BT achieves relatively high extrinsic performance against simpler approaches such as Easy Data Augmentation (EDA) [7] but also against some Transformer models; it also obtains most of the best scores for semantic fidelity and data diversity.

Xie et al. [6] make use of BT as an augmentation strategy in their semi-supervised Consistency Training approach, in which a model is trained with a loss function combining traditional supervised learning on a limited amount of labeled data, with an unsupervised consistency loss. The latter consists of minimizing a divergence metric between the output distributions for an unlabeled input and a noised version of it, the noise function being the chosen data augmentation method, i.e., for text, BT.

As far as the challenges related to the application of translation to texts with irony or sarcasm, a few papers discussing this task were recently published, among which we can cite [8] and [9].

To use BT as a data augmentation method, one crucial decision to make is that of the translation system(s) . Machine Translation (MT) models are in fact not explicitly designed to inject relevant noise into texts to increase the variety of data available. Therefore, a significant part of this beneficial noise will be linked to the idiosyncrasies of the chosen model(s).

In this work, due to the number of diferent configurations, and thus source-target language pairs we wished to investigate, we decided to limit our selection to intrinsically multilingual models.In a preliminary phase, we thus experimented with the locally runnable Transformerbased multimodal Neural MT model SeamlessM4T v2 [12] proposed by Meta AI. However, after early evaluations of obtained translations and back-translations, we observed too many issues and violations of important criteria (see section 5). As such, we eventually selected the Google Translate API for our evaluation and experiments, as it seemed to ofer the best tradeofs between translation and back-translation quality, as well as ease of access to the languages chosen for this work (French, Italian, English, and Arabic). It is important to note, however, that the models used by Google Translate themselves make use of BT as a data augmentation technique, as well as M4 Modelling1: in practice, this may cause some issues for use in BT, as undesirable artifacts of BT and

We focus on the tasks of stereotypes and irony detection

with relevant multilingual datasets. Table 1 summarizes the characteristics of their French and Italian splits, the chosen languages for this study:

1https://research.google/blog/recent-advances-in-google-translate/

Massively Multilingual Massive NMT — possibly caused by parameters bottlenecks or languages interferences [13] — may have detrimental efects on the quality of the augmented data. 5. Intrinsic Evaluation for maintaining clarity, consistency, and legal compliance. This category also includes idiomatic expressions which are especially dificult to translate; • fluency : a text is fluent when it is perceived by a native speaker as reading “natural”, in the way they would be expected to have structured it; • other: this last criterion is used to report less frequent violations that cannot be encoded by the other criteria, including incomplete translations, word tokenization, or

To judge the viability of BT for these two datasets and

languages, we perform a human qualitative evaluation of produced back-translations using the following protocol. sentence segmentation.

First, we collect a set of data for both datasets and languages randomly sample 50 instances each for the French 5.1. Back-Translation Examples and Italian subsets, 25 from the positive class, and 25 from the negative class, for a total of 200 instances. For all the To illustrate violations of these criteria, this section cases examined, we consider the text of the messages and presents example parts of instances in their original the associated conversational context, which can consist (Og), translated (Tr), and back-translated (BT) forms, of one or two other messages (an optional direct parent, underlining the relevant spans, when applicable. and the conversation head/original post). In the following example from the Italian subset of

In addition to French and Italian as source and pivot MultiPICo, the fluency criterion is violated because of the languages, American English and Modern Standard Ara- inadequate and unnatural back-translation of the plural bic were also selected on account of the linguistic ex- expression “per i primi tempi” (“for the initial period”), pertise of the authors. Thus, for the 100 instances in into the singular “per la prima volta” (“for the first time”): Italian, we apply the following BT settings (<source> <pivot> - <target=source>): Italian - English - Italian; Ital- Og:alt"rSiemernitmianèerseoliomppieergaitopraim1i40t0emepuir"o è il tuo obiettivo ok, ian - French - Italian; Italian - Arabic - Italian. Similarly, Tr: "If staying employed at 1400 euros is your goal, ok, for the 100 French instances, we apply the following BT otherwise it’s only for the first time" settings: French - English - French; French - Italian - BT: "Se restare impiegato a 1400 euro è il tuo obiettivo, ok, French; French - Arabic - French. We use the Google altrimenti è solo per la prima volta" Translate API due to its ease of use and availability of the chosen source and target languages. This example from French StereoHoax illustrates

A manual qualitative approach is used for the evalu- breaking the faithfulness criterion, with Arabic as the ation of the BT results: 4 language experts (co-authors pivot language. In this message, the informal vulgar exof this paper) evaluate the quality of the produced back- pression “n’avoir rien à foutre” (vulgar. “to have nothing translations (and intermediate translations, though in to do”), which conveys an implied judgment of laziness a less quantitative capacity). All evaluators are native towards the described target, cannot be properly transspeakers of one of the source languages (French and lated into Arabic, like most vulgar expressions (a comItalian), as well as suficiently proficient (or a native mon issue with this pivot language), and loses its proper speaker) in the pivot languages (French, Italian, English, meaning in the back-translation, “n’avoir rien à se soucier” and Arabic). They are tasked with comparing the origi- meaning “to have nothing to worry/care about”: nal and back-translated instances, also considering the pivot translation to help understand potential artifacts or Og: "Elle n’a rien à foutre" errors introduced in the process. Evaluators could assign Tr: " éK. ÕaeîE AÓ AîEYË Ë one label to problematic instances containing a violation " of the following associated quality criteria: BT: "Elle n’a rien à se soucier" • faithfulness: a faithful translation accurately conveys In this example from Italian MultiPICo, the violation the meaning of the original text without introducing er- concerns a non-translatable, in the form of the colloquial rors, omissions, or distortions. Since we focus on texts expression “<X> della Madonna”, intended as an idiomatic featuring expressions of stereotype or irony, faithful in- intensifier (similar to “A hell of a <X>” in American Enstances must also preserve these phenomena; glish). In the pivot translation, the idiom fails to be trans• preservation of non-translatables: this criterion is posed, and “Madonna” is interpreted as part of the proper referred to in the translation of numbers, units, measure- noun of a non-existent virus (“Madonna virus”) and transments, and, in general, non-translatable terms such as posed into the back-translation: proper nouns, brands, trademarks, hashtags, user mentions, emojis, acronyms, and specific cultural references Og: "... Gli asiatici stanno tramando qualcosa di losco.... use of the idiomatic discourse marker/connector “du coup” upnrivmiarugslidelslpaagmhaedtotninaa?l" microonde con ketchup e adesso (equivalent to connector “so” in English), has this quoted Tr: "... The Asians are up to something shady... first expression consistently mis-backtranslated to “tout d’un microwaved spaghetti with ketchup and now a Madonna virus?" coup” (“all of a sudden/suddenly”), despite it not makBT: "... Gli asiatici stanno tramando qualcosa di ing sense in the context of the message. The use of the losco... prima spaghetti al microonde con ketchup e ora expression in quotation marks in this case may have conun virus Madonna?" fused the MT model, which otherwise does not struggle Another example of a non-translatable failing to be with this expression when manually tested. preserved is the following, taken from the French sub- Overall, English appears to perform best across all the set of StereoHoax. Here, the idiomatic expression “se pivot languages in all settings. This is not surprising contuer/mourrir à la tâche” (lit. “to kill oneself/die doing a sidering that, for most MT models, English is the most task”), used in its informal variant with “[se] crever” (lit. represented language in the training data (both in the “to burst”, informal. “kill [oneself]/die”) was translated source and target language), as well as the language typiincorrectly, changing the meaning of the message: cally used as a pivot to generate augmented instances for lower-resource languages. When using Arabic as a pivot language in our evaluations, we observed some unnatural Og:a "rOiuein fmoauitsueestt qcueeqlu’eauct’reests’neosrtmaclre?verQuàanlda yteanchae u?n Nqouni expressions and constructs that appear “borrowed” from la logique c’est qu’il peuvent cumuler pour arriver à une English: for example, in a MultiPICo Italian instance, the retraite vivable et qui dépasse le seuil de pauvreté !" Tr: "Yes but is this normal? When one has done nothing word “gratis” (“free [of charge/cost]”) is mistranslated to and the other has died? No, the logic is that they can Qk (“freedom/liberty”); we thus hypothesize that the MT apcocvuemrutlyalteinteo!"achieve a livable retirement that exceeds the model used English as a pivot language for the ItalianBT: "Oui mais est-ce normal ? Quand l’un n’a rien fait et Arabic language pair, as both terms would indeed likely que l’autre est mort ? Non, la logique est qu’ils peuvent be mapped to the polysemic and thus ambiguous term ascecuuimluldeerpapuovurretoébt!e"nir une retraite viable qui dépasse le “free” in English. 5.2. Samples Evaluation

6. Extrinsic Evaluation

Ita-Eng-Ita Ita-Fra-Ita Ita-Arb-Ita mean Fra-Eng-Fra Fra-Ita-Fra Fra-Arb-Fra mean StereoHoax MultiPICo StereoHoax MultiPICo ifgurations, in which augmented samples are added to size of the negative class, balancing the two requires the positive class until it is the same size as the negative sampling more instances from the data augmentation class. We evaluated the following configurations: source than there are original positive instances, which • baseline: the model is trained on the original, unmodi- could result in injecting translation related biases into ifed training set (with no balancing of the classes). the training set. To attempt to mitigate this, we also • oversampling (OV): Oversampling was shown to be a evaluate sampling 50% from back or cross-translation strong baseline in various previous works [16, 17], and strategies, with 50% from oversampling the positive class. we thus evaluate it as an alternative or complement to Note that, given the number of potential configurations, BT. we only evaluate BT[Eng]|OV and XT|OV due to time • back-translation from <language> (BT[<language>]): and resource constraints. augmented instances are sampled from back-translations Table 3 displays the results of our experiments in terms of the original data using <language> as a pivot. of macro F1-scores, as well as positive class F1-scores. • cross-translation (XT): as the datasets used are multi- Except for StereoHoax French, at least one of the data lingual and contain subsets in both French and Italian, augmentation configurations outperforms the baseline, one language’s subset can be translated and used as aug- though not necessarily BT. Indeed, for both StereoHoax mented data for the other. Italian and MultiPICo French, the mixed cross-translation • mixed back/cross-translation with oversampling with oversampling (XT|OV) configuration achieves the (BT[<language>]/XT|OV): as the positive classes are, highest Macro F1-score, though not the best positive class for both phenomena and all languages, less than half the score. This seems to indicate that the variety of data intrinsic to using a separate language subset of a mul- liminary extrinsic evaluation of two multilingual datasets, tilingual dataset can be beneficial, when possible, over we found that cross-translation can outperform Back that artificially created by a data augmentation technique Translation, allowing us to augment one language subset like BT. Additionally, we only experimented with cross- by leveraging the variety of inputs present in the others. translation within one linguistic typology (Romance lan- In future work, we aim to expand this study to more guages). As such, future investigations on whether this numerous and varied source and pivot languages, and extends to cross-typologies XT would be worth pursuing. diferent data augmentation configurations, namely, dif

Interestingly, we find that the mixture of oversampling ferent proportions and selections of injected augmented and back/cross-translation outperforms the equivalent data. We may also compare Back and Cross-Translation non-mixed configuration for all datasets and languages against or alongside other related techniques, such as except MultiPICo Italian. However, due to its small size multitasking learning or Active Learning. We also expect (see Table 1), the results on this particular subset may be that some improvements can be obtained by mitigating less significant, given the overall protocol for these ex- translation failures; this can be done, for example, by periments, and a protocol that can inject greater amounts leveraging an external LLM to check each step and reof augmented data might be preferable. During initial move or correct the errors from the final augmented experiments, however, we found that injecting larger dataset. Finally, it could be also interesting to perform quantities of augmented data (preserving or not the ini- tests with diferent model types on top of RoBERTa. tial label distributions) seemed to consistently negatively impact test-set performance, most likely due to overfitting but also possibly due to the models fitting on the Acknowledgment translation model detrimental idiosyncrasies, instead of the characteristics of the phenomena to detect. The work of T. Bourgeade was funded by the project

Moreover, the performance on the positive class (Ta- StereotypHate, funded by the Compagnia di San Paolo ble 3b) is not necessarily improved correspondingly with for the call ‘Progetti di Ateneo - Compagnia di San Paolo the overall macro F1-score (Table 3a), even when the aug- 2019/2021 - Mission 1.1 - Finanziamento ex-post’. mentation is applied solely to this class. In other works The work of C. Bosco was partially funded by this same on similar phenomena, it is shown that data augmenta- project. tion and related methods can boost the Out-of-Domain performance of such detection models [17]. The addition References of variety in the occurrences of the phenomenon to detect would indeed help in generalizing its detection to other sources of data. Though, as the example of StereoHoax Italian in the cross-translation (XT) configuration shows, care should be taken not to overly shift the data distribution; otherwise, models may fail to learn the particular dataset’s positive class entirely. The mixed data augmentation with oversampling configurations seems, however, successful in addressing this potential issue, though more variations in the proportions should be experimented with.

7. Conclusions In this work, we have investigated using BackTranslation as a data augmentation technique for challenging low-resources tasks like stereotypes and irony detection, in a multilingual context.

Through an intrinsic evaluation of the quality of the augmented instances, we identified modes of failure of Machine Translation, which could negatively impact the data augmentation process. These errors stem from the intrinsic diferences between typologies and specific languages or translation model idiosyncrasies themselves potentially learned from methods like BT. Through a pre

Long Papers), Association for Computational Lin- D. Licht, J. Maillard, R. Mavlyutov, A. Rakotoariguistics, Berlin, Germany, 2016, pp. 86–96. URL: son, K. R. Sadagopan, A. Ramakrishnan, T. Tran, https://aclanthology.org/P16-1009. doi:10.18653/ G. Wenzek, Y. Yang, E. Ye, I. Evtimov, P. Fernanv1/P16-1009. dez, C. Gao, P. Hansanti, E. Kalbassi, A. Kallet, [5] V. Kumar, A. Choudhary, E. Cho, Data Augmen- A. Kozhevnikov, G. M. Gonzalez, R. S. Roman, tation using Pre-trained Transformer Models, in: C. Touret, C. Wong, C. Wood, B. Yu, P. Andrews, Proceedings of the 2nd Workshop on Life-long C. Balioglu, P.-J. Chen, M. R. Costa-jussà, M. ElLearning for Spoken Language Systems, Associa- bayad, H. Gong, F. Guzmán, K. Hefernan, S. Jain, tion for Computational Linguistics, Suzhou, China, J. Kao, A. Lee, X. Ma, A. Mourachko, B. Peloquin, 2020, pp. 18–26. URL: https://aclanthology.org/2020. J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk, lifelongnlp-1.3. A. Sun, P. Tomasello, C. Wang, J. Wang, S. Wang, [6] Q. Xie, Z. Dai, E. Hovy, T. Luong, Q. Le, Unsu- M. Williamson, Seamless: Multilingual Expressive pervised Data Augmentation for Consistency and Streaming Speech Translation, 2023. URL: http: Training, in: Advances in Neural Informa- //arxiv.org/abs/2312.05187. doi:10.48550/arXiv. tion Processing Systems, volume 33, Curran 2312.05187. arXiv:2312.05187. Associates, Inc., 2020, pp. 6256–6268. URL: [13] A. Mueller, G. Nicolai, A. D. McCarthy, D. Lewis, https://proceedings.neurips.cc/paper/2020/hash/ W. Wu, D. Yarowsky, An Analysis of Massively 44feb0096faa8326192570788b38c1d1-Abstract. Multilingual Neural Machine Translation for Lowhtml. Resource Languages, in: N. Calzolari, F. Béchet, [7] J. Wei, K. Zou, EDA: Easy Data Augmentation Tech- P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, niques for Boosting Performance on Text Classifica- H. Isahara, B. Maegaard, J. Mariani, H. Mazo, tion Tasks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings Proceedings of the 2019 Conference on Empirical of the Twelfth Language Resources and Evaluation Methods in Natural Language Processing and the Conference, European Language Resources Associ9th International Joint Conference on Natural Lan- ation, Marseille, France, 2020, pp. 3710–3718. URL: guage Processing (EMNLP-IJCNLP), Association https://aclanthology.org/2020.lrec-1.458. for Computational Linguistics, Hong Kong, China, [14] E. Rabinovich, S. Mirkin, R. Patel, L. Specia, 2019, pp. 6382–6388. URL: https://aclanthology.org/ S. Winther, Personalized machine translation: PreD19-1670. doi:10.18653/v1/D19-1670. serving original author traits, in: Proceedings of [8] H. Ardi, M. Al Hafizh, I. Rezqy, R. Tuzzikriah, Can the EACL 2017 vol. 1 long papers, 2017. machine translations translate humorous texts?, [15] A. Conneau, K. Khandelwal, N. Goyal, V. ChaudHumanus 21 (2022) 99–112. hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, [9] Initial exploration into sarcasm and irony through L. Zettlemoyer, V. Stoyanov, Unsupervised Crossmachine translation, Natural Language Processing lingual Representation Learning at Scale, in: ProJournal 9 (2024) 100106. ceedings of the 58th Annual Meeting of the Associa[10] T. Bourgeade, A. T. Cignarella, S. Frenda, M. Lau- tion for Computational Linguistics, Association for rent, W. Schmeisser-Nieto, F. Benamara, C. Bosco, Computational Linguistics, Online, 2020, pp. 8440– V. Moriceau, V. Patti, M. Taulé, A Multilingual 8451. URL: https://aclanthology.org/2020.acl-main. Dataset of Racial Stereotypes in Social Media Con- 747. doi:10.18653/v1/2020.acl-main.747. versational Threads, in: Findings of the As- [16] M. Juuti, T. Gröndahl, A. Flanagan, N. Asokan, A sociation for Computational Linguistics: EACL little goes a long way: Improving toxic language 2023, Association for Computational Linguistics, classification despite data scarcity, in: Findings Dubrovnik, Croatia, 2023, pp. 686–696. URL: https: of the Association for Computational Linguistics: //aclanthology.org/2023.findings-eacl.51. EMNLP 2020, Association for Computational Lin[11] S. Casola, S. Frenda, S. M. Lo, E. Sezerer, A. Uva, guistics, Online, 2020, pp. 2991–3009. URL: https:// V. Basile, C. Bosco, A. Pedrani, C. Rubagotti, V. Patti, aclanthology.org/2020.findings-emnlp.269. doi: 10. D. Bernardi, MultiPICo: Multilingual Perspectivist 18653/v1/2020.findings-emnlp.269. Irony Corpus, in: Proceedings of the 62th Annual [17] C. Casula, S. Tonelli, Generation-Based Data AugMeeting of the Association for Computational Lin- mentation for Ofensive Language Detection: Is guistics, Association for Computational Linguistics, It Worth It?, in: Proceedings of the 17th ConOnline, 2024. ference of the European Chapter of the Associ[12] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, ation for Computational Linguistics, Association N. Dong, M. Duppenthaler, P.-A. Duquenne, B. El- for Computational Linguistics, Dubrovnik, Croalis, H. Elsahar, J. Haaheim, J. Hofman, M.-J. tia, 2023, pp. 3359–3377. URL: https://aclanthology. Hwang, H. Inaguma, C. Klaiber, I. Kulikov, P. Li, org/2023.eacl-main.244. doi:10.18653/v1/2023. eacl-main.244. [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- For all experiments, we used the XLM-RoBERTa-base as towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, provided by the the HuggingFace transformers [18] Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug- ecosystem (including the datasets library for data proger, M. Drame, Q. Lhoest, A. Rush, Transform- cessing). ers: State-of-the-Art Natural Language Process- Automatic hyperparameters fine-tuning was accoming, in: Proceedings of the 2020 Conference on plished using the Weights & Biases [19] AI platform’s Empirical Methods in Natural Language Process- Bayesian hyperparameters optimization system, with the ing: System Demonstrations, Association for Com- Hyperband early-stopping algorithm [20]. As mentioned putational Linguistics, Online, 2020, pp. 38–45. in section 6, only 4 such optimizations were executed, one URL: https://aclanthology.org/2020.emnlp-demos.6. for each language subset of each dataset, in the baseline doi:10.18653/v1/2020.emnlp-demos.6. configuration (no data augmentation). [19] L. Biewald, Experiment tracking with weights and The learning rate (), the hardware training batch biases, 2020. URL: https://www.wandb.com/, soft- size (), and the number of gradient accumulation steps ware available from wandb.com. (ga), were automatically fine-tuned, and their final values [20] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, are listed in Table A1. These models were trained for a A. Talwalkar, Hyperband: A Novel Bandit-Based maximum of 10 epochs, with the best performing epoch Approach to Hyperparameter Optimization, Jour- checkpoint kept at the end (measured by macro F1-score), nal of Machine Learning Research 18 (2018) 1–52. with a warm-up ratio of 0.2 (linear warm-up from 0 to URL: http://jmlr.org/papers/v18/16-558.html. the initial learning rate over 20% of the training set), both determined during initial experiments.

Automatic fine-tuning and training of the models was performed on the Google Colab platform, using highRAM T4 GPU instances, for an approximate total of 50 GPU-hours.

Dataset 2.963E-05 1.000E-06 2.963E-05 2.920E-05 16 16

[1]

Menini ,

A. P.

Aprosio ,

Tonelli , Abuse is Contextual, What about NLP? The Role of Context in Abusive Language Annotation and Detection, 2021 . URL: http://arxiv.org/abs/ 2103.14916. doi: 10 .48550/arXiv.2103.14916. arXiv: 2103 . 14916 .

[2]

Bayer , M. -

A. Kaufhold , C.

Reuter , A

Survey on Data Augmentation for Text Classification , ACM Computing Surveys 55 ( 2022 ) 146 : 1 - 146 : 39 . URL: https://dl.acm.org/doi/10.1145/ 3544558. doi: 10 .1145/3544558.

[3]

S. Y.

Feng ,

Gangal ,

Wei ,

Chandar ,

Vosoughi ,

Mitamura ,

Hovy , A Survey of Data Augmentation Approaches for NLP, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , Association for Computational Linguistics , Online, 2021 , pp. 968 - 988 . URL: https://aclanthology.org/ 2021 .findings-acl. 84 . doi: 10 .18653/v1/ 2021 .findings-acl. 84 .

[4]

Sennrich ,

Haddow ,

Birch , Improving Neural Machine Translation Models with Monolingual Data , in: K. Erk,

N. A.

Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1 : French Italian