Data Augmentation through Back-Translation for Stereotypes and Irony Detection Tom Bourgeade1,* , Silvia Casola2 , Adel Mahmoud Wizani3 and Cristina Bosco3 1 LORIA, University of Lorraine, Nancy, France 2 MaiNLP & MCML, LMU Munich, Germany 3 Dipartimento di Informatica, Università di Torino, Turin, Italy Abstract Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier for stereotype or irony detection on mono-lingual data. Warning: This paper may contain potentially offensive example messages. Keywords Data Augmentation, Back Translation, Irony Detection, Stereotypes Detection, Low-Resource NLP 1. Introduction Back-Translation (BT) [4] was shown to be a strong and relatively easy-to-implement baseline [5, 6]. A BT pro- Equipping systems with linguistics-grounded capabili- cess generally consists of two steps: given one or multiple ties can be complex. Despite the advancements by Large translation systems, a text in a source language is first Language Models (LLMs), the availability of annotated translated into a chosen pivot language, and the resulting corpora remains crucial. State-of-the-art systems still ex- text is then translated back into the source language. The hibit shortcomings, for example, when access to context expected output of the BT process is a text that is similar or pragmatics for giving a true comprehension of the but not the same as the original input, accounting for the features of the involved phenomena is required [1]. linguistic differences intrinsic to the language pair, but Unfortunately, the development of large datasets an- also the idiosyncrasies of the chosen translation model(s). notated for specifically complex phenomena can be very This relies on the fact that translation is only partially time-consuming. When only small corpora are avail- deterministic: the expected output should have the same able, data augmentation techniques can be applied [2, 3]. meaning as the input, outputs that morphologically or Given a small set of original sample data, data augmenta- syntactically differ may be considered as correct transla- tion artificially generates new instances that are similar tions of the input. In BT, the application of (at least) two and comparable to the existing data and can, therefore, be translations improves the variability between the input used to train and test systems with an extended dataset. and the output text. In this paper, we present experiments for augment- The usefulness of a dataset augmented by applying BT ing two small datasets annotated for two diverse, chal- depends on the quality of the translated outputs. Outputs lenging phenomena, namely stereotypes and irony de- too similar to the inputs can cause overfitting when used tection. In several works exploring data augmentation, for training, while with too different outputs, there is a risk of a shift in distribution that is too large, which CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may negatively impact performance, at least in intra- Dec 04 — 06, 2024, Pisa, Italy dataset evaluations. A compromise between these two * Corresponding author. alternatives must be found. Therefore, an evaluation The work of T. Bourgeade and S. Casola was performed while at Dipartimento di Informatica, Università di Torino, Turin, Italy. of the quality of translations and back-translations is $ tom.bourgeade@loria.fr (T. Bourgeade); s.casola@lmu.de important to assess the benefits. (S. Casola); adel.mahmoudwizani@edu.unito.it (A. M. Wizani); In this paper, we want to investigate the viability of BT cristina.bosco@unito.it (C. Bosco) as a data augmentation technique for low-resource tasks  0000-0002-0247-3130 (T. Bourgeade); 0000-0002-0017-2975 in various configurations. We use French and Italian as (S. Casola); 0000-0002-8857-4484 (C. Bosco) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License source languages — leveraging two multilingual datasets Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings with subsets for these languages — and various languages Dataset Lang. Size (train; test; val) Positive as pivots for the BT process (French/Italian, English, and Class Arabic). We compare BT with an alternative process Italian 3123 (1841; 1185; 97) 15.11% for data augmentation, specific for multilingual datasets, StereoHoax French 9342 (6981; 1993; 368) 12.07% which we refer to as “cross-translation”, where the data Italian 967 (619; 193; 155) 25.34% from one language subset is translated and then used as MultiPICo a data augmentation source for another language subset. French 1724 (1104; 345; 275) 25.17% Our contributions are (1) an intrinsic qualitative hu- Table 1 man evaluation of translations and back-translations for Statistics for the datasets used in this work. stereotypes detection and irony detection datasets in vari- ous combinations of source and pivot languages, followed by (2) an extrinsic evaluation of machine learning model • StereoHoax [10] is a contextualized multilingual performance on these datasets, using these various data dataset of tweets annotated primarily for the presence of augmentation sources. anti-migrant stereotypes. It consists of replies to tweets containing racial hoaxes (RH), with each message having 2. Related Work a “conversation head” (the message containing the source RH) and a direct parent message (if applicable). BT as a data augmentation method was originally pro- • MultiPICo [11] is a disaggregated multilingual dataset posed by Sennrich et al. [4], in the context of Neural of short social media conversations annotated for irony Machine Translation (NMT), to allow using monolingual detection through crowdsourcing. Each instance is a data to improve translation quality, particularly when (post, reply) pair, where the post is a starting message parallel (source and target) training data is scarce. in a thread, and the reply is either a direct reply or a Since then, several works have explored BT, either as second-level reply. a baseline to evaluate other data augmentation methods against or as the primary augmentation method for low- resource tasks. For example, Kumar et al. [5] evaluated 4. Translation Model pre-trained conditional generative Transformer models To use BT as a data augmentation method, one crucial as data augmentation sources and used BT as a baseline. decision to make is that of the translation system(s) . Ma- They found that BT achieves relatively high extrinsic per- chine Translation (MT) models are in fact not explicitly formance against simpler approaches such as Easy Data designed to inject relevant noise into texts to increase the Augmentation (EDA) [7] but also against some Trans- variety of data available. Therefore, a significant part of former models; it also obtains most of the best scores for this beneficial noise will be linked to the idiosyncrasies semantic fidelity and data diversity. of the chosen model(s). Xie et al. [6] make use of BT as an augmentation strat- In this work, due to the number of different configura- egy in their semi-supervised Consistency Training ap- tions, and thus source-target language pairs we wished to proach, in which a model is trained with a loss function investigate, we decided to limit our selection to intrinsi- combining traditional supervised learning on a limited cally multilingual models.In a preliminary phase, we thus amount of labeled data, with an unsupervised consistency experimented with the locally runnable Transformer- loss. The latter consists of minimizing a divergence met- based multimodal Neural MT model SeamlessM4T v2 ric between the output distributions for an unlabeled [12] proposed by Meta AI. However, after early evalua- input and a noised version of it, the noise function being tions of obtained translations and back-translations, we the chosen data augmentation method, i.e., for text, BT. observed too many issues and violations of important cri- As far as the challenges related to the application of teria (see section 5). As such, we eventually selected the translation to texts with irony or sarcasm, a few pa- Google Translate API for our evaluation and experiments, pers discussing this task were recently published, among as it seemed to offer the best tradeoffs between transla- which we can cite [8] and [9]. tion and back-translation quality, as well as ease of access to the languages chosen for this work (French, Italian, 3. Datasets English, and Arabic). It is important to note, however, that the models used by Google Translate themselves We focus on the tasks of stereotypes and irony detection make use of BT as a data augmentation technique, as with relevant multilingual datasets. Table 1 summarizes well as M4 Modelling1 : in practice, this may cause some the characteristics of their French and Italian splits, the issues for use in BT, as undesirable artifacts of BT and chosen languages for this study: 1 https://research.google/blog/recent-advances-in-google-translate/ Massively Multilingual Massive NMT — possibly caused for maintaining clarity, consistency, and legal compli- by parameters bottlenecks or languages interferences ance. This category also includes idiomatic expressions [13] — may have detrimental effects on the quality of the which are especially difficult to translate; augmented data. • fluency: a text is fluent when it is perceived by a native speaker as reading “natural”, in the way they would be expected to have structured it; 5. Intrinsic Evaluation • other: this last criterion is used to report less frequent To judge the viability of BT for these two datasets and violations that cannot be encoded by the other criteria, languages, we perform a human qualitative evaluation of including incomplete translations, word tokenization, or produced back-translations using the following protocol. sentence segmentation. First, we collect a set of data for both datasets and lan- guages randomly sample 50 instances each for the French 5.1. Back-Translation Examples and Italian subsets, 25 from the positive class, and 25 from To illustrate violations of these criteria, this section the negative class, for a total of 200 instances. For all the presents example parts of instances in their original cases examined, we consider the text of the messages and (Og), translated (Tr), and back-translated (BT) forms, the associated conversational context, which can consist underlining the relevant spans, when applicable. of one or two other messages (an optional direct parent, In the following example from the Italian subset of and the conversation head/original post). MultiPICo, the fluency criterion is violated because of the In addition to French and Italian as source and pivot inadequate and unnatural back-translation of the plural languages, American English and Modern Standard Ara- expression “per i primi tempi” (“for the initial period”), bic were also selected on account of the linguistic ex- into the singular “per la prima volta” (“for the first time”): pertise of the authors. Thus, for the 100 instances in Italian, we apply the following BT settings ( - - ): Italian - English - Italian; Ital- Og:altrimenti "Se rimanere impiegato a 1400 euro è il tuo obiettivo ok, è solo per i primi tempi" ian - French - Italian; Italian - Arabic - Italian. Similarly, Tr: "If staying employed at 1400 euros is your goal, ok, for the 100 French instances, we apply the following BT otherwise it’s only for the first time" settings: French - English - French; French - Italian - BT: "Se restare impiegato a 1400 euro è il tuo obiettivo, ok, French; French - Arabic - French. We use the Google altrimenti è solo per la prima volta" Translate API due to its ease of use and availability of This example from French StereoHoax illustrates the chosen source and target languages. breaking the faithfulness criterion, with Arabic as the A manual qualitative approach is used for the evalu- pivot language. In this message, the informal vulgar ex- ation of the BT results: 4 language experts (co-authors pression “n’avoir rien à foutre” (vulgar. “to have nothing of this paper) evaluate the quality of the produced back- to do”), which conveys an implied judgment of laziness translations (and intermediate translations, though in towards the described target, cannot be properly trans- a less quantitative capacity). All evaluators are native lated into Arabic, like most vulgar expressions (a com- speakers of one of the source languages (French and mon issue with this pivot language), and loses its proper Italian), as well as sufficiently proficient (or a native meaning in the back-translation, “n’avoir rien à se soucier” speaker) in the pivot languages (French, Italian, English, meaning “to have nothing to worry/care about”: and Arabic). They are tasked with comparing the origi- nal and back-translated instances, also considering the pivot translation to help understand potential artifacts or Og: "Elle n’a rien à foutre" Tr: " errors introduced in the process. Evaluators could assign éK. Õæî E AÓ AîE YË Ë one label to problematic instances containing a violation " BT: "Elle n’a rien à se soucier" of the following associated quality criteria: • faithfulness: a faithful translation accurately conveys In this example from Italian MultiPICo, the violation the meaning of the original text without introducing er- concerns a non-translatable, in the form of the colloquial rors, omissions, or distortions. Since we focus on texts expression “ della Madonna”, intended as an idiomatic featuring expressions of stereotype or irony, faithful in- intensifier (similar to “A hell of a ” in American En- stances must also preserve these phenomena; glish). In the pivot translation, the idiom fails to be trans- • preservation of non-translatables: this criterion is posed, and “Madonna” is interpreted as part of the proper referred to in the translation of numbers, units, measure- noun of a non-existent virus (“Madonna virus”) and trans- ments, and, in general, non-translatable terms such as posed into the back-translation: proper nouns, brands, trademarks, hashtags, user men- tions, emojis, acronyms, and specific cultural references Og: "... Gli asiatici stanno tramando qualcosa di losco.... use of the idiomatic discourse marker/connector “du coup” prima gli spaghetti al microonde con ketchup e adesso un virus della madonna ?" (equivalent to connector “so” in English), has this quoted Tr: "... The Asians are up to something shady... first expression consistently mis-backtranslated to “tout d’un microwaved spaghetti with ketchup and now a Madonna virus?" coup” (“all of a sudden/suddenly”), despite it not mak- BT: "... Gli asiatici stanno tramando qualcosa di ing sense in the context of the message. The use of the losco... prima spaghetti al microonde con ketchup e ora expression in quotation marks in this case may have con- fused the MT model, which otherwise does not struggle un virus Madonna?" Another example of a non-translatable failing to be with this expression when manually tested. preserved is the following, taken from the French sub- Overall, English appears to perform best across all the set of StereoHoax. Here, the idiomatic expression “se pivot languages in all settings. This is not surprising con- tuer/mourrir à la tâche” (lit. “to kill oneself/die doing a sidering that, for most MT models, English is the most task”), used in its informal variant with “[se] crever” (lit. represented language in the training data (both in the “to burst”, informal. “kill [oneself]/die”) was translated source and target language), as well as the language typi- incorrectly, changing the meaning of the message: cally used as a pivot to generate augmented instances for lower-resource languages. When using Arabic as a pivot language in our evaluations, we observed some unnatural Og: "Oui mais est ce que c’est normal ? Quand yen a un qui a rien foutu et que l’autre s’est crever à la tache ? Non expressions and constructs that appear “borrowed” from la logique c’est qu’il peuvent cumuler pour arriver à une English: for example, in a MultiPICo Italian instance, the retraite vivable et qui dépasse le seuil de pauvreté !" Tr: "Yes but is this normal? When one has done nothing word “gratis” (“free [of charge/cost]”) is mistranslated to and the other has died? No, the logic is that they can  (“freedom/liberty”); we thus hypothesize that the MT Qk accumulate to achieve a livable retirement that exceeds the poverty line!" model used English as a pivot language for the Italian- BT: "Oui mais est-ce normal ? Quand l’un n’a rien fait et Arabic language pair, as both terms would indeed likely que l’autre est mort ? Non, la logique est qu’ils peuvent accumuler pour obtenir une retraite viable qui dépasse le be mapped to the polysemic and thus ambiguous term seuil de pauvreté !" “free” in English. 5.2. Samples Evaluation 6. Extrinsic Evaluation Table 2 presents the quantitative results of this quality To evaluate the effectiveness of BT as a data augmen- evaluation on 200 instances (see section 5). Cases that fall tation method for stereotypes or irony detection, we outside the selected criteria (classified under “other”) in- performed some preliminary experiments with varying clude erroneous translations of grammatical gender, espe- configurations. For these experiments, we used the XLM- cially when using English as a pivot language, which has RoBERTa [15] multilingual Transformer classifier: while been extensively discussed in the literature [14]. Other for smaller models, monolingual Transformers are gen- errors refer to segmentation or punctuation. The preser- erally preferable to multilingual ones, we preferred to vation of proper punctuation and distinction between use a single model in all configurations. For similar rea- different sentences, text chunks, and segments ensures sons, and due to time and resource constraints, for all clarity and readability and can impact the quality of trans- experiments, we only automatically fine-tuned the hy- lation when using Machine Translation models. Unfor- perparameters of the models once for each dataset and tunately, due to the nature of the texts in question, i.e., source language combination (with a total of 4 starting social media messages, proper content segmentation is configurations), on the baseline training set, that is, with- difficult to achieve due to the overall poor structure and out any data augmentation. For more technical details, formatting of the content in question (among many other see Appendix A. forms of typographical artifacts and errors). As the positive class (stereotype or irony present) is Regardless of the pivot language, some instances seem often the minority class for these and related tasks (see to be systematic sources of errors which can be explained Table 1), we evaluate “balanced” data augmentation con- by the particularities of the MT used model. For example, in MultiPICo Italian, one instance is “Non la chiudono tranquillo”, which should be interpreted as “They won’t close it, don’t worry” (speaking of the Italian Stock Ex- change); however, for all pivot languages, and possibly due to the absence of a comma separating “tranquillo”, it is misinterpreted as an adverb and thus incorrectly back- translated to “silenziosamente” (“quietly”). Similarly, in MultiPICo French, a message discussing the increasing BT-setting faith n-trs fluency other BT-setting faith n-trs fluency other Ita-Eng-Ita 16% 8% 4% 2% Ita-Eng-Ita 22% 4% 8% 0% Ita-Fra-Ita 26% 6% 4% 4% Ita-Fra-Ita 24% 4% 12% 2% Ita-Arb-Ita 36% 8% 4% 2% Ita-Arb-Ita 44% 12% 8% 0% mean 27% 7% 4% 3% mean 30% 7% 9% 1% Fra-Eng-Fra 18% 14% 0% 0% Fra-Eng-Fra 18% 6% 4% 0% Fra-Ita-Fra 28% 14% 0% 0% Fra-Ita-Fra 36% 4% 6% 0% Fra-Arb-Fra 36% 12% 0% 2% Fra-Arb-Fra 18% 20% 10% 0% mean 27% 13% 0% 1% mean 24% 10% 7% 0% (a) MultiPICo Back-Translation errors (b) StereoHoax Back-Translation errors Table 2 Distribution of translation-related errors (faith: faithfulness, n-trs: non-translatable; see section 5) in 50 sample instances (25 of each class) of each dataset, for all combinations of source and pivot languages (BT-setting). Dataset Source baseline OV BT[Eng] BT[Fra] BT[Arb] XT BT[Eng]|OV XT|OV Ita 75.44 74.98 74.29 74.34 75.96 46.55 74.58 76.18 StereoHoax Fra 68.05 67.36 55.73 64.12 60.8 64.43 65.68 65.85 Ita 68.21 65.23 65.71 63.56 68.49 65.79 61.86 63.48 MultiPICo Fra 59.73 64.7 64.01 61.24 63.28 64.91 64.09 65.17 (a) Results in terms of Macro F1-score. Dataset Source baseline OV BT[Eng] BT[Fra] BT[Arb] XT BT[Eng]|OV XT|OV Ita 56.13 56.06 54.55 54.48 57.55 0.00 55.36 57.14 StereoHoax Fra 43.48 42.89 34.43 39.75 36.09 39.74 39.84 42.63 Ita 54.55 46.67 55.22 47.71 53.47 48.42 44.86 42.86 MultiPICo Fra 37.09 45.57 49.51 47.53 48.80 48.94 49.00 48.62 (b) Results in terms of Positive class F1-score. Table 3 Results of our experiments for various data augmentation configurations (see section 6). The best scores for each configuration are highlighted in bold. figurations, in which augmented samples are added to size of the negative class, balancing the two requires the positive class until it is the same size as the negative sampling more instances from the data augmentation class. We evaluated the following configurations: source than there are original positive instances, which • baseline: the model is trained on the original, unmodi- could result in injecting translation related biases into fied training set (with no balancing of the classes). the training set. To attempt to mitigate this, we also • oversampling (OV): Oversampling was shown to be a evaluate sampling 50% from back or cross-translation strong baseline in various previous works [16, 17], and strategies, with 50% from oversampling the positive class. we thus evaluate it as an alternative or complement to Note that, given the number of potential configurations, BT. we only evaluate BT[Eng]|OV and XT|OV due to time • back-translation from (BT[]): and resource constraints. augmented instances are sampled from back-translations Table 3 displays the results of our experiments in terms of the original data using as a pivot. of macro F1-scores, as well as positive class F1-scores. • cross-translation (XT): as the datasets used are multi- Except for StereoHoax French, at least one of the data lingual and contain subsets in both French and Italian, augmentation configurations outperforms the baseline, one language’s subset can be translated and used as aug- though not necessarily BT. Indeed, for both StereoHoax mented data for the other. Italian and MultiPICo French, the mixed cross-translation • mixed back/cross-translation with oversampling with oversampling (XT|OV) configuration achieves the (BT[]/XT|OV): as the positive classes are, highest Macro F1-score, though not the best positive class for both phenomena and all languages, less than half the score. This seems to indicate that the variety of data intrinsic to using a separate language subset of a mul- liminary extrinsic evaluation of two multilingual datasets, tilingual dataset can be beneficial, when possible, over we found that cross-translation can outperform Back that artificially created by a data augmentation technique Translation, allowing us to augment one language subset like BT. Additionally, we only experimented with cross- by leveraging the variety of inputs present in the others. translation within one linguistic typology (Romance lan- In future work, we aim to expand this study to more guages). As such, future investigations on whether this numerous and varied source and pivot languages, and extends to cross-typologies XT would be worth pursuing. different data augmentation configurations, namely, dif- Interestingly, we find that the mixture of oversampling ferent proportions and selections of injected augmented and back/cross-translation outperforms the equivalent data. We may also compare Back and Cross-Translation non-mixed configuration for all datasets and languages against or alongside other related techniques, such as except MultiPICo Italian. However, due to its small size multitasking learning or Active Learning. We also expect (see Table 1), the results on this particular subset may be that some improvements can be obtained by mitigating less significant, given the overall protocol for these ex- translation failures; this can be done, for example, by periments, and a protocol that can inject greater amounts leveraging an external LLM to check each step and re- of augmented data might be preferable. During initial move or correct the errors from the final augmented experiments, however, we found that injecting larger dataset. Finally, it could be also interesting to perform quantities of augmented data (preserving or not the ini- tests with different model types on top of RoBERTa. tial label distributions) seemed to consistently negatively impact test-set performance, most likely due to overfit- ting but also possibly due to the models fitting on the Acknowledgment translation model detrimental idiosyncrasies, instead of The work of T. Bourgeade was funded by the project the characteristics of the phenomena to detect. StereotypHate, funded by the Compagnia di San Paolo Moreover, the performance on the positive class (Ta- for the call ‘Progetti di Ateneo - Compagnia di San Paolo ble 3b) is not necessarily improved correspondingly with 2019/2021 - Mission 1.1 - Finanziamento ex-post’. the overall macro F1-score (Table 3a), even when the aug- The work of C. Bosco was partially funded by this same mentation is applied solely to this class. In other works project. on similar phenomena, it is shown that data augmenta- tion and related methods can boost the Out-of-Domain performance of such detection models [17]. The addition References of variety in the occurrences of the phenomenon to de- tect would indeed help in generalizing its detection to [1] S. Menini, A. P. Aprosio, S. Tonelli, Abuse other sources of data. Though, as the example of Stereo- is Contextual, What about NLP? The Role Hoax Italian in the cross-translation (XT) configuration of Context in Abusive Language Annotation shows, care should be taken not to overly shift the data and Detection, 2021. URL: http://arxiv.org/abs/ distribution; otherwise, models may fail to learn the par- 2103.14916. doi:10.48550/arXiv.2103.14916. ticular dataset’s positive class entirely. The mixed data arXiv:2103.14916. augmentation with oversampling configurations seems, [2] M. Bayer, M.-A. Kaufhold, C. Reuter, A Sur- however, successful in addressing this potential issue, vey on Data Augmentation for Text Classifi- though more variations in the proportions should be ex- cation, ACM Computing Surveys 55 (2022) perimented with. 146:1–146:39. URL: https://dl.acm.org/doi/10.1145/ 3544558. doi:10.1145/3544558. [3] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, 7. Conclusions T. Mitamura, E. Hovy, A Survey of Data Aug- mentation Approaches for NLP, in: Findings In this work, we have investigated using Back- of the Association for Computational Linguis- Translation as a data augmentation technique for chal- tics: ACL-IJCNLP 2021, Association for Compu- lenging low-resources tasks like stereotypes and irony tational Linguistics, Online, 2021, pp. 968–988. detection, in a multilingual context. URL: https://aclanthology.org/2021.findings-acl.84. Through an intrinsic evaluation of the quality of the doi:10.18653/v1/2021.findings-acl.84. augmented instances, we identified modes of failure of [4] R. Sennrich, B. Haddow, A. Birch, Improving Machine Translation, which could negatively impact the Neural Machine Translation Models with Mono- data augmentation process. These errors stem from the lingual Data, in: K. Erk, N. A. Smith (Eds.), Pro- intrinsic differences between typologies and specific lan- ceedings of the 54th Annual Meeting of the Asso- guages or translation model idiosyncrasies themselves ciation for Computational Linguistics (Volume 1: potentially learned from methods like BT. Through a pre- Long Papers), Association for Computational Lin- D. Licht, J. Maillard, R. Mavlyutov, A. Rakotoari- guistics, Berlin, Germany, 2016, pp. 86–96. URL: son, K. R. Sadagopan, A. Ramakrishnan, T. Tran, https://aclanthology.org/P16-1009. doi:10.18653/ G. Wenzek, Y. Yang, E. Ye, I. Evtimov, P. Fernan- v1/P16-1009. dez, C. Gao, P. Hansanti, E. Kalbassi, A. Kallet, [5] V. Kumar, A. Choudhary, E. Cho, Data Augmen- A. Kozhevnikov, G. M. Gonzalez, R. S. Roman, tation using Pre-trained Transformer Models, in: C. Touret, C. Wong, C. Wood, B. Yu, P. Andrews, Proceedings of the 2nd Workshop on Life-long C. Balioglu, P.-J. Chen, M. R. Costa-jussà, M. El- Learning for Spoken Language Systems, Associa- bayad, H. Gong, F. Guzmán, K. Heffernan, S. Jain, tion for Computational Linguistics, Suzhou, China, J. Kao, A. Lee, X. Ma, A. Mourachko, B. Peloquin, 2020, pp. 18–26. URL: https://aclanthology.org/2020. J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk, lifelongnlp-1.3. A. Sun, P. Tomasello, C. Wang, J. Wang, S. Wang, [6] Q. Xie, Z. Dai, E. Hovy, T. Luong, Q. Le, Unsu- M. Williamson, Seamless: Multilingual Expressive pervised Data Augmentation for Consistency and Streaming Speech Translation, 2023. URL: http: Training, in: Advances in Neural Informa- //arxiv.org/abs/2312.05187. doi:10.48550/arXiv. tion Processing Systems, volume 33, Curran 2312.05187. arXiv:2312.05187. Associates, Inc., 2020, pp. 6256–6268. URL: [13] A. Mueller, G. Nicolai, A. D. McCarthy, D. Lewis, https://proceedings.neurips.cc/paper/2020/hash/ W. Wu, D. Yarowsky, An Analysis of Massively 44feb0096faa8326192570788b38c1d1-Abstract. Multilingual Neural Machine Translation for Low- html. Resource Languages, in: N. Calzolari, F. Béchet, [7] J. Wei, K. Zou, EDA: Easy Data Augmentation Tech- P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, niques for Boosting Performance on Text Classifica- H. Isahara, B. Maegaard, J. Mariani, H. Mazo, tion Tasks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings Proceedings of the 2019 Conference on Empirical of the Twelfth Language Resources and Evaluation Methods in Natural Language Processing and the Conference, European Language Resources Associ- 9th International Joint Conference on Natural Lan- ation, Marseille, France, 2020, pp. 3710–3718. URL: guage Processing (EMNLP-IJCNLP), Association https://aclanthology.org/2020.lrec-1.458. for Computational Linguistics, Hong Kong, China, [14] E. Rabinovich, S. Mirkin, R. Patel, L. Specia, 2019, pp. 6382–6388. URL: https://aclanthology.org/ S. Winther, Personalized machine translation: Pre- D19-1670. doi:10.18653/v1/D19-1670. serving original author traits, in: Proceedings of [8] H. Ardi, M. Al Hafizh, I. Rezqy, R. Tuzzikriah, Can the EACL 2017 vol. 1 long papers, 2017. machine translations translate humorous texts?, [15] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- Humanus 21 (2022) 99–112. hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, [9] Initial exploration into sarcasm and irony through L. Zettlemoyer, V. Stoyanov, Unsupervised Cross- machine translation, Natural Language Processing lingual Representation Learning at Scale, in: Pro- Journal 9 (2024) 100106. ceedings of the 58th Annual Meeting of the Associa- [10] T. Bourgeade, A. T. Cignarella, S. Frenda, M. Lau- tion for Computational Linguistics, Association for rent, W. Schmeisser-Nieto, F. Benamara, C. Bosco, Computational Linguistics, Online, 2020, pp. 8440– V. Moriceau, V. Patti, M. Taulé, A Multilingual 8451. URL: https://aclanthology.org/2020.acl-main. Dataset of Racial Stereotypes in Social Media Con- 747. doi:10.18653/v1/2020.acl-main.747. versational Threads, in: Findings of the As- [16] M. Juuti, T. Gröndahl, A. Flanagan, N. Asokan, A sociation for Computational Linguistics: EACL little goes a long way: Improving toxic language 2023, Association for Computational Linguistics, classification despite data scarcity, in: Findings Dubrovnik, Croatia, 2023, pp. 686–696. URL: https: of the Association for Computational Linguistics: //aclanthology.org/2023.findings-eacl.51. EMNLP 2020, Association for Computational Lin- [11] S. Casola, S. Frenda, S. M. Lo, E. Sezerer, A. Uva, guistics, Online, 2020, pp. 2991–3009. URL: https:// V. Basile, C. Bosco, A. Pedrani, C. Rubagotti, V. Patti, aclanthology.org/2020.findings-emnlp.269. doi:10. D. Bernardi, MultiPICo: Multilingual Perspectivist 18653/v1/2020.findings-emnlp.269. Irony Corpus, in: Proceedings of the 62th Annual [17] C. Casula, S. Tonelli, Generation-Based Data Aug- Meeting of the Association for Computational Lin- mentation for Offensive Language Detection: Is guistics, Association for Computational Linguistics, It Worth It?, in: Proceedings of the 17th Con- Online, 2024. ference of the European Chapter of the Associ- [12] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, ation for Computational Linguistics, Association N. Dong, M. Duppenthaler, P.-A. Duquenne, B. El- for Computational Linguistics, Dubrovnik, Croa- lis, H. Elsahar, J. Haaheim, J. Hoffman, M.-J. tia, 2023, pp. 3359–3377. URL: https://aclanthology. Hwang, H. Inaguma, C. Klaiber, I. Kulikov, P. Li, org/2023.eacl-main.244. doi:10.18653/v1/2023. eacl-main.244. A. Technical Details [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- For all experiments, we used the XLM-RoBERTa-base as towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, provided by the the HuggingFace transformers [18] Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug- ecosystem (including the datasets library for data pro- ger, M. Drame, Q. Lhoest, A. Rush, Transform- cessing). ers: State-of-the-Art Natural Language Process- Automatic hyperparameters fine-tuning was accom- ing, in: Proceedings of the 2020 Conference on plished using the Weights & Biases [19] AI platform’s Empirical Methods in Natural Language Process- Bayesian hyperparameters optimization system, with the ing: System Demonstrations, Association for Com- Hyperband early-stopping algorithm [20]. As mentioned putational Linguistics, Online, 2020, pp. 38–45. in section 6, only 4 such optimizations were executed, one URL: https://aclanthology.org/2020.emnlp-demos.6. for each language subset of each dataset, in the baseline doi:10.18653/v1/2020.emnlp-demos.6. configuration (no data augmentation). [19] L. Biewald, Experiment tracking with weights and The learning rate (𝑙𝑟), the hardware training batch biases, 2020. URL: https://www.wandb.com/, soft- size (𝑏𝑠), and the number of gradient accumulation steps ware available from wandb.com. (ga), were automatically fine-tuned, and their final values [20] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, are listed in Table A1. These models were trained for a A. Talwalkar, Hyperband: A Novel Bandit-Based maximum of 10 epochs, with the best performing epoch Approach to Hyperparameter Optimization, Jour- checkpoint kept at the end (measured by macro F1-score), nal of Machine Learning Research 18 (2018) 1–52. with a warm-up ratio of 0.2 (linear warm-up from 0 to URL: http://jmlr.org/papers/v18/16-558.html. the initial learning rate over 20% of the training set), both determined during initial experiments. Automatic fine-tuning and training of the models was performed on the Google Colab platform, using high- RAM T4 GPU instances, for an approximate total of 50 GPU-hours. Dataset Lang. lr bs ga French 2.963E-05 16 4 StereoHoax Italian 1.000E-06 16 1 French 2.963E-05 16 4 MultiPICo Italian 2.920E-05 8 1 Table A1 Automatically fine-tuned hyperparameters (lr: learning rate; bs: batch size; ga: gradient accumulation steps)