Data Augmentation through Back-Translation for
                                Stereotypes and Irony Detection
                                Tom Bourgeade1,* , Silvia Casola2 , Adel Mahmoud Wizani3 and Cristina Bosco3
                                1
                                  LORIA, University of Lorraine, Nancy, France
                                2
                                  MaiNLP & MCML, LMU Munich, Germany
                                3
                                  Dipartimento di Informatica, Università di Torino, Turin, Italy


                                                 Abstract
                                                 Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower
                                                 availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such
                                                 datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages
                                                 on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and
                                                 Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a
                                                 multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of
                                                 back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform
                                                 an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier for
                                                 stereotype or irony detection on mono-lingual data.
                                                 Warning: This paper may contain potentially offensive example messages.

                                                 Keywords
                                                 Data Augmentation, Back Translation, Irony Detection, Stereotypes Detection, Low-Resource NLP


                                1. Introduction                                                                                        Back-Translation (BT) [4] was shown to be a strong and
                                                                                                                                       relatively easy-to-implement baseline [5, 6]. A BT pro-
                                Equipping systems with linguistics-grounded capabili- cess generally consists of two steps: given one or multiple
                                ties can be complex. Despite the advancements by Large translation systems, a text in a source language is first
                                Language Models (LLMs), the availability of annotated translated into a chosen pivot language, and the resulting
                                corpora remains crucial. State-of-the-art systems still ex- text is then translated back into the source language. The
                                hibit shortcomings, for example, when access to context expected output of the BT process is a text that is similar
                                or pragmatics for giving a true comprehension of the but not the same as the original input, accounting for the
                                features of the involved phenomena is required [1].                                                    linguistic differences intrinsic to the language pair, but
                                   Unfortunately, the development of large datasets an- also the idiosyncrasies of the chosen translation model(s).
                                notated for specifically complex phenomena can be very This relies on the fact that translation is only partially
                                time-consuming. When only small corpora are avail- deterministic: the expected output should have the same
                                able, data augmentation techniques can be applied [2, 3]. meaning as the input, outputs that morphologically or
                                Given a small set of original sample data, data augmenta- syntactically differ may be considered as correct transla-
                                tion artificially generates new instances that are similar tions of the input. In BT, the application of (at least) two
                                and comparable to the existing data and can, therefore, be translations improves the variability between the input
                                used to train and test systems with an extended dataset. and the output text.
                                   In this paper, we present experiments for augment-                                                     The usefulness of a dataset augmented by applying BT
                                ing two small datasets annotated for two diverse, chal- depends on the quality of the translated outputs. Outputs
                                lenging phenomena, namely stereotypes and irony de- too similar to the inputs can cause overfitting when used
                                tection. In several works exploring data augmentation, for training, while with too different outputs, there is
                                                                                                                                       a risk of a shift in distribution that is too large, which
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, may negatively impact performance, at least in intra-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                         dataset evaluations. A compromise between these two
                                *
                                  Corresponding author.                                                                                alternatives must be found. Therefore, an evaluation
                                  The work of T. Bourgeade and S. Casola was performed while at
                                  Dipartimento di Informatica, Università di Torino, Turin, Italy.
                                                                                                                                       of the quality of translations and back-translations is
                                $ tom.bourgeade@loria.fr (T. Bourgeade); s.casola@lmu.de                                               important to assess the benefits.
                                (S. Casola); adel.mahmoudwizani@edu.unito.it (A. M. Wizani);                                              In this paper, we want to investigate the viability of BT
                                cristina.bosco@unito.it (C. Bosco)                                                                     as a data augmentation technique for low-resource tasks
                                 0000-0002-0247-3130 (T. Bourgeade); 0000-0002-0017-2975                                              in various configurations. We use French and Italian as
                                (S. Casola); 0000-0002-8857-4484 (C. Bosco)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License source languages — leveraging two multilingual datasets
                                           Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
with subsets for these languages — and various languages           Dataset         Lang.     Size (train; test; val)
                                                                                                                       Positive
as pivots for the BT process (French/Italian, English, and                                                              Class
Arabic). We compare BT with an alternative process                                 Italian    3123 (1841; 1185; 97)    15.11%
for data augmentation, specific for multilingual datasets,         StereoHoax
                                                                                   French     9342 (6981; 1993; 368)   12.07%
which we refer to as “cross-translation”, where the data
                                                                                   Italian     967 (619; 193; 155)     25.34%
from one language subset is translated and then used as            MultiPICo
a data augmentation source for another language subset.                            French     1724 (1104; 345; 275)    25.17%
   Our contributions are (1) an intrinsic qualitative hu-     Table 1
man evaluation of translations and back-translations for      Statistics for the datasets used in this work.
stereotypes detection and irony detection datasets in vari-
ous combinations of source and pivot languages, followed
by (2) an extrinsic evaluation of machine learning model    • StereoHoax [10] is a contextualized multilingual
performance on these datasets, using these various data     dataset of tweets annotated primarily for the presence of
augmentation sources.                                       anti-migrant stereotypes. It consists of replies to tweets
                                                            containing racial hoaxes (RH), with each message having
2. Related Work                                             a “conversation head” (the message containing the source
                                                            RH) and a direct parent message (if applicable).
BT as a data augmentation method was originally pro- • MultiPICo [11] is a disaggregated multilingual dataset
posed by Sennrich et al. [4], in the context of Neural of short social media conversations annotated for irony
Machine Translation (NMT), to allow using monolingual detection through crowdsourcing. Each instance is a
data to improve translation quality, particularly when (post, reply) pair, where the post is a starting message
parallel (source and target) training data is scarce.       in a thread, and the reply is either a direct reply or a
   Since then, several works have explored BT, either as second-level reply.
a baseline to evaluate other data augmentation methods
against or as the primary augmentation method for low-
resource tasks. For example, Kumar et al. [5] evaluated 4. Translation Model
pre-trained conditional generative Transformer models
                                                            To use BT as a data augmentation method, one crucial
as data augmentation sources and used BT as a baseline.
                                                            decision to make is that of the translation system(s) . Ma-
They found that BT achieves relatively high extrinsic per-
                                                            chine Translation (MT) models are in fact not explicitly
formance against simpler approaches such as Easy Data
                                                            designed to inject relevant noise into texts to increase the
Augmentation (EDA) [7] but also against some Trans-
                                                            variety of data available. Therefore, a significant part of
former models; it also obtains most of the best scores for
                                                            this beneficial noise will be linked to the idiosyncrasies
semantic fidelity and data diversity.
                                                            of the chosen model(s).
   Xie et al. [6] make use of BT as an augmentation strat-
                                                               In this work, due to the number of different configura-
egy in their semi-supervised Consistency Training ap-
                                                            tions, and thus source-target language pairs we wished to
proach, in which a model is trained with a loss function
                                                            investigate, we decided to limit our selection to intrinsi-
combining traditional supervised learning on a limited
                                                            cally multilingual models.In a preliminary phase, we thus
amount of labeled data, with an unsupervised consistency
                                                            experimented with the locally runnable Transformer-
loss. The latter consists of minimizing a divergence met-
                                                            based multimodal Neural MT model SeamlessM4T v2
ric between the output distributions for an unlabeled
                                                            [12] proposed by Meta AI. However, after early evalua-
input and a noised version of it, the noise function being
                                                            tions of obtained translations and back-translations, we
the chosen data augmentation method, i.e., for text, BT.
                                                            observed too many issues and violations of important cri-
   As far as the challenges related to the application of
                                                            teria (see section 5). As such, we eventually selected the
translation to texts with irony or sarcasm, a few pa-
                                                            Google Translate API for our evaluation and experiments,
pers discussing this task were recently published, among
                                                            as it seemed to offer the best tradeoffs between transla-
which we can cite [8] and [9].
                                                            tion and back-translation quality, as well as ease of access
                                                            to the languages chosen for this work (French, Italian,
3. Datasets                                                 English, and Arabic). It is important to note, however,
                                                            that the models used by Google Translate themselves
We focus on the tasks of stereotypes and irony detection make use of BT as a data augmentation technique, as
with relevant multilingual datasets. Table 1 summarizes well as M4 Modelling1 : in practice, this may cause some
the characteristics of their French and Italian splits, the issues for use in BT, as undesirable artifacts of BT and
chosen languages for this study:
                                                              1
                                                                  https://research.google/blog/recent-advances-in-google-translate/
Massively Multilingual Massive NMT — possibly caused           for maintaining clarity, consistency, and legal compli-
by parameters bottlenecks or languages interferences           ance. This category also includes idiomatic expressions
[13] — may have detrimental effects on the quality of the      which are especially difficult to translate;
augmented data.                                                • fluency: a text is fluent when it is perceived by a native
                                                               speaker as reading “natural”, in the way they would be
                                                               expected to have structured it;
5. Intrinsic Evaluation                                        • other: this last criterion is used to report less frequent
To judge the viability of BT for these two datasets and violations that cannot be encoded by the other criteria,
languages, we perform a human qualitative evaluation of including incomplete translations, word tokenization, or
produced back-translations using the following protocol. sentence segmentation.
First, we collect a set of data for both datasets and lan-
guages randomly sample 50 instances each for the French 5.1. Back-Translation Examples
and Italian subsets, 25 from the positive class, and 25 from
                                                               To illustrate violations of these criteria, this section
the negative class, for a total of 200 instances. For all the
                                                               presents example parts of instances in their original
cases examined, we consider the text of the messages and
                                                               (Og), translated (Tr), and back-translated (BT) forms,
the associated conversational context, which can consist
                                                               underlining the relevant spans, when applicable.
of one or two other messages (an optional direct parent,
                                                                  In the following example from the Italian subset of
and the conversation head/original post).
                                                               MultiPICo, the fluency criterion is violated because of the
   In addition to French and Italian as source and pivot
                                                               inadequate and unnatural back-translation of the plural
languages, American English and Modern Standard Ara-
                                                               expression “per i primi tempi” (“for the initial period”),
bic were also selected on account of the linguistic ex-
                                                               into the singular “per la prima volta” (“for the first time”):
pertise of the authors. Thus, for the 100 instances in
Italian, we apply the following BT settings (<source> -
<pivot> - <target=source>): Italian - English - Italian; Ital- Og:altrimenti
                                                                     "Se rimanere impiegato a 1400 euro è il tuo obiettivo ok,
                                                                              è solo per i primi tempi"
ian - French - Italian; Italian - Arabic - Italian. Similarly, Tr: "If staying employed at 1400 euros is your goal, ok,
for the 100 French instances, we apply the following BT            otherwise it’s only for the first time"

settings: French - English - French; French - Italian - BT: "Se restare impiegato a 1400 euro è il tuo obiettivo, ok,
French; French - Arabic - French. We use the Google
                                                                   altrimenti è solo per la prima volta"


Translate API due to its ease of use and availability of
                                                                  This example from French StereoHoax illustrates
the chosen source and target languages.
                                                               breaking the faithfulness criterion, with Arabic as the
   A manual qualitative approach is used for the evalu-
                                                               pivot language. In this message, the informal vulgar ex-
ation of the BT results: 4 language experts (co-authors
                                                               pression “n’avoir rien à foutre” (vulgar. “to have nothing
of this paper) evaluate the quality of the produced back-
                                                               to do”), which conveys an implied judgment of laziness
translations (and intermediate translations, though in
                                                               towards the described target, cannot be properly trans-
a less quantitative capacity). All evaluators are native
                                                               lated into Arabic, like most vulgar expressions (a com-
speakers of one of the source languages (French and
                                                               mon issue with this pivot language), and loses its proper
Italian), as well as sufficiently proficient (or a native
                                                               meaning in the back-translation, “n’avoir rien à se soucier”
speaker) in the pivot languages (French, Italian, English,
                                                               meaning “to have nothing to worry/care about”:
and Arabic). They are tasked with comparing the origi-
nal and back-translated instances, also considering the
pivot translation to help understand potential artifacts or Og: "Elle n’a rien à foutre"
                                                                Tr: "
errors introduced in the process. Evaluators could assign                                                        éK. Õæî E AÓ AîE YË Ë
one label to problematic instances containing a violation          "
                                                                BT: "Elle n’a rien à se soucier"
of the following associated quality criteria:
• faithfulness: a faithful translation accurately conveys         In this example from Italian MultiPICo, the violation
the meaning of the original text without introducing er- concerns a non-translatable, in the form of the colloquial
rors, omissions, or distortions. Since we focus on texts expression “<X> della Madonna”, intended as an idiomatic
featuring expressions of stereotype or irony, faithful in- intensifier (similar to “A hell of a <X>” in American En-
stances must also preserve these phenomena;                    glish). In the pivot translation, the idiom fails to be trans-
• preservation of non-translatables: this criterion is posed, and “Madonna” is interpreted as part of the proper
referred to in the translation of numbers, units, measure- noun of a non-existent virus (“Madonna virus”) and trans-
ments, and, in general, non-translatable terms such as posed into the back-translation:
proper nouns, brands, trademarks, hashtags, user men-
tions, emojis, acronyms, and specific cultural references
 Og: "... Gli asiatici stanno tramando qualcosa di losco....        use of the idiomatic discourse marker/connector “du coup”
   prima gli spaghetti al microonde    con   ketchup   e   adesso
   un virus della madonna ?"                                        (equivalent to connector “so” in English), has this quoted
 Tr: "...   The Asians are up to something shady...      first      expression consistently mis-backtranslated to “tout d’un
   microwaved spaghetti with ketchup and now a Madonna virus?"      coup” (“all of a sudden/suddenly”), despite it not mak-
 BT: "...      Gli  asiatici stanno  tramando qualcosa  di          ing sense in the context of the message. The use of the
   losco... prima spaghetti al microonde con ketchup e ora          expression in quotation marks in this case may have con-
                                                                    fused the MT model, which otherwise does not struggle
   un virus Madonna?"


   Another example of a non-translatable failing to be              with this expression when manually tested.
preserved is the following, taken from the French sub-                 Overall, English appears to perform best across all the
set of StereoHoax. Here, the idiomatic expression “se               pivot languages in all settings. This is not surprising con-
tuer/mourrir à la tâche” (lit. “to kill oneself/die doing a         sidering that, for most MT models, English is the most
task”), used in its informal variant with “[se] crever” (lit.       represented language in the training data (both in the
“to burst”, informal. “kill [oneself]/die”) was translated          source and target language), as well as the language typi-
incorrectly, changing the meaning of the message:                   cally used as a pivot to generate augmented instances for
                                                                    lower-resource languages. When using Arabic as a pivot
                                                                    language in our evaluations, we observed some unnatural
 Og: "Oui mais est ce que c’est normal ? Quand yen a un qui
   a rien foutu et que l’autre s’est crever à la tache ? Non        expressions and constructs that appear “borrowed” from
   la logique c’est qu’il peuvent cumuler pour arriver à une        English: for example, in a MultiPICo Italian instance, the
   retraite vivable et qui dépasse le seuil de pauvreté !"
 Tr: "Yes but is this normal?     When one has done nothing         word “gratis” (“free [of charge/cost]”) is mistranslated to
   and the other has died?   No, the logic is that they can           (“freedom/liberty”); we thus hypothesize that the MT
                                                                    Qk
   accumulate to achieve a livable retirement that exceeds the
   poverty line!"                                                   model used English as a pivot language for the Italian-
 BT: "Oui mais est-ce normal ?  Quand l’un n’a rien fait et         Arabic language pair, as both terms would indeed likely
   que l’autre est mort ? Non, la logique est qu’ils peuvent
   accumuler pour obtenir une retraite viable qui dépasse le
                                                                    be mapped to the polysemic and thus ambiguous term
   seuil de pauvreté !"                                             “free” in English.


5.2. Samples Evaluation                                             6. Extrinsic Evaluation
Table 2 presents the quantitative results of this quality           To evaluate the effectiveness of BT as a data augmen-
evaluation on 200 instances (see section 5). Cases that fall        tation method for stereotypes or irony detection, we
outside the selected criteria (classified under “other”) in-        performed some preliminary experiments with varying
clude erroneous translations of grammatical gender, espe-           configurations. For these experiments, we used the XLM-
cially when using English as a pivot language, which has            RoBERTa [15] multilingual Transformer classifier: while
been extensively discussed in the literature [14]. Other            for smaller models, monolingual Transformers are gen-
errors refer to segmentation or punctuation. The preser-            erally preferable to multilingual ones, we preferred to
vation of proper punctuation and distinction between                use a single model in all configurations. For similar rea-
different sentences, text chunks, and segments ensures              sons, and due to time and resource constraints, for all
clarity and readability and can impact the quality of trans-        experiments, we only automatically fine-tuned the hy-
lation when using Machine Translation models. Unfor-                perparameters of the models once for each dataset and
tunately, due to the nature of the texts in question, i.e.,         source language combination (with a total of 4 starting
social media messages, proper content segmentation is               configurations), on the baseline training set, that is, with-
difficult to achieve due to the overall poor structure and          out any data augmentation. For more technical details,
formatting of the content in question (among many other             see Appendix A.
forms of typographical artifacts and errors).                          As the positive class (stereotype or irony present) is
   Regardless of the pivot language, some instances seem            often the minority class for these and related tasks (see
to be systematic sources of errors which can be explained           Table 1), we evaluate “balanced” data augmentation con-
by the particularities of the MT used model. For example,
in MultiPICo Italian, one instance is “Non la chiudono
tranquillo”, which should be interpreted as “They won’t
close it, don’t worry” (speaking of the Italian Stock Ex-
change); however, for all pivot languages, and possibly
due to the absence of a comma separating “tranquillo”, it
is misinterpreted as an adverb and thus incorrectly back-
translated to “silenziosamente” (“quietly”). Similarly, in
MultiPICo French, a message discussing the increasing
          BT-setting      faith    n-trs      fluency     other          BT-setting      faith     n-trs     fluency     other
          Ita-Eng-Ita      16%      8%           4%         2%           Ita-Eng-Ita      22%       4%        8%          0%
          Ita-Fra-Ita      26%      6%           4%         4%           Ita-Fra-Ita      24%       4%        12%         2%
          Ita-Arb-Ita      36%      8%           4%         2%           Ita-Arb-Ita      44%       12%       8%          0%
          mean             27%      7%           4%         3%           mean             30%       7%         9%         1%

          Fra-Eng-Fra      18%     14%           0%         0%           Fra-Eng-Fra      18%       6%        4%          0%
          Fra-Ita-Fra      28%     14%           0%         0%           Fra-Ita-Fra      36%       4%        6%          0%
          Fra-Arb-Fra      36%     12%           0%         2%           Fra-Arb-Fra      18%       20%       10%         0%
          mean             27%     13%           0%         1%           mean             24%       10%        7%         0%

               (a) MultiPICo Back-Translation errors                          (b) StereoHoax Back-Translation errors
Table 2
Distribution of translation-related errors (faith: faithfulness, n-trs: non-translatable; see section 5) in 50 sample instances (25
of each class) of each dataset, for all combinations of source and pivot languages (BT-setting).


         Dataset        Source    baseline        OV     BT[Eng]     BT[Fra]      BT[Arb]          XT      BT[Eng]|OV     XT|OV

                         Ita        75.44       74.98        74.29       74.34         75.96     46.55           74.58     76.18
       StereoHoax
                         Fra        68.05       67.36        55.73       64.12          60.8     64.43           65.68     65.85
                         Ita        68.21       65.23        65.71       63.56         68.49     65.79           61.86     63.48
        MultiPICo
                         Fra        59.73        64.7        64.01       61.24         63.28     64.91           64.09     65.17

                                                (a) Results in terms of Macro F1-score.
         Dataset        Source    baseline        OV     BT[Eng]     BT[Fra]      BT[Arb]          XT      BT[Eng]|OV     XT|OV

                         Ita        56.13       56.06        54.55       54.48         57.55      0.00           55.36      57.14
       StereoHoax
                         Fra        43.48       42.89        34.43       39.75         36.09     39.74           39.84      42.63
                         Ita        54.55       46.67       55.22        47.71         53.47     48.42           44.86      42.86
        MultiPICo
                         Fra        37.09       45.57       49.51        47.53         48.80     48.94           49.00      48.62

                                             (b) Results in terms of Positive class F1-score.
Table 3
Results of our experiments for various data augmentation configurations (see section 6). The best scores for each configuration
are highlighted in bold.


figurations, in which augmented samples are added to                  size of the negative class, balancing the two requires
the positive class until it is the same size as the negative          sampling more instances from the data augmentation
class. We evaluated the following configurations:                     source than there are original positive instances, which
• baseline: the model is trained on the original, unmodi-             could result in injecting translation related biases into
fied training set (with no balancing of the classes).                 the training set. To attempt to mitigate this, we also
• oversampling (OV): Oversampling was shown to be a                   evaluate sampling 50% from back or cross-translation
strong baseline in various previous works [16, 17], and               strategies, with 50% from oversampling the positive class.
we thus evaluate it as an alternative or complement to                Note that, given the number of potential configurations,
BT.                                                                   we only evaluate BT[Eng]|OV and XT|OV due to time
• back-translation from <language> (BT[<language>]):                  and resource constraints.
augmented instances are sampled from back-translations                   Table 3 displays the results of our experiments in terms
of the original data using <language> as a pivot.                     of macro F1-scores, as well as positive class F1-scores.
• cross-translation (XT): as the datasets used are multi-             Except for StereoHoax French, at least one of the data
lingual and contain subsets in both French and Italian,               augmentation configurations outperforms the baseline,
one language’s subset can be translated and used as aug-              though not necessarily BT. Indeed, for both StereoHoax
mented data for the other.                                            Italian and MultiPICo French, the mixed cross-translation
• mixed back/cross-translation with oversampling                      with oversampling (XT|OV) configuration achieves the
(BT[<language>]/XT|OV): as the positive classes are,                  highest Macro F1-score, though not the best positive class
for both phenomena and all languages, less than half the              score. This seems to indicate that the variety of data
intrinsic to using a separate language subset of a mul-       liminary extrinsic evaluation of two multilingual datasets,
tilingual dataset can be beneficial, when possible, over      we found that cross-translation can outperform Back
that artificially created by a data augmentation technique    Translation, allowing us to augment one language subset
like BT. Additionally, we only experimented with cross-       by leveraging the variety of inputs present in the others.
translation within one linguistic typology (Romance lan-         In future work, we aim to expand this study to more
guages). As such, future investigations on whether this       numerous and varied source and pivot languages, and
extends to cross-typologies XT would be worth pursuing.       different data augmentation configurations, namely, dif-
   Interestingly, we find that the mixture of oversampling    ferent proportions and selections of injected augmented
and back/cross-translation outperforms the equivalent         data. We may also compare Back and Cross-Translation
non-mixed configuration for all datasets and languages        against or alongside other related techniques, such as
except MultiPICo Italian. However, due to its small size      multitasking learning or Active Learning. We also expect
(see Table 1), the results on this particular subset may be   that some improvements can be obtained by mitigating
less significant, given the overall protocol for these ex-    translation failures; this can be done, for example, by
periments, and a protocol that can inject greater amounts     leveraging an external LLM to check each step and re-
of augmented data might be preferable. During initial         move or correct the errors from the final augmented
experiments, however, we found that injecting larger          dataset. Finally, it could be also interesting to perform
quantities of augmented data (preserving or not the ini-      tests with different model types on top of RoBERTa.
tial label distributions) seemed to consistently negatively
impact test-set performance, most likely due to overfit-
ting but also possibly due to the models fitting on the       Acknowledgment
translation model detrimental idiosyncrasies, instead of
                                                              The work of T. Bourgeade was funded by the project
the characteristics of the phenomena to detect.
                                                              StereotypHate, funded by the Compagnia di San Paolo
   Moreover, the performance on the positive class (Ta-
                                                              for the call ‘Progetti di Ateneo - Compagnia di San Paolo
ble 3b) is not necessarily improved correspondingly with
                                                              2019/2021 - Mission 1.1 - Finanziamento ex-post’.
the overall macro F1-score (Table 3a), even when the aug-
                                                              The work of C. Bosco was partially funded by this same
mentation is applied solely to this class. In other works
                                                              project.
on similar phenomena, it is shown that data augmenta-
tion and related methods can boost the Out-of-Domain
performance of such detection models [17]. The addition       References
of variety in the occurrences of the phenomenon to de-
tect would indeed help in generalizing its detection to        [1] S. Menini, A. P. Aprosio, S. Tonelli, Abuse
other sources of data. Though, as the example of Stereo-           is Contextual, What about NLP? The Role
Hoax Italian in the cross-translation (XT) configuration           of Context in Abusive Language Annotation
shows, care should be taken not to overly shift the data           and Detection, 2021. URL: http://arxiv.org/abs/
distribution; otherwise, models may fail to learn the par-         2103.14916. doi:10.48550/arXiv.2103.14916.
ticular dataset’s positive class entirely. The mixed data          arXiv:2103.14916.
augmentation with oversampling configurations seems,           [2] M. Bayer, M.-A. Kaufhold, C. Reuter, A Sur-
however, successful in addressing this potential issue,            vey on Data Augmentation for Text Classifi-
though more variations in the proportions should be ex-            cation,      ACM Computing Surveys 55 (2022)
perimented with.                                                   146:1–146:39. URL: https://dl.acm.org/doi/10.1145/
                                                                   3544558. doi:10.1145/3544558.
                                                               [3] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi,
7. Conclusions                                                     T. Mitamura, E. Hovy, A Survey of Data Aug-
                                                                   mentation Approaches for NLP, in: Findings
In this work, we have investigated using Back-
                                                                   of the Association for Computational Linguis-
Translation as a data augmentation technique for chal-
                                                                   tics: ACL-IJCNLP 2021, Association for Compu-
lenging low-resources tasks like stereotypes and irony
                                                                   tational Linguistics, Online, 2021, pp. 968–988.
detection, in a multilingual context.
                                                                   URL: https://aclanthology.org/2021.findings-acl.84.
Through an intrinsic evaluation of the quality of the
                                                                   doi:10.18653/v1/2021.findings-acl.84.
augmented instances, we identified modes of failure of
                                                               [4] R. Sennrich, B. Haddow, A. Birch, Improving
Machine Translation, which could negatively impact the
                                                                   Neural Machine Translation Models with Mono-
data augmentation process. These errors stem from the
                                                                   lingual Data, in: K. Erk, N. A. Smith (Eds.), Pro-
intrinsic differences between typologies and specific lan-
                                                                   ceedings of the 54th Annual Meeting of the Asso-
guages or translation model idiosyncrasies themselves
                                                                   ciation for Computational Linguistics (Volume 1:
potentially learned from methods like BT. Through a pre-
     Long Papers), Association for Computational Lin-                D. Licht, J. Maillard, R. Mavlyutov, A. Rakotoari-
     guistics, Berlin, Germany, 2016, pp. 86–96. URL:                son, K. R. Sadagopan, A. Ramakrishnan, T. Tran,
     https://aclanthology.org/P16-1009. doi:10.18653/                G. Wenzek, Y. Yang, E. Ye, I. Evtimov, P. Fernan-
     v1/P16-1009.                                                    dez, C. Gao, P. Hansanti, E. Kalbassi, A. Kallet,
 [5] V. Kumar, A. Choudhary, E. Cho, Data Augmen-                    A. Kozhevnikov, G. M. Gonzalez, R. S. Roman,
     tation using Pre-trained Transformer Models, in:                C. Touret, C. Wong, C. Wood, B. Yu, P. Andrews,
     Proceedings of the 2nd Workshop on Life-long                    C. Balioglu, P.-J. Chen, M. R. Costa-jussà, M. El-
     Learning for Spoken Language Systems, Associa-                  bayad, H. Gong, F. Guzmán, K. Heffernan, S. Jain,
     tion for Computational Linguistics, Suzhou, China,              J. Kao, A. Lee, X. Ma, A. Mourachko, B. Peloquin,
     2020, pp. 18–26. URL: https://aclanthology.org/2020.            J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk,
     lifelongnlp-1.3.                                                A. Sun, P. Tomasello, C. Wang, J. Wang, S. Wang,
 [6] Q. Xie, Z. Dai, E. Hovy, T. Luong, Q. Le, Unsu-                 M. Williamson, Seamless: Multilingual Expressive
     pervised Data Augmentation for Consistency                      and Streaming Speech Translation, 2023. URL: http:
     Training,      in: Advances in Neural Informa-                  //arxiv.org/abs/2312.05187. doi:10.48550/arXiv.
     tion Processing Systems, volume 33, Curran                      2312.05187. arXiv:2312.05187.
     Associates, Inc., 2020, pp. 6256–6268. URL:                [13] A. Mueller, G. Nicolai, A. D. McCarthy, D. Lewis,
     https://proceedings.neurips.cc/paper/2020/hash/                 W. Wu, D. Yarowsky, An Analysis of Massively
     44feb0096faa8326192570788b38c1d1-Abstract.                      Multilingual Neural Machine Translation for Low-
     html.                                                           Resource Languages, in: N. Calzolari, F. Béchet,
 [7] J. Wei, K. Zou, EDA: Easy Data Augmentation Tech-               P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi,
     niques for Boosting Performance on Text Classifica-             H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
     tion Tasks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.),        A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings
     Proceedings of the 2019 Conference on Empirical                 of the Twelfth Language Resources and Evaluation
     Methods in Natural Language Processing and the                  Conference, European Language Resources Associ-
     9th International Joint Conference on Natural Lan-              ation, Marseille, France, 2020, pp. 3710–3718. URL:
     guage Processing (EMNLP-IJCNLP), Association                    https://aclanthology.org/2020.lrec-1.458.
     for Computational Linguistics, Hong Kong, China,           [14] E. Rabinovich, S. Mirkin, R. Patel, L. Specia,
     2019, pp. 6382–6388. URL: https://aclanthology.org/             S. Winther, Personalized machine translation: Pre-
     D19-1670. doi:10.18653/v1/D19-1670.                             serving original author traits, in: Proceedings of
 [8] H. Ardi, M. Al Hafizh, I. Rezqy, R. Tuzzikriah, Can             the EACL 2017 vol. 1 long papers, 2017.
     machine translations translate humorous texts?,            [15] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud-
     Humanus 21 (2022) 99–112.                                       hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
 [9] Initial exploration into sarcasm and irony through              L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-
     machine translation, Natural Language Processing                lingual Representation Learning at Scale, in: Pro-
     Journal 9 (2024) 100106.                                        ceedings of the 58th Annual Meeting of the Associa-
[10] T. Bourgeade, A. T. Cignarella, S. Frenda, M. Lau-              tion for Computational Linguistics, Association for
     rent, W. Schmeisser-Nieto, F. Benamara, C. Bosco,               Computational Linguistics, Online, 2020, pp. 8440–
     V. Moriceau, V. Patti, M. Taulé, A Multilingual                 8451. URL: https://aclanthology.org/2020.acl-main.
     Dataset of Racial Stereotypes in Social Media Con-              747. doi:10.18653/v1/2020.acl-main.747.
     versational Threads, in: Findings of the As-               [16] M. Juuti, T. Gröndahl, A. Flanagan, N. Asokan, A
     sociation for Computational Linguistics: EACL                   little goes a long way: Improving toxic language
     2023, Association for Computational Linguistics,                classification despite data scarcity, in: Findings
     Dubrovnik, Croatia, 2023, pp. 686–696. URL: https:              of the Association for Computational Linguistics:
     //aclanthology.org/2023.findings-eacl.51.                       EMNLP 2020, Association for Computational Lin-
[11] S. Casola, S. Frenda, S. M. Lo, E. Sezerer, A. Uva,             guistics, Online, 2020, pp. 2991–3009. URL: https://
     V. Basile, C. Bosco, A. Pedrani, C. Rubagotti, V. Patti,        aclanthology.org/2020.findings-emnlp.269. doi:10.
     D. Bernardi, MultiPICo: Multilingual Perspectivist              18653/v1/2020.findings-emnlp.269.
     Irony Corpus, in: Proceedings of the 62th Annual           [17] C. Casula, S. Tonelli, Generation-Based Data Aug-
     Meeting of the Association for Computational Lin-               mentation for Offensive Language Detection: Is
     guistics, Association for Computational Linguistics,            It Worth It?, in: Proceedings of the 17th Con-
     Online, 2024.                                                   ference of the European Chapter of the Associ-
[12] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale,              ation for Computational Linguistics, Association
     N. Dong, M. Duppenthaler, P.-A. Duquenne, B. El-                for Computational Linguistics, Dubrovnik, Croa-
     lis, H. Elsahar, J. Haaheim, J. Hoffman, M.-J.                  tia, 2023, pp. 3359–3377. URL: https://aclanthology.
     Hwang, H. Inaguma, C. Klaiber, I. Kulikov, P. Li,               org/2023.eacl-main.244. doi:10.18653/v1/2023.
     eacl-main.244.                                           A. Technical Details
[18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-    For all experiments, we used the XLM-RoBERTa-base as
     towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,   provided by the the HuggingFace transformers [18]
     Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug-           ecosystem (including the datasets library for data pro-
     ger, M. Drame, Q. Lhoest, A. Rush, Transform-            cessing).
     ers: State-of-the-Art Natural Language Process-             Automatic hyperparameters fine-tuning was accom-
     ing, in: Proceedings of the 2020 Conference on           plished using the Weights & Biases [19] AI platform’s
     Empirical Methods in Natural Language Process-           Bayesian hyperparameters optimization system, with the
     ing: System Demonstrations, Association for Com-         Hyperband early-stopping algorithm [20]. As mentioned
     putational Linguistics, Online, 2020, pp. 38–45.         in section 6, only 4 such optimizations were executed, one
     URL: https://aclanthology.org/2020.emnlp-demos.6.        for each language subset of each dataset, in the baseline
     doi:10.18653/v1/2020.emnlp-demos.6.                      configuration (no data augmentation).
[19] L. Biewald, Experiment tracking with weights and            The learning rate (𝑙𝑟), the hardware training batch
     biases, 2020. URL: https://www.wandb.com/, soft-         size (𝑏𝑠), and the number of gradient accumulation steps
     ware available from wandb.com.                           (ga), were automatically fine-tuned, and their final values
[20] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh,         are listed in Table A1. These models were trained for a
     A. Talwalkar, Hyperband: A Novel Bandit-Based            maximum of 10 epochs, with the best performing epoch
     Approach to Hyperparameter Optimization, Jour-           checkpoint kept at the end (measured by macro F1-score),
     nal of Machine Learning Research 18 (2018) 1–52.         with a warm-up ratio of 0.2 (linear warm-up from 0 to
     URL: http://jmlr.org/papers/v18/16-558.html.             the initial learning rate over 20% of the training set), both
                                                              determined during initial experiments.
                                                                 Automatic fine-tuning and training of the models was
                                                              performed on the Google Colab platform, using high-
                                                              RAM T4 GPU instances, for an approximate total of 50
                                                              GPU-hours.

                                                                    Dataset        Lang.             lr    bs    ga
                                                                                   French     2.963E-05    16     4
                                                                    StereoHoax
                                                                                   Italian    1.000E-06    16     1
                                                                                   French     2.963E-05    16     4
                                                                    MultiPICo
                                                                                   Italian    2.920E-05     8     1

                                                              Table A1
                                                              Automatically fine-tuned hyperparameters (lr: learning rate;
                                                              bs: batch size; ga: gradient accumulation steps)