<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Augmentation through Back-Translation for Stereotypes and Irony Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tom Bourgeade</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Casola</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adel Mahmoud Wizani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Bosco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Università di Torino</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LORIA, University of Lorraine</institution>
          ,
          <addr-line>Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>MaiNLP &amp; MCML, LMU Munich</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, and Arabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of diferent data augmentation configurations to train a multilingual Transformer-based classifier for stereotype or irony detection on mono-lingual data. Warning: This paper may contain potentially ofensive example messages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Augmentation</kwd>
        <kwd>Back Translation</kwd>
        <kwd>Irony Detection</kwd>
        <kwd>Stereotypes Detection</kwd>
        <kwd>Low-Resource NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy
* Corresponding author.</p>
      <p>The work of T. Bourgeade and S. Casola was performed while at
Dipartimento di Informatica, Università di Torino, Turin, Italy.
$ tom.bourgeade@loria.fr (T. Bourgeade); s.casola@lmu.de
(S. Casola); adel.mahmoudwizani@edu.unito.it (A. M. Wizani);
cristina.bosco@unito.it (C. Bosco)</p>
      <p>0000-0002-0247-3130 (T. Bourgeade); 0000-0002-0017-2975
(S. Casola); 0000-0002-8857-4484 (C. Bosco)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License source languages — leveraging two multilingual datasets
Attribution 4.0 International (CC BY 4.0).
with subsets for these languages — and various languages
as pivots for the BT process (French/Italian, English, and
Arabic). We compare BT with an alternative process
for data augmentation, specific for multilingual datasets,
which we refer to as “cross-translation”, where the data
from one language subset is translated and then used as
a data augmentation source for another language subset.</p>
      <p>
        Our contributions are (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) an intrinsic qualitative
human evaluation of translations and back-translations for
stereotypes detection and irony detection datasets in
various combinations of source and pivot languages, followed
by (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) an extrinsic evaluation of machine learning model
performance on these datasets, using these various data
augmentation sources.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Datasets</title>
      <p>• StereoHoax [10] is a contextualized multilingual
dataset of tweets annotated primarily for the presence of
anti-migrant stereotypes. It consists of replies to tweets
containing racial hoaxes (RH), with each message having
2. Related Work a “conversation head” (the message containing the source
RH) and a direct parent message (if applicable).</p>
      <p>BT as a data augmentation method was originally pro- • MultiPICo [11] is a disaggregated multilingual dataset
posed by Sennrich et al. [4], in the context of Neural of short social media conversations annotated for irony
Machine Translation (NMT), to allow using monolingual detection through crowdsourcing. Each instance is a
data to improve translation quality, particularly when (post, reply) pair, where the post is a starting message
parallel (source and target) training data is scarce. in a thread, and the reply is either a direct reply or a</p>
      <p>Since then, several works have explored BT, either as second-level reply.
a baseline to evaluate other data augmentation methods
against or as the primary augmentation method for
lowresource tasks. For example, Kumar et al. [5] evaluated 4. Translation Model
pre-trained conditional generative Transformer models
as data augmentation sources and used BT as a baseline.</p>
      <p>They found that BT achieves relatively high extrinsic
performance against simpler approaches such as Easy Data
Augmentation (EDA) [7] but also against some
Transformer models; it also obtains most of the best scores for
semantic fidelity and data diversity.</p>
      <p>Xie et al. [6] make use of BT as an augmentation
strategy in their semi-supervised Consistency Training
approach, in which a model is trained with a loss function
combining traditional supervised learning on a limited
amount of labeled data, with an unsupervised consistency
loss. The latter consists of minimizing a divergence
metric between the output distributions for an unlabeled
input and a noised version of it, the noise function being
the chosen data augmentation method, i.e., for text, BT.</p>
      <p>As far as the challenges related to the application of
translation to texts with irony or sarcasm, a few
papers discussing this task were recently published, among
which we can cite [8] and [9].</p>
      <p>To use BT as a data augmentation method, one crucial
decision to make is that of the translation system(s) .
Machine Translation (MT) models are in fact not explicitly
designed to inject relevant noise into texts to increase the
variety of data available. Therefore, a significant part of
this beneficial noise will be linked to the idiosyncrasies
of the chosen model(s).</p>
      <p>In this work, due to the number of diferent
configurations, and thus source-target language pairs we wished to
investigate, we decided to limit our selection to
intrinsically multilingual models.In a preliminary phase, we thus
experimented with the locally runnable
Transformerbased multimodal Neural MT model SeamlessM4T v2
[12] proposed by Meta AI. However, after early
evaluations of obtained translations and back-translations, we
observed too many issues and violations of important
criteria (see section 5). As such, we eventually selected the
Google Translate API for our evaluation and experiments,
as it seemed to ofer the best tradeofs between
translation and back-translation quality, as well as ease of access
to the languages chosen for this work (French, Italian,
English, and Arabic). It is important to note, however,
that the models used by Google Translate themselves
make use of BT as a data augmentation technique, as
well as M4 Modelling1: in practice, this may cause some
issues for use in BT, as undesirable artifacts of BT and</p>
      <sec id="sec-2-1">
        <title>We focus on the tasks of stereotypes and irony detection</title>
        <p>with relevant multilingual datasets. Table 1 summarizes
the characteristics of their French and Italian splits, the
chosen languages for this study:</p>
      </sec>
      <sec id="sec-2-2">
        <title>1https://research.google/blog/recent-advances-in-google-translate/</title>
        <p>Massively Multilingual Massive NMT — possibly caused
by parameters bottlenecks or languages interferences
[13] — may have detrimental efects on the quality of the
augmented data.
5. Intrinsic Evaluation
for maintaining clarity, consistency, and legal
compliance. This category also includes idiomatic expressions
which are especially dificult to translate;
• fluency : a text is fluent when it is perceived by a native
speaker as reading “natural”, in the way they would be
expected to have structured it;
• other: this last criterion is used to report less frequent
violations that cannot be encoded by the other criteria,
including incomplete translations, word tokenization, or</p>
      </sec>
      <sec id="sec-2-3">
        <title>To judge the viability of BT for these two datasets and</title>
        <p>languages, we perform a human qualitative evaluation of
produced back-translations using the following protocol. sentence segmentation.</p>
        <p>First, we collect a set of data for both datasets and
languages randomly sample 50 instances each for the French 5.1. Back-Translation Examples
and Italian subsets, 25 from the positive class, and 25 from
the negative class, for a total of 200 instances. For all the To illustrate violations of these criteria, this section
cases examined, we consider the text of the messages and presents example parts of instances in their original
the associated conversational context, which can consist (Og), translated (Tr), and back-translated (BT) forms,
of one or two other messages (an optional direct parent, underlining the relevant spans, when applicable.
and the conversation head/original post). In the following example from the Italian subset of</p>
        <p>In addition to French and Italian as source and pivot MultiPICo, the fluency criterion is violated because of the
languages, American English and Modern Standard Ara- inadequate and unnatural back-translation of the plural
bic were also selected on account of the linguistic ex- expression “per i primi tempi” (“for the initial period”),
pertise of the authors. Thus, for the 100 instances in into the singular “per la prima volta” (“for the first time”):
Italian, we apply the following BT settings (&lt;source&gt;
&lt;pivot&gt; - &lt;target=source&gt;): Italian - English - Italian; Ital- Og:alt"rSiemernitmianèerseoliomppieergaitopraim1i40t0emepuir"o è il tuo obiettivo ok,
ian - French - Italian; Italian - Arabic - Italian. Similarly, Tr: "If staying employed at 1400 euros is your goal, ok,
for the 100 French instances, we apply the following BT otherwise it’s only for the first time"
settings: French - English - French; French - Italian - BT: "Se restare impiegato a 1400 euro è il tuo obiettivo, ok,
French; French - Arabic - French. We use the Google altrimenti è solo per la prima volta"
Translate API due to its ease of use and availability of
the chosen source and target languages. This example from French StereoHoax illustrates</p>
        <p>A manual qualitative approach is used for the evalu- breaking the faithfulness criterion, with Arabic as the
ation of the BT results: 4 language experts (co-authors pivot language. In this message, the informal vulgar
exof this paper) evaluate the quality of the produced back- pression “n’avoir rien à foutre” (vulgar. “to have nothing
translations (and intermediate translations, though in to do”), which conveys an implied judgment of laziness
a less quantitative capacity). All evaluators are native towards the described target, cannot be properly
transspeakers of one of the source languages (French and lated into Arabic, like most vulgar expressions (a
comItalian), as well as suficiently proficient (or a native mon issue with this pivot language), and loses its proper
speaker) in the pivot languages (French, Italian, English, meaning in the back-translation, “n’avoir rien à se soucier”
and Arabic). They are tasked with comparing the origi- meaning “to have nothing to worry/care about”:
nal and back-translated instances, also considering the
pivot translation to help understand potential artifacts or Og: "Elle n’a rien à foutre"
errors introduced in the process. Evaluators could assign Tr: " éK. ÕaeîE AÓ AîEYË  Ë
one label to problematic instances containing a violation "
of the following associated quality criteria: BT: "Elle n’a rien à se soucier"
• faithfulness: a faithful translation accurately conveys In this example from Italian MultiPICo, the violation
the meaning of the original text without introducing er- concerns a non-translatable, in the form of the colloquial
rors, omissions, or distortions. Since we focus on texts expression “&lt;X&gt; della Madonna”, intended as an idiomatic
featuring expressions of stereotype or irony, faithful in- intensifier (similar to “A hell of a &lt;X&gt;” in American
Enstances must also preserve these phenomena; glish). In the pivot translation, the idiom fails to be
trans• preservation of non-translatables: this criterion is posed, and “Madonna” is interpreted as part of the proper
referred to in the translation of numbers, units, measure- noun of a non-existent virus (“Madonna virus”) and
transments, and, in general, non-translatable terms such as posed into the back-translation:
proper nouns, brands, trademarks, hashtags, user
mentions, emojis, acronyms, and specific cultural references
Og: "... Gli asiatici stanno tramando qualcosa di losco.... use of the idiomatic discourse marker/connector “du coup”
upnrivmiarugslidelslpaagmhaedtotninaa?l" microonde con ketchup e adesso (equivalent to connector “so” in English), has this quoted
Tr: "... The Asians are up to something shady... first expression consistently mis-backtranslated to “tout d’un
microwaved spaghetti with ketchup and now a Madonna virus?" coup” (“all of a sudden/suddenly”), despite it not
makBT: "... Gli asiatici stanno tramando qualcosa di ing sense in the context of the message. The use of the
losco... prima spaghetti al microonde con ketchup e ora expression in quotation marks in this case may have
conun virus Madonna?" fused the MT model, which otherwise does not struggle
Another example of a non-translatable failing to be with this expression when manually tested.
preserved is the following, taken from the French sub- Overall, English appears to perform best across all the
set of StereoHoax. Here, the idiomatic expression “se pivot languages in all settings. This is not surprising
contuer/mourrir à la tâche” (lit. “to kill oneself/die doing a sidering that, for most MT models, English is the most
task”), used in its informal variant with “[se] crever” (lit. represented language in the training data (both in the
“to burst”, informal. “kill [oneself]/die”) was translated source and target language), as well as the language
typiincorrectly, changing the meaning of the message: cally used as a pivot to generate augmented instances for
lower-resource languages. When using Arabic as a pivot
language in our evaluations, we observed some unnatural
Og:a "rOiuein fmoauitsueestt qcueeqlu’eauct’reests’neosrtmaclre?verQuàanlda yteanchae u?n Nqouni expressions and constructs that appear “borrowed” from
la logique c’est qu’il peuvent cumuler pour arriver à une English: for example, in a MultiPICo Italian instance, the
retraite vivable et qui dépasse le seuil de pauvreté !"
Tr: "Yes but is this normal? When one has done nothing word “gratis” (“free [of charge/cost]”) is mistranslated to
and the other has died? No, the logic is that they can Qk (“freedom/liberty”); we thus hypothesize that the MT
apcocvuemrutlyalteinteo!"achieve a livable retirement that exceeds the model used English as a pivot language for the
ItalianBT: "Oui mais est-ce normal ? Quand l’un n’a rien fait et Arabic language pair, as both terms would indeed likely
que l’autre est mort ? Non, la logique est qu’ils peuvent be mapped to the polysemic and thus ambiguous term
ascecuuimluldeerpapuovurretoébt!e"nir une retraite viable qui dépasse le “free” in English.
5.2. Samples Evaluation</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Extrinsic Evaluation</title>
      <p>Ita-Eng-Ita
Ita-Fra-Ita
Ita-Arb-Ita
mean
Fra-Eng-Fra
Fra-Ita-Fra
Fra-Arb-Fra
mean
StereoHoax
MultiPICo
StereoHoax
MultiPICo
ifgurations, in which augmented samples are added to size of the negative class, balancing the two requires
the positive class until it is the same size as the negative sampling more instances from the data augmentation
class. We evaluated the following configurations: source than there are original positive instances, which
• baseline: the model is trained on the original, unmodi- could result in injecting translation related biases into
ifed training set (with no balancing of the classes). the training set. To attempt to mitigate this, we also
• oversampling (OV): Oversampling was shown to be a evaluate sampling 50% from back or cross-translation
strong baseline in various previous works [16, 17], and strategies, with 50% from oversampling the positive class.
we thus evaluate it as an alternative or complement to Note that, given the number of potential configurations,
BT. we only evaluate BT[Eng]|OV and XT|OV due to time
• back-translation from &lt;language&gt; (BT[&lt;language&gt;]): and resource constraints.
augmented instances are sampled from back-translations Table 3 displays the results of our experiments in terms
of the original data using &lt;language&gt; as a pivot. of macro F1-scores, as well as positive class F1-scores.
• cross-translation (XT): as the datasets used are multi- Except for StereoHoax French, at least one of the data
lingual and contain subsets in both French and Italian, augmentation configurations outperforms the baseline,
one language’s subset can be translated and used as aug- though not necessarily BT. Indeed, for both StereoHoax
mented data for the other. Italian and MultiPICo French, the mixed cross-translation
• mixed back/cross-translation with oversampling with oversampling (XT|OV) configuration achieves the
(BT[&lt;language&gt;]/XT|OV): as the positive classes are, highest Macro F1-score, though not the best positive class
for both phenomena and all languages, less than half the score. This seems to indicate that the variety of data
intrinsic to using a separate language subset of a mul- liminary extrinsic evaluation of two multilingual datasets,
tilingual dataset can be beneficial, when possible, over we found that cross-translation can outperform Back
that artificially created by a data augmentation technique Translation, allowing us to augment one language subset
like BT. Additionally, we only experimented with cross- by leveraging the variety of inputs present in the others.
translation within one linguistic typology (Romance lan- In future work, we aim to expand this study to more
guages). As such, future investigations on whether this numerous and varied source and pivot languages, and
extends to cross-typologies XT would be worth pursuing. diferent data augmentation configurations, namely,
dif</p>
      <p>Interestingly, we find that the mixture of oversampling ferent proportions and selections of injected augmented
and back/cross-translation outperforms the equivalent data. We may also compare Back and Cross-Translation
non-mixed configuration for all datasets and languages against or alongside other related techniques, such as
except MultiPICo Italian. However, due to its small size multitasking learning or Active Learning. We also expect
(see Table 1), the results on this particular subset may be that some improvements can be obtained by mitigating
less significant, given the overall protocol for these ex- translation failures; this can be done, for example, by
periments, and a protocol that can inject greater amounts leveraging an external LLM to check each step and
reof augmented data might be preferable. During initial move or correct the errors from the final augmented
experiments, however, we found that injecting larger dataset. Finally, it could be also interesting to perform
quantities of augmented data (preserving or not the ini- tests with diferent model types on top of RoBERTa.
tial label distributions) seemed to consistently negatively
impact test-set performance, most likely due to
overfitting but also possibly due to the models fitting on the Acknowledgment
translation model detrimental idiosyncrasies, instead of
the characteristics of the phenomena to detect. The work of T. Bourgeade was funded by the project</p>
      <p>Moreover, the performance on the positive class (Ta- StereotypHate, funded by the Compagnia di San Paolo
ble 3b) is not necessarily improved correspondingly with for the call ‘Progetti di Ateneo - Compagnia di San Paolo
the overall macro F1-score (Table 3a), even when the aug- 2019/2021 - Mission 1.1 - Finanziamento ex-post’.
mentation is applied solely to this class. In other works The work of C. Bosco was partially funded by this same
on similar phenomena, it is shown that data augmenta- project.
tion and related methods can boost the Out-of-Domain
performance of such detection models [17]. The addition References
of variety in the occurrences of the phenomenon to
detect would indeed help in generalizing its detection to
other sources of data. Though, as the example of
StereoHoax Italian in the cross-translation (XT) configuration
shows, care should be taken not to overly shift the data
distribution; otherwise, models may fail to learn the
particular dataset’s positive class entirely. The mixed data
augmentation with oversampling configurations seems,
however, successful in addressing this potential issue,
though more variations in the proportions should be
experimented with.</p>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusions</title>
      <sec id="sec-4-1">
        <title>In this work, we have investigated using BackTranslation as a data augmentation technique for challenging low-resources tasks like stereotypes and irony detection, in a multilingual context.</title>
        <p>Through an intrinsic evaluation of the quality of the
augmented instances, we identified modes of failure of
Machine Translation, which could negatively impact the
data augmentation process. These errors stem from the
intrinsic diferences between typologies and specific
languages or translation model idiosyncrasies themselves
potentially learned from methods like BT. Through a
pre</p>
        <p>Long Papers), Association for Computational Lin- D. Licht, J. Maillard, R. Mavlyutov, A.
Rakotoariguistics, Berlin, Germany, 2016, pp. 86–96. URL: son, K. R. Sadagopan, A. Ramakrishnan, T. Tran,
https://aclanthology.org/P16-1009. doi:10.18653/ G. Wenzek, Y. Yang, E. Ye, I. Evtimov, P.
Fernanv1/P16-1009. dez, C. Gao, P. Hansanti, E. Kalbassi, A. Kallet,
[5] V. Kumar, A. Choudhary, E. Cho, Data Augmen- A. Kozhevnikov, G. M. Gonzalez, R. S. Roman,
tation using Pre-trained Transformer Models, in: C. Touret, C. Wong, C. Wood, B. Yu, P. Andrews,
Proceedings of the 2nd Workshop on Life-long C. Balioglu, P.-J. Chen, M. R. Costa-jussà, M.
ElLearning for Spoken Language Systems, Associa- bayad, H. Gong, F. Guzmán, K. Hefernan, S. Jain,
tion for Computational Linguistics, Suzhou, China, J. Kao, A. Lee, X. Ma, A. Mourachko, B. Peloquin,
2020, pp. 18–26. URL: https://aclanthology.org/2020. J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk,
lifelongnlp-1.3. A. Sun, P. Tomasello, C. Wang, J. Wang, S. Wang,
[6] Q. Xie, Z. Dai, E. Hovy, T. Luong, Q. Le, Unsu- M. Williamson, Seamless: Multilingual Expressive
pervised Data Augmentation for Consistency and Streaming Speech Translation, 2023. URL: http:
Training, in: Advances in Neural Informa- //arxiv.org/abs/2312.05187. doi:10.48550/arXiv.
tion Processing Systems, volume 33, Curran 2312.05187. arXiv:2312.05187.
Associates, Inc., 2020, pp. 6256–6268. URL: [13] A. Mueller, G. Nicolai, A. D. McCarthy, D. Lewis,
https://proceedings.neurips.cc/paper/2020/hash/ W. Wu, D. Yarowsky, An Analysis of Massively
44feb0096faa8326192570788b38c1d1-Abstract. Multilingual Neural Machine Translation for
Lowhtml. Resource Languages, in: N. Calzolari, F. Béchet,
[7] J. Wei, K. Zou, EDA: Easy Data Augmentation Tech- P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi,
niques for Boosting Performance on Text Classifica- H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
tion Tasks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings
Proceedings of the 2019 Conference on Empirical of the Twelfth Language Resources and Evaluation
Methods in Natural Language Processing and the Conference, European Language Resources
Associ9th International Joint Conference on Natural Lan- ation, Marseille, France, 2020, pp. 3710–3718. URL:
guage Processing (EMNLP-IJCNLP), Association https://aclanthology.org/2020.lrec-1.458.
for Computational Linguistics, Hong Kong, China, [14] E. Rabinovich, S. Mirkin, R. Patel, L. Specia,
2019, pp. 6382–6388. URL: https://aclanthology.org/ S. Winther, Personalized machine translation:
PreD19-1670. doi:10.18653/v1/D19-1670. serving original author traits, in: Proceedings of
[8] H. Ardi, M. Al Hafizh, I. Rezqy, R. Tuzzikriah, Can the EACL 2017 vol. 1 long papers, 2017.
machine translations translate humorous texts?, [15] A. Conneau, K. Khandelwal, N. Goyal, V.
ChaudHumanus 21 (2022) 99–112. hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
[9] Initial exploration into sarcasm and irony through L. Zettlemoyer, V. Stoyanov, Unsupervised
Crossmachine translation, Natural Language Processing lingual Representation Learning at Scale, in:
ProJournal 9 (2024) 100106. ceedings of the 58th Annual Meeting of the
Associa[10] T. Bourgeade, A. T. Cignarella, S. Frenda, M. Lau- tion for Computational Linguistics, Association for
rent, W. Schmeisser-Nieto, F. Benamara, C. Bosco, Computational Linguistics, Online, 2020, pp. 8440–
V. Moriceau, V. Patti, M. Taulé, A Multilingual 8451. URL: https://aclanthology.org/2020.acl-main.
Dataset of Racial Stereotypes in Social Media Con- 747. doi:10.18653/v1/2020.acl-main.747.
versational Threads, in: Findings of the As- [16] M. Juuti, T. Gröndahl, A. Flanagan, N. Asokan, A
sociation for Computational Linguistics: EACL little goes a long way: Improving toxic language
2023, Association for Computational Linguistics, classification despite data scarcity, in: Findings
Dubrovnik, Croatia, 2023, pp. 686–696. URL: https: of the Association for Computational Linguistics:
//aclanthology.org/2023.findings-eacl.51. EMNLP 2020, Association for Computational
Lin[11] S. Casola, S. Frenda, S. M. Lo, E. Sezerer, A. Uva, guistics, Online, 2020, pp. 2991–3009. URL: https://
V. Basile, C. Bosco, A. Pedrani, C. Rubagotti, V. Patti, aclanthology.org/2020.findings-emnlp.269. doi: 10.
D. Bernardi, MultiPICo: Multilingual Perspectivist 18653/v1/2020.findings-emnlp.269.
Irony Corpus, in: Proceedings of the 62th Annual [17] C. Casula, S. Tonelli, Generation-Based Data
AugMeeting of the Association for Computational Lin- mentation for Ofensive Language Detection: Is
guistics, Association for Computational Linguistics, It Worth It?, in: Proceedings of the 17th
ConOnline, 2024. ference of the European Chapter of the
Associ[12] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, ation for Computational Linguistics, Association
N. Dong, M. Duppenthaler, P.-A. Duquenne, B. El- for Computational Linguistics, Dubrovnik,
Croalis, H. Elsahar, J. Haaheim, J. Hofman, M.-J. tia, 2023, pp. 3359–3377. URL: https://aclanthology.
Hwang, H. Inaguma, C. Klaiber, I. Kulikov, P. Li, org/2023.eacl-main.244. doi:10.18653/v1/2023.
eacl-main.244.
[18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- For all experiments, we used the XLM-RoBERTa-base as
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, provided by the the HuggingFace transformers [18]
Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug- ecosystem (including the datasets library for data
proger, M. Drame, Q. Lhoest, A. Rush, Transform- cessing).
ers: State-of-the-Art Natural Language Process- Automatic hyperparameters fine-tuning was
accoming, in: Proceedings of the 2020 Conference on plished using the Weights &amp; Biases [19] AI platform’s
Empirical Methods in Natural Language Process- Bayesian hyperparameters optimization system, with the
ing: System Demonstrations, Association for Com- Hyperband early-stopping algorithm [20]. As mentioned
putational Linguistics, Online, 2020, pp. 38–45. in section 6, only 4 such optimizations were executed, one
URL: https://aclanthology.org/2020.emnlp-demos.6. for each language subset of each dataset, in the baseline
doi:10.18653/v1/2020.emnlp-demos.6. configuration (no data augmentation).
[19] L. Biewald, Experiment tracking with weights and The learning rate (), the hardware training batch
biases, 2020. URL: https://www.wandb.com/, soft- size (), and the number of gradient accumulation steps
ware available from wandb.com. (ga), were automatically fine-tuned, and their final values
[20] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, are listed in Table A1. These models were trained for a
A. Talwalkar, Hyperband: A Novel Bandit-Based maximum of 10 epochs, with the best performing epoch
Approach to Hyperparameter Optimization, Jour- checkpoint kept at the end (measured by macro F1-score),
nal of Machine Learning Research 18 (2018) 1–52. with a warm-up ratio of 0.2 (linear warm-up from 0 to
URL: http://jmlr.org/papers/v18/16-558.html. the initial learning rate over 20% of the training set), both
determined during initial experiments.</p>
        <p>Automatic fine-tuning and training of the models was
performed on the Google Colab platform, using
highRAM T4 GPU instances, for an approximate total of 50
GPU-hours.</p>
        <p>Dataset
2.963E-05
1.000E-06
2.963E-05
2.920E-05
16
16</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Aprosio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          , Abuse is Contextual,
          <article-title>What about NLP? The Role of Context in Abusive Language Annotation</article-title>
          and Detection,
          <year>2021</year>
          . URL: http://arxiv.org/abs/ 2103.14916. doi:
          <volume>10</volume>
          .48550/arXiv.2103.14916. arXiv:
          <volume>2103</volume>
          .
          <fpage>14916</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bayer</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Kaufhold</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Reuter</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          <article-title>Survey on Data Augmentation for Text Classification</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <volume>146</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>146</lpage>
          :
          <fpage>39</fpage>
          . URL: https://dl.acm.org/doi/10.1145/ 3544558. doi:
          <volume>10</volume>
          .1145/3544558.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gangal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>A Survey of Data Augmentation Approaches for NLP, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</article-title>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>968</fpage>
          -
          <lpage>988</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>84</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.
          <volume>84</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <source>Improving Neural Machine Translation Models with Monolingual Data</source>
          , in: K. Erk,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          : French Italian
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>