<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sexism Detection in Multilingual Tweets⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ahmed Gamal Ibrahim</string-name>
          <email>ahmed@ipb.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rui Pedro Lopes</string-name>
          <email>rlopes@ipb.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Center in Digitalization and Intelligent Robotics (CeDRI), Laboratório Associado para a Sustentabilidade e Tecnologia em Regiões de Montanha (SusTEC), Instituto Politécnico de Bragança</institution>
          ,
          <addr-line>Campus de Santa Apolónia, 5300-253 Bragança</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Online sexism is a widespread social issue, found around all social media platforms, in multiple languages. In this paper, we present our submission to the EXIST 2025 task 1 on sexism detection in English and Spanish tweets. To enhance model reliability and deal with data sparsity, we used a combination of multiple text augmentation strategies, including AEDA (punctuation-based), synonym replacement, back-translation, and light code-switching via round-trip translation. These augmentations were applied to diversify training samples and better capture patterns. Our architecture builds on XLM-RoBERTa-large, fine-tuned for three subtasks: binary sexism detection, source classification, and sexism categorization. We incorporated both soft and hard label strategies to account for annotation disagreement and applied label smoothing and class-weighted loss functions to manage class imbalance. The system was trained and evaluated using oficial splits, with results showing promising results in multi-label classification (mainly Task 1.3), especially under soft label settings. However, its performance in binary classification (Task 1.1) showed limitations in generalization. Overall, our approach focuses on the importance of multilingual-aware data augmentation and training strategies in building fairer content moderation systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism detection</kwd>
        <kwd>Data augmentation</kwd>
        <kwd>Back translation</kwd>
        <kwd>Code switching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Due to the spread of social media use, the dynamics of communication have been altered significantly,
which has introduced both opportunities and challenges. Among the latter, the proliferation of sexism
in digital discourse, ranging from clear hostility to subtle bias, has become a pressing concern. Detecting
and mitigating such content is essential to establishing inclusive digital environments and reducing the
real-world harm associated with online gender-based discrimination.</p>
      <p>
        Natural Language Processing (NLP) has consistently been a growing field for many years by now,
primarily driven by transformer-based architectures such as BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These models
have demonstrated impressive capabilities in handling dificult linguistic tasks, including multilingual
and low-resource scenarios, which are particularly relevant in the context of online social media, where
code-switching and informal language are prevalent.
      </p>
      <p>The EXIST 2025 task 1 focuses on sexism detection in tweets that are in English or Spanish through
three hierarchical subtasks: binary classification, source intention, and sexism categorization. These
subtasks present several challenges, including language diversity, class imbalance, annotation uncertainty,
and semantic nuances.</p>
      <p>In this paper, we present a unified multilingual pipeline built upon XLM-RoBERTa-large to address
all three subtasks. Our contributions include an augmentation strategy to use for multilingual data,
CLEF 2025 Working Notes, 9 – 12 September 2025, Madrid, Spain
⋆You can use this document as the template for preparing your publication. We recommend using the latest version of the
ceurart style.
* Corresponding author.
soft-label modeling to incorporate annotator uncertainty, and task-specific loss function design to
handle label imbalance. By evaluating our models across Tasks 1.1, 1.2, and 1.3, we provide insights
into the efectiveness and limitations of our approach and outline pathways for improving performance
in the future.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the art</title>
      <p>
        Recent advancements in NLP have shed light on the development and extensive use of
transformerbased models such as XLM-RoBERTa. Transformer architectures completely changed the game in NLP
tasks due to their capacity for modeling long-range dependencies through self-attention mechanisms
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. XLM-RoBERTa, specifically, displayed significant performance improvements in cross-lingual
representation learning, allowing for efective multilingual understanding, particularly beneficial for
low-resource languages and complex linguistic scenarios involving code-switching [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. This model
outperformed earlier multilingual models such as mBERT by taking advantage of massive multilingual
datasets and complex training techniques [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Addressing class imbalance remains a dificult obstacle in machine learning and in NLP. Johnson
and Khoshgoftaar [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] conducted a survey about deep learning approaches to manage class imbalance,
indicating the necessity of developing efective classification techniques specifically designed for
imbalanced datasets prevalent in real-world scenarios like fraud detection, anomaly detection, and
medical diagnosis. Buda et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] examined the efects of class imbalance on Convolutional Neural
Networks (CNNs), concluding that oversampling consistently outperforms undersampling as it helps
maintain balanced class representations, avoiding the risk of overfitting, which often afects smaller
training sets. Their experimental analysis provided important guidelines for handling imbalance in
diverse neural network architectures.
      </p>
      <p>
        Data augmentation strategies have greatly amplified NLP models’ reliability and generalization
capabilities. Back-translation, first systematically introduced by Sennrich et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], capitalizes on
monolingual data to augment bilingual training sets, thereby improving translation quality and downstream
NLP task performance. This approach has become a cornerstone in augmentation for multilingual
models. Moreover, recent explorations into adversarial perturbations and contextual synonym replacement
illustrate the importance of data augmentation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. They demonstrated how well-designed perturbations
could efectively bolster model resilience against adversarial attacks and improve generalization to
unseen data distributions, which is an important aspect for deployment in real-world scenarios.
      </p>
      <p>
        Furthermore, methods such as label smoothing and knowledge distillation have shown major
improvements in neural network training and generalization. Label smoothing, examined by Müller et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
reduces model overconfidence, leading to better-calibrated predictions and improved generalization
across diferent classification tasks. However, Müller et al. also noted potential drawbacks, indicating
that label smoothing might negatively afect the performance of subsequent knowledge distillation
processes. Knowledge distillation, detailed by Hinton et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], simplifies transferring insights from
large-scale intensive models into smaller, more eficient architectures, achieving high performance
without high computational demands, which comes in handy in practical deployment.
      </p>
      <p>
        Multilingual NLP tasks greatly benefit from translation frameworks, such as OPUS-MT, which was
developed by Tiedemann and Thottingal [12]. OPUS-MT provides open-source translation models and
pretrained multilingual resources, allowing for cross-lingual communication and addressing linguistic
inequalities, particularly benefiting low-resource languages. Additionally, standardized benchmarks
such as LinCE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been important in enabling rigorous evaluation and facilitating improvements
in multilingual NLP models. LinCE’s datasets and metrics target linguistic code-switching scenarios,
supporting the development and validation of multilingual models.
      </p>
      <p>Our research builds upon these foundations, integrating multilingual representation techniques, data
augmentation methods, and class imbalance handling to enhance sexism detection capabilities across
multilingual and code-switched datasets. The systematic application of these methodologies aim at
achieving better accuracy and generalization, especially within complex linguistic contexts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset used in this work is released within the EXIST 2025 task 1, which addresses sexism detection
in multilingual and code-switched social media text. The dataset consists of Spanish and English tweets,
annotated for three hierarchical subtasks: binary sexism identification (Task 1.1), multi-class sexism
type classification (Task 1.2), and categorization (Task 1.3). Each tweet is labeled by six annotators, and
annotations are complemented with detailed demographic metadata such as gender, age range, country,
study level, and ethnicity, ensuring representation diversity across perspectives.</p>
      <p>The three subtasks are organized as follows. In Task 1.1, each tweet is labeled as YES
(sexist) or NO. Task 1.2 assigns a label among DIRECT, REPORTED, JUDGEMENTAL, or UNKNOWN if
the tweet is sexist. Task 1.3 adds finer granularity with categories such as OBJECTIFICATION,
STEREOTYPING-DOMINANCE, SEXUAL-VIOLENCE, and IDEOLOGICAL-INEQUALITY.</p>
      <p>The dataset is split into training, development, and test partitions, comprising over 6,000 tweets for
training, approximately 1000 for development, and over 2000 instances for testing (Table 1). While
labels are provided for both training and development sets, the test set is unlabeled and intended for
ifnal system evaluation. The tweets are in English and Spanish, which allows for multilingual and
cross-lingual modeling.</p>
      <p>All tweets were preprocessed by removing user mentions, URLs, hashtags, emojis, special characters,
and numbers, and by converting text to lowercase. To address class imbalance and improve model
generalization, we used several data augmentation techniques including AEDA (insertion of
punctuation), contextual synonym replacement using masked language modeling (xlm-roberta-large),
and back-translation via OPUS-MT pipelines. We also simulated code-switching behavior based on the
tweet’s language label using round-trip translation. Augmented data was cached and reused across
training runs to reduce computational overhead.</p>
      <p>Annotation reliability is addressed through majority voting for Tasks 1.1 and 1.2. For Task 1.3, we
retained the union of all non-null category labels provided by annotators, enabling training while
preserving label diversity. Incomplete or ambiguous labels (e.g., marked as "-") were excluded from
loss computation but retained in the dataset for potential analysis or augmentation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology and Implementation</title>
      <p>Our approach to the EXIST 2025 task 1 is grounded in a modular training pipeline designed to handle
multilingual data, label noise, class imbalance, and cross-task generalization. We implement a unified
system that supports Tasks 1.1 (binary classification), 1.2 (multi-class classification), and 1.3
(multilabel classification) using XLM-RoBERTa-large as the backbone architecture. This section provides a
walkthrough of our training pipeline, model configuration, augmentation strategy, and optimization
procedure.</p>
      <sec id="sec-4-1">
        <title>4.1. Overall Architecture and Data Flow</title>
        <p>The core architecture employs XLM-RoBERTa-large, a multilingual transformer pretrained on over 100
languages. We use the HuggingFace Transformers API to load both the tokenizer and the model. A
custom PyTorch training loop wraps the forward and backward passes, optimizer steps, and evaluation
metrics.</p>
        <p>Each tweet is preprocessed to remove mentions, URLs, hashtags, digits, and punctuation. Cleaned
tweets are augmented with four variants using AEDA, contextual synonym replacement, code-switch
simulation, and back-translation. The augmented data is stored and cached for eficient reuse.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task-Specific Handling</title>
        <p>We process each task (1.1, 1.2, 1.3) separately, with both hard and soft labeling strategies:
• Task 1.1 (Binary): Labels are binarized using majority voting. For soft labels, the proportion of
"YES" votes among six annotators is used as a float target.
• Task 1.2 (Multi-class): Votes across classes (DIRECT, REPORTED, JUDGEMENTAL) are tallied and
converted into either a hard label or a soft distribution (normalized counts).
• Task 1.3 (Multi-label): We use a fixed taxonomy of 7 categories (including ’NO’ category for
unlabeled instances). Each category is assigned a binary indicator for hard mode or a proportional
weight in soft mode.</p>
        <p>All tasks share a unified dataset wrapper with dynamic label encoding based on task and mode.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Augmentation Pipeline</title>
        <p>We put four data augmentation techniques into use:
1. AEDA (An Easier Data Augmentation): This technique injects punctuation marks (such as
commas, periods, semicolons) at random positions within the token sequence. It was originally
proposed to provide syntactic variety with minimal semantic distortion. For instance, the sentence
"She thinks women should stay home" might be transformed into "She, thinks women. should
stay; home". AEDA is language-agnostic and eficient, and its simplicity helps diversify sentence
structure without requiring any additional language resources.
2. Contextual Synonym Replacement: Tokens are randomly masked and replaced using
predictions from a masked language model (XLM-RoBERTa). This method provides contextual
paraphrasing that is semantically aligned with the original sentence. For example, "girls cannot
code" might yield "girls cannot program". Unlike naive synonym substitution, this method
considers sentence-wide context, leading to fluent and accurate rewrites. The top prediction is selected
to ensure semantic coherence.
3. Code-Switching Simulation: This strategy randomly selects tokens from a source language
(e.g., English) and translates them into a target language (e.g., Spanish), followed by a reverse
translation. The goal is to simulate real-world multilingual input where users switch languages
mid-sentence. For example, "He is such a macho man" could be transformed to "He is such a
hombre macho" and then retranslated to "He is such a macho man" or a variant. This approach
reflects authentic usage patterns in bilingual communities and trains the model to handle
intrasentence code-switching.
4. Back-Translation: This method uses OPUS-MT neural translation models to perform round-trip
translation, i.e., translating from the original language to the other (English to Spanish or vice
versa), and then back. This process generates paraphrastic variants that preserve meaning while
altering lexical and syntactic structure. For example, “She should not talk so loud” could become
“Ella no debería hablar tan fuerte” and back to “She shouldn’t speak so loudly”. This augmentation
enriches the dataset with diverse linguistic constructions and improves generalization to unseen
phrasings.</p>
        <p>Each original training tweet produces five total versions (1 original + 4 augmentations).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Model Configuration</title>
        <p>Each task is trained using XLM-RoBERTa with a classification head. For Tasks 1.1 and 1.3, the output
layer is a single-unit or multi-sigmoid layer respectively, while Task 1.2 uses a softmax head with class
weights.</p>
        <p>The model uses dropout in both the attention and feed-forward layers with a probability of 0.1. We
adopt the AdamW optimizer with a learning rate of 3e-5, weight decay of 0.01, and a linear warmup
over 10% of total steps.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Handling Imbalance and Label Smoothing</title>
        <p>To mitigate label imbalance and improve generalization, we applied task-specific strategies for loss
functions and target smoothing:
• Task 1.1 (Binary Classification) : In soft mode, labels were constructed by computing the
proportion of "YES" annotations among six annotators. We used SmoothBCEWithLogits, a
custom smoothed binary cross-entropy loss, which regularizes model confidence by slightly
adjusting soft targets. In hard mode, we applied BCEWithLogitsLoss with a positive class
weight inversely proportional to the frequency of the "YES" class to address the strong class
imbalance.
• Task 1.2 (Multi-Class Classification) : For soft mode, we converted annotator votes into a
normalized probability distribution and trained using KL-divergence loss (SoftKLDivLoss) to align
model outputs with partial annotator consensus. In hard mode, we used CrossEntropyLoss
with class weights computed from label frequencies to prevent dominant categories like DIRECT
from overshadowing less frequent ones such as REPORTED and JUDGEMENTAL.
• Task 1.3 (Multi-Label Classification) : In soft mode, each category label was assigned a
realvalued target based on the proportion of annotators selecting it. Unlike Task 1.2, we did not apply
class weighting in the multi-label task and relied solely on smoothing to mitigate label imbalance.</p>
        <p>In hard mode, we used standard BCEWithLogitsLoss without additional class reweighting.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Training and Evaluation</title>
        <p>Training was conducted for 10 epochs with early stopping based on the F1 score on the validation set.
We compute micro-F1 and accuracy after each epoch. The best checkpoints were saved and reloaded if
they surpass previous performance.</p>
        <p>Batch sizes were set to 16. A learning rate scheduler adjusts the rate linearly with warmup. All
computations were performed on a CUDA-enabled GPU when available.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Task 1.1: Binary Sexism Classification</title>
        <p>For Task 1.1, which involved classifying tweets as sexist or not, our approach integrated label smoothing,
class weighting, and five-version data augmentation (original + 4 augmented variants). Despite these
techniques, our model did not appear in the top ranks of the leaderboard. Upon inspecting the results,
we confirm that our system underperformed relative to other performers that managed to achieve
higher normalized scores.</p>
        <p>Given the training logic in our implementation, including soft-target binarization based on annotator
agreement and smoothed binary cross-entropy loss, this performance gap may be attributable to either:
• Limited generalization due to overfitting on augmented variants.</p>
        <p>• Conservative thresholding strategies during sigmoid post-processing.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Task 1.2: Multi-Class Sexism Type Classification</title>
        <p>In Task 1.2, which required identifying the type of sexism (DIRECT, REPORTED, JUDGEMENTAL), our
system performed better, especially in soft labeling. The leaderboard shows that exist@Cedri obtained
a Cross Entropy score of 3.7432 and an ICM-Soft Norm score of 0.3279, ranking better in comparison
to the 1.1 task.</p>
        <p>Our architecture used soft targets computed from annotator votes and employed KL-divergence
loss to model partial agreement. This allowed the model to maintain sensitivity toward ambiguous or
multi-class examples. However, performance might have been negatively afected by:
• A skewed class distribution, with a dominance of DIRECT labels.</p>
        <p>Class reweighting was applied during training to balance this skew, but additional techniques could
have further improved the performance.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Task 1.3: Multi-Label Fine-Grained Categorization</title>
        <p>Task 1.3 required the identification of multiple sexism-related labels per tweet. Our approach applied
sigmoid classification with soft targets reflecting label agreement across annotators. The system
performed well in soft labeling, getting a normalized score of 0.3193 in ICM-Soft Norm, placing within
the top ten of submissions.</p>
        <p>We attribute this relative success to:
• The application of SmoothBCEWithLogits to prevent overconfidence in rare category
predictions.</p>
        <p>However, the system still sufered from:
• Overfitting on majority labels due to their prevalence.</p>
        <p>• Underrepresentation of specific categories, which limited recall despite good precision.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Overall Assessment and Future Directions</title>
        <p>Our architecture showed reliable performance in some cases (such as task 1.3 soft labels), while it
heavily underperformed in other cases (such as task 1.1 soft labels). That mainly tells us that there is a
decent basis but also a huge room for improvement in the future.</p>
        <p>Nonetheless, future work could benefit from:
• Exploring transformer variants optimized for multi-label settings.
• Conducting more extensive hyperparameter sweeps and specific augmentation removal studies
to quantify the marginal utility of each augmentation method.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper introduced a multilingual sexism detection system based on the XLM-RoBERTa-large
architecture, evaluated across all three subtasks of task 1 in the EXIST 2025 challenge. Our approach
incorporated a modular training pipeline with 4 diferent data augmentation techniques, soft-label
strategies to capture annotator uncertainty, and class imbalance-aware optimization procedures.</p>
      <p>While the system demonstrated promising results in multi-label classification (mainly Task 1.3),
especially under soft label settings, its performance in binary classification (Task 1.1) revealed limitations
in generalization. These outcomes show the complexity of building reliable classifiers for nuanced
social phenomena, especially when labels are inherently subjective or sparsely represented.</p>
      <p>Through the analysis of performance across tasks, we identified that methods such as SmoothBCE
and KL-divergence-based loss functions provided tangible benefits in modeling partial consensus among
annotators. However, further improvements may rely on more aggressive regularization, tuning of
augmentation techniques, and the exploration of task-specialized transformer variants.</p>
      <p>Future work will focus on incorporating uncertainty-aware threshold calibration, the use of
multilingual large language models in a few-shot or instruction-tuned setting, and conducting studies to isolate
the contributions of each augmentation and loss formulation. Our findings contribute to the broader
ifeld of bias detection in NLP, ofering insights into how multilingual architectures can be adapted to
address socially impactful classification tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>The authors are grateful to the Foundation for Science and Technology (FCT, Portugal) for
financial support through national funds FCT/MCTES (PIDDAC) to CeDRI, UIDB/05757/2020 (DOI:
10.54499/UIDB/05757/2020) and UIDP/05757/2020 (DOI: 10.54499/UIDP/05757/2020) and SusTEC,
LA/P/0007/2020 (DOI: 10.54499/LA/P/0007/2020).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used AI tools for grammar and spelling checks. The
authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.
[12] J. Tiedemann, S. Thottingal, OPUS-MT – building open translation services for the world, in:
A. Martins, H. Moniz, S. Fumega, B. Martins, F. Batista, L. Coheur, C. Parra, I. Trancoso, M. Turchi,
A. Bisazza, J. Moorkens, A. Guerberof, M. Nurminen, L. Marg, M. L. Forcada (Eds.), Proceedings
of the 22nd Annual Conference of the European Association for Machine Translation, European
Association for Machine Translation, 2020, pp. 479–480. URL: https://aclanthology.org/2020.eamt-1.
61/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          . 11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . URL: http://arxiv.org/abs/1706.03762. doi:
          <volume>10</volume>
          .48550/arXiv.1706. 03762. arXiv:
          <volume>1706</volume>
          .03762 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Aguilar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kar</surname>
          </string-name>
          , T. Solorio,
          <article-title>LinCE: A centralized benchmark for linguistic code-switching evaluation</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1803</fpage>
          -
          <lpage>1813</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .223/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. M.
          <article-title>Khoshgoftaar, Survey on deep learning with class imbalance 6 (</article-title>
          <year>2019</year>
          )
          <article-title>27</article-title>
          . URL: https://doi.org/10.1186/s40537-019-0192-5. doi:
          <volume>10</volume>
          .1186/s40537-019-0192-5.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Mazurowski</surname>
          </string-name>
          ,
          <article-title>A systematic study of the class imbalance problem in convolutional neural networks 106 (</article-title>
          <year>2018</year>
          )
          <fpage>249</fpage>
          -
          <lpage>259</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/S0893608018302107. doi:
          <volume>10</volume>
          .1016/j.neunet.
          <year>2018</year>
          .
          <volume>07</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <article-title>Improving neural machine translation models with monolingual data, 2016</article-title>
          . URL: http://arxiv.org/abs/1511.06709. doi:
          <volume>10</volume>
          .48550/arXiv.1511.06709. arXiv:
          <volume>1511</volume>
          .06709 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gokhale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sachdeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baral</surname>
          </string-name>
          ,
          <article-title>Generalized but not Robust? comparing the efects of data modification methods on out-of-domain generalization and adversarial robustness</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>2705</fpage>
          -
          <lpage>2718</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-acl.
          <volume>213</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-acl.
          <volume>213</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          , G. Hinton,
          <source>When does label smoothing help?</source>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/ abs/
          <year>1906</year>
          .02629. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1906</year>
          .
          <volume>02629</volume>
          . arXiv:
          <year>1906</year>
          .02629 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network, 2015</article-title>
          . URL: http: //arxiv.org/abs/1503.02531. doi:
          <volume>10</volume>
          .48550/arXiv.1503.02531. arXiv:
          <volume>1503</volume>
          .02531 [stat].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>