Evaluation of Intermediate Pre-training for the
        Detection of Offensive Language

                 Segun Taofeek Aroyehun and Alexander Gelbukh

               CIC, Instituto Politécnico Nacional Mexico City, Mexico
                aroyehun.segun@gmail.com & gelbukh@gelbukh.com


        Abstract. This paper presents an evaluation of intermediate pre-
        training for the task of offensive language identification. We leverage
        recent advances in multilingual contextual representation and fine-tuning
        of pre-trained language models. We compare the performance of a pre-
        trained language model adapted for the social media domain and an-
        other that was further trained on multilingual sentiment analysis data.
        We found that the intermediate pre-training steps prior to fine-tuning
        on the target task yield performance gains. The best submissions by our
        team, NLP-CIC, achieved first and second place on the non-contextual
        Spanish (Subtask 1) and Mexican Spanish (Subtask 3) subtasks of the
        MeOffendEs-IberLEF 2021 shared task respectively.

        Keywords: XLM-RoBERTa · Social Media · Spanish · Mexican Spanish
        · Offensive Language Identification · Sentiment Analysis.


1     Introduction

The purpose of social media is for information exchange. This involves interac-
tions among users on the various social media platforms. During these interac-
tions, users often show unhealthy and anti-social behaviour such as insults and
personal attacks. This kind of behaviour hampers meaningful conversations at
the least and can cause harm to individuals, groups, and the society at large.
Natural language processing research can help in identifying offensive language
to help reduce incidences of unacceptable behaviour. Research into this problem
has gained attention especially in English language. This is attributable to avail-
ability of labeled data and pre-trained word embeddings and language models.
Recently, there has been a number of shared tasks with a focus on languages
other than English. One example is the IberLEF shared task series on offensive
language identification in Mexican Spanish. For the 2021 edition [8] the tasks
include MeOffendEs [11], a track on offensive language identification on several
social media platforms. The challenge includes datasets for Spanish and Mexican
Spanish.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
    Multilingual language models (LM) are becoming a popular area of focus
with the transformer architecture [13] which makes it possible to combine text
written in different languages to learn a single multilingual representation. There
are several successful instances of this approach in multilingual BERT [6], XLM-
RoBERTa [5], and recently multilingual T5 [15]. However, these pre-trained mul-
tilingual LMs mainly cover domains with text written in consistent and formal
style in contrast to social media text which are noisy, irregular and informal. To
adapt LMs to a specific domain, the authors of [4, 7] showed that using the trans-
former architecture and its pre-trained weights, a domain-specific model can be
derived by continuing pre-training on text specific to the domain of interest. For
the social media domain, [9] and [3] are examples of this adaptation in the mono-
lingual English setting. For the multilingual case, the authors of [2] introduced
an adaptation of XLM-RoBERTa to multilingual twitter text. XLM-RoBERTa
was further trained with the masked language modeling objective on twitter
text (about 12GB) in over 30 languages. Furthermore, this LM was trained on a
unified collection of sentiment analysis data in eight languages with the goal of
demonstrating the effectiveness of the multilingual LM trained on twitter text.
    In this paper, our focus is to evaluate the effectiveness of the pre-trained
LMs on the identification of offensive language in tweets written in Spanish
and Mexican Spanish. We examine the effect that intermediate pre-training on
sentiment analysis, a la [10, 12], has on offensive language identification.


2   Methodology

Task. We address the non-contextual classification of offensive language in tweets
written in Spanish (Subtask 1) and Mexican Spanish (Subtask 3). For Subtask
1, the task is to classify comments written in Spanish using only the textual con-
tent into one of four categories: Offensive where the target is a person (OFP);
Offensive where the target is a group of people (OFG); non-offensive, but with
inadequate language (NOM); non-offensive (NO). This subtask also assess the
agreement between the confidence of model predictions and the confidence of hu-
man annotators. Subtask 3 is a binary classification of tweets written in Mexican
Spanish. It requires predicting whether a comment is offensive or not.

Data. The MeOffendEs 2021 shared task [11] provides two corpora, OffendEs
and OffendMEX, which are collections of messages on social media platforms
in Spanish and Mexican Spanish annotated with labels indicating offensiveness.
The generic Spanish data consist of labeled comments focusing on popular young
Spanish influencers collected from different social media platforms (YouTube, In-
stagram, and Twitter). The Mexican Spanish dataset was collected from Twitter
and manually labeled for offensiveness. In addition, metadata for each comment
is provided for the classification in the contextual tracks of the competition. We
only participate in the non-contextual tracks in both languages. Table 1 provides
details of the Mexican Spanish dataset. Also, the details of the Spanish dataset
is in Table 2.
                 Table 1. Details of the datatset for Mexcian Spanish

                                 Class Train Dev. Test
                                   0 3679 35       –
                                   1 1381 41       –
                                 Total 5060 76 2183

                  Table 2. Details of the datatset for generic Spanish

                                Class Train Dev. Test
                                 NO 13212 64      –
                                NOM 1235 10       –
                                OFP 2051 22       –
                                OFG 212      4    –
                                Total 16710 100 13606


    We perform minimal pre-processing of the data in our experiments as it was
reported in [1] that extensive pre-processing tends to hurt performance of pre-
trained LMs. Hence, we normalize the text by converting user mentions and web
links to @USER and URL. We also replace multiple consecutive whitespaces
with a single one and punctuation marks are surrounded by a single whitespace
character on both sides. Then, the sequence of text is tokenized with the subword
tokenizer provided with the XLM-RoBERTa model, which is a Sentence Piece
model (using a unigram language model) with a vocabulary size of 250K [5]. We
set the maximum sequence length to 128 subword tokens.

Fine-tuning. We use the huggingface transformers library [14] for the experi-
ments. We add a linear prediction layer on top of the pooled output of the last
transformer layer and optimize this layer jointly with the pre-trained layers. We
optimize the model using Adam (without bias correction) with a batch size of
128 on a single Nvidia V100 GPU (32GB) and a maximum learning rate within
the range of 1e-5 to 5e-5. We use a warmup ratio of 0.1 and set the maximum
number of epochs to 10 with earlystopping on the validation performance metric
(micro F1) using a patience of 2 evaluation runs. We evaluate the performance
of the model every 20 steps on the validation set. Furthermore, we employ three
regularization techniques: weight decay with a factor of 0.01, dropout applied
to the pooled output of the last transformer layer with a probability of 0.2,
and label smoothing with a factor of 0.1. Our submissions vary in their use of
the regularization approaches. Details of the settings for each submission are
in Table 3. The configurations are based on XLM-twitter 1 and XLM-twitter-
sentiment 2 pre-trained models introduced in [2]. For all experiments, we set
the random seed to 42. On average, the fine-tuning procedure takes about 900
seconds (wall time) for the Spanish task and approximately 600 seconds (wall
time) for the Mexican Spanish task.
1
    https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base
2
    https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment
Table 3. Model configuration of submissions made to the non-contextual Mexican
Spanish (MX ES) and generic Spanish (ES) tracks. LR is the maximum learning rate
and LSF is the label smoothing factor.

                         Language       Model           LR Dropout LSF
         Submission-I     MX ES XLM-twitter-sentiment 1e-5   0.2   0.1
        Submission-II     MX ES XLM-twitter-sentiment 2e-5   0.2   0.0
        Submission-III    MX ES      XLM-twitter        2e-5 0.2   0.1
        Submission-IV     MX ES      XLM-twitter        2e-5 0.2   0.0
         Submission-I       ES       XLM-twitter        4e-5 0.2   0.0
        Submission-II       ES    XLM-twitter-sentiment 2e-5 0.2   0.0
        Submission-II       ES       XLM-twitter        3e-5 0.2   0.1


3   Results

The scores of our submissions on the development and test sets for Subtask 1
(generic Spanish) are in Table 4. Three submissions are allowed for this subtask.
Submission-II which is based on the XLM-RoBERTa model trained on both
multilingual twitter text and sentiment analysis dataset achieved our best sub-
mission out of the three. The model also has the least mean squared error, an
indication of greater agreement with the confidence of human annotators. The
overall ranking showed that this system is the best on the competition for the
non-contextual classification in Spanish. Figure 1 shows the confusion matrix of
the best model (Submission II) on the validation dataset for Subtask 1. It can
be observed that most of the mistakes on the validation data occurs where the
model predicts non-offensive (NO) when the comments are actually offensive to
a person (OFP). Also, the model performs poorly on the offensive to a group
category (OFG). It makes the correct prediction on 1 out of 4 examples in the
validation set.
    Table 5 presents the results received by the NLP-CIC team on the leader-
board on the unseen test set for Subtask 3 (Mexican Spanish). The maximum
number of submissions for this task is five. It can be observed that the XLM-
Roberta model that has been further trained on twitter data and a collection
of sentiment analysis datasets in eight languages (Submission-I) that we fine-
tune on Subtask 3 dataset has the highest score out of our four submissions.
Also, the results show that label smoothing was beneficial for this task. On the
overall ranking for the competition, this system is in second place. In Figure
2, the confusion matrix provides an overview into the best model performance
(Submission I) across the two classes. The model predicts the non-offensive label
(NO) on 8 examples when the true label is offensive (OFF) compared to the con-
verse where the model predicts the offensive label on 1 example when the true
label is non-offensive. It shows that the model is relatively better at identifying
the non-offensive category on the validation dataset. This can be linked to the
number of examples for the non-offensive category which is about three times
more than the offensive category in the training set.
Table 4. Performance scores on the development and test sets for submissions to the
generic Spanish non-contextual track. MSE is the mean squared error computed on the
model prediction confidence for the test set against the confidence of the annotators.
Best scores are in bold and second best scores are underlined.

                                      Micro-averaged Test Scores
                        Dev. Micro F1 Precision Recall    F1     MSE
          Submission-I     0.8700      0.8430 0.8430 0.8430 0.0330
         Submission-II     0.8500      0.8816 0.8816 0.8816 0.0231
         Submission-III    0.8600      0.8493 0.8493 0.8493 0.0313

Table 5. Performance scores on the development and test sets for the Mexican Spanish
non-contextual track. The highest scores are in bold and the second highest are with
underline.

                                          Macro-averaged Test Scores
                            Dev. Micro F1 Precision Recall    F1
              Submission-I     0.8816      0.7550 0.6407 0.6932
             Submission-II     0.8553      0.8183 0.5756 0.6758
             Submission-III    0.8421      0.7800 0.5872 0.6700
             Submission-IV     0.8816      0.7867 0.5834 0.6700


Fig. 1. Confusion matrix of the best model (Submission II for ES) on the valida-
tion dataset for the generic Spanish task. NO means non-offensive; NOM means non-
offensive, but with inadequate language; OFP means offensive where the target is a
person; OFG means offensive where the target is a group of people.
Fig. 2. Confusion matrix of the best model (Subission I for MX ES) on the valida-
tion dataset for the Mexican Spanish task. NO refers to non-offensive; OFF refers to
offensive.


    We observed that overall, the scores on the Spanish dataset is far higher
than the Mexican Spanish dataset even though it’s a binary classification task.
We think that the amount of data available for the Spanish task is a factor
for this difference in performance. The consistent performance of the model that
includes sentiment analysis as part of the pre-training for both tasks confirms our
hypothesis that sentiment analysis can be beneficial for detecting offensiveness.


4   Conclusion
We address the task of offensive language identification in Spanish and Mexican
Spanish using a pre-trained language model adapted for the twitter domain.
We found that a further training on multilingual sentiment analysis is beneficial
to the task. In addition, label smoothing proved useful on the Mexican Spanish
dataset. The best systems submitted by our team, NLP-CIC, achieved first place
on the non-contextual Spanish task and second place on the non-contextual
Mexican Spanish task.
    In the future, we will like to examine whether a model trained on Spanish
data can be seamlessly transferred to Mexican Spanish for this task and vice
versa. Our models only use textual content, it is very likely that the addition of
metadata can improve their performance.
Acknowledgements
Thanks to the competition organizers for their support. The authors thank
CONACYT for the computer resources provided through the INAOE Super-
computing Laboratory’s Deep Learning Platform for Language Technologies.


References
1. Aroyehun, S.T., Gelbukh, A.: NLP-CIC at HASOC 2020: Multilingual Offensive
   Language Detection using All-in-one Model. In: FIRE (Working Notes). pp. 331–
   335 (2020), http://ceur-ws.org/Vol-2826/T2-31.pdf
2. Barbieri, F., Anke, L.E., Camacho-Collados, J.: XLM-T: A Multilingual Language
   Model Toolkit for Twitter. arXiv preprint arXiv:2104.12250 (2021)
3. Barbieri, F., Camacho-Collados, J., Espinosa Anke, L., Neves, L.: Tweet-
   Eval: Unified benchmark and comparative evaluation for tweet classifi-
   cation. In: Findings of the Association for Computational Linguistics:
   EMNLP 2020. pp. 1644–1650. Association for Computational Linguis-
   tics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.148,
   https://www.aclweb.org/anthology/2020.findings-emnlp.148
4. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scientific
   text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
   guage Processing and the 9th International Joint Conference on Natural Language
   Processing (EMNLP-IJCNLP). pp. 3615–3620. Association for Computational Lin-
   guistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1371,
   https://www.aclweb.org/anthology/D19-1371
5. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán,
   F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual
   representation learning at scale. In: Proceedings of the 58th Annual Meeting of the
   Association for Computational Linguistics. pp. 8440–8451. Association for Com-
   putational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-
   main.747, https://www.aclweb.org/anthology/2020.acl-main.747
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of
   deep bidirectional transformers for language understanding. In: Proceedings of
   the 2019 Conference of the North American Chapter of the Association for
   Computational Linguistics: Human Language Technologies, Volume 1 (Long
   and Short Papers). pp. 4171–4186. Association for Computational Linguis-
   tics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423,
   https://www.aclweb.org/anthology/N19-1423
7. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey,
   D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains
   and tasks. In: Proceedings of the 58th Annual Meeting of the Association
   for Computational Linguistics. pp. 8342–8360. Association for Computational
   Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.740,
   https://www.aclweb.org/anthology/2020.acl-main.740
8. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez-Carmona,
   M.Á., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L.,
   Gómez Adorno, H., Gutiérrez, Y., Jiménez-Zafra, S.M., Lima, S., Plaza-de Arco,
   F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum
   (IberLEF 2021) (2021)
 9. Nguyen, D.Q., Vu, T., Tuan Nguyen, A.: BERTweet: A pre-trained
    language model for English tweets. In: Proceedings of the 2020 Con-
    ference on Empirical Methods in Natural Language Processing: Sys-
    tem Demonstrations. pp. 9–14. Association for Computational Linguis-
    tics, Online (Oct 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.2,
    https://www.aclweb.org/anthology/2020.emnlp-demos.2
10. Phang, J., Calixto, I., Htut, P.M., Pruksachatkun, Y., Liu, H., Vania, C.,
    Kann, K., Bowman, S.R.: English intermediate-task training improves zero-
    shot cross-lingual transfer too. In: Proceedings of the 1st Conference of the
    Asia-Pacific Chapter of the Association for Computational Linguistics and the
    10th International Joint Conference on Natural Language Processing. pp. 557–
    575. Association for Computational Linguistics, Suzhou, China (Dec 2020),
    https://www.aclweb.org/anthology/2020.aacl-main.56
11. Plaza-del-Arco, F.M., Casavantes, M., Jair Escalante, H., Martı́n-Valdivia, M.T.,
    Montejo-Ráez, A., Montes-y-Gómez, M., Jarquı́n-Vásquez, H., Villaseñor-Pineda,
    L.: Overview of the MeOffendEs task on offensive text detection at IberLEF 2021.
    Procesamiento del Lenguaje Natural 67(0) (2021)
12. Pruksachatkun, Y., Phang, J., Liu, H., Htut, P.M., Zhang, X., Pang,
    R.Y., Vania, C., Kann, K., Bowman, S.R.: Intermediate-task transfer learn-
    ing with pretrained language models: When and why does it work? In:
    Proceedings of the 58th Annual Meeting of the Association for Com-
    putational Linguistics. pp. 5231–5247. Association for Computational Lin-
    guistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.467,
    https://www.aclweb.org/anthology/2020.acl-main.467
13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio,
    S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in
    Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc.
    (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
14. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma,
    C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest,
    Q., Rush, A.: Transformers: State-of-the-Art Natural Language Processing. In:
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language
    Processing: System Demonstrations. pp. 38–45. Association for Computational
    Linguistics, Online (Oct 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6,
    https://www.aclweb.org/anthology/2020.emnlp-demos.6
15. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua,
    A., Raffel, C.: mt5: A massively multilingual pre-trained text-to-text transformer.
    arXiv preprint arXiv:2010.11934 (2020)