YNU_OXZ at HASOC 2020: Multilingual Hate Speech
and Offensive Content Identification based on
XLM-RoBERTa
Xiaozhi Ou, Hongling Li*
School of Information Science and Engineering, Yunnan University, Kunming, 650500, Yunnan, P.R. China


                                      Abstract
                                      This article introduces the submission of subtask A in three languages (English, German, Hindi) that we
                                      participated in the HASOC 2020 shared task, which aims to target hate speech and offensive language in
                                      multiple languages for identification. To solve this task, we propose a system based on the multilingual
                                      model XLM-RoBERTa and Ordered Neurons LSTM (ON-LSTM). When evaluated on the official test set,
                                      our system show the effectiveness of our method on subtask A of three languages. The Macro average
                                      F1 score of English subtask A is 0.5006, the Macro average F1 score of German subtask A is 0.5177, the
                                      Macro average F1 score of Hindi subtask A is 0.5200. This final leaderboard result is calculated with
                                      approximately 15% of the private test data.

                                      Keywords
                                      multilingual, hate speech, offensive language, identification, English, German, Hindi


1. Introduction
The existence and impact of hate speech and offensive language on social media platforms are
becoming a major concern in modern society. Given the enormous amount of content created
every day, automated methods are needed to detect and handle such content. So far, most
studies have focused on solving the problem for the English language, while the problem is
multilingual. Hate Speech and Offensive Content Identification in Indo-European Languages
(HASOC1 ) was inspired by two evaluation forums OffensEval2 and GermanEval 20183 and try
to leverage synergies of both the forum. HASOC provides a forum and a data challenge for
multilingual research on the identification of problematic content [1]. This year, the organizers
once again provided two subtasks for English, German and Hindi, altogether more than 10,000
annotated tweets from Twitter.

               • Subtask A: Identifying Hate, offensive and profane content.
               • Subtask B: Discrimination between Hate, profane and offensive posts.

FIRE ’20, Forum for Information Retrieval Evaluation, December 16–20, 2020, Hyderabad, India
Envelope-Open xiaozhiou88@gmail.com (X. Ou); honglingli66@126.com (H. Li*)
Orcid 0000-0001-6043-2348 (X. Ou)
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073


               1
                 https://hasocfire.github.io/hasoc/2020/
               2
                 https://competitions.codalab.org/competitions/20011
               3
                 https://projects.fzai.h-da.de/iggsa/
   We participate in subtask A for three languages: this task focuses on Hate speech and
Offensive language identification offered for English, German, and Hindi. Subtask A is a coarse-
grained binary classification in which participating systems are required to classify tweets into
two classes, namely: Hate and Offensive (HOF) and Non-Hate and offensive (NOT). HOF refers
to posts that contain Hate, offensive, and profane content, and NOT refers to posts that do not
contain such content.
   In order to effectively solve this task and achieve better results in low-resource languages,
we focus on the approach to an effective strategy of combining the multilingual model XLM-
RoBERTa with Ordered Neurons LSTM (ON-LSTM). Firstly, we are base on the pre-trained
multi-language model XLM-RoBERTa, it not only inherits the training method of XLM also
uses the ideas of RoBERTa for reference. Then, we use the ON-LSTM, It obtains hierarchical
structure information by sorting neurons, which can express richer semantic information. In
this paper, we present the related work, the details of our approach, our results and conclusion.


2. Related Work
From an NLP perspective, the topics of hate speech and offensive language and all its possible
facets and related phenomena (such as profane/abusive language) and its identification have
attracted great attention. His is shown by the proliferation, especially in the last few years, of
contributions on this matter ([2], [3], [4], [5] to name a few), corpora and lexica (e.g. [6], [7],
[8]), dedicated workshops, and shared tasks within national (GermEval4 , EVALITA5 , IberLEF6 )
and international (SemEval7 ) evaluation campaigns (see in particular [9]). In the literature
on the offensive and hate language detection, many different subtasks have been considered,
ranging from general offensive language detection to more refined tasks, such as hate speech
detection [10] and cyberbullying detection [11]. Chen et al. applied the concept of NLP to
develop sentence lexical and syntactic features for offensive language detection [12]. Huang et
al. integrated the textual features with social network features, which significantly improved
cyberbullying detection [13].
   Unfortunately, other supervised methods to hate speech classification have conflated hate
speech with offensive language, which makes it difficult to determine to what extent they actually
recognize hate speech [14]. Neural language models show promise in the task but existing work
has used training data that has a similarly broad definition of hate speech [15]. Non-linguistic
features like gender or ethnicity of the author can help improve hate speech classification but
this information is often unavailable or unreliable on social media [16]. Recently, Zampieri et al.
provided an offensive language identification dataset, which aims to identify the type and target
of offensive posts in social media [17]. This year they expanded the dataset to a multilingual
version, thus promoting multilingual research in this field [9]. Pre-trained language models,
such as BERT [18] and ELMo [19] have achieved great performance on a variety of tasks. Many
recent papers have used basic methods of fine-tuning such pre-trained models in certain fields

   4
     https://projects.fzai.h-da.de/iggsa/germeval/
   5
     http://www.evalita.it/2020/tasks
   6
     http://hitz.eus/sepln2019/?q=node/21
   7
     http://alt.qcri.org/semeval2020/index.php?id=tasks
Figure 1: 5-fold stratified sampling to the training set (The color represents the label class.)


[20] or downstream tasks [21].


3. Approach
In this chapter, we will description of the data, system, and experimental parameters we use.

3.1. Data description
The HASOC organizers provided complete training datasets for the three languages, with about
3,708 in English, about 2,373 in German, and about 2,963 in Hindi. In our experiment, we use
the multi-task learning method to combine the training datasets of the three languages and
train the model to share the representations between related tasks to solve the subtask A of the
three languages, but it did not achieve our expected effect. Finally, we use the HASOC 2019
dataset to merge the datasets of the three languages, respectively. Such as the final English
training set is the combination of the HASOC 2019 English training dataset and the HASOC
2020 English training dataset (The same goes for other languages).
   In our experiment, we use stratified sampling technology (StratifiedKFold) to randomly
split all combined training datasets. As shown in Figure 1, we using StratifiedKFold cross-
validation instead of ordinary k-fold cross-validation to evaluate a classifier. The reason is that
StratifiedKFold can utilize stratified sampling to divide, which can ensure that the proportion of
each category in the generated training set and validation set is consistent with the original
training set so that the generated data distribution disorder will not occur. In the experiment,
we use 5-fold stratified sampling.
Figure 2: System overall architecture diagram


3.2. System description
The pre-training of XLM-RoBERTa is based on 100 languages, using more than 2TB of pre-
processed CommonCrawl dataset to train cross-language representations in a self-supervised
manner. XLM-RoBERTa [22] shows that the use of large-scale multi-language pre-training
models can significantly improve the performance of cross-language migration tasks. In order
to solve the subtask A of three languages at the same time, we propose a system architecture
based on the multi-language model XLM-RoBERTa as shown in Figure 2. Firstly, we get pooler
output (P_O), P_O is the pooler output of XLM-Roberta. It is obtained by its last layer hidden
state of the first token of the sequence (CLS token) further processed by a linear layer and a
tanh activation function. Then extract the hidden state of the last four layers of XLM-RoBERTa
and input them into Ordered Neurons LSTM (ON-LSTM) [23]. Finally, we concatenate the P_O
and output of ON-LSTM together input into the Classifier for the final classification.
Table 1
Test results of the three run systems for three language subtask A
                           Subtask A      System                       Macro average F1
                                          Run_1                                0.81
                             English      Run_2                                0.88
                                          Run_3 (Final system)                 0.92
                                          Run_1                                0.66
                            German        Run_2                                0.72
                                          Run_3 (Final system)                 0.77
                                          Run_1                                0.64
                             Hindi        Run_2                                0.68
                                          Run_3 (Final system)                 0.73


Table 2
Results on the official private test set (This final leaderboard is calculated with approximately 15% of
the private test data)
                          Task             Our Score (Macro average F1)        Best Score   Rank
                  English Subtask A                     0.5006                    0.5152     12
                  German Subtask A                      0.5177                    0.5235      4
                   Hindi Subtask A                      0.5200                    0.5337     6


3.3. Experimental parameters
In our experiment, we did not clean the data. We use XLM-RoBERTa-base8 pre-trained model.
The batch size is set to 32 and the max sequence length is set to 150. We extract the last
four hidden layer state of XLM-RoBERTa by setting the output hidden States is true. For the
ON-LSTM, we set the hidden units to 512 and num levels to 1. We use binary cross-entropy,
adam optimizer and learning rate to 5e-5. The model is trained in 10 epochs.


4. Result
This section will show the results and analysis of all three languages that we participated
in subtask A on the test set and the official 15% private test set, the subtask A for the three
languages is evaluated by following the macro average F1 of scikit-learn9 . The test set results
of subtask A in all three languages are shown in Table 1. For each language subtask A, we
have performed three runs. Among them, Run_1 means we take the P_O of XLM-RoBERTa as
the final output. Run_2 means that we extract the last four hidden layers of XLM-RoBERTa
and input them into the convolution neural network (CNN) and K-max pooling. Run_3 means
that we extract the last four hidden layers of XLM-RoBERTa and input them into ON-LSTM,
which is also the system we finally submitted. Table 2 reports the official results of the best run
    8
        https://huggingface.co/xlm-roberta-base
    9
        https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
of subtask A in the three languages we participated in, we can also see the best scores of the
official leaderboard and our ranking.


5. Conclusion
In the experiment, we test the effects of using the external dataset and not using the external
dataset. Our conclusion is that using data from the same language for training and test is a
necessary condition for good performance. In addition, adding data from different languages
can improve results.


References
 [1] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
     Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Iden-
     tification in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for
     Information Retrieval Evaluation, CEUR, 2020.
 [2] A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, M. Shrivastava, A dataset of hindi-english
     code-mixed social media text for hate speech detection, in: Proceedings of the Second
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in
     Social Media, 2018.
 [3] D. Jurgens, E. Chandrasekharan, L. Hemphill, A just and comprehensive strategy for using
     nlp to address online abuse (2019).
 [4] T. Caselli, V. Basile, J. Mitrović, I. Kartoziya, M. Granitzer, I feel offended, don’t be abusive!
     implicit/explicit messages in offensive and abusive language, in: Proceedings of The 12th
     Language Resources and Evaluation Conference, 2020, pp. 6193–6202.
 [5] P. Fortuna, J. R. da Silva, L. Wanner, S. Nunes, et al., A hierarchically-labeled portuguese
     hate speech dataset, in: Proceedings of the Third Workshop on Abusive Language Online,
     2019, pp. 94–104.
 [6] E. Bassignana, V. Basile, V. Patti, Hurtlex: A multilingual lexicon of words to hurt, in: 5th
     Italian Conference on Computational Linguistics, CLiC-it 2018, volume 2253, CEUR-WS,
     2018, pp. 1–6.
 [7] S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, P. Nakov, A large-scale semi-
     supervised dataset for offensive language identification, arXiv preprint arXiv:2004.14454
     (2020).
 [8] M. Sanguinetti, F. Poletto, C. Bosco, V. Patti, M. Stranisci, An italian twitter corpus of hate
     speech against immigrants, in: Proceedings of the Eleventh International Conference on
     Language Resources and Evaluation (LREC 2018), 2018.
 [9] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Der-
     czynski, Z. Pitenis, Ç. Çöltekin, Semeval-2020 task 12: Multilingual offensive language
     identification in social media (offenseval 2020), arXiv preprint arXiv:2006.07235 (2020).
[10] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the
     problem of offensive language, arXiv preprint arXiv:1703.04009 (2017).
[11] C. Van Hee, B. Verhoeven, E. Lefever, G. De Pauw, W. Daelemans, V. Hoste, Guidelines for
     the fine-grained analysis of cyberbullying, version 1.0, LT3 Technical Report Series (2015).
[12] Y. Chen, Y. Zhou, S. Zhu, H. Xu, Detecting offensive language in social media to protect
     adolescent online safety, in: 2012 International Conference on Privacy, Security, Risk and
     Trust and 2012 International Confernece on Social Computing, IEEE, 2012, pp. 71–80.
[13] Q. Huang, V. K. Singh, P. K. Atrey, Cyber bullying detection using social and textual analysis,
     in: Proceedings of the 3rd International Workshop on Socially-Aware Multimedia, 2014,
     pp. 3–6.
[14] P. Burnap, M. L. Williams, Cyber hate speech on twitter: An application of machine
     classification and statistical modeling for policy and decision making, Policy & Internet 7
     (2015) 223–242.
[15] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate speech
     detection with comment embeddings, in: Proceedings of the 24th international conference
     on world wide web, 2015, pp. 29–30.
[16] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate
     speech detection on twitter, in: Proceedings of the NAACL student research workshop,
     2016, pp. 88–93.
[17] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the type
     and target of offensive posts in social media, arXiv preprint arXiv:1902.09666 (2019).
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[19] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized word representations, arXiv preprint arXiv:1802.05365 (2018).
[20] N. Azzouza, K. Akli-Astouati, R. Ibrahim, Twitterbert: Framework for twitter sentiment
     analysis based on pre-trained language model representations, in: International Conference
     of Reliable Information and Communication Technology, Springer, 2019, pp. 428–437.
[21] Z. Liu, G. I. Winata, Z. Lin, P. Xu, P. Fung, Attention-informed mixed-language training for
     zero-shot cross-lingual task-oriented dialogue systems, arXiv preprint arXiv:1911.09273
     (2019).
[22] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, V. Stoyanov, Unsupervised cross-
     lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of
     the Association for Computational Linguistics, 2020.
[23] Y. Shen, S. Tan, A. Sordoni, A. Courville, Ordered neurons: Integrating tree structures into
     recurrent neural networks (2018).