yasuo at HASOC2020: Fine-tune XML-RoBERTa for Hate Speech Identification Li Xu, Jun Zeng and Shi Chen School of Information Science and Engineering, Yunnan University, Kunming, P.R. China Abstract In recent years, people are more concerned about hate speech identification and identification than ever. This paper describes our system for English and German Sub-Task A in HASOC2020. For these subtasks, we fine-tune the XLM-RoBERTa pre-training model for sentence embedding and extract the layer with the best performance for slicing and splicing. In order to make full use of both English and German corpus, we propose a multi-task method to optimize two classification tasks at the same time. Our model has achieved 0.9076 for F1 score in English Sub-Task A and 0.8165 in German Sub-Task A. Keywords Hate speech, XLM-RoBERTa, fine-tune, slicing and splicing 1. Introduction With the rapid development of the Internet, communication between humans becomes more convenient through social media. Individuals can publish their own opinions freely on the Internet. But every coin has two sides, free speech often leads to sexism, racism or other aggressive behaviors [1] and cyberbullying [2, 3]. Over the past decade, as global hatred and bigotry spread through social media, ethnic minorities around the world are facing new and growing threats. Measures should be taken to reduce the spread of hate speech on social [4] . However, social media has encountered many difficulties in detecting hate speech because of its close association with other forms of abusive language [5]. The multiplicity of languages and slang adds to the complexity. With the increasing number of tweets posted online, manually monitoring hate speech is not a viable solution, HASOC 2020 provides a forum and a competition for multilingual research on identification of hate speech to solve the problem that human surveillance always lacks of scalability by automatically identifying hate speech content[6]. In this paper, we concentrate on detecting multilingual hate speech which are written in English and German and detail our solution. We fine-tune the XLM-RoBERTa pre-training model for sentence embedding and extract the layer with the best performance for slicing and splicing. To make full use of both English and German corpus, we further propose a multi-task method to optimize two classification tasks at the same time. FIRE ’20, Forum for Information Retrieval Evaluation, December 16–20, 2020, Hyderabad, India. Envelope-Open x619496775@gmail.com (L. Xu) GLOBE https://619496775.github.io/ (L. Xu) Orcid 0000-0001-5130-1645 (L. Xu); 0000-0002-7599-9176 (J. Zeng); 0000-0002-8067-9142 (S. Chen) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Works Detecting abusive language in a sea of data on social media is a difficult and arduous work, and researches have only conducted in recent years. Some research shows that the deep learning model with word-embedding can achieve better results in text classification tasks. As a result, Word2Vec is commonly used to obtain semantic information and attributes in the text through an unsupervised word embedding method. Common machine learning algorithms include Logistic Regression(LR), Support Vector Machine(SVM), Random Forest, etc. For deep learning methods, most of them are based on Long-Short Term Memory network (LSTM), convolutional neural network (CNN) or a recursive neural network (RNN). Some of the early work use features like bag of words, word and character n-grams with relatively machine learning classifiers for detection (Dinakar et al.[7]; Waseem and Hovy[8]; Nobata et al.[5]). Kim et. al[9] use CNNs for sentiment classification, It requires very few hyperparameter adjustments and static vectors to achieve good results on multiple baseline. MacAvaney et al.[10] propose a multi-view SVM approach that achieves near state-of-the-art performance, while being simpler and producing more easily interpretable decisions than neural methods. More recently, BERT released by Google gains more attention in the research community, as it can capture both long-distance reliance and true bidirectional context information compared with the traditional way of deep learning. BERT [11] and other models have achieved good results in monolingual NLP tasks, but for NLP in addition to English, researchers have cultivated more and more monolingual models for a variety of different languages. At the same time, there appears to be an alternative approach that has received little attention: the multilingual model. XLM[12] is a cross-language pre- training model that extends the full-length and training strategy MLM (Masked Language Model) proposed in BERT to multiple languages, and has been experimentally proven to take effect. RoBERTa [13] is an upgraded version of BERT. Compared to BERT, it uses a larger number of model parameters, a larger batch size and more training data. It is built on the BERT language masking strategy, which modifies key hyperparameters in BERT, including deleting BERT’s next sentence prediction task, which enables RoBERTa representation to be better expansibility to downstream tasks than BERT. As for the XLM-RoBERTa model, it combines these advantages of both. We fine-tune XLM- RoBERTa and use multitask training to improve the performance of prediction. This enables our model to achieve good results in the category of hate text. 3. Dataset and Task description The dataset of HASOC task A contains over 10,000 annotated tweets respectively composed of userID, tweets, and labels from Twitter. Through our analysis on the dataset, English dataset has a uniform distribution, while German dataset is uneven. We took part in the English and German Sub-Task A, which involves building a coarse-grained binary categorization model to test whether a text is offensive or insulting (HOF). If a text contains any form of unacceptable verbal, aggression and profanity, it will be considered as Table 1 The initial statistics about training and valid data Language Type Not Hate/Offensive/Profane Hate/Offensive/Profane Total English Train 3040 2968 6008 English Valid 760 742 1502 English Test 457 423 880 German Train 3164 1206 4370 German Valid 791 302 1093 German Test 396 145 541 HOF. 4. Our Solution Our solution is specifically divided into data preprocessing, feature extraction and model structure. 4.1. Data Preprocessing Because there was noise in tweets in official dataset, which would affect the performance of model training, we cleaned the data before training. The steps are as following: • Keep the label • Handle user name and @ beginning uniformly as ”username” • Separate conjunction • Turn Emojis expression into the corresponding phrase • Remove words that have no emotional meaning • Remove all URLs • Convert text to lowercase • Numbers are normalized to strings as ”number” After data preprocessing, we combined the English and German datasets as one dataset. In order to improve the generalization ability of model, let the model learn the real data characteristic distribution, the method of stratified sampling is adopted to ensure the training set and the valid set have the same data distribution. Besides, we divide 20% as the valid set, 80% as the train set. Table 1 shows the distribution of HOF and NOT in the dataset. The test set is given by official. 4.2. Feature vectors XLM-RoBERTa embeddings: We utilized XLM-RoBERTa[14] for embedding, which is a multilin- gual model with Transformer for the major structure. There are 12 layers of the model, output with 768 dimensions. Sentences Labels De Labels En De and En Softmax [CLS] Tok 1 Tok 2 [SEP] Tok 1 Tok M Classifier Layer1 E [CLS] E1 E2 E [SEP] E 1' E M' XLM-RoBERTa Concatena te Layer12 C T1 T2 T [SEP] T 1' T M' 32×64×768 Slice 32×768 Tc 32×768 Figure 1: The architecture of our model 4.3. Our Model After getting the embedded vectors of the texts, we fine-tuned XLM-RoBERTa to make it more suitable for the downstream task of hate speech identification. XLM-roBERTa has a total of 12 layers which learn different semantic information. Generally speaking, the shallower the layers, the more word level semantic information is learned. The deeper layers, the more generalized semantic information is learned. Whereas 𝑇𝑐 ([CLS]vectors) contains the semantic information for classification of the entire sentence, we try to combine 𝑇𝑐 with vectors at a certain layer to improve training performance. The influence of each layer is given in Table 2. Global semantic information is more helpful for binary classification such as sub-Task A. The hidden layer of the XLM-RoBERTa model is 768 dimensions, with 12 layers of Transformer. Because the shape of 𝑇𝑐 is [32,768] and the hidden vector of the 12th layer’s shape is [32,60,768], we take out 12th layer as dim= (0,2) and splice together with 𝑇𝑐 , then passed it to classifier and send the result to Softmax. We trained both English and German tasks by feeding the processed data into the model. Multitask learning can learn useful information from similar tasks. All tasks will share XLM- RoBERTa’s layers. In order to take full advantage of each classification corpus, two classification tasks are combined to carry out multi-task training. In this way, the number of datasets can also be expanded. Table 2 Performance of each layer Layer Test error rates(%) Layer-1 11.07 Layer-2 9.81 Layer-3 9.29 Layer-4 8.66 Layer-5 7.83 Layer-6 6.83 Layer-7 6.83 Layer-8 6.41 Layer-9 6.04 Layer-10 5.70 Layer-11 5.46 Layer-12 5.42 First 4 Layers 8.78 Last 4 Layers 5.43 All 12 Layers 6.88 5. Result 5.1. Baslines LR is a machine learning method to solve the problem of categorization (0 or 1). We use it as the baseline classifier for both English and German datasets. The configuration is as follows: we use L2 regularization with the hyper parameter C=1.2 (Inverse of regularization strength) and use TF-IDF features of word n-grams(1,6) for training the classifier. SVM is a binary classification model [15]. The basic method is to solve the separated hyper- plane that can correctly divide the training dataset and has the maximum geometric spacing. We use it as the baseline classifier for both English and German datasets. The configuration is as follows: we uses the ‘linear’ kernel, L2 regularization with the hyper parameter C=1.0 (Inverse of regularization strength) , and the same TF-IDF features of word n-grams(1,6) to train the classifier. BiLSTM[16] model is implemented with 100 units, adopts sigmoid activation. For training, binary cross entropy loss function and adam optimizer are used. As for regularization, 50% dropout is configured. For English and German subtasks, we respectively use 300 dimensional English fastText embeddings, 300 dimensional German fastText embeddings to initialize the word vectors. 5.2. Comparison XLM-RoBERTa model and other model’s accuracy and F1 macro-average score in English and German are showed in Table 3 and Table 4. To verify the effectiveness of the fine-tuning strategies, ablation experiments are conducted. We also using BERT base model and 𝑋 𝐿𝑀 − 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 as the baseline. Table 3 The result of the English dataset under the test set Approach Features Acc(%) F1(%) LR Char n-grams (1,6) 63.65 63.25 SVM Char n-grams (1,6) 67.83 64.32 BiLSTM pre-trained fastText 69.79 67.36 BERT base - 88.96 88.52 𝑋 𝐿𝑀 − 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 - 89.78 89.65 𝑋 𝐿𝑀 − 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑓 𝑖𝑛𝑒−𝑡𝑢𝑛𝑒𝑑 - 90.79 90.76 Table 4 The result of the German dataset under the test set Approach Features Acc(%) F1(%) LR Char n-grams (1,6) 62.80 62.25 SVM Char n-grams (1,6) 66.79 63.13 BiLSTM pre-trained fastText 73.39 69.16 BERT base - 79.49 79.36 𝑋 𝐿𝑀 − 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑏𝑎𝑠𝑒 - 80.93 80.61 𝑋 𝐿𝑀 − 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝑓 𝑖𝑛𝑒−𝑡𝑢𝑛𝑒𝑑 - 81.67 81.65 In the two subtasks, BERT and XLM-RoBERTa models performance better than LR, SVM and BiLSTM models. It may because deep learning has great advantages over traditional machine learning methods in document classification. For traditional machine learning methods, it is not easy to extract text features. Moreover, these features cannot well represent the semantics and syntax of the document, and a large part of useful information is lost. Deep learning is to hand over the feature extraction to deep network for automatic completion. Higher computational costs in exchange for more comprehensive and better text features. So deep learning methods performance better in our hate speech identification tasks. These experiments prove that our fine-tune XLM-RoBERTa model is effective for both German and English tasks. Table 3 and 4 show the comparison between different models where deep learning models performs better than traditional machine learning models. Our system ranked 16th in the German subtask and 34th in the English subtask, F1 macro- average Score for German subtask was 0.4968 (Top team was 0.5235) under the official private dataset, F1 macro-average score for English subtask was 0.4856 (Top team was 0.5152) under the official private dataset. 6. Conclusions We have proposed a neural solution with fine-tuning XML-RoBERTa for hate speech identifica- tion. Particularly, the output and hidden layers are slicing and splicing, which solves the sparsity of data and increases the generalization ability of the model in a multi-task way. Experiments have proved the competitiveness of our method. Acknowledgments This work is supported by the National Natural Science Foundation of China (61962061), partially supported by the Yunnan Provincial Foundation for Leaders of Disciplines in Science and Technology, Top Young Talents of ”Ten Thousand Plan” in Yunnan Province, the Program for Excellent Young Talents of Yunnan University. References [1] R. Kumar, A. K. Ojha, S. Malmasi, M. Zampieri, Benchmarking aggression identification in social media, in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), 2018, pp. 1–11. [2] C. Chelmis, D.-S. Zois, M. Yao, Mining patterns of cyberbullying on twitter, in: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, 2017, pp. 126–133. [3] M. Yao, C. Chelmis, D.-S. Zois, Cyberbullying detection on instagram with optimal online feature selection, in: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, 2018, pp. 401–408. [4] J. Waldron, The harm in hate speech, Harvard University Press, 2012. [5] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive language detection in online user content, in: Proceedings of the 25th international conference on world wide web, 2016, pp. 145–153. [6] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer, Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Iden- tification in Indo-European Languages), in: Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, CEUR, 2020. [7] K. Dinakar, R. Reichart, H. Lieberman, Modeling the detection of textual cyberbullying, in: In Proceedings of the Social Mobile Web, Citeseer, 2011. [8] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate speech detection on twitter, in: Proceedings of the NAACL student research workshop, 2016, pp. 88–93. [9] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882 (2014). [10] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith, The risk of racial bias in hate speech detection, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1668–1678. [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [12] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint arXiv:1901.07291 (2019). [13] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, arXiv preprint arXiv:1909.11942 (2019). [14] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [15] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate speech detection with comment embeddings, in: Proceedings of the 24th international conference on world wide web, 2015, pp. 29–30. [16] A. Baruah, F. Barbhuiya, K. Dey, Abaruah at semeval-2019 task 5: Bi-directional lstm for hate speech detection, in: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 371–376. A. Online Resources • GitHub