Detect Hate and Offensive Content in English and Indo-Aryan Languages based on Transformer

Detect Hate and Offensive Content in English and Indo-Aryan Languages based on Transformer YongyiKui 3964438@qq.com Information Institute of Yunnan University

650504 Yunnan China

Forum for Information Retrieval Evaluation

December 13-17 2021 India

Detect Hate and Offensive Content in English and Indo-Aryan Languages based on Transformer DEC8D0DA56DAAAF769512E340A90EF28 GROBID - A machine learning software for extracting information from scholarly documents Text Classification Hate and Offensive Content Analysis pre-trained model Transformers

This paper describes my submission to the Subtask 1A and Subtask 1B tasks of the HASOC (2021) Hate Speech and Offensive Content Identification Challenge. In the experiment, I applied different pre-training and common neural network models for this task and integrated them. According to the official evaluation results, the test results of the solution proposed in this article are ranked fourteenth and fifteenth on English Subtask A and English Subtask B, the rankings on Hindi Subtask A and Hindi Subtask B are sixth and fifth, respectively, and Marathi Subtask A is ranked eleventh. The source code for the evaluated models in this paper is shared openly.

Introduction

In recent years, offensive language on social media platforms has surged. Because the Internet has a certain degree of anonymity, people are more likely to publish hate speech [1] on online platforms than in reality.

Hate speech will bring challenges to social civilization and harmony. Similarly, insulting offensive speech will lead to the radicalization of communication. Therefore, it is necessary to find an appropriate way to automatically recognize such content to enhance the public opinion environment of social media. Human beings are more sensitive to hate speech and offensive content, so people can easily identify such speech. However, the computer can only detect whether the text is hateful or offensive after learning via unsupervised, self-supervised, or supervised methods that are based on large amount of data.

In the challenges of HASOC 2019 [2] and HASOC 2020 [3], there is a task of identifying hate speech and offensive content in English and Hindi. In addition, in 2019, SemEval [4] has a task to identify the offensive and non-aggressiveness of English tweets. They use convolutional neural networks and the BERT model to solve them. SemEval proposed a task called OffensEval 2020 [5] in 2020, to identify offensive content in multiple languages, including English. In order to identify offensive content, Risch et al. [6] used a BERT model with distinct random seeds, while Subhanshu et al. [7] fine-tuned the BERT-based network model. Many existing text content recognition systems are based on Transformer [8] models.

The rest of the paper is structured as follows: In the second part of the paper, we give an overview of the tasks and datasets of this challenge; the third part describes the models used in this challenge; the fourth part describes the experimental process of Subtask A and Subtask B; in the fifth part, the official evaluation results of these two tasks are listed. In the last part, I summarized the evaluation results and the paper.

Task and Data Description

Subtasks

HASOC (2021) [9] Subtask 1 includes two subtasks, Subtask A and Subtask B. The main purpose is to detect the Hate Speech and Offensive Content of the text. The two subtasks are defined as follows:

Subtask A: Tweets predicted as hate speech and offensive speech are further divided into hate speech, offensive speech, and profane content. Therefore, it is a multiclass classification problem.

Subtask B: Tweets predicted to be hate speech and offensive speech in English and Hindi corpus are further identified into three categories: Hate speech, Offensive, and Profane. Subtask B is a text multi-classification task.

The evaluation standards of the prediction results of Subtask A and Subtask B are both Macro F1 and Macro Precision.

Dataset

The data for this challenge comes from comments on the Twitter platform, the corpus involves Marathi data [10], English and Hindi data . The Subtask A and Subtask B tasks of the HASOC (2021) Hate Speech and Offensive Content Identification Challenge [11] provide training datasets and testing datasets.

In order to get more data to train the model better, to make the model have a better generalization and reduce the risk of overfitting, I collected data on English and Hindi corpus in HASOC (2019) and HASOC (2020) challenges. After integrating the collected data, the amount of training data about English corpus and Hindi corpus is 12035 and 10215 respectively. Next, we used the shuffle function in the Sklearn package to shuffle the order of the data and finally divided the integrated data into Training Dataset and Validation Dataset at a ratio of 4/1. Table 1 specifically lists the data volume of the three languages in Subtask A and Subtask B.

Data pre-process

The datasets of Subtask A and Subtask B are sampled from Twitter, and the format of the data is informal. Tweets can be long or short, and there are a lot of emojis or URL links in the text, and even spelling errors in words.

Pre-processing step is applied on the data to make the model better extract the information carried by the text and help enhance the accuracy of the classifier. In this challenge, some of the methods we used include: deleting URL links in the text, deleting emoticons and punctuation marks in the text.

System Description

Pre-trained Model

In this hate speech and offensive content detection challenge, I tried to use six pre-trained models: BERT, ALBERT, multilingual BERT (mBERT), DeBERTa, XLNet, and SqueezeBERT.

Here is a brief introduction to each pre-trained model.

The BERT [12] model is a Deep Bidirectional model trained on the Transformer Encoder structure. The training process is divided into the pre-training stage and Fine-tuning stage, and its pre-training tasks include Masked LM (Language Model) and Next Sentence Prediction.

ALBERT [13] uses word embedding parameter factorization and hidden layer parameter sharing methods to reduce the amount of model parameters, and uses Sentence Order Prediction Loss to optimize Next Sentence Prediction Loss. Therefore, compared with the BERT model, it significantly reduces the amount of model parameters, while the performance loss is tiny.

mBERT is a Cross-language [14] model. It performs cross-language pre-training on data in 104 languages including English, Hindi, and Marathi. Texts in different languages share some common word blocks or vocabulary (such as numbers, links, etc.).

DeBERTa [15] model uses two methods to enhance BERT. The first is Disentangled Attention, and in this way, each word uses two vectors to encode the text and position respectively, and the attention weights between words are calculated separately by using a matrix of text and relative positions; the second technique is to introduce absolute positions in the Decode Layer to predict masked tokens.

Compared with the Masked of the BERT model, XLNet [16] introduces the new pre-training target of the Permutation Language Model during pre-training. In addition, XLNet introduces the Transformer XL mechanism, so it has an advantage over the Bert model for tasks where the input is a long text.

The SqueezeBERT [17] model applies experience in the computer vision field to the Natural Language Processing tasks. The SqueezeBERT model replaces several operations in self-attention layers with grouped convolutions. It replaces several operations in self-attention layers with grouped convolutions. This model has achieved high accuracy on the GLUE Dataset.

Common neural network

In this part, I will give a brief overview of several common neural networks used in the paper.

A Recurrent Neural Network (RNN) [18] is a neural network that can be used to process sequence data. The RNN model has a memory function, it can remember important words in the text.

Long Short-Term Memory (LSTM) [19] has the same memory function as RNN. LSTM uses a gate mechanism, so it can solve the problem of gradient disappearance to a certain extent. In this paper, the Bidirectional LSTM (BiLSTM) model is used, which can extract the contextual information of the text.

TextCNN [20] is a text classification model using convolutional networks. It passes the word vector through convolution and pooling operations, and finally sends the output to the softmax function to achieve classification. The structure of the TextCNN model is relatively simple, with fewer parameters, and good results can be achieved by introducing Pre-trained word vectors.

Integrated

In the integration process, we did not freeze the data initialized by the pre-trained model, but integrated the pre-trained model with a smaller-scale model (RNN, LSTM, or TextCNN) for training.

The pre-trained model uses a lot of data for training, it can get high-quality word and sentence embedding vectors. Therefore, we plan to add models such as RNN, BiLSTM, or TextCNN to the output layer of the pre-trained model to further extract high-dimensional features. Next, the results obtained by these relatively small-scale Neural Networks are sent to the Fully Connected Neural Network for classification. From our subsequent experimental results, we can see that the accuracy of this method is similar to that of the pre-trained model, but the result of this integration strategy increases the Macro F1 value in the official evaluation score by 0.3% to 1%. In fact, many downstream tasks have been achieved in this way.

Experiments

Subtask A Parameters Setting

In Subtask A, the optimizer selects AdamW; loss function uses Crossentropy; the epoch, max length, and batch size parameters are set to 10, 96, and 32 respectively; drop_out, learning rate, and weight_decay parameters are set to 0.4, 1e-5, and 1e-2, respectively.

Subtask A

First, I use six pre-trained models to conduct a text binary classification experiment under the parameters set above. These pre-trained models are BERT, ALBERT, mBERT, DeBERTa, XLNet, SqueezeBERT. Table 2 lists the specific performance of each pre-trained model on the respective validation datasets of the three language corpora.

The experimental results show that among the six pre-trained models, the DeBERTa model performs best on English corpus, and the mBERT model achieves the highest accuracy on the text binary classification task of Hindi and Marathi corpus. In the classification experiment of Subtask A, the six models all use a learning rate of 1e-5 in the training phase. Next, I use the common learning rates of 1e-6, 5e-6, 1e-5, 3e-5, and 5e-5 to train the DeBERTa+ BiLSTM model separately, and ensure that the remaining parameters remain unchanged. Table 3 lists the scores of the models trained with these five learning rates on the validation Dataset. The results show that the DeBERTa+ BiLSTM model uses a 5e-5 learning rate to obtain the best performance on the Dataset of this challenge. So in the subsequent experiments of Subtask A and Subtask B, I chose to use the 5e-5 learning rate.

Next, I integrated the 4-layer RNN, 2-layer BiLSTM, and TextCNN models after the pre-trained model with the highest score in the three-language verification data set for experiments. During this experiment, the learning rate is set to 5e-5, and the other parameters are unchanged. Finally, the output results of the RNN, BiLSTM, or TextCNN model are sent to the fully connected layer for classification. The final models are obtained after DeBERTa and mBERT integrate RNN, BiLSTM or TextCNN models. Table 4,5 list the scores of the final models on their respective Validation Dataset.

From the experimental results listed in Table 4, it can be seen that the DeBERTa+ BiLSTM model performs best on the Validation Dataset of the English corpus, and the accuracy and Macro F1 score are improved compared to the DeBERTa model alone. Therefore, I use the prediction result of the DeBERTa+ BiLSTM model as the final submission result on the English Subtask A task. Similarly, it can be seen from Table 5 that the mBERT+ TextCNN model has the best performance on the Validation datasets of Hindi and Marathi corpus. So, I use the prediction results of the mBERT+ TextCNN model on the Hindi and Marathi test data sets as the final answer to the Hindi Subtask A and Marathi Subtask A tasks.

Subtask B

The experimental results of Subtask A show that the DeBERTa+ BiLSTM and mBERT+ TextCNN models perform best on the Validation datasets of English and Hindi corpus, respectively. Therefore, I still use these two models on the English and Hindi corpus in Subtask B. The difference from Subtask A is that the fully connected layer of Subtask B outputs a matrix of batch_size * 4, while Subtask A outputs a matrix of batch_size * 2. Tables 6,7 respectively list the scores of the DeBERTa+ BiLSTM and mBERT+ TextCNN models on the Validation datasets of English Subtask B and Hindi Subtask B.

Results

On Subtask A and Subtask B, among all teams, the solutions I put forward in the paper are ranked fourteenth and fifteenth on the English corpus, ranked sixth and fifth on the Hindi corpus, and ranked eleventh in the Marathi corpus.

Conclusion

In this paper, I describe the solution I proposed in the HASOC (2021) challenge, including the pre-preprocessing of the data before training the model, the selection of the learning rate, and the construction of the final model. This challenge mainly includes the text classification tasks of English, Hindi, and Marathi. I solved the classification problem of English corpus by integrating the DeBERTa and BiLSTM models, and the classification problem of Hindi and Marathi corpus was solved by integrating the mBERT and TextCNN models. The difficulties of Subtask A and Subtask B in this challenge are as follows. First of all, the text content is informal, the text length is generally short, and it lacks context, so it is difficult to obtain very high accuracy. Secondly, during the experiment, I found that the data distribution of each category in Subtask A and Subtask B is not uniform, which is also a reason why the model is biased to predict a category that appears more commonly. Finally, the amount of data in this challenge is not sufficient compared to large models like BERT, which leads to over-fitting, which makes the model perform poorly on the testing Dataset. In future research work, I will try to use a variety of fine-tuning strategies [21], or the idea of transfer learning [22] to continue to improve my solution.

Table 22Evaluation results of six pre-trained models on subtask A's Validation datasets.ModelValidation Accuracy English Hindi MarathiALBERT0.80640.67540.6905BERT0.80920.79110.7872DeBERTa0.8212 0.75330.7083mBERT0.8122 0.8034 0.8908XLNet0.80940.68250.6899SqueezeBERT 0.80550.75880.7542

Table 33The evaluation results of the DeBERTa + BiLSTM models trained with five common learning rates on the subtask A's Validation Dataset of the English corpus.Learning Rate1e-65e-61e-53e-55e-5Accuracy0.8143 0.8149 0.8201 0.8194 0.8221Macro F10.8093 0.8088 0.8024 0.8019 0.8182

Table 44The final models are obtained after the DeBERTa model integrates RNN, BiLSTM, and TextCNN models respectively, and the evaluation result of the final models on the English corpus Validation Dataset.ModelEnglish Validation Dataset Accuracy Macro F1DeBERTa+RNN0.81630.8020DeBERTa+BiLSTM0.82010.8024DeBERTa+TextCNN0.81880.8059

Table 55The final models are obtained after the mBERT model integrates RNN, BiLSTM, and TextCNN models respectively, and the scores of the final models on the Validation datasets of Hindi and Marathi corpus.Hindi Validation Dataset Marathi Validation DatasetModelAccuracyMacro F1 AccuracyMacro F1mBERT+RNN0.81010.79060.88660.8712mBERT+BiLSTM0.81120.79510.89160.8748mBERT+TextCNN0.81240.79640.89800.8831

Table 66DeBERTa+ BiLSTM model's accuracy and Macro F1 score on the Validation Dataset of the English corpus of Subtask B.ModelEnglish Validation Dataset Accuracy Macro F1DeBERTa0.72910.5968DeBERTa+BiLSTM0.73560.6051

Table 8 lists the best scores on the corpus

Table 7 mBERT+7TextCNN model's accuracy and Macro F1 score on the Hindi Validation Dataset of Subtask B.

ModelHindi Validation Dataset Accuracy Macro F1mBERT0.72660.5911mBERT+TextCNN0.73370.5954of each language in Subtask A and Subtask B in the HASOC (2021) Challenge, as well as theofficial evaluation results of the answers I finally submitted.

Table 88After the official evaluation, my final score on each task, my ranking, and the best result of each task.SubtaskMacro F1 Best Score My ScoreMy RankEnglish Subtask A0.83050.80306 / 56English Subtask B0.66570.611615 / 37Hindi Subtask A0.78250.77256 / 34Hindi Subtask B0.56030.55095 / 24Marathi Subtask A0.91440.861111 / 25

A survey on automatic detection of hate speech in text PFortuna SNunes ACM Computing Surveys (CSUR) 51 2018 Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages TMandl SModha PMajumder DPatel MDave CMandlia APatel Proceedings of the 11th forum for information retrieval evaluation the 11th forum for information retrieval evaluation 2019 TMandl SModha GKShahi AKJaiswal DNandini DPatel PMajumder JSchäfer CoRR abs/2108.05927 Overview of the HASOC track at FIRE 2020: Hate speech and offensive content identification in indo-european languages 2021 MZampieri SMalmasi PNakov SRosenthal NFarra RKumar arXiv:1903.08983 Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval) 2019 arXiv preprint MZampieri PNakov SRosenthal PAtanasova GKaradzhov HMubarak LDerczynski ZPitenis ÇÇöltekin arXiv:2006.07235 Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020) 2020 arXiv preprint Bagging bert models for robust aggression identification JRisch RKrestel Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying the Second Workshop on Trolling, Aggression and Cyberbullying 2020 3idiots at hasoc 2019: Fine-tuning transformer neural networks for hate speech identification in indo-european languages SMishra SMishra FIRE (Working Notes) 2019 Improving arabic text categorization using transformer training diversification SAChowdhury AAbdelali KDarwish JSoon-Gyo JSalminen BJJansen Proceedings of the Fifth Arabic Natural Language Processing Workshop the Fifth Arabic Natural Language Processing Workshop 2020 Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages TMandl SModha GKShahi HMadhu SSatapara PMajumder JSchäfer TRanasinghe MZampieri DNandini AK Working Notes of FIRE 2021 -Forum for Information Retrieval Evaluation 2021 Cross-lingual offensive language identification for low resource languages: The case of marathi SGaikwad TRanasinghe MZampieri CMHoman Proceedings of RANLP RANLP 2021 Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech SModha TMandl GKShahi HMadhu SSatapara TRanasinghe MZampieri FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event ACM December 2021. 2021 JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2018 arXiv preprint ZLan MChen SGoodman KGimpel PSharma RSoricut arXiv:1909.11942 Albert: A lite bert for self-supervised learning of language representations 2019 arXiv preprint ZChi LDong FWei NYang SSinghal WWang XSong X.-LMao HHuang MZhou arXiv:2007.07834 Infoxlm: An information-theoretic framework for cross-lingual language model pre-training 2020 arXiv preprint PHe XLiu JGao WChen arXiv:2006.03654 Deberta: Decoding-enhanced bert with disentangled attention 2020 arXiv preprint Xlnet: Generalized autoregressive pretraining for language understanding ZYang ZDai YYang JCarbonell RRSalakhutdinov QVLe Advances in neural information processing systems 32 2019 Squeezebert: What can computer vision teach nlp about efficient neural networks? FNIandola AEShaw RKrishna KWKeutzer arXiv:2006.11316 2020 arXiv preprint KCho BVan Merriënboer CGulcehre DBahdanau FBougares HSchwenk YBengio arXiv:1406.1078 Learning phrase representations using rnn encoder-decoder for statistical machine translation 2014 arXiv preprint Lstm: A search space odyssey KGreff RKSrivastava JKoutník BRSteunebrink JSchmidhuber IEEE transactions on neural networks and learning systems 28 2016 TLei RBarzilay TJaakkola arXiv:1508.04112 Molding cnns for text: non-linear, non-consecutive convolutions 2015 arXiv preprint An autoregulated fine-tuning strategy for titer improvement of secondary metabolites using native promoters in streptomyces SLi JWang WXiang KYang ZLi WWang ACS synthetic biology 7 2018 Reducing overfitting in diabetic retinopathy detection using transfer learning NBarhate SBhave RBhise RGSutar DCKaria IEEE 5th International Conference on Computing Communication and Automation (ICCCA), IEEE 2020. 2020