=Paper=
{{Paper
|id=Vol-3159/T1-39
|storemode=property
|title=Hate Speech Detection using LIME guided Ensemble Method and DistilBERT
|pdfUrl=https://ceur-ws.org/Vol-3159/T1-39.pdf
|volume=Vol-3159
|authors=Deepakindresh N,Rohan Avireddy,Aakash Ambalavanan,B Radhika Selvamani
|dblpUrl=https://dblp.org/rec/conf/fire/NAAS21
}}
==Hate Speech Detection using LIME guided Ensemble Method and DistilBERT==
Hate Speech Detection using LIME guided Ensemble Method and DistilBERT N Deepakindresh1 , AviReddy Rohan1 , Aakash Ambalavanan1 and B. Radhika Selvamani2 1 Vellore Institute of Technology, Chennai, India 2 Center for Advanced Data Science, Vellore Institute of Technology, Chennai, India Abstract Hate Speech classification has crucial applications in the social media domain. We describe the perfor- mance of our classifiers in the Hate Speech and Offensive Content Identification Track (HASOC) of FIRE 2021 conference. The dataset provided is for Indo-European Languages. We chose English tweets and developed two main classifiers as part of HASOC Track 1, which had two Subtasks 1A and 1B. Subtask 1A is a binary Hate Speech identification task, and Subtask 1B is multi grained classification of hate, profane, offensive and neutral content. Our team ”Beware Haters” studied Support Vector Machine, Random Forest, Logistic Regression, Bidirectional Long Short Term Memory Model and an Ensemble of the listed models for the Subtask 1A and the highest Macro F1 score we achieved was 0.7722 by our Ensemble model which combined the advantages of SVM, Logistic Regression and Random Forest. We used a model interpretation tool LIME, before integrating the models in a weighted Ensemble approach. For Subtask 1B, we obtained better results using a DistilBERT model that achieved a Macro F1 score of 0.6311. We have compared the performance of the basic DistilBERT Model with a fine tuned version. Keywords Hate Speech Identification, TF-IDF, Ensemble, LSTM, SVM, Random Forest, Logistic Regression, LIME, BERT, DistilBERT 1. Introduction The initial research in hate speech dates back to 1993 and has been legally defined by John T. Nockleby to describe any communication that incites hatred against anyone or any group of people in the name of race, ethnicity, religion, sexual orientation etc [1]. Hate speech identification task caught up with the Machine learning community during the spurt of user created content in social media during the last decade. Huge efforts have been put forth in the task of hate speech identification on the internet by Facebook [2], Twitter and Youtube, to conform to the legal and social responsibilities posed on these sites through governmental policies. Meanwhile, the research community finds the task of hate speech identification to be challenging, due to the diversity of the hate speech statements and the skewed nature of the collected data in most websites. Hate speech also poses challenges concerning the language Forum for Information Retrieval Evaluation, December 13-17, 2021, India Envelope-Open deepakindresh.n2019@vitstudent.ac.in (N. Deepakindresh); avireddynvsrk.rohan2019@vitstudent.ac.in (A. Rohan); aakash.ambalavanan2019@vitstudent.ac.in (A. Ambalavanan); radhika.selvamani@vit.ac.in (B. R. Selvamani) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) used and the context in which it originates [3]. These challenges have made the hate speech identification task an interesting topic to be studied in the light of new machine learning algorithms and approaches which have been made available by cloud based libraries. The first HASOC Track of workshops created annotated hate speech in Indo-European languages to enable continued research in this direction. A detailed explanation about the HASOC Track at FIRE 2019 conference and the datasets have been discussed by Mandl et al.,2019 [4]. This paper is a summary of the efforts put forth by our team Beware Haters in the HASOC 2021 Track 1 [5]. There were multiple tracks analysing Twitter tweets in different code-mixed languages. We participated in the English hate speech identification subtasks of Track 1 [6]. There were two different subtasks in Track 1. The Subtask 1A requires the participants to identify hate speech in the given tweets and Subtask 1B is about classifying the tweets into multiple classes such as Hate, Profane, Offensive, Neutral etc. 2. Dataset Analysis The dataset that we chose for analysis consisted only of English tweets. We analysed two different Subtasks 1A and 1B of Track 1. Subtask 1A is a binary classification task on tweets belonging to two distinct categories namely HOF (Hate and Offensive) and NOT (Non Hate- Offensive). HOF consists of hateful, offensive and profane content whereas NOT represents neutral content. Subtask 1B is a fine-grained classification problem, with 4 distinct classes, namely hate, offensive, profane and neutral (usually represented as none). The size of the dataset including training and test data is limited to about 4000 tweets. To overcome the data limitation we collected additional data from other sources. A free and publicly available Twitter hate speech dataset from Kaggle1 was chosen for data augmentation and solving the class imbalance problem in the dataset provided by HASOC. On manual scrutiny, we found substantial similarity between the HASOC and Kaggle datasets for Subtask 1A. This Kaggle dataset has approximately six thousand tweets with labels comparable to HOF and NOT. The HASOC training dataset contains 65% HOF tweets whereas the Kaggle dataset consists of only 45% HOF tweets [Fig.1]. The combined dataset has 40% of HASOC data and 60% of Kaggle data and was used for the Random Forest model and Ensemble model for Subtask 1A. The dataset provided by HASOC for fine-grained classification of English tweets (Subtask 1B of HASOC 2021 Track 1) has 18% hate, 16% offensive, and 31% profane tweets [Fig.2]. We did not use any additional data for Subtask 1B other than the dataset provided by the HASOC 2021 organizers. 3. Feature Engineering We extensively analysed the tweets from the dataset to decide on the pre-processing step. Tweets have been either removed or have been transformed using pattern matching techniques to deem them fit to the classification models under consideration. We have filtered out non-informative features from the tweets like URLs, white spaces, usernames that start with @ and hashtags. Other features like emojis have been filtered out. The tweets have been decontracted. Words 1 https://www.kaggle.com/vkrahul/twitter-hate-speech Figure 1: Distribution of HOF and NOT classes within the HASOC and the Kaggle datasets used for the binary classification task of HASOC2021 Subtask 1A. Figure 2: The distribution of classes within the dataset for the multilabel classification Subtask 1B of HASOC 2021. like won’t, don’t, can’t, he’ll, I’ll etc have been converted to their complete forms. Stop words are those that appear very frequently in the tweets but don’t help in conveying any meaning. Common stop words like the, not, is and was have been removed. In addition, we have performed tokenization and lemmatization of the preprocessed tweets. Table 1 shows some examples of tweets before and after preprocessing. Table 1 Tweets before and after Preprocessing Before preprocessing After preprocessing @krtoprak_yigit Soldier of Japan Who has dick head soldier japan dick head @blueheartedly You’d be better off asking who DOESN’T would better ask think sleazy think he’s a sleazy shitbag lmao. shitbag lmao @wealth if you made it through this && were not only able to start making money for yourself but make not able start make money sustain sustain living that way all from home, fuck these live way home fuck company corporate pig companies & corporate pigs. power to the people, power people always always. technically still turn back clock dick head Technically that’s still turning back the clock, dick head https://t.co/jbKaPJmpt1 Figure 3: The Word Cloud of the TF-IDF Vector Space after Removing the Stop Words. A word embedding is a fixed length floating point vector whose length may be pre-specified. Word embedding approaches transform the uneven text statements to efficient dense numerical vector representation of fixed length. Our benchmark algorithm used feature engineering techniques of n-grams with tf-idf vectorization [7] to convert words into vectors of size 3843 [Fig.3]. We have compared the performance of tf-idf vectorization to the native word2vec embedding of TensorFlow, where words are represented in an n-dimensional vector space in such a way that words are clustered based on their semantic similarity. [Fig.3] shows the feature space of the tf-idf approach after removing the stop words. Since we were restricted with very little data, we used a 32-dimensional word2vec embedding for the Bidirectional LSTM, a sequence classifier that captures contextual information from a sequence. The vectors obtained were plotted using the embedding projector provided by TensorFlow. The visualizations were useful to validate the vectorization approach, that categorizes semantically similar words to the same cluster. The plot in [Fig.4] was obtained by applying the Principal Component Analysis method on the word2vec vector space for dimension reduction. The result is plotted in 3D by spherizing the data. We visualized that the spherized word vectors of similar words such as idiot and uneducated were closer to each other [Fig.4] in the 3D plot. By configuring the T-SNE (T-distributed Stochastic Neighbour Embedding) with a perplexity measure of 25 and a learning rate of 10 we obtained a 2d plot as shown in [Fig.5]. The [Fig.5] has a visualization of the T-SNE word embedding after 2500 iterations. The words rape and rapers are present in the top cluster but the word grape, which rhymes with rape has been correctly classified into the bottom cluster. The PCA plot and T-SNE plot have been visualized for the words fuck [Fig.6], ass [Fig.7] and bjp [Fig.8] respectively. The word ass seems to be associated with other semantically similar words such as harass, embarrass etc. But compassion seems to be wrongly grouped in along with the above words. The word bjp has been associated with most political phrases irrespective of the polarity of the sentiments. We observe that the word2vec approach used clusters semantically similar words in an acceptable manner. Figure 4: A 3D visualization of the Word2Vec embedding using Principle Component Analysis tech- niques for dimension reduction. 4. The Hate Speech Identification Task Hate speech identification has been perceived as a binary classification problem to determine whether a twitter content is hateful-offensive or not. We compared the performance of various binary text classifiers on the training data. We used an n-gram based tf-idf vector embedding to prepare the training data used for learning the models. Logistic Regression [8] is a well-known simple regression model that serves as a basic model for binary classification. Random Forest is a widely used meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting [9]. The number of estimators and the sub-sample size for each estimator are some of the parameters used to fine tune the approach. We used 2000 estimators and a minimum sample-size of 2 per estimator as per the default settings of sklearn’s Random Forest Classifier. The Support Vector Machine [10] is a state-of-the-art machine learning model with proven performance in countless machine learning applications with sparse high dimensional data. It uses different kernels, namely Linear, polynomial, Radial basis function, and sigmoid to transform the data to a lower dimension, which enables application of maximum margin classifier for obtaining the decision plane. Figure 5: T-SNE word embedding for the rhyming words rape and grape. 4.1. Sequence Classifiers In addition to the simple classifiers listed above, we have explored sequence classifiers using Long Short-Term Memory Models (LSTM) [11]. LSTM is a kind of Recurrent Neural Network (RNN) model, which has the added benefit of encoding contextual meaning among the words for a longer time step. Bidirectional LSTMs overcome the directional bias associated with traditional LSTM. Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of the input sequence. This can provide additional context to the network and result in faster and even fuller learning of the concepts. The same training data used for the simple classifiers has been used to train the LSTM models. A Bidirectional LSTM layer with 64 unit follows an input layer of 32 units. The output is then connected to two dense layers activated by two types of functions: relu and sigmoid (for binary prediction) with 64 and 1 unit respectively. In this pipeline we used binary cross-entropy as the loss function and the Adam optimizer [12] to optimize the model parameters. We identified the right stopping epoch by analyzing the validation accuracy and thus finalizing the model parameters. 4.2. Ensemble Methods Ensemble learning is a process where multiple diverse models are integrated in a way to ob- tain better predictive performance than what could be obtained by each constituent model independently. We created an Ensemble classifier of Random Forest, Logistic Regression and SVM with the soft voting method [13]. The models used for Ensemble include a Random Forest classifier with 2000 estimators, a Logistic Regression model and a Support Vector Machine using a radial bias function. We used a Local Agnostic Model Interpretation approach to understand the performance of the models before building the Ensemble. Figure 6: Visualization using Principle Component Analysis for the Word fuck. 4.3. Lime Guided Ensemble Approach To have a better understanding of the models, we decided to used model explanation strategies. LIME is a local agnostic model interpreter [14]. The advantage of using LIME is that it provides uniform explanations across different models since it is model agnostic. It has been recently been proposed by Sangani et.al.,2021 [15] to be used to compare the high performing models in the Kaggle platform. LIME supports explanations on both regression and classification models. We used LIME to compare the predictions of our models. The explanations are provided for each model by means of highlighted text and a relevance bar chart [Fig.9]. The words highlighted in orange denote a hate content, whereas those highlighted in blue support neutral decision. They also provide a score for the overall final decision of the interpreter summarizing over the highlighted words. We used LIME to analyse the shortcomings of each model and fine-tuned the weights accordingly for the Ensemble technique. The best performance by the Ensemble models was achieved by assigning a weight of 1,2 and 1 to the Logistic Regression, Random Forest, and Figure 7: T-SNE visualization for the word ass. Figure 8: T-SNE visualization for the word bjp. SVM respectively. We chose the soft voting method over other methods as it predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an Ensemble of well-calibrated classifiers [13]. The particular LIME implementation we used was time-consuming, hence we could only make a qualitative decision based on the manual analysis of the limited number of explanations obtained for each model. We also had issues trying to implement LIME in other models such as LSTM as a lot of processing was required before and after training and prediction. We couldn’t fit LSTM into the LIME pipeline to give explanations for predictions. The LIME explanations are provided in [Fig.9] and [Fig.10] for predictions made on a hateful content and a neutral content by Random Forest, Logistic Regression, SVM, and Ensemble model respectively. We used LIME for analysing 6 manually sampled instances from the dataset. From [Fig.9] we can infer that SVM has done better than Logistic Regression and Random Forest as it has assigned a higher weight to the word covidvaccine and other models have either made a mistake or have assigned lesser weights. In [Fig.10] Random Forest has done best since it has the highest prediction score for hate speech along with SVM. But SVM misses relevant words. Random forest has correctly classified mattancock as a word of Hate Speech, unlike SVM which has mistaken it for Non-Hate Speech. The Ensemble model’s prediction has also been explained and it is clearly visible that it has Figure 9: LIME Explanations for Non Hate Sentences for Random Forest, Logistic Regression, SVM and Ensemble model respectively used a combined result of all three models for its prediction making it unbiased and being able to perform better. 5. Fine Grained Classification for Subtask 1B For Subtask 1B which requires a multi-label classifier, we turned to DistilBERT [16]. Google provides pre-trained partial BERT [17] language models which may be fine tuned based on our Figure 10: LIME Explanations for Hate Sentences for Random Forest, Logistic Regression, SVM and Ensemble model respectively application. BERT is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context words. BERT uses the advantages of transfer learning, a method where a model developed for a task is reused as the starting point for a model on a second task. We have built DistilBERT, a model trained in a self-supervised fashion using the BERT base model as a teacher. It is smaller and faster than BERT. We have trained DistilBERT on the hate speech corpus [18]. The main reason for choosing this model, unlike other sequence classifiers like RNN is that, the DistilBERT model works on the entire sequence at once instead of reading the tokens sequentially. This process can be further accelerated using GPU support provided by Google Colab. We could adapt the pre-trained DistilBERT model by training it further on our relatively smaller dataset, further fine-tuning the parameters for better accuracy without much computational overhead. The tweets in the dataset were of different lengths. We used padding to normalize all the tweets to have the maximum sequence length. The DistilBERT model we chose was pre-trained on unlabelled Wikipedia corpus. It was then fine-tuned for our corpus by adding the output layer. The DistilBERT [18] model we chose facilitated in developing a multi label regression classifier. The output vector from the model is a 4-dimensional numeric vector. We used one-hot encoding on the class labels of the existing HASOC dataset to obtain a 4-dimensional boolean output vector for each tweet. The first value of the vector indicates hate class, the second profane, the third offensive and the fourth represents neutral. For eg., a tweet that is profane will be encoded [0 1 0 0]. While classifying a tweet using DistilBERT, on obtaining the 4 dimensional output vector, the tweet is assigned the class corresponding to the vector position which has the highest numerical value. 6. Results 6.1. Metrics For Comparison The models were compared using precision, recall, F-Measure and accuracy. • Precision: Precision is also known as the positive predicted value. It is the proportion of predictive positives which are actually positive (true positives). • Recall: It is the proportion of actual positives which are predicted positive. • F-Measure: It is the harmonic mean of precision and recall. The standard F-measure (F1) gives equal importance to precision and recall. • Accuracy: It is the number of correctly classified instances (true positives and true negatives). 6.2. Comparing the Models for Subtask 1A For Subtask 1A the combined training dataset from HASOC and Kaggle were used for training and the test dataset was randomly sampled from the HASOC training dataset. The split was 90:10 for training and testing. We have experimented with the HASOC dataset as well as the combined dataset. In both cases the test cases were sampled from the HASOC test dataset. The augmented kaggle dataset did play a significant role to improve the accuracy of Ensemble and Random Forest while it decreased the performance for the others such as Logistic Regression, Bidirectional LSTM and SVM. We have tabulated the best results obtained for the different models among the different training sets used. The performance of Random Forest and Ensemble models trained on the combined HASOC and Kaggle datasets and the other models trained on the HASOC dataset have been provided in Table 2. The data has been ordered based on accuracy. From the plots 11 and 12 we can infer that the Ensemble model performs much better than the independent models as it combines the advantages of all the other models and reduces the ill effects of each model. We have trained the LSTM model for 100 epochs, at which it gives the best performance. LSTM model does not fare well compared to other models with a Macro F1 of just 0.7198. Yet it provides valuable insights about the hate words through embedding vectors that we projected Figure 11: Accuracy of the models built for Task 1A Figure 12: Macro F1, Macro Precision and Macro Recall of all the models built for Task1A using embedding projectors explained in section 3. With more data the model could have performed better by learning improved hate word embeddings and understanding the context better. Although from the given LIME examples the significance of the Logistic Regression model in the Ensemble might not be visible, we could get the best Ensemble performance only when Logistic Regression was included. 6.3. Comparison between Fine tuned and Untuned DistilBERT Models for Subtask 1B For Subtask 1B we tested two versions of DistilBERT model on a 90:10 training and test data split of the HASOC dataset. The untuned initial version has 4 hidden layers each with 128, 64, 32 and 4 units. We removed the hidden layer with 128 units while fine tuning the DistilBERT model. To prevent underfitting we reduced dropout from 0.1 to 0.05. We trained the fine tuned DistilBERT model for 6 epochs, unlike the untuned version which had only 5 epochs. The major changes that led to a colossal increase in accuracy were basically with respect to simplifying the hidden layer, reducing the drop out and increasing the training epochs. The results are found in Table 3. Table 2 Submissions for Subtask 1A Classifier Macro F1 Macro prec Macro recall Accuracy (%) Logistic Regression 0.7462 0.7902 0.7344 78.064 Linear SVM 0.7386 0.7882 0.7270 77.596 Bi-directional LSTM 0.7198 0.7326 0.7138 74.629 Random Forest 0.7631 0.7775 0.7558 78.532 Ensemble(RF+SVM+LR) 0.7722 0.7859 0.7649 79.313 Table 3 Submissions for Subtask 1B Classifier Macro F1 Macro Prec Macro Recall Accuracy (%) Untuned DistilBERT 0.6106 0.6223 0.6098 66.667 Finetuned DIstilBERT 0.6311 0.6364 0.6303 67.681 7. Conclusion We classified the Hate Speech dataset provided by the HASOC Track 1 of FIRE 2021. Though Subtask 1A is a simple binary classifier we could not achieve the expected accuracy and stood 25th out of the total 56 submissions [Fig.13]. The reason is that Subtask 1A had a limited data and our decision to augment it with more data from Kaggle probably did not work. Meanwhile, we observed that an Ensemble method improved our accuracy over other approaches used. Work is under progress in this line to quantify the interpretations provided by LIME. The multigrain classification problem was the toughest with the benchmark performance as low as 0.5 (Precision and F1 score). With respect to the fine-grained classifier developed for Subtask 1B, DistilBERT gave superior performance compared to other approaches and scored the 8th rank among the total 37 submissions [Fig.14]. It is interesting to note that both Subtask 1A and Subtask 1B required entirely different approaches though they were all Hate Speech classifiers. Figure 13: Performance of the Ensemble Model based Hate Speech Classifier (Beware Haters) in HASOC 2021 Track 1 Subtask A Figure 14: Performance of the DistilBERT Model based Hate Speech Classifier (Beware Haters) in HASOC 2021 Track 1 Subtask B References [1] J. Nockleyby, Hate speech in encyclopedia of the american constitution, Electronic Journal of Academic and Special librarianship (2000). [2] T. Guardian, Zuckerberg on refugee crisis: ’hate speech has no place on facebook’, 2016. URL: https://www.theguardian.com/technology/2016/feb/26/ mark-zuckerberg-hate-speech-germany-facebook-refugee-crisis. [3] Z. Zhang, L. Luo, Hate speech detection: A solved problem? the challenging case of long tail on twitter, Semantic Web 10 (2019) 925–945. [4] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandalia, A. Patel, Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in indo- european languages, Proceedings of the 11th Forum for Information Retrieval Evaluation (2019). [5] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in: FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December 2021, ACM, 2021. [6] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Satapara, P. Majumder, J. Schäfer, T. Ranas- inghe, M. Zampieri, D. Nandini, A. K. Jaiswal, Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Lan- guages, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. URL: http://ceur-ws.org/. [7] J. E. Ramos, Using tf-idf to determine word relevance in document queries, 2003. [8] O. Oriola, E. Kotzé, Evaluating machine learning techniques for detecting offensive and hate speech in south african tweets, IEEE Access 8 (2020) 21496–21509. [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [10] D. Robinson, Z. Zhang, J. Tepper, Hate speech detection on twitter: Feature engineering vs feature selection, in: European Semantic Web Conference, Springer, 2018, pp. 46–49. [11] M. Sundermeyer, R. Schlüter, H. Ney, Lstm neural networks for language modeling, in: Thirteenth annual conference of the international speech communication association, 2012. [12] D. Kingma, J. Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations (2014). [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [14] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should i trust you?”: Explaining the predictions of any classifier, 2016. a r X i v : 1 6 0 2 . 0 4 9 3 8 . [15] R. B. Sangani, A. Shukla, R. B. Selvamani, Comparing deep sentiment models using quantified local explanations, in: Accepted for publication in Proceedings of IEEE-Smart Technologies, Communication Robotics 2021 Conference, IEEE, 2021. [16] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, ArXiv abs/1910.01108 (2019). [17] S. Yu, J. Su, D. Luo, Improving bert-based text classification with auxiliary sentence and domain knowledge, IEEE Access 7 (2019) 176600–176612. doi:1 0 . 1 1 0 9 / A C C E S S . 2 0 1 9 . 2953990. [18] R. Mutanga, N. Naicker, O. O. Olugbara, Hate speech detection in twitter using transformer methods, International Journal of Advanced Computer Science and Applications 11 (2020).