Hate Speech and Offensive Language Identification on Multilingual code-mixed Text using BERT

Hate Speech and Offensive Language Identification on Multilingual code-mixed Text using BERT SnehaanBhawal Kalinga Institute of Industrial Technology

Odisha India

PradeepKumar Indian Institute of Information Technology Surat

Gujarat India

AbhinavKumar Siksha 'O' Anusandhan Deemed to be University

Bhubaneswar Odisha India

Hate Speech and Offensive Language Identification on Multilingual code-mixed Text using BERT 1613-0073 D1D0129B586869DBCF5A307DD18DA577 GROBID - A machine learning software for extracting information from scholarly documents Multilingual Text Hate Speech Deep Learning Machine Learning BERT

Hate Speech and Offensive Content detection in social media has been an active field of research for the last couple of years. For the majority of the world consisting of non-native English speakers, most of the time unofficial messages are written in code-mixed language in a combination of words in a native language with English text. The current study focuses on using Machine and Deep learning techniques for detection of Hate Speech and Offensive content in a Malayalam and Tamil code-mixed text collected from social media. The study showed that Deep learning models perform better than the machine learning models, specifically the implementation of BERT based transfer learning models performed best.

Introduction

Hate Speech is generally defined as content that expresses hate or prejudice against a particular group, ethnicity, religion, nationality or sexual orientation [1,2,3]. Social network platforms consists of a large amount of user-generated content, and due to being not moderated in nature, there is a widespread use of targeted hate speech against certain individuals, which has become a very critical issue [4,5].

Humans can't always moderate the social media networks to read, identify, and deal with the hateful text that the platform generates in such high frequency affecting the users mentally. Thus, there is a need for automation, and it has already been established that detection of such content by automation has been successful to a certain extent. Davidson et al. [6] used Logistic Regression with n-grams TF-IDF features to perform classification of Offensive and Non-Offensive text. At the same time, in another paper, a neural network-based approach was presented by Badjatiya et al. [7], where they used GloVe embedding with CNNs and LSTMs to provide better results.

However, most of the research that has taken place over hate speech and offensive language detection is predominately for the English language [1]. In a country like India, with home FIRE 2021,Forum for Information Retrieval Evaluation, December [13][14][15][16][17]2021 Envelope mailtosnehaan@gmail.com (S. Bhawal); pradeep.roy@iiitsurat.ac.in (P. K. Roy); abhinavkumar@soa.ac.in (A. Kumar) Orcid 0000-0002-1072-5326 (S. Bhawal); 0000-0001-5513-2834 (P. K. Roy); 0000-0001-9367-7069 (A. Kumar) to numerous regional languages, people have adapted to using a mix of regional and English languages to express themselves in social media [8,9]. The current research is done upon bilingual texts, which contain words from both languages and are written in one script, called code-mixed text. While there is another way of combining the words in native writing with the English script, which is known as Script-mixed text. These are far more challenging to work with as it requires a different tokenization process compared to what we need for English texts. Examples of some popular code-mixed languages in India are Hinglish (Hindi and English), Tanglish (Tamil and English), Manglish (Malayalam and English), and a mixed language of Kannada and English [10].

Identifying Hate Speech in such code-mixed languages is much more challenging than in English [11] due to the absence of sufficient NLP resources. The models which are trained on a monolingual corpus might find it difficult to provide satisfactory results. This is because the system learns and recognizes the words provided in the given vocabulary while training. In the case of code-mixed text, many new words will be introduced which will not be present in the training vocabulary. The words are then marked as out of vocabulary token that makes no difference in the estimation of the model. Thus, the performance of the model decreases.

The current study focuses on Offensive language identification in code-mixed languages of Tanglish and Manglish with the data set provided in HASOC-Dravidian-CodeMix-FIRE2021 challenge. An overview of the dataset can be found here [12]. We have implemented a number of Machine learning and Deep learning models, including transfer learning models like BERT, to distinguish between the Offensive and Non-Offensive text.

The rest of the article is summarized as follows: Section 2 discusses the related works. Section 3, 3.1, 4 provides the task description, the pre-processing steps taken, followed by the explanation of the proposed methodology. The experimental results and discussion are explained in Section 5 and 6, respectively. Section 7 concluded the work by highlighting the limitations and future scope.

Literature Review

The use of Hate speech and Offensive language has become one of the major issues concerning the social networking platforms and hence received fruitful attention from many worldwide researchers [1,2,4,8,9]. Roy et al. [1] developed a deep learning-based framework to address the hate speech issue on Twitter. They used a Convolutional Neural Network to process the tweets and predict whether it were Hate or non-Hate. They considered only the tweets written in English language and hence unable to detect the tweets of multilingual texts, such as Tamil-English, Kannada-English and others. Badjatiya et al. [7] developed a deep learning model to classify the tweets into racist, sexist or neither category. Their model experimented on 16k labelled data and outperformed existing models. The main issues with the existing works are the coverage of the language. Most of the existing researches use an English dataset. However, currently, people prefer to post the message on the social platform in code-mixed languages like Hindi-English mixed, Tamil-English mixed and others.

Recent work by Kumar et al. [4] suggested a deep learning-based framework to classify the Tamil and Malayalam code-mixed YouTube comments into the offensive and non-offensive categories. Many machines and deep learning models have experimented. The best result was obtained when a character n-gram tf-idf features passed to the dense neural network. Their model achieved the weighted F1-score value of 0.95. Suryawanshi et al. [13] developed the resources for Tamilmeme detection. The developed dataset consisted of two labels: troll and not_troll. A total of ten models were submitted, and the model with an F1-score value of 0.55 secured the first rank among them. Banerjee et al. [14] compared the performance of the pre-trained models on the Hinglish code-mixed dataset for predicting the Hate and non-Hate post.

Task and Data Description

The current study is an implementation and comparison of different Machine and Deep Learning models for a Hate Speech and Offensive Language detection system for Tamil and Malayalam code-mixed texts in English. The dataset consists of sentences collected from comments or posts from social media. Table 1 shows the overview of the data used in this analysis. There are two sets of data, Malayalam code-mixed and Tamil code-mixed data each consisting of code-mixed sentences with addition of various emojis in most of the cases.

Data Preprocessing

As the data was code-mixed with Malayalam or Tamil mixed with English, no stop-word removal was done. The text being informal in nature contained emojis and emoticons which were replaced with their respective textual meaning using data from the Unicode Consortium's emoji code repository by using the demoji library. This was then followed with the removal of punctuation, URLS, email-ids, hyperlinks and numeric data from the text.

Methodology

This section discusses the working of the implemented models in detail, the codes of which can be found in the GitHub repository 1 . In our current study, three different approaches were used as shown in Figure 1: i Conventional Machine learning based models. ii Neural Network based models iii Transfer learning based models

Traditional ML Models

In traditional ML-based models, we looked into using a 1 -5 gram word TF-IDF feature set. The extracted features were then fed to classifiers like Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), XGBoost (XGB), and Support Vector Machine (SVM). The performance of these models were evaluated in terms of precision, recall, and F1-score [15]. The detailed performance report of these models are provided in the Section 5.

Neural Network based models

In neural network-based models, the 1 to 5 grams TF-IDF features extracted while working with the Machine Learning models were used again as an input to a simple deep neural network (DNN) model. This model consisted of four fully connected layers in sequential order, with 512, 256, 128 and 1 neurons in the first, second, third and fourth (output) layers. Due to classification between two distinct labels, only one output neuron was used to identify the outputs. The hidden neurons were set up with the ReLU activation function. In contrast, the output neuron was set up with sigmoid activation function with Adam and binary-cross-entropy as the respective chosen optimizer and loss function.

The second experimented neural model is Convolutional Neural Network (CNN) [16]. The CNN consisted of one Conv1D layer followed by a Global Max Pooling and a Dropout layer connected to a fully connected sequential network with two hidden layers of 128 and 64 neurons, respectively, The activation function for the hidden neurons were chosen to be ReLU. The output was a single neuron with sigmoid activation. An embedding layer was used as the input layer with the embedding dimension set to 50 and the input length set to 120. Therefore, a (120, 50) dimensional embedding matrix was given as an input to CNN. The Convolutional layer consisted of 64 filters with a kernel size of three.

Our final neural network based model was a Bidirectional Long Short-Term Memory model (Bi-LSTM), consisting of 256 memory units followed by Global Max Pooling and Batch Normalization. The input layer was an embedding layer with 50 dimensions and length padded to 120 like the previous model. Two fully connected dense layers served as the hidden layers comprising of 20 and 10 neurons, respectively with ReLU as the activation function, which was then connected to a single neuron as the output layer with sigmoid activation. Subsequent Hyper-parameter tuning was done for the described models to check for the optimal performance by adjusting the optimizer, learning rate and embedding dimensions. Our experiments led to the best result with a learning rate set to 0.0001 with the optimizer set as Adam. The embedding dimension was set to 50 as it gave the best result. Due to binary classification and the overall balanced nature of the data set, the loss function was kept to be binary cross-entropy and sigmoid activation function for the output neuron.

Transfer Learning

This study has implemented BERT (Bidirectional Encoder Representations from Transformers) models to work with these models' transfer learning capabilities. For these models, no preprocessing was done. Three different variants of BERT models were studied.

i BERT (multilingual) [17].

ii IndicBERT [18]. iii Multilingual Representations for Indian Languages (MuRIL) [19].

The BERT [17] multilingual model was trained on 102 languages with masked language modelling. The case-sensitive model was chosen, as no prior data pre-processing was done in case of transfer learning models. IndicBERT [18] is a multilingual ALBERT model, pretrained exclusively on a corpus of 12 major Indian languages. Compared to other such BERT based models, IndicBERT is comparatively smaller and has much less number of parameters. We used ktrain [20] libraries to develop the IndicBERT model. The last model that we used is MuRIL(Multilingual Representations for Indian Languages) [19]. MuRIL is a BERT model trained over a monolingual corpus of 17 Indian languages along with their translated and transliterated counterpart. The differentiating factor between this and the previous model is that IndicBERT is trained only on the native Indian scripts. In contrast, MuRIL is trained on traditional scripts as well their transliterated corpus in roman script. The benefit of this will be evident in our experiment, which deals with code-mixed data of Indian and English language written strictly in roman script.

Results

This section presents the results of all our experiments done during this study, as mentioned in Section 2. The results shown below corresponds to the model prediction on the validation data and are shown in terms of precision, recall, and F1-score belonging to OFF (Offensive) or NOT (Not Offensive) class. A model is said to be the best if it reports the highest weighted average in terms of precision, recall, and F1-score. The best results for the particular data set are presented in bold for each different model used in this study. Traditional ML models were built using 1 to 5-gram character TF-IDF features which included the following models, LR, RF, NB, XGB and SVM. Their results are shown in Table 2 respectively. In the Malayalam code-mixed data set, the LR classifier gave a better performance with recall and F1 of 0.70. Similarly, in Tamil code-mixed text, the LR classifier performed the best and reported precision of 0.83 with recall and an F1-score of 0.82.

Results of the neural network models are presented in Table 3. It is seen that a simple DNN provided the best results in the case of the Malayalam Code Mix data with a precision of 0.75 with recall and F1-Score being 0.74. In Tamil Code Mix data, CNN showed the best performance with precision reaching 0.90 with Recall and F1-Score of 0.89.

In Table 4 the results of using different BERT models are presented. In both Malayalam and Tamil Data, it is seen that the MuRIL model performed the best among the other models. In Malayalam data, the precision was 0.79 with recall and F1-Score being 0.78, and for Tamil data, precision, recall and F1-Score were 0.91, which was the highest among all experimented models

Discussion

Among all experimented models, the MuRIL-a transfer learning model, performed the best for both Malayalam and Tamil code-mixed data. The experimental outcomes show that the traditional machine learning models are unable to understand the context of the message and hence may not be a good choice for this task. A simple Deep Neural Network (DNN) with an embedding layer performed better than most of the machine learning models (Tables 2, 3). Although some of their performance came near those of neural network models, we were dealing mostly with text data consisting of single sentences. For multiple sentence texts, a neural network with the ability to hold some memory like LSTM would have outclassed the machine learning models [21].

As shown in Table 4, the IndicBERT model is not able to perform as good as the multilingual BERT model. This may happen because the data set consisted of code-mixed data in the Roman script only. If there were any text written in the traditional script, then the multilingual BERT model would have treated most of the tokens as an unknown token which would had affected the model performance-benefiting the IndicBERT model as it was trained on monolingual Indian scripts. Finally, MuRIL, which was trained on a corpus of both traditional script and transliterated one, performed better than all the models. The above reported results (Tables 2, 3, 4) were based on the predictions done over the validation data set. While using the test data, the proposed MuRIL model achieved the precision, recall and F1-score value of 0.679, 0.673 and 0.636, respectively for Tamil code-mixed data, while on the Malayalam code-mixed data, the precision, recall and F1-score value is 0.752, 0.727, and 0.734, respectively for the best case.

The models were re-experimented with labelled test data, and the obtained results with different machine learning, neural network and transfer learning models are shown in Table 5. Similar to the results on the validation data, MuRIL -a transfer learning model produces the best prediction outcomes in terms of weighted average precision, recall and F1-score for both Tanglish and Manglish test dataset.

Conclusion

Hate speech and offensive language detection is still a challenge for low resource and codemixed languages in NLP. We implemented various machine learning, deep learning and transfer learning models to find the best suitable model for code-mixed Tamil and Malayalam datasets. The results reported by the models show the deep learning models. Specifically, the pre-trained models outperformed the machine learning models. The MuRIL model performed the best reporting weighted F1-score of 0.636 in Tamil code-mixed data. The same model provided a weighted F1-score of 0.734 in Malayalam code-mixed data. On test data, the BERT and MuRIL both transfer learning model yielded almost similar outcomes. In the future, a better model can be built by some additional preprocessing steps on the dataset to achieve better prediction accuracy.

Figure 1 :1Figure 1: Framework used to predict the offensive post

Table 11Distribution of Data in the Training and Validation classesData SetClassOffensive Not Offensive TotalMalayalamTrain195220473999Code-MixedValidation478473951TamilTrain201919803999Code-MixedValidation465465940

Table 22Results of Traditional ML Models on Malayalam and Tamil code-mixed validation data setModel ClassMalayalam Code-MixedTamil Code-MixedPrecision Recall F1-score Precision Recall F1-scoreOffensive0.730.650.690.810.860.83LRNot Offensive0.680.750.720.850.790.82Weighted Avg0.700.700.700.830.820.82Offensive0.780.510.610.780.870.82RFNot Offensive0.630.850.730.850.760.80Weighted Avg0.710.680.670.810.810.81Offensive0.710.660.690.840.770.81NBNot Offensive0.680.730.710.780.850.82Weighted Avg0.700.700.700.810.810.81Offensive0.810.330.470.670.930.78XGBNot Offensive0.580.920.710.890.520.66Weighted Avg0.700.630.590.770.730.72Offensive0.720.600.660.760.880.82SVMNot Offensive0.660.770.710.850.720.78Weighted Avg0.690.680.680.810.800.80

Table 33Results of Neural Network based models on Malayalam and Tamil code-mixed data setModel ClassMalayalam Code-MixedTamil Code-MixedPrecision Recall F1-score Precision Recall F1-scoreOffensive0.730.670.700.840.830.83DNNNot Offensive0.690.750.720.830.830.83Weighted Avg0.710.710.710.830.830.83DNN+ EmbOffensive Not Offensive Weighted Avg0.79 0.70 0.750.65 0.82 0.740.72 0.76 0.740.84 0.91 0.870.92 0.82 0.870.88 0.86 0.87Offensive0.780.610.680.950.840.89CNNNot Offensive0.680.830.740.850.950.90Weighted Avg0.730.720.710.900.890.89Bi-LSTMOffensive Not Offensive Weighted Avg0.77 0.64 0.700.52 0.84 0.680.62 0.72 0.670.92 0.79 0.860.76 0.93 0.840.83 0.86 0.84

Table 44Results of Transfer Learning based models on Malayalam and Tamil code-mixed validation data setModel ClassMalayalam Code-MixedTamil Code-MixedPrecision Recall F1-score Precision Recall F1-scoreOffensive0.750.730.740.880.920.90BERTNot Offensive0.740.750.740.920.870.89Weighted Avg0.740.740.740.900.900.90In-Offensive0.710.690.700.780.820.80dicNot Offensive0.700.720.710.810.760.78BERTWeighted Avg0.710.710.710.790.790.79Offensive0.820.720.770.920.910.92MuRILNot Offensive0.750.840.790.910.920.91Weighted Avg0.790.780.780.910.910.91

Table 55Test Data Prediction Results on selected modelsModel ClassMalayalam Code-MixedTamil Code-MixedPrecision Recall F1-score Precision Recall F1-scoreOffensive0.510.650.570.520.560.54LRNot Offensive0.810.700.750.700.660.68Weighted Avg0.710.690.690.620.620.62DNN+ EmbOffensive Not Offensive Weighted Avg0.49 0.80 0.700.65 0.68 0.670.56 0.73 0.670.55 0.70 0.640.53 0.72 0.640.54 0.71 0.64Offensive0.550.520.530.560.500.53CNNNot Offensive0.780.790.780.690.750.72Weighted Avg0.700.700.700.640.650.64Offensive0.560.650.600.610.840.71BERTNot Offensive0.820.750.780.730.440.55Weighted Avg0.730.720.720.670.640.63Offensive0.550.600.580.580.430.50MuRILNot Offensive0.800.770.780.680.800.74Weighted Avg0.750.720.730.680.670.64

https://github.com/Sbhawal/HASOC-FIRE-2021-CODES

A framework for hate speech detection using deep convolutional neural network PKRoy AKTripathy TKDas X.-ZGao IEEE Access 8 2020 Offensive language identification in dravidian code mixed social media text SSaumya AKumar JPSingh Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages the First Workshop on Speech and Language Technologies for Dravidian Languages 2021 Overview of the hasoc track at fire 2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english and german TMandl SModha AKumar M BRChakravarthi Forum for Information Retrieval Evaluation 2020 Nitp-ai-nlp@ hasoc-dravidian-codemix-fire2020: A machine learning approach to identify offensive languages from dravidian code-mixed text AKumar SSaumya JPSingh FIRE (Working Notes) 2020 Nitp-ai-nlp@ hasoc-fire2020: Fine tuned bert for the hate speech and offensive content identification from social media AKumar SSaumya JPSingh FIRE (Working Notes) 2020 Automated hate speech detection and the problem of offensive language TDavidson DWarmsley MMacy IWeber Proceedings of the International AAAI Conference on Web and Social Media the International AAAI Conference on Web and Social Media 2017 11 Deep learning for hate speech detection in tweets PBadjatiya SGupta MGupta VVarma Proceedings of the 26th international conference on World Wide Web companion the 26th international conference on World Wide Web companion 2017 SMJayanthi AGupta arXiv:2102.01051 Sj_aj@ dravidianlangtech-eacl2021: Task-adaptive pretraining of multilingual bert models for offensive language identification 2021 arXiv preprint Hypers@ dravidianlangtech-eacl2021: Offensive language identification in dravidian code-mixed youtube comments and posts CVasantharajan UThayasivam Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages the First Workshop on Speech and Language Technologies for Dravidian Languages 2021 Findings of the shared task on offensive language identification in Tamil, Malayalam, and Kannada BRChakravarthi RPriyadharshini NJose AKumar M TMandl PKKumaresan RPonnusamy JPMccrae ESherly Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics

Kyiv

2021 Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam BRChakravarthi PKKumaresan RSakuntharaj AKMadasamy SThavareesan PB SChinnaudayar Navaneethakrishnan JPMccrae TMandl Working Notes of FIRE 2021 -Forum for Information Retrieval Evaluation CEUR 2021 Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada RPriyadharshini BRChakravarthi SThavareesan DChinnappa TDurairaj ESherly Forum for Information Retrieval Evaluation, FIRE 2021 Association for Computing Machinery 2021 Findings of the shared task on troll meme classification in Tamil SSuryawanshi BRChakravarthi Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics

Kyiv

2021 Comparison of pretrained embeddings to identify hate speech in indian code-mixed text SBanerjee BChakravarthi JPMccrae 10.1109/ICACCCN51052.2020.9362731 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN) 2020 A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification DTripathi DREdla RCheruku VKuppili Computational Intelligence 35 2019 Multilayer convolutional neural network to filter low quality content from quora PKRoy Neural Processing Letters 52 2020 JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2018 arXiv preprint IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages DKakwani AKunchukuttan SGolla GN C ABhattacharyya MMKhapra PKumar Findings of EMNLP 2020 Muril: Multilingual representations for indian languages SKhanuja DBansal SMehtani SKhosla ADey BGopalan DKMargam PAggarwal RTNagipogu SDave SGupta SC BGali VSubramanian PTalukdar a r X i 2 1 2021 ktrain: A low-code library for augmented machine learning ASMaiya arXiv:2004.10703 2020 arXiv preprint a r X i v : 2 0 0 4 . 1 0 7 0 3 Deep learning to filter sms spam PKRoy JPSingh SBanerjee Future Generation Computer Systems 102 2020