1. Introduction

Ofensive Language Identification on Multilingual code-mixed Text using BERT

Snehaan Bhawal

mailtosnehaan@gmail.com 1

Pradeep Kumar Roy

pradeep.roy@iiitsurat.ac.in 0

Abhinav Kumar

2 0 Indian Institute of Information Technology Surat , Gujarat , India 1 Kalinga Institute of Industrial Technology , Odisha , India 2 Siksha 'O' Anusandhan, Deemed to be University , Bhubaneswar, Odisha , India

2021

Hate Speech and Ofensive Content detection in social media has been an active field of research for the last couple of years. For the majority of the world consisting of non-native English speakers, most of the time unoficial messages are written in code-mixed language in a combination of words in a native language with English text. The current study focuses on using Machine and Deep learning techniques for detection of Hate Speech and Ofensive content in a Malayalam and Tamil code-mixed text collected from social media. The study showed that Deep learning models perform better than the machine learning models, specifically the implementation of BERT based transfer learning models performed best.

1. Introduction

Hate Speech is generally defined as content that expresses hate or prejudice against a particular group, ethnicity, religion, nationality or sexual orientation [ 1, 2, 3 ]. Social network platforms consists of a large amount of user-generated content, and due to being not moderated in nature, there is a widespread use of targeted hate speech against certain individuals, which has become a very critical issue [ 4, 5 ].

Humans can’t always moderate the social media networks to read, identify, and deal with the hateful text that the platform generates in such high frequency afecting the users mentally. Thus, there is a need for automation, and it has already been established that detection of such content by automation has been successful to a certain extent. Davidson et al. [ 6 ] used Logistic Regression with n-grams TF-IDF features to perform classification of Ofensive and Non-Ofensive text. At the same time, in another paper, a neural network-based approach was presented by Badjatiya et al. [ 7 ], where they used GloVe embedding with CNNs and LSTMs to provide better results.

However, most of the research that has taken place over hate speech and ofensive language detection is predominately for the English language [ 1 ]. In a country like India, with home nEvelop-O (A. Kumar) to numerous regional languages, people have adapted to using a mix of regional and English languages to express themselves in social media [ 8, 9 ]. The current research is done upon bilingual texts, which contain words from both languages and are written in one script, called code-mixed text. While there is another way of combining the words in native writing with the English script, which is known as Script-mixed text. These are far more challenging to work with as it requires a diferent tokenization process compared to what we need for English texts. Examples of some popular code-mixed languages in India are Hinglish (Hindi and English), Tanglish (Tamil and English), Manglish (Malayalam and English), and a mixed language of Kannada and English [ 10 ].

Identifying Hate Speech in such code-mixed languages is much more challenging than in English [ 11 ] due to the absence of suficient NLP resources. The models which are trained on a monolingual corpus might find it dificult to provide satisfactory results. This is because the system learns and recognizes the words provided in the given vocabulary while training. In the case of code-mixed text, many new words will be introduced which will not be present in the training vocabulary. The words are then marked as out of vocabulary token that makes no diference in the estimation of the model. Thus, the performance of the model decreases.

The current study focuses on Ofensive language identification in code-mixed languages of Tanglish and Manglish with the data set provided in HASOC-Dravidian-CodeMix-FIRE2021 challenge. An overview of the dataset can be found here [ 12 ]. We have implemented a number of Machine learning and Deep learning models, including transfer learning models like BERT, to distinguish between the Ofensive and Non-Ofensive text.

The rest of the article is summarized as follows: Section 2 discusses the related works. Section 3, 3.1, 4 provides the task description, the pre-processing steps taken, followed by the explanation of the proposed methodology. The experimental results and discussion are explained in Section 5 and 6, respectively. Section 7 concluded the work by highlighting the limitations and future scope.

2. Literature Review

The use of Hate speech and Ofensive language has become one of the major issues concerning the social networking platforms and hence received fruitful attention from many worldwide researchers [ 1, 2, 4, 8, 9 ]. Roy et al. [ 1 ] developed a deep learning-based framework to address the hate speech issue on Twitter. They used a Convolutional Neural Network to process the tweets and predict whether it were Hate or non-Hate. They considered only the tweets written in English language and hence unable to detect the tweets of multilingual texts, such as TamilEnglish, Kannada-English and others. Badjatiya et al. [ 7 ] developed a deep learning model to classify the tweets into racist, sexist or neither category. Their model experimented on 16k labelled data and outperformed existing models. The main issues with the existing works are the coverage of the language. Most of the existing researches use an English dataset. However, currently, people prefer to post the message on the social platform in code-mixed languages like Hindi-English mixed, Tamil-English mixed and others.

Recent work by Kumar et al. [ 4 ] suggested a deep learning-based framework to classify the Tamil and Malayalam code-mixed YouTube comments into the ofensive and non-ofensive categories. Many machines and deep learning models have experimented. The best result was obtained when a character n-gram tf-idf features passed to the dense neural network. Their model achieved the weighted F1-score value of 0.95. Suryawanshi et al. [ 13 ] developed the resources for Tamilmeme detection. The developed dataset consisted of two labels: troll and not_troll. A total of ten models were submitted, and the model with an F1-score value of 0.55 secured the first rank among them. Banerjee et al. [ 14 ] compared the performance of the pre-trained models on the Hinglish code-mixed dataset for predicting the Hate and non-Hate post.

3. Task and Data Description

The current study is an implementation and comparison of diferent Machine and Deep Learning models for a Hate Speech and Ofensive Language detection system for Tamil and Malayalam code-mixed texts in English. The dataset consists of sentences collected from comments or posts from social media. Table 1 shows the overview of the data used in this analysis. There are two sets of data, Malayalam code-mixed and Tamil code-mixed data each consisting of code-mixed sentences with addition of various emojis in most of the cases.

3.1. Data Preprocessing

As the data was code-mixed with Malayalam or Tamil mixed with English, no stop-word removal was done. The text being informal in nature contained emojis and emoticons which were replaced with their respective textual meaning using data from the Unicode Consortium’s emoji code repository by using the demoji library. This was then followed with the removal of punctuation, URLS, email-ids, hyperlinks and numeric data from the text.

4. Methodology

This section discusses the working of the implemented models in detail, the codes of which can be found in the GitHub repository1. In our current study, three diferent approaches were used as shown in Figure 1: i Conventional Machine learning based models. ii Neural Network based models iii Transfer learning based models

4.1. Traditional ML Models

In traditional ML-based models, we looked into using a 1 - 5 gram word TF-IDF feature set. The extracted features were then fed to classifiers like Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), XGBoost (XGB), and Support Vector Machine (SVM). The performance of these models were evaluated in terms of precision, recall, and F1-score [ 15 ]. The detailed performance report of these models are provided in the Section 5.

1https://github.com/Sbhawal/HASOC-FIRE-2021-CODES

Conventional ML Models: RF NB XGB

LR SVM rk Dense (128) o w t e lN Dense (256) a r u N Dense (512) e

Dense (64) N N Dense (128) C

Conv1D Embedding Layer Preprocessing Code-Mixed Social Data Transfer Learning Models: BERT Indic BERT MuRIL 4.2. Neural Network based models

In neural network-based models, the 1 to 5 grams TF-IDF features extracted while working with the Machine Learning models were used again as an input to a simple deep neural network (DNN) model. This model consisted of four fully connected layers in sequential order, with 512, 256, 128 and 1 neurons in the first, second, third and fourth (output) layers. Due to classification between two distinct labels, only one output neuron was used to identify the outputs. The hidden neurons were set up with the ReLU activation function. In contrast, the output neuron was set up with sigmoid activation function with Adam and binary-cross-entropy as the respective chosen optimizer and loss function.

The second experimented neural model is Convolutional Neural Network (CNN) [ 16 ]. The CNN consisted of one Conv1D layer followed by a Global Max Pooling and a Dropout layer connected to a fully connected sequential network with two hidden layers of 128 and 64 neurons, respectively, The activation function for the hidden neurons were chosen to be ReLU. The output was a single neuron with sigmoid activation. An embedding layer was used as the input layer with the embedding dimension set to 50 and the input length set to 120. Therefore, a (120, 50) dimensional embedding matrix was given as an input to CNN. The Convolutional layer consisted of 64 filters with a kernel size of three.

Our final neural network based model was a Bidirectional Long Short-Term Memory model (Bi-LSTM), consisting of 256 memory units followed by Global Max Pooling and Batch Normalization. The input layer was an embedding layer with 50 dimensions and length padded to 120 like the previous model. Two fully connected dense layers served as the hidden layers comprising of 20 and 10 neurons, respectively with ReLU as the activation function, which was then connected to a single neuron as the output layer with sigmoid activation. Subsequent Hyper-parameter tuning was done for the described models to check for the optimal performance by adjusting the optimizer, learning rate and embedding dimensions. Our experiments led to the best result with a learning rate set to 0.0001 with the optimizer set as Adam. The embedding dimension was set to 50 as it gave the best result. Due to binary classification and the overall balanced nature of the data set, the loss function was kept to be binary cross-entropy and sigmoid activation function for the output neuron.

4.3. Transfer Learning

This study has implemented BERT (Bidirectional Encoder Representations from Transformers) models to work with these models’ transfer learning capabilities. For these models, no preprocessing was done. Three diferent variants of BERT models were studied.

i BERT (multilingual) [ 17 ]. ii IndicBERT [ 18 ]. iii Multilingual Representations for Indian Languages (MuRIL) [ 19 ].

The BERT [ 17 ] multilingual model was trained on 102 languages with masked language modelling. The case-sensitive model was chosen, as no prior data pre-processing was done in case of transfer learning models. IndicBERT [ 18 ] is a multilingual ALBERT model, pretrained exclusively on a corpus of 12 major Indian languages. Compared to other such BERT based models, IndicBERT is comparatively smaller and has much less number of parameters. We used ktrain [ 20 ] libraries to develop the IndicBERT model. The last model that we used is MuRIL(Multilingual Representations for Indian Languages) [ 19 ]. MuRIL is a BERT model trained over a monolingual corpus of 17 Indian languages along with their translated and transliterated counterpart. The diferentiating factor between this and the previous model is that IndicBERT is trained only on the native Indian scripts. In contrast, MuRIL is trained on traditional scripts as well their transliterated corpus in roman script. The benefit of this will be evident in our experiment, which deals with code-mixed data of Indian and English language written strictly in roman script.

5. Results

Ofensive Not Ofensive Weighted Avg Ofensive Not Ofensive Weighted Avg Ofensive Not Ofensive Weighted Avg Ofensive Not Ofensive Weighted Avg Ofensive Not Ofensive Weighted Avg This section presents the results of all our experiments done during this study, as mentioned in Section 2. The results shown below corresponds to the model prediction on the validation data and are shown in terms of precision, recall, and F1-score belonging to OFF (Ofensive) or NOT (Not Ofensive) class. A model is said to be the best if it reports the highest weighted average in terms of precision, recall, and F1-score. The best results for the particular data set are presented in bold for each diferent model used in this study.

Traditional ML models were built using 1 to 5-gram character TF-IDF features which included the following models, LR, RF, NB, XGB and SVM. Their results are shown in Table 2 respectively. In the Malayalam code-mixed data set, the LR classifier gave a better performance with recall and F1 of 0.70. Similarly, in Tamil code-mixed text, the LR classifier performed the best and reported precision of 0.83 with recall and an F1-score of 0.82.

Results of the neural network models are presented in Table 3. It is seen that a simple DNN provided the best results in the case of the Malayalam Code Mix data with a precision of 0.75 with recall and F1-Score being 0.74. In Tamil Code Mix data, CNN showed the best performance with precision reaching 0.90 with Recall and F1-Score of 0.89.

In Table 4 the results of using diferent BERT models are presented. In both Malayalam and Tamil Data, it is seen that the MuRIL model performed the best among the other models. In Malayalam data, the precision was 0.79 with recall and F1-Score being 0.78, and for Tamil data, precision, recall and F1-Score were 0.91, which was the highest among all experimented models

6. Discussion

Among all experimented models, the MuRIL-a transfer learning model, performed the best for both Malayalam and Tamil code-mixed data. The experimental outcomes show that the traditional machine learning models are unable to understand the context of the message and hence may not be a good choice for this task. A simple Deep Neural Network (DNN) with an embedding layer performed better than most of the machine learning models (Tables 2, 3). Although some of their performance came near those of neural network models, we were dealing mostly with text data consisting of single sentences. For multiple sentence texts, a neural network with the ability to hold some memory like LSTM would have outclassed the machine learning models [ 21 ].

As shown in Table 4, the IndicBERT model is not able to perform as good as the multilingual BERT model. This may happen because the data set consisted of code-mixed data in the Roman script only. If there were any text written in the traditional script, then the multilingual BERT model would have treated most of the tokens as an unknown token which would had afected the model performance—benefiting the IndicBERT model as it was trained on monolingual Indian scripts. Finally, MuRIL, which was trained on a corpus of both traditional script and transliterated one, performed better than all the models.

The above reported results (Tables 2, 3, 4) were based on the predictions done over the validation data set. While using the test data, the proposed MuRIL model achieved the precision, recall and F1-score value of 0.679, 0.673 and 0.636, respectively for Tamil code-mixed data, while on the Malayalam code-mixed data, the precision, recall and F1-score value is 0.752, 0.727, and 0.734, respectively for the best case.

The models were re-experimented with labelled test data, and the obtained results with diferent machine learning, neural network and transfer learning models are shown in Table 5. Similar to the results on the validation data, MuRIL -a transfer learning model produces the best prediction outcomes in terms of weighted average precision, recall and F1-score for both Tanglish and Manglish test dataset.

7. Conclusion

Hate speech and ofensive language detection is still a challenge for low resource and codemixed languages in NLP. We implemented various machine learning, deep learning and transfer learning models to find the best suitable model for code-mixed Tamil and Malayalam datasets. The results reported by the models show the deep learning models. Specifically, the pre-trained models outperformed the machine learning models. The MuRIL model performed the best reporting weighted F1-score of 0.636 in Tamil code-mixed data. The same model provided a weighted F1-score of 0.734 in Malayalam code-mixed data. On test data, the BERT and MuRIL both transfer learning model yielded almost similar outcomes. In the future, a better model can be built by some additional preprocessing steps on the dataset to achieve better prediction accuracy.

[1]

P. K.

Roy ,

A. K.

Tripathy , T. K. Das , X.-Z. Gao , A framework for hate speech detection using deep convolutional neural network , IEEE Access 8 ( 2020 ) 204951 - 204962 .

[2]

Saumya ,

Kumar ,

J. P.

Singh , Ofensive language identification in dravidian code mixed social media text , in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages , 2021 , pp. 36 - 45 .

[3]

Mandl ,

Modha , A. Kumar

B. R.

Chakravarthi , Overview of the hasoc track at ifre 2020: Hate speech and ofensive language identification in tamil, malayalam, hindi, english and german , in: Forum for Information Retrieval Evaluation , 2020 , pp. 29 - 32 .

[4]

Kumar ,

Saumya ,

J. P.

Singh , Nitp-ai-nlp@ hasoc-dravidian-codemix-fire2020: A machine learning approach to identify ofensive languages from dravidian code-mixed text ., in: FIRE (Working Notes) , 2020 , pp. 384 - 390 .

[5]

Kumar ,

Saumya ,

J. P.

Singh , Nitp-ai-nlp@ hasoc-fire2020: Fine tuned bert for the hate speech and ofensive content identification from social media ., in: FIRE (Working Notes) , 2020 , pp. 266 - 273 .

[6]

Davidson ,

Warmsley ,

Macy , I. Weber , Automated hate speech detection and the problem of ofensive language , in: Proceedings of the International AAAI Conference on Web and Social Media , volume 11 , 2017 .

[7]

Badjatiya ,

Gupta ,

Varma , Deep learning for hate speech detection in tweets , in: Proceedings of the 26th international conference on World Wide Web companion , 2017 , pp. 759 - 760 .

[8]

S. M.

Jayanthi ,

Gupta , Sj_aj@ dravidianlangtech -eacl2021: Task-adaptive pretraining of multilingual bert models for ofensive language identification , arXiv preprint arXiv:2102.01051 ( 2021 ).

[9]

Vasantharajan , U. Thayasivam, Hypers@ dravidianlangtech -eacl2021: Ofensive language identification in dravidian code-mixed youtube comments and posts , in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages , 2021 , pp. 195 - 202 .

[10]

B. R.

Chakravarthi ,

Priyadharshini ,

Jose , A. Kumar

, T. Mandl,

P. K.

Kumaresan ,

Ponnusamy , H. R L , J. P. McCrae ,

Sherly , Findings of the shared task on ofensive language identification in Tamil, Malayalam, and Kannada , in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics , Kyiv, 2021 , pp. 133 - 145 . URL: https://aclanthology.org/ 2021 . dravidianlangtech- 1 . 17 .

[11]

B. R.

Chakravarthi ,

P. K.

Kumaresan ,

Sakuntharaj ,

A. K.

Madasamy ,

Thavareesan , P. B,

S. Chinnaudayar

Navaneethakrishnan ,

J. P.

McCrae ,

Mandl , Overview of the HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and Malayalam , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[12]

Priyadharshini ,

B. R.

Chakravarthi ,

Thavareesan ,

Chinnappa ,

Durairaj , E. Sherly, Overview of the dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation , FIRE 2021 , Association for Computing Machinery , 2021 .

[13]

Suryawanshi ,

B. R.

Chakravarthi , Findings of the shared task on troll meme classification in Tamil , in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics , Kyiv, 2021 , pp. 126 - 132 . URL: https://aclanthology.org/ 2021 .dravidianlangtech- 1 . 16 .

[14]

Banerjee ,

B. Raja

Chakravarthi ,

J. P.

McCrae , Comparison of pretrained embeddings to identify hate speech in indian code-mixed text , in: 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN) , 2020 , pp. 21 - 25 . doi:1 0 . 1 1 0 9

/ I C A C C C N

5 1

0 5 2 . 2 0 2 0 . 9 3 6 2 7 3 1 .

[15]

Tripathi ,

D. R.

Edla ,

Cheruku ,

Kuppili , A novel hybrid credit scoring model based on ensemble feature selection and multilayer ensemble classification , Computational Intelligence 35 ( 2019 ) 371 - 394 .

[16]

P. K.

Roy , Multilayer convolutional neural network to filter low quality content from quora , Neural Processing Letters 52 ( 2020 ) 805 - 821 .

[17]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[18]

Kakwani ,

Kunchukuttan ,

Golla , G. N.C. ,

Bhattacharyya , M. M. Khapra , P. Kumar, IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages , in: Findings of EMNLP, 2020 .

[19]

Khanuja ,

Bansal ,

Mehtani ,

Khosla ,

Dey ,

Gopalan ,

D. K.

Margam ,

Aggarwal ,

R. T.

Nagipogu ,

Dave ,

Gupta ,

S. C. B.

Gali ,

Subramanian ,

Talukdar , Muril: Multilingual representations for indian languages , 2021 . a r X i v : 2 1 0 3 . 1 0 7 3 0 .

[20]

A. S.

Maiya , ktrain: A low-code library for augmented machine learning , arXiv preprint arXiv:2004 . 10703 ( 2020 ). a r X i v : 2 0 0 4 . 1 0 7 0 3 .

[21]

P. K.

Roy ,

J. P.

Singh ,

Banerjee , Deep learning to filter sms spam , Future Generation Computer Systems 102 ( 2020 ) 524 - 533 .