=Paper=
{{Paper
|id=Vol-3395/T4-6
|storemode=property
|title=Emotional Threat Speech Detection in Urdu Language using BERT Variants
|pdfUrl=https://ceur-ws.org/Vol-3395/T4-6.pdf
|volume=Vol-3395
|authors=Sakshi Kalra,Kushank Maheshwari,Saransh Goel,Yashvardhan Sharma
|dblpUrl=https://dblp.org/rec/conf/fire/KalraMGS22
}}
==Emotional Threat Speech Detection in Urdu Language using BERT Variants==
Emotional Threat Speech Detection in Urdu Language using BERT Variants Sakshi Kalra1 , Kushank Maheshwari1 , Saransh Goel1 and Yashvardhan Sharma1 1 Department of CSIS, BITS Pilani, 333031, Rajasthan, INDIA Abstract Threatening speech is a particular kind of content that is usually regarded as illegal and must be isolated and curbed. Threat speech identification cannot be done manually because of the volume and speed of the data being generated, i.e., over 350,000 tweets are sent per minute. Numerous studies have been done on detecting threat speech in European languages to solve this problem, but South Asian languages with limited resources have received less attention, leaving millions of users vulnerable on social media. Around 230 million people speak Urdu as their first language worldwide. This corpus of tweets is divided into three categories: Non-Threatening, Group (targeting a group), and Individual (targeting an individual). In our approach, we have fine-tuned five different pre-trained BERT models, which are transformer-based machine learning techniques. The results show that MuRIL outperformed all other models, by achieving an F1 score of 71.6%, an accuracy of 73.8% and a ROC-AUC value of 72.9% on test data. Keywords Threat Speech, Social Media, BERT, MuRIL, Transformers model, Multi-Class Classification 1. Introduction Online social media platforms have exploded in popularity over the past ten years, and their user bases are expanding at an exponential rate. Users of these platforms have the freedom to share their thoughts and the opportunity to communicate with others from various groups. However, it is also used to spread, incite, promote, or justify hatred, violence, and discrimination against users based on their gender, religion, race, affiliation with particular groups, and views related to certain events or subjects (such as politics). On the one hand, this has led to exchanges of ideas and fostered relationships. On the other hand, however, it is exploited to spread hateful, offensive, derogatory, or obscene language against individuals and groups. Over 400 languages are listed in the SIL Ethnologue as being spoken in India; 24 of these languages have more than a million native speakers, while 114 have more than 10,000. Thus, there is a need for automated monitoring of threat detection. Firms are investing heavily and advancing research in this area of threat speech detection by establishing assignments and seminars, online forums, social media enterprises, and technology. One such group is FIRE, which has been actively putting on the EmoThreat challenge to address FIRE 2022: Forum for Information Retrieval Evaluation, December 9-13, 2022, India Envelope-Open p20180437@pilani.bits-pilani.ac.in (S. Kalra); f20180679@pilani.bits-pilani.ac.in (K. Maheshwari); f20190988@pilani.bits-pilani.ac.in (S. Goel); yash@pilani.bits-pilani.ac.in (Y. Sharma) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the problem. EmoThreat 2022 is looking for ways to detect threats in speech without human intervention. The competition is broken up into two subtasks. And this paper tackles Task B. This paper contains details regarding emotions and threat detection in Urdu. This is a multi-class classification task in which the aim is to classify a tweet by a user as either non-threatening, group (targeting a group), or individual (targeting an individual). We tackled the problem by using five different transformer-based models, namely, UrduHack, MuRIL, Multilingual-BERT, bert-base-uncased, and distilroberta. These models have displayed good outcomes in natural language processing tasks like text classification in the past, better than conventional machine learning algorithms. The urdu dataset provided by FIRE was fine- tuned using the above pre-trained transformer model from the HuggingFace library1 . The code is available from the github repository2 . 2. Related Work Several researchers have already participated in the hate speech detection tasks [1], [2], [3], [4], [5],[6],[7]. Several machine learning and deep learning algorithms have been tested for automatically detecting offensive and threat speech[8]. Techniques like TF-IDF weightings and word embedding are employed in [9] and are fed into machine learning algorithms like logistic regression, random forest, and support vector classifier. Both ML models and Transformer-based models have been used for the Urdu language in [10]. According to Fire2021[11], BERT models for the identification of hate speech in the Urdu language have also been used. Deep learning techniques[12] are currently growing in acceptance in a variety of disci- plines, including language modelling, sentiment analysis, machine translation, and text clas- sification. These include long short-term memories (LSTMs)[13], convolutional neural net- works (CNNs)[14], recurrent neural networks (RNNs)[15], bidirectional encoder representations (BERT)[16]. The paper [17] lists the performance of BERT across different active learning strategies in multi-class text classification. Thus, it indicates the usage of BERT for multi-class classification involving applications in the pickup and delivery service. Another move in this di- rection is by [18], which compares BERT against traditional machine learning text classification. Various versions have been developed for BERT depending on its application, like DocBERT [19], which is used for document classification. BERT has been proven to perform better than existing machine learning approaches. 3. Dataset The dataset for the task is provided by the organisers of EmoThreat’223 . Task B in the EmoThreat Urdu challenge is a multi-class classification task. A statement likely to cause damage or danger is classified as ”Threatening”. Threatening is further divided into ”Group” and ”Individual”. We 1 https://huggingface.co/ 2 https://github.com/Kushank24/fknw 3 https://sites.google.com/view/multi-label-emotionsfire-task/dataset?authuser=0 need to categorise the sentences in the Urdu Language dataset into the following classes: Table 1 shows the data statistics based on binary label data. Table 2 shows the multiclass label data. • Non-Threatening - Tweets containing this label do not contain any threatening or profane content. • Group - This label indicates that this Twitter post contains threatening content for group (s). • Individual - This label indicates that this Twitter post contains threatening or profane content for an individual. Table 1 Dataset Statistics on the basis of Binary Label Data Data Threatening Non-Threatening Total Entries Training Data 1782 1782 3564 Testing Data 308 627 935 Table 2 Dataset Statistics on the basis of Multiclass Label Data Data Group Individual Non-Threatening Total Entries Training Data 441 1341 1782 3564 Testing Data 253 55 627 935 As inferred from the data, the classes Threatening and Non-Threatening have the same number of entries, but the sub-division of Threatening resulting in Individual and Group have a different number of entries. A better view can be obtained from Figure 1: Figure 1: Training set distribution in the Urdu Dataset 4. Handling the Class Imbalanced Issue As seen from the above figure, labels are imbalanced, so we split the data set in a stratified fashion. The proportion of data distribution in the target column is preserved by stratification, and the train-test-split function shows the same proportion of distribution. Stratify therefore evenly distributes the target (label) throughout the training and test sets, just as it did in the original dataset. After stratification, we did oversampling of the dataset using the Imblearn library because the training instances are few and removing examples from the majority class will further reduce them. Thus, we oversampled instead of undersampling. 5. Proposed Techniques and Algorithms For many NLP-related tasks, such as fake news identification, question answering systems, machine translation, rumour detection, etc., transformer-based models provide cutting-edge implementation. They outperform other ML methods because of their bidirectional training and improved language understanding. Pre-training is the first phase in the building of a transformer-based model, which is then fine-tuned. The model is initially trained using large language datasets (monolingual) or datasets in a variety of languages (multilingual). Only the encoder part of the transformer architecture is employed to get the word embeddings. An additional output layer is implemented to calculate the probability for classes. The various word embedding models that have been employed are listed below: • UrduHack4 - The Urdu News Corpus was used to train Roberta-Urdu-Small. The nor- malisation module from urduhack was used to remove characters from other languages, such as arabic, from the training data. • MuRIL5 - This model uses a BERT base architecture that was previously trained using corpora from 17 Indian languages from Common Crawl, Wikipedia, Dakshina, and PMINDIA. • bert-base6 - English language pre-trained model employing masked language modelling (MLM) objective. • Multilingual-BERT7 - This has 104 pre-trained languages. The texts are tokenized and lowercased using WordPiece, and a vocabulary with a size of 110,000 is employed. The languages with fewer resources are oversampled, whereas the languages with more Wikipedia articles are undersampled. • Distil-BERT8 - The model has six layers, 82 million parameters, 768 dimensions, and 12 heads. The Flowchart in Figure 2 shows the brief approach and intermediate steps. The following Hyper-parameters were used while training the model: • Optimizer - an optimizer is a function or an algorithm that modifies the attributes to reduce the overall loss and improve accuracy. In our implementation, we have used the AdamW optimizer, which is a variant of the Adam optimizer with an improved implementation of weight decay. 4 https://huggingface.co/urduhack/roberta-urdu-small 5 https://huggingface.co/google/ MuRIL-base-cased 6 https://huggingface.co/bert-base-uncased 7 https://huggingface.co/bert-base-multilingual-cased 8 https://huggingface.co/distilroberta-base Figure 2: Flowchart of our methodology and techniques • Learning Rate - an optimization technique tuning parameter that establishes the step size for each iteration. In the implementation, a learning rate of 1e-5 is used. • Number of Epochs - number of iterations over the training dataset. Five epochs were used in the implementation of the training data. • Batch Size - number of samples processed before the model is updated. A batch size of 3 was used during implementation. 6. Results and Evaluations ROC-AUC, accuracy, and the F1-score are used to evaluate each model’s performance. UrduHack and MuRIL gave almost similar results which were better than rest 3 BERT models. The test data provided by EmoThreat is run for the following hyperparameters: Number of Epochs = 5, Batch size = 3, Optimizer = AdamW, and Learning Rate = 1.e-5. The results have been separately shown for both Binary Classification(”Threatening” vs. ”Non-Threatening”) and Multi-class Classification(”Individual” vs. ”Group” vs. ”Non-Threatening”). The results are shown in the below tables and figures, numbered from 5 to 14. Table 3 shows the comparison of the five fine-tuned BERT models. As seen from Table 3 MuRIL performed best on the test data, while Multilingual BERT and UrduHack performed similarly. While distilbert and bertbase performed the worst of all models. The ROC-AUC, F1-score, and accuracy help make a complete comparison between all models. Additionally, the confusion matrix for each model also lists various errors in the classification. Finally, at last, the ROC curve for MuRIL multi-class and UrduHack multi-class is shown for the ROC value comparison. The blank values in the table show that their ROC curve was not plotted. As seen from the ROC curve, Individual vs. Rest is different in MuRIL and UrduHack, and thus UrduHack is better able to classify Individual vs. Rest as compared to MuRIL. Table 3 Comparison of the 5 fine-tuned BERT models Test Data Results Classification Binary Classification Multi-Class Classification Metrics Accuracy F1 ROC-AUC Accuracy F1 ROC-AUC MuRIL 73.8% 71.6% 72.9% 54.4 32.3% 60.9% Multilingual 70.37% 65.61% 65.27% 56.14% 31.11% 56.4% BERT UrduHack 70.2% 67.9% 69.1% 51.2% 30.1% 56.4% Bert-base 67.37% 65.43% 67.08% 48.98% 29.45% - Distil-Bert 65.13 60.60% 60.62% 52.40 29.66 - Figure 3: MuRIL Binary Confusion Matrix Figure 4: MuRIL Multi-class Confusion Matrix 7. Error Analysis As seen from the confusion matrix, the number of false positives (FP) in the MuRIL binary class is higher than the number of FP in the mBERT binary class, while the overall accuracy for MuRIL is higher than mBERT, so for improving results, a combination of MuRIL and mBERT should be tried. Similarly, for multi-class, the false-positive total for group vs. all is lower in mBERT than in MuRIL, so a combination or an ensemble of these two would be a good model. On the other hand, the false negative for UrduHack is very low as compared to MuRIL and mBERT. Thus, if a combination of all three models or an ensemble of these three models would prove to be better 8. Conclusion and Future Work According to the results shown above, pre-trained BERT models perform better and have a better understanding of the meaning of a sentence, making them superior learning representations. Therefore, the transfer learning strategy using pre-trained BERT models is more appropriate for identifying threat speech than standard feature extraction methods. Out of all the models, Figure 5: mBert Binary Confusion Matrix Figure 6: mBert Multi-class Confusion Matrix Figure 7: UrduHack Binary Confusion Matrix Figure 8: UrduHack Multi-class Confusion Matrix the MuRIL performed the best. In addition, mBERT and UrduHack were comparable. We were ranked 1 on the public leaderboard. As shown by the findings above, where one model outperformed the others in a particular way, an ensemble of numerous models can also be tested to see if accuracy is increased or not. To further increase accuracy, models can be trained on a larger corpus in the future, i.e., the group and individual data points are smaller as compared to the total number of entries, thus the model is not trained well on them. The model can thus be properly trained by increasing the number of data entries. Future research on deeper transformer architectures may also be done. References [1] M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga, A. Gelbukh, Threatening language detection and target identification in urdu tweets, IEEE Access 9 (2021) 128302–128313. [2] M. Amjad, A. Zhila, G. Sidorov, A. Labunets, S. Butt, H. I. Amjad, O. Vitman, A. Gelbukh, Figure 9: bertbase Binary Confusion Matrix Figure 10: bertbase Multi-class Confusion Matrix Figure 11: distilbert Binary Confusion Matrix Figure 12: distilbert Multi-class Confusion Matrix UrduThreat@ FIRE2021: Shared track on abusive threat identification in Urdu, in: Forum for Information Retrieval Evaluation, 2021, pp. 9–11. [3] M. Amjad, A. Zhila, G. Sidorov, A. Labunets, S. Butt, H. I. Amjad, O. Vitman, A. Gelbukh, Overview of the shared task on threatening and abusive detection in Urdu at FIRE 2021, in: FIRE (Working Notes), CEUR Workshop Proceedings, 2021. [4] N. Ashraf, A. Rafiq, S. Butt, H. M. F. Shehzad, G. Sidorov, A. Gelbukh, Youtube based religious hate speech and extremism detection dataset with machine learning baselines, Journal of Intelligent & Fuzzy Systems (2022) 1–9. [5] N. Ashraf, R. Mustafa, G. Sidorov, A. Gelbukh, Individual vs. group violent threats classi- fication in online discussions, in: Companion Proceedings of the Web Conference 2020, 2020, pp. 629–633. [6] S. Butt, M. Amjad, F. Balouchzahi, N. Ashraf, R. Sharma, G. Sidorov, A. Gelbukh, Overview of EmoThreat: Emotions and Threat Detection in Urdu at FIRE 2022, in: CEUR Workshop Proceedings, 2022. [7] S. Butt, M. Amjad, F. Balouchzahi, N. Ashraf, R. Sharma, G. Sidorov, A. Gelbukh, EmoTh- Figure 13: ROC curve for MuRIL Multiclass Figure 14: ROC curve for UrduHack Multi-class reat@FIRE2022: Shared Track on Emotions and Threat Detection in Urdu, in: Forum for Information Retrieval Evaluation, FIRE 2022, Association for Computing Machinery, New York, NY, USA, 2022. [8] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, in: Proceedings of the international AAAI conference on web and social media, volume 11, 2017, pp. 512–515. [9] S. Kalraa, K. N. Inania, Y. Sharmaa, G. S. Chauhanb, Applying transfer learning using bert-based models for hate speech detection (2020). [10] S. Kalraa, Y. Bansala, Y. Sharmaa, Detection of abusive records by analyzing the tweets in urdu language exploring transformer based models (2021). [11] S. Kalraa, M. Agrawala, Y. Sharmaa, Detection of threat records by analyzing the tweets in urdu language exploring deep learning transformer-based models (2021). [12] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep learning for hate speech detection in tweets, in: Proceedings of the 26th international conference on World Wide Web companion, 2017, pp. 759–760. [13] A. Bisht, A. Singh, H. Bhadauria, J. Virmani, et al., Detection of hate speech and offensive language in twitter data using lstm model, in: Recent trends in image and signal processing in computer vision, Springer, 2020, pp. 243–264. [14] Z. Zhang, L. Luo, Hate speech detection: A solved problem? the challenging case of long tail on twitter, Semantic Web 10 (2019) 925–945. [15] G. K. Pitsilis, H. Ramampiaro, H. Langseth, Effective hate-speech detection in twitter data using recurrent neural networks, Applied Intelligence 48 (2018) 4730–4742. [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [17] S. Prabhu, M. Mohamed, H. Misra, Multi-class text classification using bert-based active learning, arXiv preprint arXiv:2104.14289 (2021). [18] S. González-Carvajal, E. C. Garrido-Merchán, Comparing bert against traditional machine learning text classification, arXiv preprint arXiv:2005.13012 (2020). [19] A. Adhikari, A. Ram, R. Tang, J. Lin, Docbert: Bert for document classification, arXiv preprint arXiv:1904.08398 (2019).