1. Introduction

Forum for Information Retrieval Evaluation, December

Urdu Abusive Language Detection using Machine Learning

Muhammad Owais Raza

owais.leghari@hotmail.com 0 2

Qaisar Khan

qaisar.k@imail.sunway.edu.my 0 3

Ghulam Muhammad Soomro

soomrogm95@gmail.com 0 1 0 Machine Learning, NLP, Urdu Abuse Detection , Python, Logistic Regression 1 Mehran University Institute of Science, Technology and Development, Indus Hwy , Jamshoro, Sindh 76062 2 Mehran University of Engineering and Technology , Indus Hwy, Jamshoro, Sindh 76062 , Pakistan 3 Sunway University, 5 Jalan University , Bandar Sunway, 47500 Petaling Jaya, Selangor , Malaysia

2021

1 3 17

The growing popularity of user-generated material on social media has increased the quantity of offensive language used online. The tendency of user-generated material on social media is growing, giving rise to offensive language on these platforms. The offensive language negatively impacts individuals and affects society as a whole, which is why it is a dire need of time to identify vulgar remarks in languages used online. 'Urdu' is one of the many languages used on the internet that faces the same issue. Manually labeling the text as abusive on social media platforms is unattainable due to the production of a large amount of daily content. Therefore, automation (machine learning) is used to create the solution. This study uses machine learning algorithms, namely logistic regression, bagging algorithms, decision trees, and artificial neural networks (ANN), to detect abuse in the text. The F1 score is used as the primary metric, along with accuracy, precision, recall, and AUC-ROC, to measure the performance. Based on the evaluation, the bagging and logistic regression perform equally with an 83% F1 score. However, logistic regression is better for this use case because it is computationally less expensive and requires less effort than the bagging classifier.

1. Introduction

Historically, mass communication mediums were utilized under ethical and moral obligations dictated by societal standards. In this digital age, the wide acceptance of social media continues to be fueled by the prevalence of internet connection and mobile technologies, particularly smartphones and tablets [ 2 ]. The growing opportunities to express opinions online have given a high rise in hate speech and offensive language. Studies show that people may use offensive language online that affects other people's feelings [ 4 ]. The internet's secrecy has a detrimental effect on the population, encouraging obscene language, disparaging phrases, poisonous, unpleasant, and abusive language on the web, specifically social media. This derogatory content on the internet can be aggressive and harmful. It can erode people's self-esteem, inflict suicidal thoughts, and compel them to wipe out their social media existence. Due to this rise of cyberterrorism, cyberbullying, and widespread usage of derogatory language on the internet, identifying hate speech has become a critical component of anti-bullying measures for social media platforms [ 5 ]. The manual detection and removal of hate speech and undesirable information is a time-consuming process, owing to the vastness of the web and the growing number of internet users. The work gets harder considering the anonymity of online users. Hence, it is high time for technologies and approaches to rapidly detect abusive language on social media platforms and eradicate the spread of hate speech.

2021 Copyright for this paper by its authors.

Urdu is South Asia's resource-scarce language [ 6 ]. Compared to resource-rich languages such as English, a few annotated corpora are available for different NLP applications. The lack of linguistic resources such as stemmers and annotated corpora complicates and inspires study. Studying abusive language detection in Urdu [23] presents several difficulties. There is a dearth of sufficient annotated corpora and Urdu text preprocessing tools. This study presents different techniques to detect the abusive language in Urdu using different machine learning techniques and discusses the challenges and solutions.

2. Related Work

The rise in the use of social media has given birth to numerous problems, one of which is abusive behavior on social media platforms. Each platform has its policies to detect such behavior. For example, Twitter defines abusive behavior as an attempt to intimidate, harass, or silence someone's voice. Based on this definition, Twitter can classify a tweet as abusive or non-abusive. Recently, the computational linguistics community has focused on detecting abusive language and hate speech from various online social media platforms, such as Twitter [ 7 ], [ 8 ]. Early identification of many social abnormalities, such as hate speech, cyberbullying [ 9 ], [ 10 ], trolling [ 11 ], false news [ 12 ], rumor [ 13 ], fake profile identification [ 14 ], and sexism [ 15 ], has been a current trend in social media-based research. In [22] researchers have detected threatening tweets in urdu. Different researchers used different techniques to identify inappropriate text online. Researchers in [ 16 ] employ a variety of machine learning approaches that include support vector machines, decision trees, instance-based and rule-based, and algorithms from the WEKA toolkit used to identify bully-specific language patterns and built rules for automatically detecting cyberbullying content. In [ 17 ], researchers use a variety of classifiers to detect cyberbullying, including support vector machines, naive bayes, random forest, JRip, J48, k-nearest neighbors, sentence pattern extraction architecture (SPEC), and convolutional neural network (CNN). The results indicate that CNN outperforms other classifiers by over 11% in F-score. However, a challenge in this study is that there is no limitation for the language on these social media platforms. It is easier to create a machine learning model to detect abusive language for English because of resources. However, when it comes to languages like Urdu, the resources are low, and the process is laborious.

The dataset is adapted from HASOC abusive and threatening language detection in Urdu competition [20][21]. The dataset was gathered and labeled in the natural language and text processing laboratory at the center for computing research of Instituto Politécnico Nacional. The collector and annotator of the datasets are native Urdu speakers. The dataset used in the study contains two columns tweet and target with 2400 rows. Each row represents a tweet and its corresponding labels. There are two labels, 0 and 1, '0' represents a neutral text while '1' shows abusive text. Table 1 shows the distribution of the dataset on the label count. We have two labels in the dataset, abusive and nonabusive, and 1187 abusive tweets and 1213 non abusive tweets, which balances the dataset.

Label Abusive Non-Abusive Count 1187 1213 3. Dataset 4. Algorithms

The problem we are tackling in this research is a binary classification problem, and the following algorithms were used: 1. Logistic Regression 2. Decision Tree 3. Bagging classifier 4. Neural Network 4.1.

Logistic Regression

Logistic regression is a powerful technique of simulated results of a binary classification. It is used to allocate data to a discrete label, and being a classification method, it relies on probability [ 16 ]. The logistic regression function is represented by equation 1.

The prediction function is represented by equation 2 ( ) =

1 1 − ℎ ( ) = ( ) =

1 1 + − (1) (2) The value of has a unique significance; it indicates the likelihood that ℎ ( ) is 1 [ 17 ].

In this study, all the hyperparameters for the logistic regression are kept default because of getting the best results. 4.2.

Decision Tree

The decision tree is a technique for successfully supervising inductive learning through the generation of rules from data. Typically, one event might trigger two or more subsequent events, each with a distinct outcome. The decision tree's structure is top-down as a result of this characteristic. This structure resembles a flowchart. Every branch of a tree corresponds to a new decision result. The children node on each node corresponds to the corresponding attribute test. This child node's ID, generated from decision-making algorithms, is passed on [ 18 ]. 4.3.

Bagging Classifier

The bagging technique (bootstrap aggregation) generates a collection of classifiers. A bootstrapped duplicate of the original dataset is created for each classifier by randomly selecting N instances with replacement. When a new input is desired to be classified, the number of classifiers that anticipate the instance's class value is counted for every label state. The number of votes and the state with the most votes are projected to win the instance. In this study, we are using bagging for bootstrapping different logistic regression classifiers. 4.4.

Artificial Neural Network

A neural network comprises a linked group of artificial neurons that analyses data in a connectionist fashion. In general, an ANN is a self-organizing system that fine-tunes its organization in response to external or internal data that flows through the network throughout the learning process. They are typically used to describe complex connections between inputs and outputs or to deduce patterns from data. ANN has been successfully utilized in a variety of applications. For instance, ANNs have been successfully utilized in predictions, handwritten character recognition, and assessing home values [ 19 ]. The methodology of this study is shown in figure 1. The methodology for the research consists of 6 The first step is to import the respective dataset, so the abuse language dataset was imported. After importing the dataset, ambiguities are searched. Then, tokenizing takes place, removing any punctuation marks, unique characters, and numbers in the data using regular expression. Next is to remove any stopwords, which are high-frequency words with low semantic importance. The remaining

Methodology

steps: 1. Importing Dataset:

Cleaning Dataset: data is cleaned data.

Extracting Features: To extract features, TF-IDF is used. The TF-IDF algorithm (term frequency-inverse document frequency) is an enhancement to the DF technique. It is a type of statistical technique used to determine the significance of a word within a file collection. The significance of a word is related to its frequency of occurrence in the text and inversely related to its frequency of occurrence in the total document collection.

, is the rate of input in document , as shown by equation (3) is represented by equation 4 = log

| | |{ : ∈ }| , = , (3) (4) (5) Here | | is the total number of files and indicates the total number of occurrences of a word.

, is represented by equation 5. 4. Splitting Dataset:

Using test train split for creating two sets of datasets—one for training and the other for testing. The train test ratio is 75:25. That is, 75% of the dataset is used for training and 20% for evaluating results. 5.

Apply Machine Learning Algorithm:

In this step, machine learning algorithms are applied to the training set to create a classification model. The algorithms used in this study are logistic regression, decision tree, bagging ensemble classifier, and ANN.

Evaluate Machine Learning Model:

The last step is where inference is performed on 25% of the dataset to determine how well the classifiers are performed. The classification metrics used in this study are accuracy, precision, recall, F1 score, and AUC ROC.

6. Results 6.1. Evaluation Metrics

The evaluation parameter taken for this study is accuracy, precision, recall, F1 score, and AUC ROC. To decide the best-performing model, the key parameter used is the F1 score.

6.1.1. Accuracy

Accuracy is an evaluation parameter that shows the degree to which a classifier fits all classes. It is helpful since it treats all classes equally. It is measured as the fraction of correct predictions to all predictions. Equation 6 represents accuracy in mathematical terms.

in equation 6 represent true positive, is true negative, is false positive, and is false (6) (7) negative. 6.1.2. Precision 6.1.3. Recall

+ ( + + + )

( + )

The precision is determined as the fraction of Positive samples accurately classified to all positive cases classified. Precision is a measure that shows a model's predictive accuracy in classifying a sample as positive. It is determined by equation 7.

in equation 7 represent true positive and is false positive. (8) (9)

The recall is calculated as the ratio of correctly classified positive samples to all accessible, positive occurrences. The recall parameter specifies the model's ability to detect positive samples. The higher the recall, the more positive samples are discovered. Equation 8 represents mathematical representation. in equation 6 represent true positive and is false negative.

( + ) 2 ∗

∗ + 6.1.4. F1 Score: 6.1.5. AUC ROC:

The F1 score is a statistic that indicates how accurate a model is on a given dataset. It is used to assess binary classification systems that categorize examples as either positive or negative. Equation 9 represents mathematical representation.

The AUC - ROC curve is a benchmarking tool for classification problems using a variety of threshold values. The receiver operating characteristic (ROC) curve denotes the extent or measure of separability. In contrast, the area under the curve (AUC) indicates the level or degree of separability. It indicates the degree to which the model is adept at differentiating between classes. The larger the AUC, the more accurately the model predicts 1 class as '1' and 0 class as '0'. For example, the greater the AUC, the more accurate the model discriminates between abusive and non-abusive texts. 6.2.

Accuracy, Precision, and Recall

Three fundamental parameters for deciding which classifier performed best are accuracy, precision, and recall. The classifier was created using all four algorithms. Table 2 shows the accuracy, precision, and recall value for all the algorithms.

Accuracy, Precision, and Recall Evaluation Table Algorithm Decision Tree Logistic Regression Bagging Classifier

difference between logistic regression and bagging classifier, which shows the use of extra effort in bagging classifier is not worth it. The ANN was the second least performing model with 79.1%, 74.1%, and 70.5% accuracy, precision, and recall. 6.3.

F1 Score:

The main parameter used in this study is the F1 score, which is considered a reliable metric in classification tasks due to the involvement of both precision and recall. Figure 2 represents the F1 score for all the algorithms. F1 score for bagging and logistic regression are the same, 83%. Due to the extra effort being put into the bagging classifier, logistic regression is considered the better choice for the task. It provides the same F1 score with less effort.

F1 score

ANN s hm Bagging Classifer t i r oLogistic Regression g l A

Decision Tree Decision Tree F1 Score 77.1 Logistic Regression 83 76.8 77.1

The area under the curve for the receiver operation curve is an important parameter to judge the algorithm's performance. The AUC ROC values for each of the algorithm is shown in Figure 3. Looking at figure 3, we can see that bagging has the best value of 91.4, followed by logistic regression with 90.7%, which is not a significant difference compared to the extra efforts put into the bagging classifier. ANN also proved to be a good algorithm with a value of 87%. The algorithm with the lowest value of AUC ROC is the decision tree having 75.5% AUC ROC.

100 80 CO 60 R CU 40 A 20 0 75.5 90.7

The receiver operating characteristic (ROC) curve illustrates the relationship between TPR and FPR at various categorization levels. Reduce the classification threshold, and more items are classified as positive, increasing both true and false positives. Figure 4 shows the ROC curve for the decision tree, logistic regression, bagging classifier, and ANN. The more the area under the curve, the better the model. According to curves, it can be seen that the best performing models are logistic regression and bagging classifier. Although the ROC curve is not bad, compared to logistic regression and bagging classifier, it does not cut the best algorithms for this case due to the limitation of the dataset.

7. Conclusion

To eradicate the problem of the use of abusive language on social media, machine learning is employed to detect abusive remarks in Urdu tweets. The dataset used in this study was obtained from the text processing laboratory at the center for computing research of Instituto Politécnico Nacional. The detection of abusive language in Urdu text is performed as a classification task. The dataset has only two labels which makes it a binary classification problem. In order to solve this problem, the algorithm chosen were logistic regression, decision tree bagging algorithm, and ANN, which could work well on binary classification. Accuracy, precision, recall, F1 score, and AUC ROC were used as evaluation metrics. The key parameter for deciding the best algorithm was the F1 score. Based on the F1 score, logistic regression, and bagging performed equally well. However, logistic regression was chosen as the best performing model with an 83% of F1 score. ANN did not perform well due to the limitation of the dataset. All these evaluations were made on 25% of the test split.

For future work, different embedding layers will be trained on Urdu data. Different pre-trained models will be tuned for this use case to improve accuracy. After acquiring a model with better results, a chrome extension can be created to detect abusive Urdu text on social media.

8. References

[20] Amjad, Maaz, Alisa Zhila, Oxana Vitman, Sabur Butt, Hamza Imam Amjad, Grigori Sidorov, Alexander Gelbukh. "Overview of the shared task on threatening and abusive detection in Urdu at fire 2021." In CEUR Workshop Proceedings. (2021). [21] Amjad, Maaz, Alisa Zhila, Oxana Vitman, Sabur Butt, Hamza Imam Amjad, Grigori Sidorov, Alexander Gelbukh. "UrduThreat@ FIRE2021: Shared Track on abusive threat Identification in Urdu." In Forum for Information Retrieval Evaluation. (2021). [22] Amjad, Maaz, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, and Alexander Gelbukh. "Threatening Language Detecting and Threatening Target Identification in Urdu Tweets." IEEE Access. (2021). [23] Maaz Amjad, Noman Ashraf, Grigori Sidorov, Alisa Zhila, Liliana Chanona-Hernandez, Alexander Gelbukh. "Automatic Abusive Language Detection in Urdu Tweets." Acta Polytechnica Hungarica. (2021).

[1]

Mubarak and

Darwish , "Arabic offensive language classification on Twitter," in Proceedings of the International Conference on Social Informatics , pp. 269 - 276 , National Research Council of Pisa, Pisa, Italy, May 2019 .

[2]

Abozinadah , Detecting Abusive Arabic Language Twitter Accounts Using a Multidimensional Analysis Model , George Mason University, Fairfax, VA , USA, 2017 .

[3] Nayel , Hamada A. , and

H. L.

Shashirekha . "DEEP at HASOC2019: A Machine Learning Framework for Hate Speech and Offensive Language Detection." FIRE (Working Notes) . 2019 .

[4]

Stapleton , "Swearing and perceptions of the speaker: a discursive approach," Journal of Pragmatics , vol. 170 , pp. 381 - 395 , 2020 .

[5] . de Gibert, O. , Perez , N. , Garc´ıa- Pablos , A. , Cuadros , M. : Hate Speech Dataset from a White Supremacy Forum . In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2) . pp. 11 - 20 . Association for Computational Linguistics, Brussels, Belgium (Oct 2018 ). https://doi.org/10.18653/v1/ W18 -5102

[6] Akhter , Muhammad Pervez , et al. "Exploring deep learning approaches for Urdu text classification in product manufacturing." Enterprise Information Systems ( 2020 ): 1 - 26 .

[7]

Watanabe ,

Bouazizi and

Ohtsuki , "Hate speech on Twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection" , IEEE Access , vol. 6 , pp. 13825 - 13835 , 2018

[8]

J. M.

Schneider ,

Roller ,

Bourgonje ,

Hegele and

Rehm , "Towards the automatic classification of offensive language and related phenomena in German tweets" , Proc. 14th Conf. Natural Lang. Process. (Konvens) , pp. 95 , 2018 .

[9]

Fortuna and

Nunes , "A survey on automatic detection of hate speech in text," ACM Comput. Surv. (CSUR) , vol. 51 , no. 4 , pp. 1 - 30 , 2018 .

[10]

MacAvaney , H. -R. Yao , E.

Yang , K.

Russell , N.

Goharian , and O.

Frieder , "Hate speech detection: Challenges and solutions," PLOS One , vol. 14 , no. 8 , p. e0221152 , 2019 .

[11]

Tomaiuolo , G. Lombardo,

Mordonini ,

Cagnoni , and

Poggi , "A survey on troll detection," Future Internet , vol. 12 , no. 2 , p. 31 , 2020 .

[12]

Jiang ,

J. P.

Li ,

A. U.

Haq ,

Saboor , and

Ali , "A Novel Stacking Approach for Accurate Detection of Fake News," IEEE Access , vol. 9 , pp. 22626 - 22639 , 2021 .

[13]

Kumar ,

Singh ,

Ali ,

Pal , and

Singh , "Empirical evaluation of shallow and deep classifiers for rumor detection," in Proc. ICACM 2019 , in Advances in Computing and Intelligent Systems, in Algorithms for Intelligent Systems , 2020 , pp. 239 - 252 .

[14]

S. R.

Sahoo and

Gupta , "Real-time detection of fake account in twitter using machine-learning approach," in Proc. CICT 2019 , in Advances in computational intelligence and communication technology , in Advances in Intelligent Systems and Computing , vol. 1086 , 2021 , pp. 149 - 159 .

[15]

E. W.

Pamungkas ,

Basile , and

Patti , "Misogyny detection in twitter: a multilingual and cross-domain study," Inf . Process. Manag., vol. 57 , no. 6 , p. 102360 , 2020 .

[16] Shah , Kanish , et al. "A comparative analysis of logistic regression, random forest and KNN models for the text classification . " Augmented Human Research 5.1 ( 2020 ): 1 - 16 .

[17] Wang , Peng , et al. "Classification of proactive personality: Text mining based on weibo text and short-answer questions text." IEEE Access 8 ( 2020 ): 97370 - 97382 .

[18] Chen , Caixia, Liwei

Geng , and Sheng

Zhou . "Design and implementation of bank CRM system based on decision tree algorithm . " Neural Computing and Applications 33 .14 ( 2021 ): 8237 - 8247 .

[19] El-Mahelawi , Jamal Khamis , et al. "Tumor Classification Using Artificial Neural Networks." International Journal of Academic Engineering Research (IJAER) 4 .11 ( 2020 ).