Urdu Abusive Language Detection using Machine Learning
Muhammad Owais Raza1, Qaisar Khan2, Ghulam Muhammad Soomro 3
1
  Mehran University of Engineering and Technology, Indus Hwy, Jamshoro, Sindh 76062, Pakistan
2
  Sunway University, 5 Jalan University, Bandar Sunway, 47500 Petaling Jaya, Selangor, Malaysia
3
  Mehran University Institute of Science, Technology and Development, Indus Hwy, Jamshoro, Sindh 76062,
Pakistan


                Abstract
                The growing popularity of user-generated material on social media has increased the quantity
                of offensive language used online. The tendency of user-generated material on social media is
                growing, giving rise to offensive language on these platforms. The offensive language
                negatively impacts individuals and affects society as a whole, which is why it is a dire need of
                time to identify vulgar remarks in languages used online. 'Urdu' is one of the many languages
                used on the internet that faces the same issue. Manually labeling the text as abusive on social
                media platforms is unattainable due to the production of a large amount of daily content.
                Therefore, automation (machine learning) is used to create the solution. This study uses
                machine learning algorithms, namely logistic regression, bagging algorithms, decision trees,
                and artificial neural networks (ANN), to detect abuse in the text. The F1 score is used as the
                primary metric, along with accuracy, precision, recall, and AUC-ROC, to measure the
                performance. Based on the evaluation, the bagging and logistic regression perform equally with
                an 83% F1 score. However, logistic regression is better for this use case because it is
                computationally less expensive and requires less effort than the bagging classifier.

                Keywords 1
                Machine Learning, NLP, Urdu Abuse Detection, Python, Logistic Regression

1. Introduction
   Historically, mass communication mediums were utilized under ethical and moral obligations
dictated by societal standards. In this digital age, the wide acceptance of social media continues to be
fueled by the prevalence of internet connection and mobile technologies, particularly smartphones and
tablets [2]. The growing opportunities to express opinions online have given a high rise in hate speech
and offensive language. Studies show that people may use offensive language online that affects other
people's feelings [4]. The internet's secrecy has a detrimental effect on the population, encouraging
obscene language, disparaging phrases, poisonous, unpleasant, and abusive language on the web,
specifically social media. This derogatory content on the internet can be aggressive and harmful. It can
erode people's self-esteem, inflict suicidal thoughts, and compel them to wipe out their social media
existence. Due to this rise of cyberterrorism, cyberbullying, and widespread usage of derogatory
language on the internet, identifying hate speech has become a critical component of anti-bullying
measures for social media platforms [5]. The manual detection and removal of hate speech and
undesirable information is a time-consuming process, owing to the vastness of the web and the growing
number of internet users. The work gets harder considering the anonymity of online users. Hence, it is
high time for technologies and approaches to rapidly detect abusive language on social media platforms
and eradicate the spread of hate speech.


Forum for Information Retrieval Evaluation, December 13-17, 2021, India
EMAIL: owais.leghari@hotmail.com (A. 1); qaisar.k@imail.sunway.edu.my (A. 2); soomrogm95@gmail.com (A. 3)
ORCID: 0000-0002-3065-385X (A. 1); 0000-0001-7903-0277 (A. 2); 0000-0002-9327-9674 (A. 3)
https://github.com/owais4321/Urdu-Abusive-Language-Detection-using-Machine-Learning
             ©️©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
   Urdu is South Asia's resource-scarce language [6]. Compared to resource-rich languages such as
English, a few annotated corpora are available for different NLP applications. The lack of linguistic
resources such as stemmers and annotated corpora complicates and inspires study. Studying abusive
language detection in Urdu [23] presents several difficulties. There is a dearth of sufficient annotated
corpora and Urdu text preprocessing tools. This study presents different techniques to detect the abusive
language in Urdu using different machine learning techniques and discusses the challenges and
solutions.

2. Related Work
    The rise in the use of social media has given birth to numerous problems, one of which is abusive
behavior on social media platforms. Each platform has its policies to detect such behavior. For example,
Twitter defines abusive behavior as an attempt to intimidate, harass, or silence someone's voice. Based
on this definition, Twitter can classify a tweet as abusive or non-abusive. Recently, the computational
linguistics community has focused on detecting abusive language and hate speech from various online
social media platforms, such as Twitter [7], [8]. Early identification of many social abnormalities, such
as hate speech, cyberbullying [9], [10], trolling [11], false news [12], rumor [13], fake profile
identification [14], and sexism [15], has been a current trend in social media-based research. In [22]
researchers have detected threatening tweets in urdu. Different researchers used different techniques to
identify inappropriate text online. Researchers in [16] employ a variety of machine learning approaches
that include support vector machines, decision trees, instance-based and rule-based, and algorithms
from the WEKA toolkit used to identify bully-specific language patterns and built rules for
automatically detecting cyberbullying content. In [17], researchers use a variety of classifiers to detect
cyberbullying, including support vector machines, naive bayes, random forest, JRip, J48, k-nearest
neighbors, sentence pattern extraction architecture (SPEC), and convolutional neural network (CNN).
The results indicate that CNN outperforms other classifiers by over 11% in F-score. However, a
challenge in this study is that there is no limitation for the language on these social media platforms. It
is easier to create a machine learning model to detect abusive language for English because of resources.
However, when it comes to languages like Urdu, the resources are low, and the process is laborious.

3. Dataset
   The dataset is adapted from HASOC abusive and threatening language detection in Urdu
competition [20][21]. The dataset was gathered and labeled in the natural language and text processing
laboratory at the center for computing research of Instituto Politécnico Nacional. The collector and
annotator of the datasets are native Urdu speakers. The dataset used in the study contains two columns
tweet and target with 2400 rows. Each row represents a tweet and its corresponding labels. There are
two labels, 0 and 1, '0' represents a neutral text while '1' shows abusive text. Table 1 shows the
distribution of the dataset on the label count. We have two labels in the dataset, abusive and non-
abusive, and 1187 abusive tweets and 1213 non abusive tweets, which balances the dataset.

Table 1
Label Count
                       Label                                                Count
                      Abusive                                               1187
                    Non-Abusive                                             1213


4. Algorithms
   The problem we are tackling in this research is a binary classification problem, and the following
algorithms were used:
   1. Logistic Regression
   2. Decision Tree
   3. Bagging classifier
   4. Neural Network

4.1.    Logistic Regression
    Logistic regression is a powerful technique of simulated results of a binary classification. It is used
to allocate data to a discrete label, and being a classification method, it relies on probability [16]. The
logistic regression function is represented by equation 1.

                                                    1                                                (1)
                                         𝑔(𝑧) =
                                                  1 − 𝑒𝑧

   The prediction function is represented by equation 2

                                                             1                                       (2)
                               ℎ𝜃(𝑥) = 𝑔(𝜃 𝑇 𝑥) =                𝑇
                                                       1 + 𝑒 −𝜃 𝑥
   The value of 𝜃 has a unique significance; it indicates the likelihood that ℎ𝜃(𝑥) is 1 [17].
   In this study, all the hyperparameters for the logistic regression are kept default because of getting
the best results.

4.2.    Decision Tree
    The decision tree is a technique for successfully supervising inductive learning through the
generation of rules from data. Typically, one event might trigger two or more subsequent events, each
with a distinct outcome. The decision tree's structure is top-down as a result of this characteristic. This
structure resembles a flowchart. Every branch of a tree corresponds to a new decision result. The
children node on each node corresponds to the corresponding attribute test. This child node's ID,
generated from decision-making algorithms, is passed on [18].

4.3.    Bagging Classifier
   The bagging technique (bootstrap aggregation) generates a collection of classifiers. A bootstrapped
duplicate of the original dataset is created for each classifier by randomly selecting N instances with
replacement. When a new input is desired to be classified, the number of classifiers that anticipate the
instance's class value is counted for every label state. The number of votes and the state with the most
votes are projected to win the instance. In this study, we are using bagging for bootstrapping different
logistic regression classifiers.

4.4.    Artificial Neural Network
   A neural network comprises a linked group of artificial neurons that analyses data in a connectionist
fashion. In general, an ANN is a self-organizing system that fine-tunes its organization in response to
external or internal data that flows through the network throughout the learning process. They are
typically used to describe complex connections between inputs and outputs or to deduce patterns from
data. ANN has been successfully utilized in a variety of applications. For instance, ANNs have been
successfully utilized in predictions, handwritten character recognition, and assessing home values [19].
5. Methodology
The methodology of this study is shown in figure 1. The methodology for the research consists of 6
steps:
    1. Importing Dataset:
The first step is to import the respective dataset, so the abuse language dataset was imported.
    2. Cleaning Dataset:
After importing the dataset, ambiguities are searched. Then, tokenizing takes place, removing any
punctuation marks, unique characters, and numbers in the data using regular expression. Next is to
remove any stopwords, which are high-frequency words with low semantic importance. The remaining
data is cleaned data.
    3. Extracting Features:
To extract features, TF-IDF is used. The TF-IDF algorithm (term frequency-inverse document
frequency) is an enhancement to the DF technique. It is a type of statistical technique used to determine
the significance of a word within a file collection. The significance of a word is related to its frequency
of occurrence in the text and inversely related to its frequency of occurrence in the total document
collection.

   𝑇𝐹𝑖,𝑗 is the rate of input 𝑊𝑖 in document 𝑋𝑗 , as shown by equation (3)

                                                      𝑛𝑖,𝑗                                            (3)
                                           𝑇𝐹𝑖,𝑗 =
                                                     ∑𝑘 𝑛𝑘,𝑗

       𝐼𝐷𝐹𝑖 is represented by equation 4

                                                        |𝐷|                                           (4)
                                     𝐼𝐷𝐹𝑖 = log
                                                  |{𝑗: 𝑡𝑖 ∈ 𝑑𝑗 }|

       Here |𝐷| is the total number of files and 𝑑𝑗 indicates the total number of occurrences of a word.
       𝑇𝐹𝐼𝐷𝐹𝑖,𝑗 is represented by equation 5.

                                     𝑇𝐹𝐼𝐷𝐹𝑖,𝑗 = 𝑇𝐹𝑖,𝑗 𝑋 𝐼𝐷𝐹𝑖                                          (5)

     4. Splitting Dataset:
    Using test train split for creating two sets of datasets—one for training and the other for testing. The
train test ratio is 75:25. That is, 75% of the dataset is used for training and 20% for evaluating results.
    5. Apply Machine Learning Algorithm:
    In this step, machine learning algorithms are applied to the training set to create a classification
model. The algorithms used in this study are logistic regression, decision tree, bagging ensemble
classifier, and ANN.
    6. Evaluate Machine Learning Model:
    The last step is where inference is performed on 25% of the dataset to determine how well the
classifiers are performed. The classification metrics used in this study are accuracy, precision, recall,
F1 score, and AUC ROC.
Figure 1: Methodology Flowchart

6. Results
6.1. Evaluation Metrics
   The evaluation parameter taken for this study is accuracy, precision, recall, F1 score, and AUC ROC.
To decide the best-performing model, the key parameter used is the F1 score.

6.1.1. Accuracy
   Accuracy is an evaluation parameter that shows the degree to which a classifier fits all classes. It is
helpful since it treats all classes equally. It is measured as the fraction of correct predictions to all
predictions. Equation 6 represents accuracy in mathematical terms.
                                            𝑇𝑝 + 𝑇𝑛                                                 (6)
                                       (𝑇𝑝 + 𝑇𝑛 +𝐹𝑝 +𝐹𝑛 )

   𝑇𝑝 in equation 6 represent true positive, 𝑇𝑛 is true negative, 𝐹𝑝 is false positive, and 𝐹𝑛 is false
negative.

6.1.2. Precision
   The precision is determined as the fraction of Positive samples accurately classified to all positive
cases classified. Precision is a measure that shows a model's predictive accuracy in classifying a sample
as positive. It is determined by equation 7.
                                                 𝑇𝑝                                                  (7)
                                             (𝑇𝑝 + 𝐹𝑝 )

   𝑇𝑝 in equation 7 represent true positive and 𝐹𝑝 is false positive.

6.1.3. Recall
   The recall is calculated as the ratio of correctly classified positive samples to all accessible, positive
occurrences. The recall parameter specifies the model's ability to detect positive samples. The higher
the recall, the more positive samples are discovered. Equation 8 represents mathematical representation.
                                                 𝑇𝑝                                                    (8)
                                             (𝑇𝑝 + 𝐹𝑛 )

   𝑇𝑝 in equation 6 represent true positive and 𝐹𝑛 is false negative.


6.1.4. F1 Score:
   The F1 score is a statistic that indicates how accurate a model is on a given dataset. It is used to
assess binary classification systems that categorize examples as either positive or negative. Equation 9
represents mathematical representation.

                                      2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙                                           (9)
                                       𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

6.1.5. AUC ROC:
   The AUC - ROC curve is a benchmarking tool for classification problems using a variety of
threshold values. The receiver operating characteristic (ROC) curve denotes the extent or measure of
separability. In contrast, the area under the curve (AUC) indicates the level or degree of separability. It
indicates the degree to which the model is adept at differentiating between classes. The larger the AUC,
the more accurately the model predicts 1 class as '1' and 0 class as '0'. For example, the greater the AUC,
the more accurate the model discriminates between abusive and non-abusive texts.

6.2.    Accuracy, Precision, and Recall
   Three fundamental parameters for deciding which classifier performed best are accuracy, precision,
and recall. The classifier was created using all four algorithms. Table 2 shows the accuracy, precision,
and recall value for all the algorithms.

Table 2
Accuracy, Precision, and Recall Evaluation Table
        Algorithm                 Accuracy                     Precision                   Recall
      Decision Tree                  74.6                        71.5                       83.7
   Logistic Regression               83.6                        89.4                       77.1
    Bagging Classifier               83.6                        88.5                       78.1
           ANN                       79.1                        74.1                       70.5

According to the table, the best performing algorithm is bagging that showed an accuracy of
83.6, the same as logistic regression. However, it outperformed logistic regression in precision
and recall values having 88.5 and 78.1, respectively. The least performing model was a decision
tree with accuracy, precision, and recall of 74.6, 71.5, and 83.7, respectively. There is not much
difference between logistic regression and bagging classifier, which shows the use of extra
effort in bagging classifier is not worth it. The ANN was the second least performing model
with 79.1%, 74.1%, and 70.5% accuracy, precision, and recall.

6.3.    F1 Score:
   The main parameter used in this study is the F1 score, which is considered a reliable metric
in classification tasks due to the involvement of both precision and recall. Figure 2 represents
the F1 score for all the algorithms. F1 score for bagging and logistic regression are the same,
83%. Due to the extra effort being put into the bagging classifier, logistic regression is
considered the better choice for the task. It provides the same F1 score with less effort.


                                                                       F1 score
                                                          ANN                         76.8
                            Algorithms


                                              Bagging Classifer                                                    83
                                         Logistic Regression                                                       83
                                                 Decision Tree                           77.1

                                                                  72     74      76      78       80        82     84

                                                                        Logistic      Bagging
                                                    Decision Tree                                           ANN
                                                                       Regression     Classifer
                                  F1 Score              77.1               83            83                 76.8

                                                                                 F1 score Values


Figure 2: F1 score Bar Chart

6.4.    AUC ROC:
    The area under the curve for the receiver operation curve is an important parameter to judge the
algorithm's performance. The AUC ROC values for each of the algorithm is shown in Figure 3. Looking
at figure 3, we can see that bagging has the best value of 91.4, followed by logistic regression with
90.7%, which is not a significant difference compared to the extra efforts put into the bagging classifier.
ANN also proved to be a good algorithm with a value of 87%. The algorithm with the lowest value of
AUC ROC is the decision tree having 75.5% AUC ROC.

                                 100                                      90.7        91.4             87
                                                            75.5
                                         80
                       AUC ROC


                                         60

                                         40

                                         20

                                         0
                                                                               ROC AUC
                                                                              Algorithms

                                     Decision Tree             Logistic Regression    Bagging Classifer          ANN


   Figure 3: AUC ROC Bar Chart

6.5.    ROC Curve:
   The receiver operating characteristic (ROC) curve illustrates the relationship between TPR and FPR
at various categorization levels. Reduce the classification threshold, and more items are classified as
positive, increasing both true and false positives. Figure 4 shows the ROC curve for the decision tree,
logistic regression, bagging classifier, and ANN. The more the area under the curve, the better the
model. According to curves, it can be seen that the best performing models are logistic regression and
bagging classifier. Although the ROC curve is not bad, compared to logistic regression and bagging
classifier, it does not cut the best algorithms for this case due to the limitation of the dataset.


   Figure 4: ROC Chart for All the Algorithms

7. Conclusion
   To eradicate the problem of the use of abusive language on social media, machine learning is
employed to detect abusive remarks in Urdu tweets. The dataset used in this study was obtained from
the text processing laboratory at the center for computing research of Instituto Politécnico Nacional.
The detection of abusive language in Urdu text is performed as a classification task. The dataset has
only two labels which makes it a binary classification problem. In order to solve this problem, the
algorithm chosen were logistic regression, decision tree bagging algorithm, and ANN, which could
work well on binary classification. Accuracy, precision, recall, F1 score, and AUC ROC were used as
evaluation metrics. The key parameter for deciding the best algorithm was the F1 score. Based on the
F1 score, logistic regression, and bagging performed equally well. However, logistic regression was
chosen as the best performing model with an 83% of F1 score. ANN did not perform well due to the
limitation of the dataset. All these evaluations were made on 25% of the test split.
   For future work, different embedding layers will be trained on Urdu data. Different pre-trained
models will be tuned for this use case to improve accuracy. After acquiring a model with better results,
a chrome extension can be created to detect abusive Urdu text on social media.
8. References
     [1] H. Mubarak and K. Darwish, "Arabic offensive language classification on Twitter," in
         Proceedings of the International Conference on Social Informatics, pp. 269–276, National
         Research Council of Pisa, Pisa, Italy, May 2019.
     [2] E. Abozinadah, Detecting Abusive Arabic Language Twitter Accounts Using a
         Multidimensional Analysis Model, George Mason University, Fairfax, VA, USA, 2017.
     [3] Nayel, Hamada A., and H. L. Shashirekha. "DEEP at HASOC2019: A Machine Learning
         Framework for Hate Speech and Offensive Language Detection." FIRE (Working Notes).
         2019.
     [4] K. Stapleton, "Swearing and perceptions of the speaker: a discursive approach," Journal of
         Pragmatics, vol. 170, pp. 381–395, 2020.
     [5] . de Gibert, O., Perez, N., Garc´ıa-Pablos, A., Cuadros, M.: Hate Speech Dataset from a
         White Supremacy Forum. In: Proceedings of the 2nd Workshop on Abusive Language
         Online (ALW2). pp. 11–20. Association for Computational Linguistics, Brussels, Belgium
         (Oct 2018). https://doi.org/10.18653/v1/W18-5102
     [6] Akhter, Muhammad Pervez, et al. "Exploring deep learning approaches for Urdu text
         classification in product manufacturing." Enterprise Information Systems (2020): 1-26.
     [7] H. Watanabe, M. Bouazizi and T. Ohtsuki, "Hate speech on Twitter: A pragmatic approach
         to collect hateful and offensive expressions and perform hate speech detection", IEEE
         Access, vol. 6, pp. 13825-13835, 2018
     [8] J. M. Schneider, R. Roller, P. Bourgonje, S. Hegele and G. Rehm, "Towards the automatic
         classification of offensive language and related phenomena in German tweets", Proc. 14th
         Conf. Natural Lang. Process. (Konvens), pp. 95, 2018.
     [9] P. Fortuna and S. Nunes, "A survey on automatic detection of hate speech in text,"
         ACM Comput. Surv. (CSUR), vol. 51, no. 4, pp. 1-30, 2018.
    [10] S. MacAvaney, H.-R. Yao, E. Yang, K. Russell, N. Goharian, and O. Frieder, "Hate
         speech detection: Challenges and solutions," PLOS One, vol. 14, no. 8, p. e0221152,
         2019.
    [11] M. Tomaiuolo, G. Lombardo, M. Mordonini, S. Cagnoni, and A. Poggi, "A survey on
         troll detection," Future Internet, vol. 12, no. 2, p. 31, 2020.
    [12] T. Jiang, J. P. Li, A. U. Haq, A. Saboor, and A. Ali, "A Novel Stacking Approach for
         Accurate Detection of Fake News," IEEE Access, vol. 9, pp. 22626-22639, 2021.
    [13] A. Kumar, V. Singh, T. Ali, S. Pal, and J. Singh, "Empirical evaluation of shallow and
         deep classifiers for rumor detection," in Proc. ICACM 2019, in Advances in
         Computing and Intelligent Systems, in Algorithms for Intelligent Systems, 2020, pp.
         239-252.
    [14] S. R. Sahoo and B. Gupta, "Real-time detection of fake account in twitter using
         machine-learning approach," in Proc. CICT 2019, in Advances in computational
         intelligence and communication technology, in Advances in Intelligent Systems and
         Computing, vol. 1086, 2021, pp. 149-159.
    [15] E. W. Pamungkas, V. Basile, and V. Patti, "Misogyny detection in twitter: a
         multilingual and cross-domain study," Inf. Process. Manag., vol. 57, no. 6, p. 102360,
         2020.
    [16] Shah, Kanish, et al. "A comparative analysis of logistic regression, random forest and KNN
         models for the text classification." Augmented Human Research 5.1 (2020): 1-16.
    [17] Wang, Peng, et al. "Classification of proactive personality: Text mining based on weibo text
         and short-answer questions text." IEEE Access 8 (2020): 97370-97382.
    [18] Chen, Caixia, Liwei Geng, and Sheng Zhou. "Design and implementation of bank CRM
         system based on decision tree algorithm." Neural Computing and Applications 33.14 (2021):
         8237-8247.
    [19] El-Mahelawi, Jamal Khamis, et al. "Tumor Classification Using Artificial Neural
         Networks." International Journal of Academic Engineering Research (IJAER) 4.11 (2020).
[20] Amjad, Maaz, Alisa Zhila, Oxana Vitman, Sabur Butt, Hamza Imam Amjad, Grigori
     Sidorov, Alexander Gelbukh. "Overview of the shared task on threatening and abusive
     detection in Urdu at fire 2021." In CEUR Workshop Proceedings. (2021).
[21] Amjad, Maaz, Alisa Zhila, Oxana Vitman, Sabur Butt, Hamza Imam Amjad, Grigori
     Sidorov, Alexander Gelbukh. "UrduThreat@ FIRE2021: Shared Track on abusive threat
     Identification in Urdu." In Forum for Information Retrieval Evaluation. (2021).
[22] Amjad, Maaz, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, and Alexander
     Gelbukh. "Threatening Language Detecting and Threatening Target Identification in Urdu
     Tweets." IEEE Access. (2021).
[23] Maaz Amjad, Noman Ashraf, Grigori Sidorov, Alisa Zhila, Liliana Chanona-Hernandez,
     Alexander Gelbukh. "Automatic Abusive Language Detection in Urdu Tweets." Acta
     Polytechnica Hungarica. (2021).