<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Urdu Abusive Language Detection using Machine Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Owais Raza</string-name>
          <email>owais.leghari@hotmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qaisar Khan</string-name>
          <email>qaisar.k@imail.sunway.edu.my</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ghulam Muhammad Soomro</string-name>
          <email>soomrogm95@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Machine Learning, NLP, Urdu Abuse Detection</institution>
          ,
          <addr-line>Python, Logistic Regression</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mehran University Institute of Science, Technology and Development, Indus Hwy</institution>
          ,
          <addr-line>Jamshoro, Sindh 76062</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mehran University of Engineering and Technology</institution>
          ,
          <addr-line>Indus Hwy, Jamshoro, Sindh 76062</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Sunway University, 5 Jalan University</institution>
          ,
          <addr-line>Bandar Sunway, 47500 Petaling Jaya, Selangor</addr-line>
          ,
          <country country="MY">Malaysia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>The growing popularity of user-generated material on social media has increased the quantity of offensive language used online. The tendency of user-generated material on social media is growing, giving rise to offensive language on these platforms. The offensive language negatively impacts individuals and affects society as a whole, which is why it is a dire need of time to identify vulgar remarks in languages used online. 'Urdu' is one of the many languages used on the internet that faces the same issue. Manually labeling the text as abusive on social media platforms is unattainable due to the production of a large amount of daily content. Therefore, automation (machine learning) is used to create the solution. This study uses machine learning algorithms, namely logistic regression, bagging algorithms, decision trees, and artificial neural networks (ANN), to detect abuse in the text. The F1 score is used as the primary metric, along with accuracy, precision, recall, and AUC-ROC, to measure the performance. Based on the evaluation, the bagging and logistic regression perform equally with an 83% F1 score. However, logistic regression is better for this use case because it is computationally less expensive and requires less effort than the bagging classifier.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Historically, mass communication mediums were utilized under ethical and moral obligations
dictated by societal standards. In this digital age, the wide acceptance of social media continues to be
fueled by the prevalence of internet connection and mobile technologies, particularly smartphones and
tablets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The growing opportunities to express opinions online have given a high rise in hate speech
and offensive language. Studies show that people may use offensive language online that affects other
people's feelings [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The internet's secrecy has a detrimental effect on the population, encouraging
obscene language, disparaging phrases, poisonous, unpleasant, and abusive language on the web,
specifically social media. This derogatory content on the internet can be aggressive and harmful. It can
erode people's self-esteem, inflict suicidal thoughts, and compel them to wipe out their social media
existence. Due to this rise of cyberterrorism, cyberbullying, and widespread usage of derogatory
language on the internet, identifying hate speech has become a critical component of anti-bullying
measures for social media platforms [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The manual detection and removal of hate speech and
undesirable information is a time-consuming process, owing to the vastness of the web and the growing
number of internet users. The work gets harder considering the anonymity of online users. Hence, it is
high time for technologies and approaches to rapidly detect abusive language on social media platforms
and eradicate the spread of hate speech.
      </p>
      <p>2021 Copyright for this paper by its authors.</p>
      <p>
        Urdu is South Asia's resource-scarce language [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Compared to resource-rich languages such as
English, a few annotated corpora are available for different NLP applications. The lack of linguistic
resources such as stemmers and annotated corpora complicates and inspires study. Studying abusive
language detection in Urdu [23] presents several difficulties. There is a dearth of sufficient annotated
corpora and Urdu text preprocessing tools. This study presents different techniques to detect the abusive
language in Urdu using different machine learning techniques and discusses the challenges and
solutions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The rise in the use of social media has given birth to numerous problems, one of which is abusive
behavior on social media platforms. Each platform has its policies to detect such behavior. For example,
Twitter defines abusive behavior as an attempt to intimidate, harass, or silence someone's voice. Based
on this definition, Twitter can classify a tweet as abusive or non-abusive. Recently, the computational
linguistics community has focused on detecting abusive language and hate speech from various online
social media platforms, such as Twitter [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Early identification of many social abnormalities, such
as hate speech, cyberbullying [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], trolling [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], false news [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], rumor [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], fake profile
identification [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and sexism [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], has been a current trend in social media-based research. In [22]
researchers have detected threatening tweets in urdu. Different researchers used different techniques to
identify inappropriate text online. Researchers in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] employ a variety of machine learning approaches
that include support vector machines, decision trees, instance-based and rule-based, and algorithms
from the WEKA toolkit used to identify bully-specific language patterns and built rules for
automatically detecting cyberbullying content. In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], researchers use a variety of classifiers to detect
cyberbullying, including support vector machines, naive bayes, random forest, JRip, J48, k-nearest
neighbors, sentence pattern extraction architecture (SPEC), and convolutional neural network (CNN).
The results indicate that CNN outperforms other classifiers by over 11% in F-score. However, a
challenge in this study is that there is no limitation for the language on these social media platforms. It
is easier to create a machine learning model to detect abusive language for English because of resources.
However, when it comes to languages like Urdu, the resources are low, and the process is laborious.
      </p>
      <p>The dataset is adapted from HASOC abusive and threatening language detection in Urdu
competition [20][21]. The dataset was gathered and labeled in the natural language and text processing
laboratory at the center for computing research of Instituto Politécnico Nacional. The collector and
annotator of the datasets are native Urdu speakers. The dataset used in the study contains two columns
tweet and target with 2400 rows. Each row represents a tweet and its corresponding labels. There are
two labels, 0 and 1, '0' represents a neutral text while '1' shows abusive text. Table 1 shows the
distribution of the dataset on the label count. We have two labels in the dataset, abusive and
nonabusive, and 1187 abusive tweets and 1213 non abusive tweets, which balances the dataset.</p>
      <sec id="sec-2-1">
        <title>Label</title>
      </sec>
      <sec id="sec-2-2">
        <title>Abusive</title>
      </sec>
      <sec id="sec-2-3">
        <title>Non-Abusive</title>
      </sec>
      <sec id="sec-2-4">
        <title>Count 1187 1213</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
    </sec>
    <sec id="sec-4">
      <title>4. Algorithms</title>
      <p>The problem we are tackling in this research is a binary classification problem, and the following
algorithms were used:
1. Logistic Regression
2. Decision Tree
3. Bagging classifier
4. Neural Network
4.1.</p>
    </sec>
    <sec id="sec-5">
      <title>Logistic Regression</title>
      <p>
        Logistic regression is a powerful technique of simulated results of a binary classification. It is used
to allocate data to a discrete label, and being a classification method, it relies on probability [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The
logistic regression function is represented by equation 1.
      </p>
      <p>The prediction function is represented by equation 2
 ( ) =</p>
      <p>1
1 −  
ℎ ( ) =  (   ) =</p>
      <p>
        1
1 +  −  
(1)
(2)
The value of  has a unique significance; it indicates the likelihood that ℎ ( ) is 1 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>In this study, all the hyperparameters for the logistic regression are kept default because of getting
the best results.
4.2.</p>
    </sec>
    <sec id="sec-6">
      <title>Decision Tree</title>
      <p>
        The decision tree is a technique for successfully supervising inductive learning through the
generation of rules from data. Typically, one event might trigger two or more subsequent events, each
with a distinct outcome. The decision tree's structure is top-down as a result of this characteristic. This
structure resembles a flowchart. Every branch of a tree corresponds to a new decision result. The
children node on each node corresponds to the corresponding attribute test. This child node's ID,
generated from decision-making algorithms, is passed on [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
4.3.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Bagging Classifier</title>
      <p>The bagging technique (bootstrap aggregation) generates a collection of classifiers. A bootstrapped
duplicate of the original dataset is created for each classifier by randomly selecting N instances with
replacement. When a new input is desired to be classified, the number of classifiers that anticipate the
instance's class value is counted for every label state. The number of votes and the state with the most
votes are projected to win the instance. In this study, we are using bagging for bootstrapping different
logistic regression classifiers.
4.4.</p>
    </sec>
    <sec id="sec-8">
      <title>Artificial Neural Network</title>
      <p>
        A neural network comprises a linked group of artificial neurons that analyses data in a connectionist
fashion. In general, an ANN is a self-organizing system that fine-tunes its organization in response to
external or internal data that flows through the network throughout the learning process. They are
typically used to describe complex connections between inputs and outputs or to deduce patterns from
data. ANN has been successfully utilized in a variety of applications. For instance, ANNs have been
successfully utilized in predictions, handwritten character recognition, and assessing home values [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
The methodology of this study is shown in figure 1. The methodology for the research consists of 6
The first step is to import the respective dataset, so the abuse language dataset was imported.
After importing the dataset, ambiguities are searched. Then, tokenizing takes place, removing any
punctuation marks, unique characters, and numbers in the data using regular expression. Next is to
remove any stopwords, which are high-frequency words with low semantic importance. The remaining
      </p>
    </sec>
    <sec id="sec-9">
      <title>Methodology</title>
      <p>steps:
1. Importing Dataset:</p>
      <p>Cleaning Dataset:
data is cleaned data.</p>
      <p>3.</p>
      <p>Extracting Features:
To extract features, TF-IDF is used. The TF-IDF algorithm (term frequency-inverse document
frequency) is an enhancement to the DF technique. It is a type of statistical technique used to determine
the significance of a word within a file collection. The significance of a word is related to its frequency
of occurrence in the text and inversely related to its frequency of occurrence in the total document
collection.</p>
      <p>, is the rate of input   in document   , as shown by equation (3)
 is represented by equation 4
 = log</p>
      <p>| |
|{ :   ∈   }|
 , =    ,  

(3)
(4)
(5)
Here | | is the total number of files and   indicates the total number of occurrences of a word.</p>
      <p>, is represented by equation 5.
4. Splitting Dataset:</p>
      <p>Using test train split for creating two sets of datasets—one for training and the other for testing. The
train test ratio is 75:25. That is, 75% of the dataset is used for training and 20% for evaluating results.
5.</p>
      <p>Apply Machine Learning Algorithm:</p>
      <p>In this step, machine learning algorithms are applied to the training set to create a classification
model. The algorithms used in this study are logistic regression, decision tree, bagging ensemble
classifier, and ANN.</p>
      <p>Evaluate Machine Learning Model:</p>
      <p>The last step is where inference is performed on 25% of the dataset to determine how well the
classifiers are performed. The classification metrics used in this study are accuracy, precision, recall,
F1 score, and AUC ROC.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Results 6.1.</title>
    </sec>
    <sec id="sec-11">
      <title>Evaluation Metrics</title>
      <p>The evaluation parameter taken for this study is accuracy, precision, recall, F1 score, and AUC ROC.
To decide the best-performing model, the key parameter used is the F1 score.</p>
    </sec>
    <sec id="sec-12">
      <title>6.1.1. Accuracy</title>
      <p>Accuracy is an evaluation parameter that shows the degree to which a classifier fits all classes. It is
helpful since it treats all classes equally. It is measured as the fraction of correct predictions to all
predictions. Equation 6 represents accuracy in mathematical terms.</p>
      <p>in equation 6 represent true positive,   is true negative,   is false positive, and   is false
(6)
(7)


negative.
6.1.2. Precision
6.1.3. Recall</p>
      <p>+  
(  +   +  +  )</p>
      <p>(  +   )</p>
      <p>The precision is determined as the fraction of Positive samples accurately classified to all positive
cases classified. Precision is a measure that shows a model's predictive accuracy in classifying a sample
as positive. It is determined by equation 7.</p>
      <p>in equation 7 represent true positive and   is false positive.
(8)
(9)</p>
      <p>The recall is calculated as the ratio of correctly classified positive samples to all accessible, positive
occurrences. The recall parameter specifies the model's ability to detect positive samples. The higher
the recall, the more positive samples are discovered. Equation 8 represents mathematical representation.
  in equation 6 represent true positive and   is false negative.</p>
      <p />
      <p>(  +   )
2 ∗</p>
      <p>∗ 
+ 
6.1.4. F1 Score:
6.1.5. AUC ROC:</p>
      <p>The F1 score is a statistic that indicates how accurate a model is on a given dataset. It is used to
assess binary classification systems that categorize examples as either positive or negative. Equation 9
represents mathematical representation.</p>
      <p>The AUC - ROC curve is a benchmarking tool for classification problems using a variety of
threshold values. The receiver operating characteristic (ROC) curve denotes the extent or measure of
separability. In contrast, the area under the curve (AUC) indicates the level or degree of separability. It
indicates the degree to which the model is adept at differentiating between classes. The larger the AUC,
the more accurately the model predicts 1 class as '1' and 0 class as '0'. For example, the greater the AUC,
the more accurate the model discriminates between abusive and non-abusive texts.
6.2.</p>
    </sec>
    <sec id="sec-13">
      <title>Accuracy, Precision, and Recall</title>
      <p>Three fundamental parameters for deciding which classifier performed best are accuracy, precision,
and recall. The classifier was created using all four algorithms. Table 2 shows the accuracy, precision,
and recall value for all the algorithms.</p>
      <sec id="sec-13-1">
        <title>Accuracy, Precision, and Recall Evaluation Table</title>
      </sec>
      <sec id="sec-13-2">
        <title>Algorithm</title>
      </sec>
      <sec id="sec-13-3">
        <title>Decision Tree</title>
      </sec>
      <sec id="sec-13-4">
        <title>Logistic Regression</title>
      </sec>
      <sec id="sec-13-5">
        <title>Bagging Classifier</title>
        <p>difference between logistic regression and bagging classifier, which shows the use of extra
effort in bagging classifier is not worth it. The ANN was the second least performing model
with 79.1%, 74.1%, and 70.5% accuracy, precision, and recall.
6.3.</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>F1 Score:</title>
      <p>The main parameter used in this study is the F1 score, which is considered a reliable metric
in classification tasks due to the involvement of both precision and recall. Figure 2 represents
the F1 score for all the algorithms. F1 score for bagging and logistic regression are the same,
83%. Due to the extra effort being put into the bagging classifier, logistic regression is
considered the better choice for the task. It provides the same F1 score with less effort.</p>
      <p>F1 score</p>
      <p>ANN
s
hm Bagging Classifer
t
i
r
oLogistic Regression
g
l
A</p>
      <sec id="sec-14-1">
        <title>Decision Tree</title>
      </sec>
      <sec id="sec-14-2">
        <title>Decision Tree</title>
      </sec>
      <sec id="sec-14-3">
        <title>F1 Score 77.1</title>
      </sec>
      <sec id="sec-14-4">
        <title>Logistic</title>
      </sec>
      <sec id="sec-14-5">
        <title>Regression 83 76.8 77.1</title>
        <p>The area under the curve for the receiver operation curve is an important parameter to judge the
algorithm's performance. The AUC ROC values for each of the algorithm is shown in Figure 3. Looking
at figure 3, we can see that bagging has the best value of 91.4, followed by logistic regression with
90.7%, which is not a significant difference compared to the extra efforts put into the bagging classifier.
ANN also proved to be a good algorithm with a value of 87%. The algorithm with the lowest value of
AUC ROC is the decision tree having 75.5% AUC ROC.</p>
        <p>100
80
CO 60
R
CU 40
A
20
0
75.5
90.7</p>
        <p>The receiver operating characteristic (ROC) curve illustrates the relationship between TPR and FPR
at various categorization levels. Reduce the classification threshold, and more items are classified as
positive, increasing both true and false positives. Figure 4 shows the ROC curve for the decision tree,
logistic regression, bagging classifier, and ANN. The more the area under the curve, the better the
model. According to curves, it can be seen that the best performing models are logistic regression and
bagging classifier. Although the ROC curve is not bad, compared to logistic regression and bagging
classifier, it does not cut the best algorithms for this case due to the limitation of the dataset.</p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>7. Conclusion</title>
      <p>To eradicate the problem of the use of abusive language on social media, machine learning is
employed to detect abusive remarks in Urdu tweets. The dataset used in this study was obtained from
the text processing laboratory at the center for computing research of Instituto Politécnico Nacional.
The detection of abusive language in Urdu text is performed as a classification task. The dataset has
only two labels which makes it a binary classification problem. In order to solve this problem, the
algorithm chosen were logistic regression, decision tree bagging algorithm, and ANN, which could
work well on binary classification. Accuracy, precision, recall, F1 score, and AUC ROC were used as
evaluation metrics. The key parameter for deciding the best algorithm was the F1 score. Based on the
F1 score, logistic regression, and bagging performed equally well. However, logistic regression was
chosen as the best performing model with an 83% of F1 score. ANN did not perform well due to the
limitation of the dataset. All these evaluations were made on 25% of the test split.</p>
      <p>For future work, different embedding layers will be trained on Urdu data. Different pre-trained
models will be tuned for this use case to improve accuracy. After acquiring a model with better results,
a chrome extension can be created to detect abusive Urdu text on social media.</p>
    </sec>
    <sec id="sec-16">
      <title>8. References</title>
      <p>[20] Amjad, Maaz, Alisa Zhila, Oxana Vitman, Sabur Butt, Hamza Imam Amjad, Grigori
Sidorov, Alexander Gelbukh. "Overview of the shared task on threatening and abusive
detection in Urdu at fire 2021." In CEUR Workshop Proceedings. (2021).
[21] Amjad, Maaz, Alisa Zhila, Oxana Vitman, Sabur Butt, Hamza Imam Amjad, Grigori
Sidorov, Alexander Gelbukh. "UrduThreat@ FIRE2021: Shared Track on abusive threat
Identification in Urdu." In Forum for Information Retrieval Evaluation. (2021).
[22] Amjad, Maaz, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, and Alexander
Gelbukh. "Threatening Language Detecting and Threatening Target Identification in Urdu
Tweets." IEEE Access. (2021).
[23] Maaz Amjad, Noman Ashraf, Grigori Sidorov, Alisa Zhila, Liliana Chanona-Hernandez,
Alexander Gelbukh. "Automatic Abusive Language Detection in Urdu Tweets." Acta
Polytechnica Hungarica. (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <article-title>"Arabic offensive language classification on Twitter,"</article-title>
          <source>in Proceedings of the International Conference on Social Informatics</source>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>276</lpage>
          , National Research Council of Pisa, Pisa, Italy, May
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Abozinadah</surname>
          </string-name>
          ,
          <article-title>Detecting Abusive Arabic Language Twitter Accounts Using a Multidimensional Analysis Model</article-title>
          , George Mason University, Fairfax,
          <string-name>
            <surname>VA</surname>
          </string-name>
          , USA,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Nayel</surname>
            ,
            <given-names>Hamada A.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          .
          <article-title>"DEEP at HASOC2019: A Machine Learning Framework for Hate Speech and Offensive Language Detection."</article-title>
          <source>FIRE (Working Notes)</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Stapleton</surname>
          </string-name>
          ,
          <article-title>"Swearing and perceptions of the speaker: a discursive approach,"</article-title>
          <source>Journal of Pragmatics</source>
          , vol.
          <volume>170</volume>
          , pp.
          <fpage>381</fpage>
          -
          <lpage>395</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] . de Gibert,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          ,
          <article-title>Garc´ıa-</article-title>
          <string-name>
            <surname>Pablos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuadros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Hate Speech Dataset from a White Supremacy Forum</article-title>
          .
          <source>In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)</source>
          . pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          . Association for Computational Linguistics, Brussels, Belgium (Oct
          <year>2018</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W18</fpage>
          -5102
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Akhter</surname>
            ,
            <given-names>Muhammad</given-names>
          </string-name>
          <string-name>
            <surname>Pervez</surname>
          </string-name>
          , et al.
          <article-title>"Exploring deep learning approaches for Urdu text classification in product manufacturing." Enterprise Information Systems (</article-title>
          <year>2020</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouazizi</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohtsuki</surname>
          </string-name>
          ,
          <article-title>"Hate speech on Twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection"</article-title>
          ,
          <source>IEEE Access</source>
          , vol.
          <volume>6</volume>
          , pp.
          <fpage>13825</fpage>
          -
          <lpage>13835</lpage>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bourgonje</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hegele</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Rehm</surname>
          </string-name>
          ,
          <article-title>"Towards the automatic classification of offensive language and related phenomena in German tweets"</article-title>
          ,
          <source>Proc. 14th Conf. Natural Lang. Process. (Konvens)</source>
          , pp.
          <fpage>95</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>"A survey on automatic detection of hate speech in text,"</article-title>
          <source>ACM Comput. Surv. (CSUR)</source>
          , vol.
          <volume>51</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-R. Yao</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goharian</surname>
            , and
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>"Hate speech detection: Challenges and solutions,"</article-title>
          <source>PLOS One</source>
          , vol.
          <volume>14</volume>
          , no.
          <issue>8</issue>
          , p.
          <fpage>e0221152</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tomaiuolo</surname>
          </string-name>
          , G. Lombardo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mordonini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cagnoni</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <article-title>"A survey on troll detection,"</article-title>
          <source>Future Internet</source>
          , vol.
          <volume>12</volume>
          , no.
          <issue>2</issue>
          , p.
          <fpage>31</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. U.</given-names>
            <surname>Haq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saboor</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <article-title>"A Novel Stacking Approach for Accurate Detection of Fake News,"</article-title>
          <source>IEEE Access</source>
          , vol.
          <volume>9</volume>
          , pp.
          <fpage>22626</fpage>
          -
          <lpage>22639</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>"Empirical evaluation of shallow and deep classifiers for rumor detection,"</article-title>
          <source>in Proc. ICACM</source>
          <year>2019</year>
          ,
          <source>in Advances in Computing and Intelligent Systems, in Algorithms for Intelligent Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>"Real-time detection of fake account in twitter using machine-learning approach,"</article-title>
          <source>in Proc. CICT</source>
          <year>2019</year>
          ,
          <article-title>in Advances in computational intelligence and communication technology</article-title>
          ,
          <source>in Advances in Intelligent Systems and Computing</source>
          , vol.
          <volume>1086</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Pamungkas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>"Misogyny detection in twitter: a multilingual and cross-domain study," Inf</article-title>
          . Process. Manag., vol.
          <volume>57</volume>
          , no.
          <issue>6</issue>
          , p.
          <fpage>102360</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kanish</surname>
          </string-name>
          , et al.
          <article-title>"A comparative analysis of logistic regression, random forest and KNN models for the text classification</article-title>
          .
          <source>" Augmented Human Research 5.1</source>
          (
          <year>2020</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
          </string-name>
          , et al.
          <article-title>"Classification of proactive personality: Text mining based on weibo text and short-answer questions text." IEEE Access 8 (</article-title>
          <year>2020</year>
          ):
          <fpage>97370</fpage>
          -
          <lpage>97382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Chen</surname>
            , Caixia,
            <given-names>Liwei</given-names>
          </string-name>
          <string-name>
            <surname>Geng</surname>
            , and
            <given-names>Sheng</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>"Design and implementation of bank CRM system based on decision tree algorithm</article-title>
          .
          <source>" Neural Computing and Applications</source>
          <volume>33</volume>
          .14 (
          <year>2021</year>
          ):
          <fpage>8237</fpage>
          -
          <lpage>8247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>El-Mahelawi</surname>
            ,
            <given-names>Jamal</given-names>
          </string-name>
          <string-name>
            <surname>Khamis</surname>
          </string-name>
          , et al.
          <source>"Tumor Classification Using Artificial Neural Networks." International Journal of Academic Engineering Research (IJAER) 4</source>
          .11 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>