<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection of Abusive Records by Analyzing the Tweets in Urdu Language Exploring Transformer Based Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sakshi Kalra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yash Bansal</string-name>
          <email>yash@pilani.bits-pilani.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Campus</institution>
          ,
          <addr-line>Rajasthan</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani</institution>
          ,
          <addr-line>Pilani</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>As social media platforms grow in popularity and importance, the consequences of their misuse become more severe. Numerous posts containing abusive language directed at specific users worsen users' experiences on such platforms. In this paper, we look at the task of detecting Abuse in the Urdu Language. We experiment with diferent machine learning algorithms and Transformer based models to achieve the best results on this one-of-a-kind task of Abusive language detection in Urdu. We got accuracy equal to 0.93607 on the test dataset using the soft voting technique with the help of 3 transformer based-techniques such as Urduhack, BERT, and XLM-Roberta.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the advent of social media, anti-social and abusive behavior has become a prominent
occurrence online. Undesirable psychological efects of abuse on individuals make it an important
societal problem of our time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Pew Research Centre, in its latest report on online harassment
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], revealed that 40% of adults in the United States had experienced abusive behavior online, of
which 18% have faced severe forms of harassment, e.g., that of sexual nature. These statistics
stress the need for automated detection and moderation systems. Hence, a new research efort
on abusive language detection has sprung up in NLP in recent years.
      </p>
      <p>
        Online communities, social media enterprises, and technology companies are investing
heavily and encouraging research in this area by organizing tasks and workshops. One such
community is FIRE, which has been actively organizing the HASOC tasks since 2019 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The
Urdu language has more than 230 million speakers worldwide with vast social networks and
digital media representation. This paper1 will contain details regarding the subtask A - Abusive
language using Twitter tweets in Urdu language of Abusive and Threatening Language Detection
https://www.bits-pilani.ac.in/pilani/yash/profile (Y. Sharma)
      </p>
      <p>© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
Task in Urdu. This is a binary classification task in which participating systems are required to
classify tweets into two classes, namely: Abusive and Non-Abusive.</p>
      <p>• Abusive This Twitter post contains any abusive content.</p>
      <p>• Non-Abusive This Twitter post does not contain any abusive or profane content.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Techniques for abuse detection have gone through several stages of development, starting with
extensive manual feature engineering and then turning to deep learning. Early approaches
experimented with feature extraction from speech text like a bag of words or n-grams [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
lexical and linguistic features [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and, and user-specific features, such as age [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. With the
advent of deep learning, the trend shifted, with great work focusing on neural architectures
for abuse detection. Initially witnessing an extensive use of CNNs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and then moving on to
LSTMs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Most recently, the use of pre-trained transformer-based architectures such as BERT
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] has given state-of-the-art results. Amjad et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] describe the first shared task for fake
news detection in the Urdu language. The dataset consists of news articles from five domains
with 900 annotated articles for the training and 400 annotated news articles for the testing
part. In this shared task, nine teams submitted their results, and the best performing system
achieved an F-score value of 0.90. Amjad et al. in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduced a new dataset for classifying
threatening and non-threatening language in the Urdu language. The recommended dataset
comprises 3,564 tweets manually annotated by human specialists. They applied diferent models
based on Machine and Deep Learning-based techniques. They compared the three forms of
text representations. Their research reveals that an MLP classifier with the combination of
word n-gram features outperformed other classifiers. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] has also performed well in the
Abusive language detection in the Urdu language.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The datasets for the tasks are provided by the organizers of HASOC ’212 and the code is
available in the github repository 3 The data consists of tweets in Urdu annotated for a binary
classification task: Abusive, Non-Abusive. Abusive - This Twitter post contains any abusive
content. Non-Abusive - This Twitter post does not contain any abusive or profane content.
Table 1 lists the statistics of the dataset. According to Twitter, the definition describes abusive
comments toward individuals or groups to harass, intimidate, or silence someone else’s voice.
The dataset was collected and annotated in Natural Language and Text Processing laboratory at
the Center of Computing Research of Instituto Politécnico Nacional, Mexico, by Ph.D. candidate
Maaz Amjad, a native Urdu-speaker4.</p>
      <sec id="sec-3-1">
        <title>2https://www.Urduthreat2021.cicling.org/home 3https://github.com/Kalra-Sakshi/Abusive-HASOC.git 4https://ods.ai/competitions/urdu-hack-soc2021/data</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Techniques and Algorithms</title>
      <p>The paper describes various approaches and draws out a comparison between them. The first
approach extracts N-grams features from the tweets, which are weighted according to TF-IDF
values. Then, models using machine learning algorithms are trained upon these features. Fig
2 shows the proposed architecture using machine learning-based techniques such as Logistic
Regression, Random Forest Classifier, and Support Vector Machine. The second approach
uses pre-trained transformer-based models and their associated tokenizers. Three pre-trained
models are used for this task. Urduhack Roberta-Urdu-small5: Trained on news data from
Urdu news resources in Pakistan BERT (checkpoint : bert-base-multilingual-cased6) : Trained
on 104 diferent languages XLM-Roberta 7: Trained on 2.5TB of newly created clean Common
Crawl data in 100 languages. Fig 3 shows the proposed architecture using transformer-based
techniques.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Work</title>
      <p>The primary evaluation metric for evaluating the applied machine-Learning and Transformer
based models is the F1 score, and ROC AUC is the secondary evaluation metric used.</p>
      <sec id="sec-5-1">
        <title>5.1. Logistic Regression, Support Vector Classifier, Random Forest Classifier</title>
        <p>Here, we use three machine learning algorithms: Logistic Regression, Support Vector Classifier,
and Random Forest Classifier available in the ’scikit-learn’ package. While training, a 5-fold
grid search is performed on the entire train dataset to find the best set of hyperparameters.</p>
        <sec id="sec-5-1-1">
          <title>5https://github.com/urduhack/urduhack 6https://huggingface.co/docs/transformers/multilingualbert 7https://huggingface.co/docs/transformers/multilingualxlm-roberta</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. TRANSFORMER BASED MODELS</title>
        <p>For initial experimentation, pre-processing is carried out in Normalization8, but results without
the Normalization are significantly better. Hyper-parameter tuning for the models is carried
out using RAY TUNE. Population-Based Training scheduler is used for all three models, with
train batch size in (2,4,8,16). The learning rate was set to a uniform log distribution between
5e-6 and 5e-5. Table 1 and 3 lists the Hyperparameter description. For the multilingual Bert and
Urduhack model, train epochs are selected between 2,3,4. Given the large size of XML-Roberta,
train epochs are fixed at 2. Finally, soft voting is carried out, taking the average of each model’s</p>
        <sec id="sec-5-2-1">
          <title>8https://docs.urduhack.com/en/stable/reference/normalization.html</title>
          <p>output scores and predicting the target class.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Evaluations</title>
      <p>The following results are obtained on the test set made public at the end of the competition
and described in Table 2.9 All models used the best parameters obtained through a 5-fold
grid search. Submission for the competition has been made using the Urduhack model with
Normalization and results are listed in Table 4. Further soft voting is carried out using the three
transformer-based models without Normalized the tweets, using the following parameters listed
in Table 5. The following results are obtained on the entire test set listed in Table 6.</p>
      <sec id="sec-6-1">
        <title>9https://drive.google.com/file/d/19G9ntBaDCGnf765ELctEX2ZPmbCvyy1G/view</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>This paper started with experimentation using classical machine learning models such as
Logistic Regression, SVM, and Random Forest Classifier. We then moved on to leveraging
recent advances in large-scale Transformer-based pre-trained language models. The larger
pre-trained models still outperform the classical models while performing well. Pre-processing
performed using the UrduHack library did not necessarily yield better results, which could lead
to why punctuations and diacritics add information valuable to Abuse detection. Our model is
getting 0.9340 accuracies on the public data with normalization of the tweets and 0.9360 without
normalization. For future work, we can try out diferent multilingual transformer-based models
to get a more robust model.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Online Resources</title>
      <sec id="sec-8-1">
        <title>The implementation of diferent pre-trained BERT-models are available at</title>
        <p>• Huggingface.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Munro</surname>
          </string-name>
          ,
          <article-title>The protection of children online: a brief scoping review to identify vulnerable groups</article-title>
          ,
          <source>Childhood Wellbeing Research Centre</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Duggan</surname>
          </string-name>
          , Online harassment
          <year>2017</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Threatening language detecting and threatening target identification in urdu tweets</article-title>
          , IEEE Access (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gaydhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Doma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kendre</surname>
          </string-name>
          , L. Bhagwat,
          <article-title>Detecting hate speech and ofensive language on twitter using machine learning: An n-gram and tfidf based approach</article-title>
          , arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>08651</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Gitari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zuping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <article-title>A lexicon-based approach for hate speech detection</article-title>
          ,
          <source>International Journal of Multimedia and Ubiquitous Engineering</source>
          <volume>10</volume>
          (
          <year>2015</year>
          )
          <fpage>215</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dadvar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Trieschnigg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , F. de Jong,
          <article-title>Improving cyberbullying detection with user context</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>693</fpage>
          -
          <lpage>696</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Park</surname>
          </string-name>
          , P. Fung,
          <article-title>One-step and two-step classification for abusive language detection on twitter</article-title>
          ,
          <source>arXiv preprint arXiv:1706.01206</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bisht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bhadauria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Virmani</surname>
          </string-name>
          , et al.,
          <article-title>Detection of hate speech and ofensive language in twitter data using lstm model, in: Recent trends in image and signal</article-title>
          processing in
          <source>computer vision</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on fake news detection in urdu at fire 2020</article-title>
          ., in: FIRE (Working Notes),
          <year>2020</year>
          , pp.
          <fpage>434</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Labunets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. I.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vitman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          , Urduthreat@ fire2021:
          <article-title>Shared track on abusive threat identification in urdu</article-title>
          ,
          <source>in: Forum for Information Retrieval Evaluation</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Automatic abusive language detection in urdu tweets</article-title>
          ,
          <source>Acta Polytechnica Hungarica</source>
          (
          <year>2021</year>
          )
          <fpage>1785</fpage>
          -
          <lpage>8860</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>