<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. Akram);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Models for Urdu Fake News Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hammad Akram</string-name>
          <email>hammad.fast.nu@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khurram Shahzad</string-name>
          <email>khurram@pucit.edu.pk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, National University of Computer and Emerging Sciences</institution>
          ,
          <addr-line>Lahore</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Data Science, University of the Punjab</institution>
          ,
          <addr-line>New Campus, Lahore</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1957</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Fake news detection is recognized as a key natural language processing task. Recognizing the importance of the task, several attempts have been made for fake news detection in Western languages. However, fake news detection in Urdu has received little attention of researchers. A key reason to that is the scarcity of fake news datasets for Urdu. In order to promote research and development in the area, a track is dedicated to the second fake news detection in the Urdu language, UrduFake'21. This study has proposed to use ensembling machine learning models for Urdu fake news detection. The proposed approach employs a voting-based approach of the three most efective techniques to decide that the given news article is fake or real. For the evaluation of the proposed approach, experiments are performed using several classical machine learning techniques, three types of features, unigram, bigram and trigram and the released dataset. The results of the experiments revealed that the proposed approach is more efective than the individual techniques. According to the results released by the organizers our proposed approach achieved a macro average F1 and accuracy scores of 0.621 and 0.713, respectively, and it is ranked 4ℎ among the 19 submissions.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake news</kwd>
        <kwd>Urdu fake news</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Classical machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the years, news and media have become an integral part of our life. According to Statista,
the worldwide news, entertainment and media market has a market value of 2.1 trillion US
dollars [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A news article that is intentionally and verifiably false is called fake news [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Several individuals, as well as organizations, purposefully publish fake news to support their
purposes and interests. Typically, such news discredit individuals, organizations, communities
and political parties undermine peace and stability in the society or to gain political mileage.
The advent of social media has also played a prominent role in making such news viral, thus
having a significant impact on the society. Therefore, it is desirable to detect fake news and
control them from spreading.
      </p>
      <p>
        Recognizing the importance of fake news, several studies have been conducted on fake news
detection [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. In particular, since the 2016 presidential elections, fake news detection has
nEvelop-O
LGOBE
received considerable attention of researchers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Consequently, resources have been developed
for fake news detection NLP task for English and Chinese language, however little attention
has been paid to fake news detection in South Asian languages. More specifically, fewer studies
have been conducted for fake news detection and there is scarcity of the relevant resources for
this task in the Urdu language despite having over 100 million speakers worldwide.
      </p>
      <p>
        To the best of our knowledge, [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] commenced fake news detection in Urdu. To promote
research and developed in Urdu fake news detection, Center for Computing Research (CIC),
Instituto Politécnico Nacional (IPN), Mexico, introduced a track at the 12ℎ Forum for Information
Retrieval Evaluation 2020 (FIRE 2020)[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The organizers released a dataset of fake Urdu news
containing a substantial amount of news articles, 42 teams participated in the competition and
9 teams submitted their final experimental results on the test dataset. As a final result, a BERT
based approach achieved a macro F1 score of 0.900. This year, the second track is announced
at the 13ℎ Forum for Information Retrieval Evaluation 2021 (FIRE 2021). The new track is
advanced and more challenging composed to the preceding year, as the revised dataset has a
larger number of news. This study has used ensembling machine learning models for Urdu fake
news detection.
      </p>
      <p>The rest of the paper is organized as follows. Section 2 provides an overview of the related
work. Section 3 presents the specifications of the dataset used for the experimentation. The
Section 4 presents our proposed ensembling approach and the machine learning techniques
used for the experimentation. The results of the experiments are presented in Section 5. Finally,
section 6 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>It is observed that majority of the studies have been conducted on fake news detection in rich
resource languages which includes, English, Chinese and Italian, whereas, little work has been
done on low resource languages, such as Urdu [9]. Therefore, this section provides an overview
of the notable existing studies conducted on Urdu fake news detection.</p>
      <p>Table 1 provides a summary of the techniques proposed in literature for Urdu fake news
detection. It can be observed from the table that the existing studies have used diverse techniques
for fake news detection. For instance, a study from Urdufake’20 has used diferent variants of
XGBoost and RoBERTa [10]. The results of the experiments showed that XGBoost outperformed
the other techniques for Urdu fake new detection. Similarly, another study [11] used XLNet
model with the AR pre-training method, whereas [12] used ensemble of machine learning
techniques along with Multi-layer Dense Neural Network which achieved a very high F1 score.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Urdu Fake News Corpus</title>
      <p>The UrduFake’21 track at the Forum for Information Retrieval Evaluation 2021 (FIRE2021) has
provided a corpus of 1600 news articles. The key strength of the corpus is that it includes real
and fake news from diverse domains. That is, it includes news articles from business, health,
showbiz, sports and technology. Another notable observation is that the corpus is balanced</p>
      <p>Paper Techniques
[9] LR, RF, SVM, AdaBoost, CharCNN-Roberta, XLNet pre-trained model, Dense Neural
Network, Bi-directional GRU model, ULMFiT model,
[10] XGBoost with multiple features, RoBERTa
[11] XLNet with AR pre-training
[12] Ensemble approach, Multi-layer Dense Neural Network
[13] RF, Bi-directional Gated Recurrent Unit, Multi-head self-attention based transformer
[14] SVN, RR, BERT, MLP, AdaBoost Gradient boosting, Extra trees
[15] HTC, TL model, Ensemble techniques
as it includes equal number of news articles from all the five domains, hence, providing equal
opportunity for learning about fake news from all domains.</p>
      <p>A summary specifications of the Urdu fake news corpus is presented in Table 2. It can be
observed from the table that the dataset is composed of 950 real and 650 fake news. The presence
of imbalance in the dataset may impede the efectiveness of supervised learning techniques. It
can also be observed from the table that the training dataset is composed of 1300 news, whereas
the testing dataset is composed of 300 new articles. Furthermore, the testing dataset is also
imbalanced in favor of real news with a ratio of 2:1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The Proposed Approach</title>
      <p>The focus of this study is to use of classical machine learning techniques for Urdu fake news
detection. We have employed a systematic approach for ensembling classical machine learning
techniques using a voting scheme. Furthermore, it involved the use of the training dataset and
the preliminary testing dataset released before the submission of the notebook to UrduFake’21
track. However, the results presented in Section 5 are generated using a unseen testing dataset
released by the organizers of UrduFake’21 for the competition.</p>
      <p>As a starting point of the approach, a comprehensive set of classical machine learning
techniques were identified. The set of the techniques identified for this study are presented
in Table 3. It can be observed from the table that we identified nine techniques for the initial
experimentation. Subsequently, experiments were performed using the initially released dataset
and precision, recall and F1 score were computed for each technique. Accordingly, the top
three most efective techniques, AdaBoost, LightGBM and XGBoost, were identified for further
processing. To develop a deeper understanding of the three top performing techniques, the
confusion matrices for each technique were generated. The confusion matrices of the three
techniques are presented in Figure 1.
Decision Tree, Logistic Regression, Random Forest, Naive Bayes,
Support Vector Machine, K-Nearest Neighbor, AdaBoost,
Light</p>
      <p>GBM, XGBoost</p>
      <p>The proposed ensemble technique employs a voting-based approach for predicting the final
label of each news. An overview of the proposed approach is presented in Figure 2, whereas
the details of the proposed approach are as follows:
• Convert the complete annotations into numeric form by replacing real news (R) with 0
and fake news (F) with 1. That is, convert the benchmark values, as well as the predictions,
into binary numbers.
• Calculate the sum of all predicted values and store them separately. That is, for one news
article, the sum of the results is 0 if all of the three top performing techniques predicted
the news as a real, and 3 if all of the three techniques predicted the news as fake. Similarly,
the value varies between 0 and 3 depending upon the predictions of the three techniques.
• Produce the prediction labels by using the sum of the results that were stored in an earlier
step. That is, if any of the 2 techniques predicted that the news is fake declare the news
as fake (1), otherwise declare the news as real (0).
• As a final step, determine the final label of each news by using the voting scheme. That
is, if a label is declared as 1 (fake), whereas the prediction of XGBoost is real (0) and at
the same time sum of results is 0, declare the final label as 1 (fake). Similarly, if the label
is 0 (real), the prediction of XGBoost is 1 and the sum of separately stored results is 2,
declare the final label as 0 (real). Whereas, if both conditions are false, the prediction of
the XGBoost should be considered as the final label.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Experiments are performed using ten techniques, nine classical machine learning techniques
and our proposed technique. For these techniques, unigram, bigram and trigram features are
used. For the experiments the training and the testing dataset discussed in Section 3 is used.
That is, 1300 news articles are used for training and 300 news articles are used for testing.
The code used for the experiments can be downloaded from GitHub1. Subsequently, Precision,
Recall and F1 scores are computed for the two classes. Also, macro average F1 and accuracy
score are computed for all the techniques. The details of the results of unigram features are
presented in Table 4 as the efectiveness of unigram features achieved a higher efectiveness
score than bigram and trigram features.</p>
      <p>It can be observed from the table that three techniques achieved a higher accuracy score of
greater 0.70. These are Random Forest, AdaBoost and our ensembling approach. It can also be
observed from the table that the macro average F1 score of two techniques is greater than 0.60,
whereas, the macro average F1 score of the third technique is less than 0.60. The two techniques
that achieved a macro average F1 score greater than 0.60 are AdaBoost and our ensembling
approach.</p>
      <p>As the macro average score of these two techniques are exactly equal to 0.62, therefore a
further comparison of these two techniques is performed in-terms of the F1 scores of each type
1https://github.com/socialmedialisteninglab/FakeNewsDetectionUrdu2021
of news. It can be observed from the table that the F1 score of fake news class for the ensembling
approach is slightly higher than that of AdaBoost. Furthermore, the recall score achieved by
the proposed approach for the fake news class is higher than that of AdaBoost. This represents
that the proposed technique is slightly more efective than AdaBoost for identifying fake news.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The importance of fake news detection is well-established and a number of studies have been
conducted on fake news detection for Western and Asian languages. However, Urdu fake
news detection has received less attention despite the fact that the risk posed by fake news in
comparable with the Western world. To that end, UrduFake’21 track at the Forum for
Informational Retrieval Evaluation 2021 (FIRE2021) has taken a significant leap towards promoting fake
news detection research in the Urdu language. In this study, an ensemble of classical machine
learning models is proposed for fake news detection. The proposed approach relies on using a
voting scheme between the three most efective techniques for the detection of Urdu fake news.
Experiments are performed with the using the unseen testing dataset released by UrduFake’21,
nine classical machine learning technique and the proposed approach. The results of the 19
teams released by the organizers ranked our proposed approach 4ℎ among the 19 submissions.
on fake news identification in Urdu, in: Proceedings of the Forum for Information Retrieval
Evaluation, volume 2826, CEUR-WS, 2020, pp. 37–40.
[9] M. Amjad, G. Sidorov, A. Zhila, A. F. Gelbukh, P. Rosso, Overview of the shared task on
fake news detection in Urdu at FIRE 2020., in: Proceedings of the Forum for Information
Retrieval Evaluation, volume 2826, CEUR-WS, 2020, pp. 434–446.
[10] N. Lina, S. Fua, S. Jianga, Fake news detection in the Urdu language using
CharCNNRoBERTa, in: Proceedings of the Forum for Information Retrieval Evaluation, volume
2826, CEUR-WS, 2020, pp. 447–451.
[11] A. F. U. R. Khiljia, S. R. Laskara, P. Pakraya, S. Bandyopadhyaya, Urdu fake news detection
using generalized autoregressors, in: Proceedings of the Forum for Information Retrieval
Evaluation, volume 2826, CEUR-WS, 2020, pp. 452–457.
[12] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@UrduFake-FIRE2020: Multi-layer dense
neural network for fake news detection in Urdu news articles., in: Proceedings of the
Forum for Information Retrieval Evaluation, volume 2826, CEUR-WS, 2020, pp. 458–463.
[13] S. M. Reddy, C. Suman, S. Saha, P. Bhattacharyya, A gru-based fake news prediction system:
Working notes for Urdufake-FIRE 2020., in: Proceedings of the Forum for Information
Retrieval Evaluation, volume 2826, CEUR-WS, 2020, pp. 464–468.
[14] N. N. A. Balaji, B. Bharathi, SSNCSE_NLP@Fake news detection in the Urdu language
(UrduFake) 2020, in: Proceedings of the Forum for Information Retrieval Evaluation,
volume 2826, CEUR-WS, 2020, pp. 469–473.
[15] F. Balouchzahi, H. Shashirekha, Learning models for Urdu fake news detection., in:
Proceedings of the Forum for Information Retrieval Evaluation, volume 2826, CEUR-WS,
2020, pp. 474–479.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Guttmann</surname>
          </string-name>
          ,
          <article-title>Value of the global entertainment and media market 2011-2024</article-title>
          , https://www. statista.com/statistics/237749/value
          <article-title>-of-the-global-</article-title>
          <string-name>
            <surname>entertainment-</surname>
          </string-name>
          and
          <string-name>
            <surname>-</surname>
          </string-name>
          media-market/,
          <year>2020</year>
          . Accessed:
          <fpage>2021</fpage>
          -10-09.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sliva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , H. Liu,
          <article-title>Fake news detection on social media: A data mining perspective</article-title>
          ,
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>19</volume>
          (
          <year>2017</year>
          )
          <fpage>22</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>A survey on fake review detection using machine learning techniques</article-title>
          ,
          <source>in: 4ℎ International Conference on Computing Communication and Automation (ICCCA)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vijayvargiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Churi</surname>
          </string-name>
          ,
          <article-title>A systematic survey on deep learning and machine learning approaches of fake news detection in the pre-and post-COVID19 pandemic</article-title>
          ,
          <source>International Journal of Intelligent Computing and Cybernetics</source>
          <volume>14</volume>
          (
          <year>2010</year>
          )
          <fpage>617</fpage>
          -
          <lpage>646</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schifer</surname>
          </string-name>
          ,
          <article-title>Media literacy in the EFL classroom</article-title>
          ,
          <source>Ph.D. thesis</source>
          , The University of Vienna,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Voronkov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Gelbukh, “
          <article-title>Bend the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation</article-title>
          ,
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          <volume>39</volume>
          (
          <year>2020</year>
          )
          <fpage>2457</fpage>
          -
          <lpage>2469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <article-title>Data augmentation using machine translation for fake news detection in the Urdu language</article-title>
          ,
          <source>in: Proceedings of the 12ℎ Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2537</fpage>
          -
          <lpage>2542</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          , P. Rosso, UrduFake@FIRE2020: Shared track
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>