<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. N. A. Balaji); bharathib@ssn.edu.in (B. Bharathi)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SSNCSE_NLP@Fake news detection in the Urdu language (UrduFake) 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nitin Nikamanth Appiah Balaji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B. Bharathi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, Sri Siva Subramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The broadcasting of fake news always hammers out the truth with considerable growth. Fake news and false rumors are spreading further and faster, reaching more people, and penetrating deeper into social networks. Social media interaction is one of the major sources of spreading the news across the world nowadays. The fake news also spread among the people very faster using digital media. The objective of this proposed work to detect unreliable information from the news content in the Urdu language using digital media text collected from diferent sources. We have experimented with this task using the features namely TFIDF, fastText. We have achieved an accuracy of 90% for development data and 78.7% for test data respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;TFIDF</kwd>
        <kwd>fastText</kwd>
        <kwd>Gradient boosting algorithm</kwd>
        <kwd>Random forest classifier</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Fake news detection has recently fascinated a growing interest from the public and research
community as the spread of unreliable information online increases, predominantly in media
outlets such as social media feeds, news blogs, and online newspapers. In recent research, fake
news detection is one of the dominant task using natural language processing. Urdu belongs to
the Indo-Aryan language group, and it is the most commonly spoken language in the world
with more than 100 million speakers. Urdu is an under-resourced language, only a small amount
of data is publicly available. To automate the fake news detection process a corpus has been
developed by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The Urdu fake news dataset, named Bend-The-Truth [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], is composed of news
articles in six diferent domains which are mentioned in Table 1. The fake news detection is
considered to be a classification task. In this paper, we propose to develop a binary classification
task to detect the given news belongs to fake news or real news using linguistic features present
in the given text with diferent machine learning classifiers.
      </p>
      <p>The organization of the paper is as follows: Section 2, lists the literature related to the fake
news detection task. The data set used in the proposed work is tabulated in section 3. The
proposed methodology is briefly explained in section 4. Results and discussions are presented
in section 5. The section 6 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        The process of detecting fake news can be grouped in diferent ways. Based on the natural
language approach used to extract the features from the document. Based on the machine
learning or deep learning model used to classify the news. Based on the fact presented in the
news, how the news is written, how the news spread in social media, and so on. Automated fake
news detection is the task of assessing the truthfulness of claims in news. Automated fake news
detection is one of the challenging tasks in natural language processing. One of the datasets
available for detecting fake news is FAKENEWSNET [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], it is the project for collecting fake
news research. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a data augmentation method for fake news detection in Urdu language
using machine translation is presented. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a combination of linguistic and semantic features
are used to discriminate real and fake news. Recurrent and convolution neural network is used
to detect the fake news by the authors of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Fake news detection is carried out using emotional
content using LSTM by authors [6].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data set</title>
      <p>
        The Urdu fake news dataset, named Bend-The-Truth, in which the news articles are collected
from diferent sources such as sports, social media, education sector, technology domain,
business, and entertainment. The real news was collected by following a very rigorous procedure
using a variety of mainstream news websites predominantly in Pakistan, India, UK, and the USA.
These news channels are BBC Urdu News, CNN Urdu, Express-News, Jung News, Naway Waqat,
and many other reliable news websites for the time frame from January 2018 to December 2018
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The oficial description of the task is given in [ 7, 8]. The distribution of data from diferent
categories is given in Table 1.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed work</title>
      <p>For this task various feature extraction technique is studied for efective feature extraction.
The investigated feature extraction strategies are explained in this section. These extracted
features are then classified using machine learning models such as Multi-Layer Perceptron
(MLP), AdaBoost (AB), ExtraTrees (ET), Random Forest(RF), Support Vector Machine (SVM),
gradient boosting (GB) algorithms. The scikit-learn implementation of machine learning models
is used. The performance of the models is compared using the F1-scores.
4.1. TF-IDF
The Term Frequency Inverse Document Frequency gives a better normalized representation of
the sentences by removing the impact of overly repeated banal words. A TFIDF vectorizer is
trained from scratch using the given training data-set for this particular task. The char TFIDF is
used to get character-level relationships as there are generally diferences in the usage of words
in diferent forms and tenses. So the word and the char TFIDF are considered for comparison.
An n-gram range of 1-4 is used for the study of the two systems. These extracted feature vectors
are fitted and trailed using diferent machine learning models such as Random Forest, Extra
Trees, Gradient Boosting, Ada Boosting.
4.2. Text Embedding
As the training data-set for the Bend-The-Truth data-set is only 900 samples, which is very low,
ifne-tuning pre-trained models are considered as a better alternative. So pre-trained sentence to
vector conversion techniques such as Word2Vec [9], FastText [10] and BERT [11] trained on
CommonCrawl and Wikipedia data is used.</p>
      <p>The Word2Vec and FastText are CBOW or Skip-gram based models that are trained in an
unsupervised manner with large amounts of data. The Urdu specific pre-trained models are
used for the task. These models generate a fixed-length representation of vector from variable
length sentences. The fixed-length representations are then used to train machine learning
models such as Multi-Layer Perceptron, Random Forest, and Support Vector Machine. The
various hyper-parameters for the models are shown in Table 2.</p>
      <p>BERT is a transformer-based neural network architecture that can be trained in an
unsupervised manner and fine-tuned for particular tasks. BERT model is considered as it has shown
excellent performance in the case of Twitter Data-set classification and other sentence
classiifcation tasks. In our experiment the model is trained end-to-end for 50 epochs keeping the
BERT-base multilingual cased pre-trained weights as the initial starting weights.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and discussions</title>
      <p>The performance of the Urdu Fake News detection task is analyzed in this section. Out of the
TFIDF models, the char TFIDF method showed better results than the word TFIDF method
as expected, even though the improvement was very small. Out of all the machine learning
models the Random Forest model gave the best output for the TFIDF model. For the embedding
techniques, FastText along with the Multi-Layer Perceptron model proved to be the better
performing model than the word2vec and the BERT model. The results of the performance
of models on dev-set and the test-set are tabulated in Table 2 and Table 3 respectively. Even
though the performance of the TFIDF was better for dev-set, the FastText model produced a
better performance as expected as it is pre-trained on a larger collection of unseen data.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In the current scenario, fake news is the biggest problem in our society. The fake news is
spreading in society through social media very faster which will cause diferent problems.
Natural language processing plays a major role in automatically classifying the given news
into real or fake news. In the proposed system, fake news detection in the Urdu language is
studied using the "Bend the truth" benchmark dataset. Our system showed an accuracy of 90%
for development data and 78.7% for test data respectively.
[6] P. R. Bilal Ghanem, F. Rangel, An emotional analysis of false information in socialmedia
and news articles, in: ACM Trans. Internet Technology, 2020.
[7] M. Amjad, G. Sidorov, A. Zhila, A. Gelbukh, P. Rosso, Urdufake@fire2020: Shared track
on fake news detection in urdu (2020). Proceedings of the 12th Forum for Information
Retrieval Evaluation (FIRE 2020), Hyderabad, India.
[8] M. Amjad, G. Sidorov, A. Zhila, A. Gelbukh, P. Rosso, Overview of the shared task on fake
news detection in urdu at fire 2020, CEUR Workshop Proceedings (2020). Working Notes
of the Forum for Information Retrieval Evaluation (FIRE 2020), Hyderabad, India.
[9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in
vector space, 2013. arXiv:1301.3781.
[10] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157
languages, in: Proceedings of the International Conference on Language Resources and
Evaluation (LREC 2018), 2018.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Gomez</given-names>
            <surname>Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Voronkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>"bend the truth": Benchmark dataset for fake news detection in urdu language and its evaluation</article-title>
          ,
          <source>Journal of Intelligent Fuzzy Systems</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . doi:
          <volume>10</volume>
          .3233/JIFS-179905.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sliva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , H. Liu,
          <article-title>Fake news detection on social media: A data mining perspective</article-title>
          ,
          <source>CoRR abs/1708</source>
          .
          <year>01967</year>
          (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1708.
          <year>01967</year>
          . arXiv:
          <fpage>1708</fpage>
          .
          <year>01967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <article-title>Data augmentation using machine translation for fake news detection in the Urdu language</article-title>
          ,
          <source>in: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>2537</fpage>
          -
          <lpage>2542</lpage>
          . URL: https://www.aclweb.org/anthology/2020.lrec-
          <volume>1</volume>
          .
          <fpage>309</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardalov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          , P. Nakov, In search of credible news (
          <year>2019</year>
          ). arXiv:
          <year>1911</year>
          .08125.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>