<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The 2021 Urdu Fake News Detection Task using Supervised Machine Learning and Feature Combinations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Humayoun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Higher Colleges of Technology</institution>
          ,
          <addr-line>Abu Dhabi</addr-line>
          ,
          <country country="AE">United Arab Emirates</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <fpage>796</fpage>
      <lpage>800</lpage>
      <abstract>
        <p>This paper presents the system description submitted at the FIRE Shared Task: “The 2021 Fake News Detection in the Urdu Language". This challenge aims at automatically identifying Fake news written in Urdu. Our submitted results ranked fifth in the competition. However, after the result declaration of the competition, we managed to attain even better results than the submitted results. The best F1 Macro score achieved by one of our models is 0.6674, higher than the second-best score in the competition. The result is achieved on Support Vector Machines (polynomial kernel degree 1) with stopwords removed, lemmatization applied, and selecting the 20K best features out of 1.557 million features in total (which were produced by Word n-grams n=1,2,3,4 and Char n-grams n=2,3,4,5,6). The code is made available for reproducibility1.</p>
      </abstract>
      <kwd-group>
        <kwd>2 Fake News</kwd>
        <kwd>Urdu</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Embeddings</kwd>
        <kwd>Support Vector Machines</kwd>
        <kwd>Feature Engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>As the world is getting more connected in the information age, fake news is also increasing.
Spreading fake news is a proven tool in the propaganda warfare of the twenty-first century. Fake news
can be spread to praise or defame an entity, person, group or society, based on geopolitical and religious
motives. The methods and techniques for fake news detection are actively studied for major languages
like English. Unfortunately, recourse poor languages are often neglected. In this context, Urdu fake
news shared task3 is an excellent step towards developing tools and techniques [1].</p>
      <p>This paper presents the system description which was submitted at the competition. Our submitted
results ranked fifth in the competition. Moreover, after the result declaration of the competition, we
managed to attain even better results than the submitted results. The best F1 Macro score achieved by
one of our models is 0.6674, which is higher than the second-best score in the competition. Some of the
related research work outside of this competition describing the dataset construction and producing
excellent results are reported in [2] [3].</p>
      <p>Urdu is a widely spoken language in South Asia and worldwide due to the large South Asian diaspora
[4]. Urdu has a modified Perso-Arabic alphabet, and it is written in cursive and context-sensitive
Nastalique writing style. Urdu is unique because it takes its literary vocabulary from Persian and Arabic
but informal vocabulary from the native languages of South Asia [5]. Some of the challenges that Urdu
computing faces are: lack of capitalization, optional use of diacritic marks, and space not being a reliable
word boundary marker [6] [7] [8]. In the absence of diacritics, context plays a vital role in guessing the
pronunciation of a word. Urdu is a Subject-Object-Verb language having a free word order [9].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset Description</title>
      <p>The dataset has 1,300 instances in the training set (including the test set within the training set). 750
instances are labeled as Real, and 550 instances are labeled as Fake. The test set has 300 instances (200
instances labeled as Real, 100 instances are labeled as Fake). The dataset is slightly imbalanced which
could be ignored. A superficial analysis of the dataset reveals a very low number of non-standard script
in the dataset. As expected in the news data, diacritic marks are absent. Data is generally clean. Proper
segmentation of Urdu words remains an unresolved problem. However, tokenizing on spaces is the best
strategy until proper word segmentation tools for Urdu are readily available.</p>
      <p>Fake news detection is fundamentally a difficult problem. It is mainly because domain knowledge
is needed to judge if news is fake or real. Anything that happens unexpectedly could be considered fake
by those lacking sufficient domain knowledge. A recent example of this could be the fall of Afghanistan
in the hands of the Taliban. It was such unexpected news that people felt the need to confirm it from
more than one sources.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preprocessing, Features and Classification Techniques</title>
      <p>In supervised learning, the task of fake news detection can be modeled as a binary classification
problem. A supervised learning algorithm known as a classifier is trained on a collection of training
documents and their labels. Once training is completed, the classifier can take a document or text as an
input and returns a label as an output. The framework we used consists of five steps: preprocessing
(Section 3.1), feature extraction, and training classifiers (Section 3.2), producing labels, and their
evaluation on reference labels (Section 4). For both tasks, train and test sets are given. The models are
produced by training a classifier on a training set, and the label predictions are performed on a test set.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 The Preprocessing</title>
      <p>Preprocessing plays a key role in NLP. We apply the following preprocessing:
1. Diacritic Removal. Vowels are optionally used in Urdu. To ensure the consistency of data,
removing all the vowels is a common practice.
2. Text Normalization. Persian and Arabic characters that visually look similar to their Urdu
counterparts are sometimes used in writing, resulting in orthographic variations. We
normalize all such variations to Normalization Form C [10].
3. Stopword Removal. The stopword list we used is provided by [7, 6], and it contains nearly
500 words.
4. Lemmatization. We used Urdu Morphological Analyzer [4] to convert all the surface forms
of a word to its lemma or root. This tool covers approximately 5000 words, capable of
handling 140,000 word forms.
5. N-grams. A list of tokens is produced by word and character n-grams (unigram, bigram,
trigram, …).</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Classification Techniques</title>
      <p>Both classic supervised learning and neural network techniques have been extensively used in the
literature for similar tasks [3] [11]. We have used the following two techniques:</p>
    </sec>
    <sec id="sec-6">
      <title>3.2.1 Support Vector Machines with K-Best features</title>
      <p>We have used Support Vector Machines – SVM (Polynomial kernel degree 1) with K-Best features.
One beneficial characteristic of SVM is the requirement of less memory to handle very large datasets.
We have used this specific kernel and degree because of its better results in our initial experiments. A
standard bag of words model is produced using a non-exhaustive list of features produced by character
n-grams (n=2,3,4,5,6) and word n-grams (n=1,2,3,4). The value of features in a bag of word model is
calculated using the TF-IDF weighting scheme. Since the number of features was huge, the K-best
features were selected using the SelectKBest algorithm using Chi-squared statistic. Another reason to
select K-best features was to keep a reasonable ratio between the number of features per instance and
the total instances in the training set.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2.2 Convolution Neural Network</title>
      <p>A Convolution Neural Network for sentence classification is reported in [11]. We used a simplified
version of this model in which we have not used a pre-trained word Embedding. Pre-trained word
Embeddings for Urdu such as this one [12] is available though we have not used it. It is mainly because,
given the size of the task dataset, we were hoping to learn good embeddings from the dataset itself. We
have used the following two variants:
1. The CNN model with four input channels. It was used for the reported results in the
competition.
2. The CNN model with six input channels. During paper submission, we discovered comparable
results with six input channels.</p>
      <p>The CNN Model:
1. Each channel in the model is defined as:
1.1. An input layer
1.2. Embedding layer set to the size of the vocabulary and 100-dimensional real-valued vector.
1.3. Convolutional layer of 1-dimension with 32 filters and a kernel size set to the number of words
or characters to read at once (word or character n-grams where n=k for channelk with k=1, 2,
3, 4, 5, 6. i.e. channel1 used unigrams, channel2 used bigrams, channel3 used trigrams, and so
on). Note that mixing of word n-grams and character n-grams are not possible.
1.4. Max Pooling layer to combine the output from the convolutional layer.</p>
      <p>1.5. Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.
2. The output from the six channels is concatenated into a single vector and processed by a Dense
layer and an output layer. The model architecture with two channels for an example sentence is
shown in Figure 1.</p>
    </sec>
    <sec id="sec-8">
      <title>4. Experiments and Results:</title>
      <p>For the task in hand, train and test sets are given. The models are produced by training a classifier
on the training set, and the predictions are performed on the test set. The experiments are performed on
a laptop with processor Intel Core i7 8th generation with 8 GB RAM.</p>
    </sec>
    <sec id="sec-9">
      <title>4.1 Experiment 1</title>
      <p>In this experiment, we produce a bag of word feature vector with a combination of word n-grams
(n=1,2,3,4) and character n-grams (n=2, 3, 4, 5, 6). Note that stopwords are removed, and lemmatization
is already applied in addition to the basic pre-processing settings mentioned in Section 3.1. The TFIDF
weighting is applied to get the feature vector. The top results are shown in Table 1.</p>
      <p>We learned that:
• The best score in Table 1, row 7 is better than the second best score in the competition.
• Excluding n=1 for char n-grams improve the results.
• The optimal number of features in K-Best is ~20K (see row 4 to 8).
• The best combination of features is: word n-grams n=1, 2, 3, 4 and char n-grams n=2, 3, 4,
5, 6.</p>
    </sec>
    <sec id="sec-10">
      <title>4.2 Experiment 2</title>
      <p>In this experiment, we performed a non-exhaustive list of the following constrained: (1) Number of
channels to be 4, 5 and 6. Character level sequences and word level sequences (n-grams) through kernel
size in the convolutional layer of each channel. On all of these experiments, stopwords were removed
and lemmatization was applied in addition to the basic pre-processing settings mentioned in Section
3.1. Note that it is not possible to combine word n-grams and character n-grams in our implementation
of CNN. It is mainly because we rely on the Keras tokenizer class which imposes the restriction of
selecting if word sequences to be used or char sequences as a basic building block of the model. The
results are shown in Table 2.</p>
      <p>It can be seen that the results by CNN is inferior as compared to the results we achieved in
Experiment 1. We think that it is mainly because of the mid-range size of the dataset. CNN models need
massive amount of training instances in order to outperform traditional models. Such a dataset in our
case is not available.
Character level CNN with 4 channels (with kernel sizes 1,2,3,4; one for each channel)
0.48 0.49 0.49 0.74 0.74 0.74 0.6114 0.653
Word level CNN with 4 channels (with kernel sizes 1,2,3,4; one for each channel)
0.49 0.58 0.53 0.77 0.7 0.73 0.629 0.656
Word level CNN with 6 channels (with kernel sizes 1,2,3,4,5,6; one for each channel)
0.45 0.77 0.57 0.82 0.53 0.64 0.603 0.606
Character level CNN with 6 channels (with kernel sizes 1,2,3,4,5,6; one for each channel)
0.47 0.66 0.55 0.79 0.64 0.70 0.627 0.643</p>
    </sec>
    <sec id="sec-11">
      <title>Conclusion References</title>
      <p>In this work, we have performed rigorous experimentation and achieved the second-best F1 Macro
score of the competition. We demonstrated that traditional models with good feature engineering could
produce good results for a mid-range dataset. In addition, the Neural Network-based methods such as
CNN works reasonably well for the mid-sized dataset in hand. One way of improving the results by
CNN might be the use of a pre-trained Urdu Embedding. However, such an investigation remains future
work. Also, the recent transfer-learning techniques such as BERT fine-tuning can be investigated in
future, though getting a large enough Urdu BERT model might be a challenge.
4 These are the results reported in the competition.</p>
      <p>
        M. Humayoun and N. Akhtar, "CORPURES: Benchmark Corpus for Urdu Extractive
Summaries and Experiments using Supervised Learning," Intelligent Syste
        <xref ref-type="bibr" rid="ref2">ms with
Applications, 2021</xref>
        .
      </p>
      <p>A. Gulzar, "Urdu Normalization Utility v1.0.," Technical Report, Center for Language
Engineering, Al-kwarzimi Institute of Computer Science (KICS), University of Engineering,
Lahore, Pakistan, 2007.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Butt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. I.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>"Overview of the shared task on fake news detection in</article-title>
          <source>Urdu at Fire</source>
          <year>2021</year>
          ,
          <article-title>"</article-title>
          in
          <source>In CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>"Threatening Language Detecting and Threatening Target Identification in Urdu Tweets</article-title>
          .,
          <source>" IEEE Access</source>
          , vol.
          <volume>9</volume>
          , pp.
          <fpage>128302</fpage>
          -
          <lpage>128313</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Amjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Voronkov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>"“Bend the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation</article-title>
          .,
          <source>" Journal of Intelligent &amp; Fuzzy Systems</source>
          , vol.
          <volume>39</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>2457</fpage>
          -
          <lpage>2469</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>