The 2021 Urdu Fake News Detection Task using Supervised
Machine Learning and Feature Combinations
Muhammad Humayoun
Higher Colleges of Technology, Abu Dhabi, United Arab Emirates


                   Abstract
                   This paper presents the system description submitted at the FIRE Shared Task: “The 2021 Fake
                   News Detection in the Urdu Language". This challenge aims at automatically identifying Fake
                   news written in Urdu. Our submitted results ranked fifth in the competition. However, after the
                   result declaration of the competition, we managed to attain even better results than the
                   submitted results. The best F1 Macro score achieved by one of our models is 0.6674, higher
                   than the second-best score in the competition. The result is achieved on Support Vector
                   Machines (polynomial kernel degree 1) with stopwords removed, lemmatization applied, and
                   selecting the 20K best features out of 1.557 million features in total (which were produced by
                   Word n-grams n=1,2,3,4 and Char n-grams n=2,3,4,5,6). The code is made available for
                   reproducibility 1.

                   Keywords 2
                   Fake News, Urdu, Convolutional Neural Network, Embeddings, Support Vector Machines,
                   Feature Engineering

1. Introduction
    As the world is getting more connected in the information age, fake news is also increasing.
Spreading fake news is a proven tool in the propaganda warfare of the twenty-first century. Fake news
can be spread to praise or defame an entity, person, group or society, based on geopolitical and religious
motives. The methods and techniques for fake news detection are actively studied for major languages
like English. Unfortunately, recourse poor languages are often neglected. In this context, Urdu fake
news shared task 3 is an excellent step towards developing tools and techniques [1].
    This paper presents the system description which was submitted at the competition. Our submitted
results ranked fifth in the competition. Moreover, after the result declaration of the competition, we
managed to attain even better results than the submitted results. The best F1 Macro score achieved by
one of our models is 0.6674, which is higher than the second-best score in the competition. Some of the
related research work outside of this competition describing the dataset construction and producing
excellent results are reported in [2] [3].
    Urdu is a widely spoken language in South Asia and worldwide due to the large South Asian diaspora
[4]. Urdu has a modified Perso-Arabic alphabet, and it is written in cursive and context-sensitive
Nastalique writing style. Urdu is unique because it takes its literary vocabulary from Persian and Arabic
but informal vocabulary from the native languages of South Asia [5]. Some of the challenges that Urdu
computing faces are: lack of capitalization, optional use of diacritic marks, and space not being a reliable
word boundary marker [6] [7] [8]. In the absence of diacritics, context plays a vital role in guessing the
pronunciation of a word. Urdu is a Subject-Object-Verb language having a free word order [9].


1
    https://github.com/humsha/UrduFakeDetection2021

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
EMAIL: mhumayoun@hct.ac.ae
ORCID: 0000-0003-0623-6703
                © 2020 Copyright for this paper by its authors.
                Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                CEUR Workshop Proceedings (CEUR-WS.org)
3
    https://www.urdufake2021.cicling.org/
2. Dataset Description
    The dataset has 1,300 instances in the training set (including the test set within the training set). 750
instances are labeled as Real, and 550 instances are labeled as Fake. The test set has 300 instances (200
instances labeled as Real, 100 instances are labeled as Fake). The dataset is slightly imbalanced which
could be ignored. A superficial analysis of the dataset reveals a very low number of non-standard script
in the dataset. As expected in the news data, diacritic marks are absent. Data is generally clean. Proper
segmentation of Urdu words remains an unresolved problem. However, tokenizing on spaces is the best
strategy until proper word segmentation tools for Urdu are readily available.
    Fake news detection is fundamentally a difficult problem. It is mainly because domain knowledge
is needed to judge if news is fake or real. Anything that happens unexpectedly could be considered fake
by those lacking sufficient domain knowledge. A recent example of this could be the fall of Afghanistan
in the hands of the Taliban. It was such unexpected news that people felt the need to confirm it from
more than one sources.

3. Preprocessing, Features and Classification Techniques
   In supervised learning, the task of fake news detection can be modeled as a binary classification
problem. A supervised learning algorithm known as a classifier is trained on a collection of training
documents and their labels. Once training is completed, the classifier can take a document or text as an
input and returns a label as an output. The framework we used consists of five steps: preprocessing
(Section 3.1), feature extraction, and training classifiers (Section 3.2), producing labels, and their
evaluation on reference labels (Section 4). For both tasks, train and test sets are given. The models are
produced by training a classifier on a training set, and the label predictions are performed on a test set.

3.1 The Preprocessing
   Preprocessing plays a key role in NLP. We apply the following preprocessing:
       1. Diacritic Removal. Vowels are optionally used in Urdu. To ensure the consistency of data,
          removing all the vowels is a common practice.
       2. Text Normalization. Persian and Arabic characters that visually look similar to their Urdu
          counterparts are sometimes used in writing, resulting in orthographic variations. We
          normalize all such variations to Normalization Form C [10].
       3. Stopword Removal. The stopword list we used is provided by [7, 6], and it contains nearly
          500 words.
       4. Lemmatization. We used Urdu Morphological Analyzer [4] to convert all the surface forms
          of a word to its lemma or root. This tool covers approximately 5000 words, capable of
          handling 140,000 word forms.
       5. N-grams. A list of tokens is produced by word and character n-grams (unigram, bigram,
          trigram, …).

3.2 Classification Techniques
Both classic supervised learning and neural network techniques have been extensively used in the
literature for similar tasks [3] [11]. We have used the following two techniques:

3.2.1 Support Vector Machines with K-Best features
   We have used Support Vector Machines – SVM (Polynomial kernel degree 1) with K-Best features.
One beneficial characteristic of SVM is the requirement of less memory to handle very large datasets.
We have used this specific kernel and degree because of its better results in our initial experiments. A
standard bag of words model is produced using a non-exhaustive list of features produced by character
n-grams (n=2,3,4,5,6) and word n-grams (n=1,2,3,4). The value of features in a bag of word model is
calculated using the TF-IDF weighting scheme. Since the number of features was huge, the K-best
features were selected using the SelectKBest algorithm using Chi-squared statistic. Another reason to
select K-best features was to keep a reasonable ratio between the number of features per instance and
the total instances in the training set.

3.2.2 Convolution Neural Network
   A Convolution Neural Network for sentence classification is reported in [11]. We used a simplified
version of this model in which we have not used a pre-trained word Embedding. Pre-trained word
Embeddings for Urdu such as this one [12] is available though we have not used it. It is mainly because,
given the size of the task dataset, we were hoping to learn good embeddings from the dataset itself. We
have used the following two variants:
     1. The CNN model with four input channels. It was used for the reported results in the
         competition.
     2. The CNN model with six input channels. During paper submission, we discovered comparable
         results with six input channels.

The CNN Model:
1. Each channel in the model is defined as:
   1.1. An input layer
   1.2. Embedding layer set to the size of the vocabulary and 100-dimensional real-valued vector.
   1.3. Convolutional layer of 1-dimension with 32 filters and a kernel size set to the number of words
        or characters to read at once (word or character n-grams where n=k for channelk with k=1, 2,
        3, 4, 5, 6. i.e. channel1 used unigrams, channel2 used bigrams, channel3 used trigrams, and so
        on). Note that mixing of word n-grams and character n-grams are not possible.
   1.4. Max Pooling layer to combine the output from the convolutional layer.
   1.5. Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.
2. The output from the six channels is concatenated into a single vector and processed by a Dense
   layer and an output layer. The model architecture with two channels for an example sentence is
   shown in Figure 1.


        Figure 1: Model architecture with two channels for an example sentence taken from [11].
4. Experiments and Results:
    For the task in hand, train and test sets are given. The models are produced by training a classifier
on the training set, and the predictions are performed on the test set. The experiments are performed on
a laptop with processor Intel Core i7 8th generation with 8 GB RAM.

4.1 Experiment 1
    In this experiment, we produce a bag of word feature vector with a combination of word n-grams
(n=1,2,3,4) and character n-grams (n=2, 3, 4, 5, 6). Note that stopwords are removed, and lemmatization
is already applied in addition to the basic pre-processing settings mentioned in Section 3.1. The TFIDF
weighting is applied to get the feature vector. The top results are shown in Table 1.
    We learned that:
         • The best score in Table 1, row 7 is better than the second best score in the competition.
         • Excluding n=1 for char n-grams improve the results.
         • The optimal number of features in K-Best is ~20K (see row 4 to 8).
         • The best combination of features is: word n-grams n=1, 2, 3, 4 and char n-grams n=2, 3, 4,
             5, 6.
Table 1
    2021 Fake New Detection Task with SVM Poly-1 and K-best Features. Stopwords removed and
lemmatization applied. Best Score is bold faced and underlines, second best is underlined, third best
is italicized.

                                   Fake Class                     Real Class              F1
 SN        K-Best                                                                                  Accuracy
                            Prec    Recall F1 Fake         Prec    Recall F1 Real        Macro

      Word n-grams n=1, 2 and Char n-grams n=2, 3, 4, 5, 6 (total features: 1.177 million)
 1    20K               0.5974 0.46       0.5198       0.7578 0.845         0.7991     0.6594      0.7167

      Word n-grams n=1, 2, 3 and Char n-grams n=2, 3, 4, 5 (total features: 0.87 million)
 2    50K               0.5542 0.46        0.5027      0.7512 0.815         0.7818     0.6423      0.6967
 3    20K               0.5647 0.48        0.5189      0.7581 0.815         0.7855     0.6522      0.7033

      Word n-grams n=1, 2, 3, 4 and Char n-grams n=2, 3, 4, 5, 6 (total features: 1.557 million)
 4    70K               0.5833 0.42         0.4884     0.7456 0.85           0.7944     0.6414     0.7067
 5    50K               0.625      0.4      0.4878     0.7458 0.88           0.8073     0.6476     0.72
 6    25K               0.5949 0.47         0.5251     0.7602 0.84           0.7981     0.6616     0.7167
 7    20K               0.6104 0.47         0.5311     0.7623 0.85           0.8038     0.6674     0.7233
 8    10K               0.6324 0.43         0.5119     0.7543 0.875          0.8102     0.661      0.7267

      Word n-grams n=1, 2, 3, 4 and Char n-grams n=3, 4, 5, 6 (total features: 1.553 million)
 9    20K               0.625      0.45     0.5233     0.7588 0.865         0.8084     0.6658      0.7267


4.2 Experiment 2
   In this experiment, we performed a non-exhaustive list of the following constrained: (1) Number of
channels to be 4, 5 and 6. Character level sequences and word level sequences (n-grams) through kernel
size in the convolutional layer of each channel. On all of these experiments, stopwords were removed
and lemmatization was applied in addition to the basic pre-processing settings mentioned in Section
3.1. Note that it is not possible to combine word n-grams and character n-grams in our implementation
of CNN. It is mainly because we rely on the Keras tokenizer class which imposes the restriction of
selecting if word sequences to be used or char sequences as a basic building block of the model. The
results are shown in Table 2.
   It can be seen that the results by CNN is inferior as compared to the results we achieved in
Experiment 1. We think that it is mainly because of the mid-range size of the dataset. CNN models need
massive amount of training instances in order to outperform traditional models. Such a dataset in our
case is not available.


Table 2
    2021 Fake New Detection Task with Convolution Neural Network. Stopwords removed and
lemmatization applied. Best Score is bold faced and underlines, second best is underlined, third best
is italicized. Results reported in the completion are shown in Row 1.
                                 Fake Class                     Real Class
             SN                                                                     F1 Macro      Accuracy
                       Prec        Recall F1 Fake        Prec    Recall F1 Real

                    Character level CNN with 4 channels (with kernel sizes 1,2,3,4; one for each channel)
             1      0.48        0.49     0.49      0.74      0.74       0.74       0.611 4       0.653

                    Word level CNN with 4 channels (with kernel sizes 1,2,3,4; one for each channel)
             2      0.49        0.58    0.53       0.77     0.7        0.73       0.629         0.656

                    Word level CNN with 6 channels (with kernel sizes 1,2,3,4,5,6; one for each channel)
             3      0.45        0.77    0.57       0.82     0.53       0.64        0.603         0.606

                    Character level CNN with 6 channels (with kernel sizes 1,2,3,4,5,6; one for each channel)
             4      0.47        0.66     0.55       0.79      0.64      0.70        0.627         0.643


Conclusion
   In this work, we have performed rigorous experimentation and achieved the second-best F1 Macro
score of the competition. We demonstrated that traditional models with good feature engineering could
produce good results for a mid-range dataset. In addition, the Neural Network-based methods such as
CNN works reasonably well for the mid-sized dataset in hand. One way of improving the results by
CNN might be the use of a pre-trained Urdu Embedding. However, such an investigation remains future
work. Also, the recent transfer-learning techniques such as BERT fine-tuning can be investigated in
future, though getting a large enough Urdu BERT model might be a challenge.

References


       [1]    M. Amjad, S. Butt, H. I. Amjad, A. Zhila, G. Sidorov and A. Gelbukh, "Overview of the
           shared task on fake news detection in Urdu at Fire 2021," in In CEUR Workshop Proceedings,
           2021.
       [2]    M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga and A. Gelbukh, "Threatening
           Language Detecting and Threatening Target Identification in Urdu Tweets.," IEEE Access,
           vol. 9, pp. 128302-128313, 2021.
       [3]    M. Amjad, G. Sidorov, A. Zhila, H. Gómez-Adorno, I. Voronkov and A. Gelbukh, "“Bend
           the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation.,"
           Journal of Intelligent & Fuzzy Systems, vol. 39, no. 2, pp. 2457-2469, 2020.


4
    These are the results reported in the competition.
[4]     M. Humayoun, H. Hammarstrom and A. Ranta, "Urdu morphology, orthography and
     lexicon extraction," in In Ali Farghaly & Karine Megerdoomian (eds.), Proceedings of the
     2nd Workshop on Computational Approaches to Arabic Script-based Languages. Pages 59–
     68, LSA 2007 Linguistic Institute, Stanford University, USA., California USA, 2007.
[5]     M. Humayoun, H. Hammarström and A. Ranta, "Implementing Urdu Grammar as Open
     Source Software," in Conference on Language and Technology, University of Peshawar, 2007.
[6]     M. Humayoun and H. Yu, "Analyzing Preprocessing Settings for Urdu Single-document
     Extractive Summarization," in Proceedings of the Tenth International Conference on
     Language Resources and Evaluation ({LREC}'16), Portoroz, Slovenia, 2016.
[7]     M. Humayoun, R. M. A. Nawab, M. Uzair, S. Aslam and O. Farzand, "Urdu summary
     corpus," in In Proceedings of the Tenth International Conference on Language Resources and
     Evaluation (LREC 2016) (pp. 796–800)., Portoroz, Slovenia, 2016.
[8]     M. Humayoun and N. Akhtar, "CORPURES: Benchmark Corpus for Urdu Extractive
     Summaries and Experiments using Supervised Learning," Intelligent Systems with
     Applications, 2021.
[9]     M. Virk Shafqat, M. Humayoun and A. Ranta, "An open source Urdu resource grammar,"
     in Proceedings of the Eighth Workshop on Asian Language Resouces, 153-160, 2010.
[10]    A. Gulzar, "Urdu Normalization Utility v1.0.," Technical Report, Center for Language
     Engineering, Al-kwarzimi Institute of Computer Science (KICS), University of Engineering,
     Lahore, Pakistan, 2007.
[11]    Y. Kim, "Convolutional Neural Networks for Sentence Classification," in Empirical
     Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.
[12]    S. Haider, "Urdu Word Embeddings," in Proceedings of the Eleventh International
     Conference on Language Resources and Evaluation (LREC 2018), European Language
     Resources Association (ELRA), isbn:979-10-95546-00-9, Miyazaki, Japan, 2018.