A Simple N-Gram Model for Urdu Fake News
Detection
Hamada Nayel1 , Ghada Amer2
1
    Department of Computer Science, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt
2
    Electrical Engineering Department, Faculty of Engineering, Benha University, Benha, Egypt


                                         Abstract
                                         Fake news in social media platforms is a critical issue and it is necessary to detect such news. In this
                                         paper, we describe the system submitted to the UrduFake@FIRE2021. The aim of this shared task is
                                         to detect fake news in Urdu language. Machine learning approach has been used to build our model.
                                         A linear classifier using Stochastic Gradient Descent (SGD) optimization algorithm has been used to
                                         develop our system. The proposed model achieved F1-score of 67.9% and secured first rank over all
                                         submissions for different teams.

                                         Keywords
                                         Social Media Analysis, Fake News Detection, Linear Classifiers, ML Approach


1. Introduction
News that are intentionally and verifiably false called fake news [1]. In this era, detecting fake
news is a critical task due to the tremendous spread of news over the social media platforms.
People or organizations with specific background might fabricate and publish fake news for
unethical purposes [2]. Fake news can be used to insult and defamation individuals, as well as
obstruct social order, incite political unrest, or even undermine the peace and stability of the
international community. More interesting and worse, research on the spreading of fake news
shows that fake news is significantly faster, deeper, and wider distributed than true news [3].

   It has been proven that fake news is spreading exponentially, and any attempts in the first
stages would greatly help in reducing the problem [4]. Fake news detection obtained a great
deal of interests in the past from both of academic researchers and industry [5].

   This paper presents the system submitted to the UrduFake@FIRE2021 shared task [6], held in
conjunction with FIRE2021. The rest of the paper is organized as follows: section 2 overviews
the related work, section 3 describes in details the structure of our system and section 4 shows
the results of our model. In section 5, the results that the proposed model obtained have been
discussed and finally in section 6, a sight on the future work has been given.

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open hamada.ali@fci.bu.edu.eg (H. Nayel); ghada.amer@bhit.bu.edu.eg (G. Amer)
GLOBE https://bu.edu.eg/staff/hamadaali14 (H. Nayel); https://bu.edu.eg/staff/ghadaamer5 (G. Amer)
Orcid 0000-0002-2768-4639 (H. Nayel); 0000-0001-6083-2376 (G. Amer)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
Hamada Nayel et al. CEUR Workshop Proceedings                                                      1–6


2. Related Work
Researchers have conducted inclusive research in fake news detection. Ensemble approach has
been used to develop the model for detecting the fake news in Urdu [7]. Amjad et. al. [8] used
Machine Translation (MT) for dataset augmentation. They merged the translated English fake
news to Urdu with the original Urdu dataset [9]. Authors used Support Vector Machines (SVM)
algorithm with character and word n-grams features to train the model. The model achieved
f1-score ranging from 0.83 to 0.89 higher than that of the f1-score obtained for the dataset
through MT. Nankai et. al. combined RoBERTa model for word embeddings Convolutional
Neural Network (CNN) for character level embeddings along with label smoothing and ensemble
learning to develop a deep learning model for Urdu fake news detection [2]. Principally, most
of the work in fake news detection focused on English [10, 11]. Research efforts have been done
in other languages, such as Arabic [12], Indonesian [13] and Italian [14].


3. Dataset
The dataset used in the shared task, named Bend-The-Truth, was distributed by the organizers,
which is divided into training, development and test set. It consists of news articles in six
different domains: technology, education, business, sports, politics, and entertainment [9].
The sources of real news are news channels websites, such as BBC Urdu News, CNN Urdu,
Express-News, Jung News, Naway Waqat, and many other reliable news websites for the time
frame from January 2018 to December 2018. A very rigorous procedure has been followed while
collecting the real news. On the other hand, the fake news articles are intentionally written by
a group of journalists, each expert in corresponding topics. The fake news articles are in the
same domains and almost of the same length as the real news articles. Full details of the dataset
are given in [9].


4. Methodology
NEWUrduFake task is modeled as a binary classification task. Given a set of news articles in
Urdu, 𝑁 = {𝑛1 , 𝑛2 , 𝑛3 , …..}, the task aims at assigning a label from a predefined set 𝐿 = {𝐹 , 𝑅} to
each news article. The label 𝐹 refers to the news article which is fake, while 𝑅 refers to the news
article which is true news.

   Vector space model has been used to represent the news article and the weighting scores
for unique tokens were calculated by Term frequency / Inverse Document Frequency (TF/IDF)
[15]. TF/IDF has been used efficiently in native language identification [15], offensive language
detection [16, 17], irony detection [18] and author profiling [19, 20]. To evaluate the effect of
N-gram models, a wide range of N-gram models have been generated and used along with
TF/IDF for building different systems. A set of classification algorithms namely; Multinomial
Naive Bayes, SVM, Linear and MLP have been used for training the model.


                                                  2
Hamada Nayel et al. CEUR Workshop Proceedings                                                1–6


4.1. Model Structure
The structure of our model is shown in Figure 1. The first phase of our model is feature
extraction. In this phase, TF/IDF has been applied to extract the features of the input data. The
next phase is training the model, in this phase, we tried different algorithms and evaluated
using development set. The final phase is producing the model and applying it to the blind test
set to get the final output.


Figure 1: Model Structure


4.2. Experimental Setup
A simple tokenization technique has been used based on white space character. In feature
extraction, tokens have been used without any preprocessing, which increases the number of
features. TF/IDF has been extracted for uni-gram, bi-gram and tri-gram models. SVM with
linear kernel, linear classifier with SGD training and MLP algorithms with different nodes have
been implemented and evaluated on the development set.


5. Results and Discussion
In the model development phase, we applied the different models on the training dataset, and
the results are given in Table 1. Result shows that SGD classifier outperformes MLP and SVM
for bi-gram and tri-gram models. We decided to submit the output of SGD algorithm. Table 2
shows the results of applying different algorithms with ranges of n-gram models. It s clear that
the best performed classifier is SGD with bi-gram model. Also, it secured the first rank among
all the participants.

  The proposed model used different machine learning classification algorithms, and the best
performed algorithm has been used for output submission. TF/IDF with a wide range of n-gram


                                                3
Hamada Nayel et al. CEUR Workshop Proceedings                                              1–6


models have been used to extract the feature for training the model, and a set of rich features
has been produced.

Table 1
Results of development set
                                                F1-Macro    Accuracy
                                        SVM         0.688    0.737
                             Uni-gram   SGD         0.741    0.756
                                        MLP         0.766    0.786
                                        SVM         0.626    0.695
                             Bi-gram    SGD         0.752    0.767
                                        MLP         0.752    0.756
                                        SVM         0.611    0.683
                             Tri-gram   SGD         0.717    0.737
                                        MLP         0.699    0.729


Table 2
Results of test set
                                                F1-Macro    Accuracy
                                        SVM         0.538    0.693
                             Uni-gram   SGD         0.677    0.737
                                        MLP         0.644    0.737
                                        SVM         0.549    0.710
                             Bi-gram    SGD         0.679    0.757
                                        MLP         0.630    0.737
                                        SVM         0.542    0.707
                             Tri-gram   SGD         0.677    0.757
                                        MLP         0.643    0.743


6. Conclusion
Our system uses TF/IDF as a language model, it is very basic and simple. Using more accurate
language model such as word embeddings may improve the performance of the model. On the
other hand, preprocessing step, if added, may enhance the accuracy of the system.


References
 [1] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data
     mining perspective, SIGKDD Explor. Newsl. 19 (2017) 22–36. URL: https://doi.org/10.1145/
     3137597.3137600. doi:1 0 . 1 1 4 5 / 3 1 3 7 5 9 7 . 3 1 3 7 6 0 0 .


                                                4
Hamada Nayel et al. CEUR Workshop Proceedings                                                                     1–6


 [2] N. Lin, S. Fu, S. Jiang, Fake news detection in the urdu language using charcnn-roberta,
     in: P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2020 -
     Forum for Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020,
     volume 2826 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 447–451. URL: http:
     //ceur-ws.org/Vol-2826/T3-2.pdf.
 [3] S. Vosoughi, D. Roy, S. Aral, The spread of true and false news online, Science 359 (2018)
     1146–1151. URL: https://www.science.org/doi/abs/10.1126/science.aap9559. doi:1 0 . 1 1 2 6 /
     science.aap9559. arXiv:https://www.science.org/doi/pdf/10.1126/science.aap9559.
 [4] A. Peck, A problem of amplification: Folklore and fake news in the age of social media, The
     Journal of American Folklore 133 (2020) 329–351. doi:1 0 . 5 4 0 6 / j a m e r f o l k . 1 3 3 . 5 2 9 . 0 3 2 9 .
 [5] D. M. J. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J.
     Metzger, B. Nyhan, G. Pennycook, D. Rothschild, M. Schudson, S. A. Sloman, C. R. Sunstein,
     E. A. Thorson, D. J. Watts, J. L. Zittrain, The science of fake news, Science 359 (2018)
     1094–1096. URL: https://www.science.org/doi/abs/10.1126/science.aao2998. doi:1 0 . 1 1 2 6 /
     science.aao2998.
 [6] A. Maaz, S. Butt, H. I. Amjad, A. Zhila, G. Sidorov, A. Gelbukh, Overview of the shared
     task on fake news detection in urdu at fire 2021, in: Working Notes of FIRE 2021 - Forum
     for Information Retrieval Evaluation, CEUR Workshop Proceedings, CEUR-WS.org, 2021.
 [7] F. Balouchzahi, H. L. Shashirekha, Learning models for urdu fake news detection, in:
     P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for
     Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, volume 2826
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 474–479. URL: http://ceur-ws.org/
     Vol-2826/T3-7.pdf.
 [8] M. Amjad, G. Sidorov, A. Zhila, Data augmentation using machine translation for fake
     news detection in the Urdu language, in: Proceedings of the 12th Language Resources
     and Evaluation Conference, European Language Resources Association, Marseille, France,
     2020, pp. 2537–2542. URL: https://aclanthology.org/2020.lrec-1.309.
 [9] M. Amjad, G. Sidorov, A. Zhila, H. Gómez-Adorno, I. Voronkov, A. F. Gelbukh, ”bend the
     truth”: Benchmark dataset for fake news detection in urdu language and its evaluation,
     Journal of Intelligent Fuzzy Systems 39 (2020) 2457–2469. URL: https://doi.org/10.3233/
     JIFS-179905. doi:1 0 . 3 2 3 3 / J I F S - 1 7 9 9 0 5 .
[10] F. M. R. Pardo, A. Giachanou, B. Ghanem, P. Rosso, Overview of the 8th author profiling
     task at PAN 2020: Profiling fake news spreaders on twitter, in: L. Cappellato, C. Eickhoff,
     N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the
     Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020, volume 2696 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/paper_267.
     pdf.
[11] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, Automatic detection of fake news,
     in: Proceedings of the 27th International Conference on Computational Linguistics, Asso-
     ciation for Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 3391–3401.
     URL: https://aclanthology.org/C18-1287.
[12] M. Alkhair, K. Meftouh, K. Smaïli, N. Othman, An arabic corpus of fake news: Collection,
     analysis and classification, in: K. Smaïli (Ed.), Arabic Language Processing: From Theory
     to Practice, Springer International Publishing, Cham, 2019, pp. 292–302.


                                                          5
Hamada Nayel et al. CEUR Workshop Proceedings                                                       1–6


[13] I. Y. R. Pratiwi, R. A. Asmara, F. Rahutomo, Study of hoax news detection using naïve bayes
     classifier in indonesian language, in: 2017 11th International Conference on Information
     Communication Technology and System (ICTS), 2017, pp. 73–78. doi:1 0 . 1 1 0 9 / I C T S . 2 0 1 7 .
     8265649.
[14] F. Pierri, A. Artoni, S. Ceri, Investigating italian disinformation spreading on twitter in
     the context of 2019 european elections, PloS one 15 (2020). URL: https://journals.plos.org/
     plosone/article?id=10.1371/journal.pone.0227821.
[15] H. A. Nayel, H. L. Shashirekha, Mangalore-University@INLI-FIRE-2017: Indian Native
     Language Identification using Support Vector Machines and Ensemble Approach, in:
     P. Majumder, M. Mitra, P. Mehta, J. Sankhavara (Eds.), Working notes of FIRE 2017 -
     Forum for Information Retrieval Evaluation, Bangalore, India, December 8-10, 2017.,
     volume 2036 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 106–109. URL: http:
     //ceur-ws.org/Vol-2036/T4-2.pdf.
[16] H. Nayel, NAYEL at SemEval-2020 task 12: TF/IDF-based approach for automatic offensive
     language detection in Arabic tweets, in: Proceedings of the Fourteenth Workshop on
     Semantic Evaluation, International Committee for Computational Linguistics, Barcelona
     (online), 2020, pp. 2086–2089. URL: https://aclanthology.org/2020.semeval-1.276.
[17] A. Allam, H. Abdallah, E. Amer, H. Nayel, Machine learning-based model for sentiment
     and sarcasm detection, in: Proceedings of the Sixth Arabic Natural Language Processing
     Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual), 2021, pp.
     386–389. URL: https://aclanthology.org/2021.wanlp-1.51.
[18] H. A. Nayel, W. Medhat, M. Rashad, BENHA@IDAT: Improving Irony Detection in Arabic
     Tweets using Ensemble Approach, in: P. Mehta, P. Rosso, P. Majumder, M. Mitra (Eds.),
     Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India,
     December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings, CEUR-WS.org, 2019,
     pp. 401–408. URL: http://ceur-ws.org/Vol-2517/T4-3.pdf.
[19] H. A. Nayel, NAYEL@APDA: Machine Learning Approach for Author Profiling and
     Deception Detection in Arabic Texts, in: P. Mehta, P. Rosso, P. Majumder, M. Mitra (Eds.),
     Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India,
     December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings, CEUR-WS.org, 2019,
     pp. 92–99. URL: http://ceur-ws.org/Vol-2517/T2-3.pdf.
[20] M. Sobhi, A. Hassan, A. El-Sawy, H. Nayel, Machine learning-based approach for Arabic
     dialect identification, in: Proceedings of the Sixth Arabic Natural Language Processing
     Workshop, Association for Computational Linguistics, Kyiv, Ukraine (Virtual), 2021, pp.
     287–290. URL: https://aclanthology.org/2021.wanlp-1.34.


                                                   6