Detecting Fake News in URDU using Classical Supervised Machine
Learning Methods and Word/Char N-grams
Yaakov HaCohen-Kerner, Natan Manor, Netanel Bashan, and Elyasaf Dimant

Computer Science Department, Jerusalem College of Technology, Jerusalem 9116001, Israel


                 Abstract
                 In this paper, we describe our submissions for the UrduFake 2021 track. We tackled the task
                 entitled “Fake News Detection in the Urdu Language". We developed different models using
                 three classical supervised machine learning methods: Support Vector Classifier, Random Forest,
                 and Logistic Regression. Our machine learning models were applied to various sets of character
                 or word n-gram features. Our best submission was an SVC model using 7,500 char trigrams.
                 This model was ranked in 11th place out of 34 teams that participated in the discussed track.

                 Keywords 1
                 Fake news, supervised machine learning, word/char n-grams

1. Introduction

    “Fake News is a term used to represent fabricated news or propaganda comprising misinformation
communicated through traditional media channels like print, and television as well as non-traditional
media channels like social media” [1]. In previous years, fake news has been used to influence politics
and promote advertising. During the last two years, the phenomenon of fake news dramatically appeared
in the field of coronavirus news.
    There are various dangers in fake news such as incorrect (and sometimes even harmful) advice, social
disorders, fear, panic, and hatred of population groups. Fake news in social networks (e.g., Facebook and
Twitter) is spreading quickly and easily via various social media platforms. A large number of fake news
in social media poses a huge challenge to the research community.
     Therefore, there is a need for high-quality systems that can detect fake news in social media. Such
systems will help to improve the protection and security of the people.
     One of the recent results of this challenge was the organization of several fake news detection
tournaments in different languages such as Constraint@AAAI2021 in English [2], FakeDeS 2021 in
Spanish [3]; Author Profiling Task at PAN 2020 in English and Spanish [4]. In 2020, the first shared task
on fake news detection in Urdu was arranged [5-6]. The current shared task is the second shared task on
fake news detection in Urdu [7-8]. In these tournaments, researchers presented various models that
combined natural language processing (NLP) and machine learning (ML) to detect fake news.
     The structure of the rest of the paper is as follows. Section 2 introduces general background about
fake news detection, natural language processing (NLP) in Urdu, and text preprocessing. Section 3
describes the UrduFake 2021 task and datasets. In Section 4, we present the applied models and their
experimental results. Section 5 summarizes and suggests ideas for future research.


Forum for Information Retrieval Evaluation, December 13-17, 2021, India
EMAIL: kerner@jct.ac.il; natanmanor@gmail.com, netanelb56@gmail.com , elyasafdi@gmail.com
ORCID: 0000-0002-4834-1272
              ©️ 2021 Copyright for this paper by the Forum for Information Retrieval Evaluation, December 13-17, 2021, India.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work

2.1     Fake news detection

   Posadas-Durán et al. [9] built a new fake news corpus for the Spanish language. This corpus contains
971 news collected from January to July of 2018. It is divided into 491 real news and 480 fake news. The
corpus covers news from 9 different topics: Science, Sport, Economy, Education, Entertainment, Politics,
Health,      Security,      and     Society.     The       resource     is     freely    available    at
https://github.com/jpposadas/FakeNewsCorpusSpanish. In addition, the authors trained four well-known
classification methods on various lexical features BOW, POS tags, n-grams (with n varying
from 3 to 5), and n-grams combinations. The highest accuracy result 0.7694 has been obtained by Rando
Forest applied on BOW and POS features.
   Shu et al. [10] explored the problem of exploiting social context for fake news detection.
They propose a tri-relationship embedding framework TriFN, which models publisher-news relations and
user-news interactions simultaneously for fake news classification. They conduct experiments on two
real-world datasets, which demonstrate that the proposed approach significantly outperforms other
baseline methods, e.g., RST, Castillo, and LIWC for fake news detection.
   In another study, Shu et al. [11] described their tool called FakeNewsTracker that can automatically
collect data for news pieces and social context, which benefits further research of understanding
and predicting fake news with effective visualization techniques.
   A systematic literature review on approaches to identify fake news is presented in [12]. The authors
present the main approaches currently available to identify fake news and how these approaches can be
applied in different situations.

2.2     NLP in Urdu
    Amjad et al. [13] investigated whether machine translation from English to Urdu can be applied as a
text data augmentation method to expand the limited annotated resources for Urdu. Yet the empirical
results show that at its current stage, the machine translation quality for this language pair does not
enable efficient automated data augmentation, in particular, for fake news detection which is regarded as
a relatively high-level task.
    Detection of threatening language and target identification in Tweeter messages written in Urdu is
described in Amjad et al. [14] In this paper, the authors introduced a dataset that contains 3,564 Tweeter
messages manually annotated by human experts as either threatening or non-threatening. The threatening
tweets are further classified by the target into one of two types: threatening to a person or threatening to
a group. Extensive experiments using various machine learning (ML) methods including deep learning
classifiers showed that the best threatening language detection was achieved using an MLP classifier with
a combination of word n-grams and the best target identification was achieved using an SVM classifier
using fastText pre-trained word embedding.

2.3     Text preprocessing

   An important component for the success of the text classification (TC) process is the preprocessing
component. In many cases, preprocessing can “clean” the data and improve its quality. There are various
basic types of preprocessing methods e.g., conversion of uppercase letters into lowercase letters, HTML
tag removal, punctuation mark removal, and stop-word removal.
   HaCohen-Kerner et al. [15] investigated the impact of all possible combinations of six preprocessing
methods (spelling correction, HTML tag removal, converting uppercase letters into lowercase letters,
punctuation mark removal, reduction of repeated characters, and stopword removal) on TC in three
benchmark mental disorder datasets. In one dataset, the best result showed a significant improvement
over the baseline result using all six preprocessing methods. In the other two datasets, several
combinations of preprocessing methods showed minimal improvements over the baseline results.
   In another study, HaCohen-Kerner et al. [16] explored the influence of various combinations of the
same six basic preprocessing methods (mentioned in the previous paragraph) on TC in four general
benchmark text corpora using a bag-of-words representation. The general conclusion was that it is always
advisable to perform an extensive and systematic variety of preprocessing methods, combined with TC
experiments because this contributes to improving TC accuracy.

3. Task and Dataset Description

     The 2021 shared task on fake news detection in Urdu [7-8] addresses the problem of "Fake News
Detection in the Urdu Language". This task is coarse-grained binary classification in which participating
systems are required to classify tweets into two classes: Real and Fake.
     The Urdu fake news dataset [17] is composed of news articles in six different domains: business,
education, entertainment, politics, sports, and technology. The real news was collected from several
mainstream Urdu news websites in Pakistan, India, the UK, and the USA. The fake news was intentionally
written by a group of professional journalists, each proficient in corresponding topics. The fake news is
in the same domains and of the approximately same length as the real news.
    General statistics about the training dataset2 that we used are provided in Table 1. This training dataset
is divided into training sub-dataset and test sub-dataset where each sub-dataset contains real and fake
news.

Table 1
General statistics about the training dataset

                           Training sub-dataset          Test sub-dataset                 Total
       Real news                   600                          150                        750
       Fake news                   438                          112                        550
         Total                     1038                         262                       1300


4. Applied Models and their Experimental Results

     We used the training dataset, which is described in the previous section, according to its given split.
Due to time limitations, we applied only one preprocessing method - converting uppercase letters into
lowercase letters and only three classical supervised ML methods: Support Vector Classifier (SVC),
Random Forest (RF), and Logistic Regression (LR) using classical features such as character n-gram
features and word n-gram features.
    SVC is a variant of the support vector machine (SVM) ML method [18] implemented in SciKit-Learn.
SVC uses LibSVM [19], which is a fast implementation of the SVM method. SVM is a supervised ML
method that classifies vectors in a feature space into one of two sets, given training data. It operates by
constructing the optimal hyperplane dividing the two sets, either in the original feature space or in higher
dimensional kernel space.
    Random forest (RF) is an ensemble learning method for classification and regression [20]. Ensemble
methods use multiple learning algorithms to obtain improved predictive performance compared to what
can be obtained from any of the constituent learning algorithms. RF operates by constructing a multitude
of decision trees at training time and outputting classification for the case at hand. RF combines Breiman’s


2
    https://github.com/MaazAmjad/Urdu-Fake-news-detection-FIRE2021/blob/main/Training%20Dataset%40FIRE2021.zip
“bagging” (Bootstrap aggregating) idea in [21] and a random selection of features introduced by Ho [22]
to construct a forest of decision trees.
    Logistic Regression (LR) [23-24] is a linear model for classification. It is known also as maximum
entropy regression (MaxEnt), logit regression, and the log-linear classifier. In this model, the probabilities
describing the possible outcome of a single trial are modeled using a logistic function.
    These ML methods were applied using the following tools and information sources: The Python 3.7.3
programming language and Scikit-learn – a Python library for ML methods.
    In our experiments, we test dozens of TC models. As mentioned above, we applied three different
supervised ML methods for various combinations of character and/or word n-gram features. Under the
user called Elyasafdi, we submitted the three models described in Table 2.
    The models in Table 2 are sorted according to their accuracy results. The best model was SVC applied
on 7,500 char trigrams (colored in gray). This model was ranked in 11th place out of 34 teams. Our main
results were F-Measure of 0.550 (while the F-Measure results of the teams that were ranked at the 9th and
10th place were 0.592 and 0.590, respectively) and Accuracy of 0.703 (while the Accuracy results of the
teams that were ranked at the 9th and 10th place were much lower than our Accuracy result, 0.65 and
0.590, respectively). Table 2 provides detailed results for the three submitted models on the test dataset3
(nine leftmost columns) and the training dataset (two rightmost columns).

Table 2
Detailed results for the three submitted models on the test and training sub-datasets

                                                                                                        Results on the
                                       Results on the Competition Test Dataset
                                                                                                       Training Dataset
     Model                Fake class                      Real class           Average                Average
                                        F1                              F1       F1       Accuracy      F1      Accuracy
                Precision    Recall             Precision    Recall
                                       Macro                           Macro    Macro                  Macro
      SVC -
      7500
                  0.720      0.180      0.288     0.701      0.965     0.812     0.550      0.703       0.832         0.793
       char
    trigrams
      SVC -
      4000
                  0.633      0.190      0.292     0.700      0.945     0.804     0.548      0.693       0.806         0.759
       char
    trigrams
      SVC -
      2533
                  0.667      0.100      0.174     0.684      0.975     0.804     0.489      0.683       0.834         0.793
       char
    bigrams


     As can be seen from Table 2, our results on the training dataset (F-Measure of 0.832 and Accuracy
of 0.793) were significantly higher than our results on the competition test dataset (F-Measure of 0.550
and Accuracy of 0.703). Possible explanations for these significant differences might be: (1) The training
dataset is more balanced (550 fake news and 750 real news) than the competition test dataset (100 fake
news and 200 real news) and (2) the content of a relatively high number of news items in the competition
test dataset is fundamentally different from the content of the news in the training dataset.


3
    https://github.com/MaazAmjad/Urdu-Fake-news-detection-FIRE2021/blob/main/Test%20Dataset%20%40%20FIRE%202021.zip
5. Conclusions and Future Work

   In this paper, we described our submitted models for the UrduFake 2021 track, which addresses the
detection of fake news in the Urdu language. We applied three classical ML methods (SVC, RF, and LR)
on various sets of character and/or word n-gram features. The best-submitted model was an SVC model
applied on 7,500 char trigrams. This model obtained an F-Measure result of 0.550 and an accuracy result
of 0.703 and it was ranked in 11th place out of 34 teams.
   Potential future ideas are application of: various deep learning models; acronym disambiguation [25-
26]; skip character n-grams that can serve as generalized n-grams [27]; stylistic feature sets [28]; key
phrases [29]; and summaries [30].

Acknowledgments
   We are grateful to the anonymous reviewers and the organizers for their fruitful comments and
suggestions.

6. References

[1] A. Thota, P. Tilak, S. Ahluwalia, N. Lohia, Fake news detection: a deep learning approach, SMU
     Data Science Review, 1(3) (2018), Article 10.
[2] P. Patwa, M. Bhardwaj, V. Guptha, G. Kumari, S. Sharma, S. Pykl, ..., T. Chakraborty, Overview of
     constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts,
     In International Workshop on Combating On line Hostile Posts in Regional Languages during
     Emergency Situation, 2021, pp. 42-53, Springer, Cham.
[3] H. Gómez-Adorno, J. P. Posadas-Durán, G. Bel-Enguix, C. Porto, Overview of fakedes task at iberlef
     2020: Fake news detection in Spanish, Procesamiento del Lenguaje Natural, 67(0) (2021).
[4] F. Rangel, A. Giachanou, B. Ghanem, P. Rosso, Overview of the 8th Author Profiling Task at PAN
     2020: Profiling Fake News Spreaders on Twitter, In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol,
     A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org, 2020.
[5] M. Amjad, G. Sidorov, A. Zhila, A. F. Gelbukh, P. Rosso, Overview of the Shared Task on Fake
     News Detection in Urdu at FIRE 2020, In FIRE (Working Notes) , 2020 ,pp. 434-446.
[6] M. Amjad, G. Sidorov, A. Zhila, A. F. Gelbukh, P. Rosso, UrduFake@ FIRE2020: Shared Track on
     Fake News Identification in Urdu, In Forum for Information Retrieval Evaluation, 2020, pp. 37-40.
[7] M. Amjad, S. Butt, H. I. Amjad, A. Zhila, G. Sidorov, A. Gelbukh, UrduFake@ FIRE2021: Shared
     Track on Fake News Identification in Urdu, In Forum for Information Retrieval Evaluation, 2021.
[8] M. Amjad, S. Butt, H. I. Amjad, A. Zhila, G. Sidorov, A. Gelbukh. Overview of the shared task on
     fake news detection in Urdu at Fire 2021, In CEUR Workshop Proceedings, 2021.
[9] J. P. Posadas-Durán, H. Gómez-Adorno, G. Sidorov, J. J. M. Escobar, Detection of fake news in a
     new corpus for the Spanish language, Journal of Intelligent & Fuzzy Systems, 36(5) (2019) 4869-
     4876.
[10] K. Shu, S. Wang, H. Liu, Beyond News Contents: The Role of Social Context for Fake News
     Detection, In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM
     ’19), 2019.
[11] K. Shu, D. Mahudeswaran, H. Liu, FakeNewsTracker: a tool for fake news collection, detection, and
     visualization, Computational and Mathematical Organization Theory, 25(1) (2019) 60-71.
[12] D. De Beer, M. Matthee, Approaches to identify fake news: a systematic literature review,
     In International Conference on Integrated Science, pp. 13-22, Springer, Cham, 2020
[13] M. Amjad, G. Sidorov, A. Zhila, Data augmentation using machine translation for fake news
     detection in the urdu language, In Proceedings of The 12th Language Resources and Evaluation
     Conference, 2020, pp. 2537-2542.
[14] M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga, A. Gelbukh, Threatening Language
     Detecting and Threatening Target Identification in Urdu Tweets, accepted for publication in IEEE
     Access, 2021.
[15] Y. HaCohen-Kerner, Y. Yigal, D. Miller, The impact of Preprocessing on Classification of Mental
     Disorders, in Proc. of the 19th Industrial Conference on Data Mining, (ICDM 2019), New York,
     2019.
[16] Y. HaCohen-Kerner, D. Miller, Y. Yigal, The influence of preprocessing on text classification using
     a bag-of-words representation, PloS one, vol. 15, p. e0232525, 2020.
[17] M. Amjad, G. Sidorov, A. Zhila, H. Gómez-Adorno, I. Voronkov, A. Gelbukh, "Bend the truth”:
     Benchmark dataset for fake news detection in Urdu language and its evaluation, Journal of Intelligent
     & Fuzzy Systems, 39(2) (2020) 2457-2469.
[18] C. Cortes, V. Vapnik, Support-vector networks, Machine learning, 20 (1995) 273–297.
[19] C.-C., Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM transactions on
     intelligent systems and technology (TIST), 2 (2011) 1–27.
[20] L. Breiman, Random forest, Machine Learning, 45(1) 2001 5-32.
[21] L. Breiman, Bagging predictors, Machine Learning, 24(2) (1996) 123-140.
[22] T. K. Ho, Random decision forests, In Proceedings of 3rd International Conference on Document
     Analysis and Recognition, 1995, Vol. 1, pp. 278-282, IEEE.
[23] D. R. Cox, The regression analysis of binary sequences, Journal of the Royal Statistical Society:
     Series B (Methodological), 20 (1958) 215–232.
[24] D. W. Hosmer Jr, S. Lemeshow, R. X. Sturdivant, Applied logistic regression, Vol. 398, John Wiley
     & Sons. Applied logistic regression (Vol. 398). John Wiley & Sons, 2013.
[25] Y. HaCohen-Kerner, A. Kass, A. Peretz, Combined one sense disambiguation of abbreviations.
     In Proceedings of ACL-08: HLT, Short Papers, Association for Computational Linguistics,
     Columbus, Ohio, 2008, pp. 61-64, URL: https://aclanthology.org/P08-2.
[26] Y. HaCohen-Kerner, A. Kass, A. Peretz, Haads: A hebrew aramaic abbreviation disambiguation
     system, Journal of the American Society for Information Science and Technology, 61(9) (2010)
     1923–1932.
[27] Y. HaCohen-Kerner, Z. Ido, R. Ya’akobov, Stance classification of tweets using skip char Ngrams,
     In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2017,
     pp. 266-278, Springer, Cham.
[28] Y. HaCohen-Kerner, H. Beck, E. Yehudai, M. Rosenstein, D. Mughaz, Cuisine: Classification using
     stylistic feature sets and/or name‐based feature sets, Journal of the American Society for Information
     Science and Technology, 61(8) (2010) 1644-1657.
[29] Y. HaCohen-Kerner, I. Stern, D. Korkus, E. Fredj, Automatic machine learning of keyphrase
     extraction from short html documents written in hebrew, Cybernetics and Systems: An International
     Journal, 38(1) (2007) 1–21.
[30] Y. HaCohen-Kerner, E. Malin, I. Chasson, Summarization of jewish law articles in hebrew,
     Proceedings of the 16th International Conference on Computer Applications in Industry and
     Engineering (CAINE), November 11-13, 2003, Imperial Palace Hotel, Las Vegas, Nevada, USA,
     2003, pp. 172–177.