Urdu Fake News Detection using Generalized Autoregressors Abdullah Faiz Ur Rahman Khiljia , Sahinur Rahman Laskara , Partha Pakraya and Sivaji Bandyopadhyaya a Department of Computer Science and Engineering, National Institute of Technology Silchar, Assam, India Abstract Automated fake news detection has become vital in today’s digital age. Differentiating legit news from fake ones has become an important classification challenge in natural language processing (NLP). Various transformer-based deep learning approaches have taken widespread adoption from the research community due to its outstanding performance. We have participated in the 2020 Fake News Detection Challenge in the Urdu Language organized by Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico and have stood second. In this work, we have implemented a generalized autoregressor based model to classify news into fake or real. We have achieved an overall accuracy of 0.8400 and F1 macro score of 0.8370. Keywords Fake News Detection, Classification, Autoregressors, XLNet 1. Introduction The age of the internet has led to a massive free flow of information. This information flow has an added advantage of being easily accessible through widespread virtualization, leading to cost benefits to humankind. This increased accessibility has led to widespread usage and adaptability of the services it has to offer, to the point that one’s principal source of information becomes the internet. Due to its ubiquitous and global nature, it is quite difficult to have a central authority to have control over it, without defeating the true purpose it has to offer. As we are moving more and more of our life to the virtual world, the importance of detecting fake news has increased manifold as it has proven to have a negative impact on our society [1]. There are various instances where fake news has been used intentionally [2] to spread misinformation. This deliberate attempt has amplified in the era of social media where any individual is ready to express their views without considering any facts, one such example is given by [3]. It is also shown that the spread of fake news is exponential [3] and any initial attempts could greatly help in curtailing the issue. Hence, there is a need to automate fake Forum for Information Retrieval Evaluation 2020, December 16-20, 2020, Hyderabad, India email: abdullah_ug@cse.nits.ac.in (A. F. U. R. Khilji); sahinur_ug@cse.nits.ac.in (S. R. Laskar); partha@cse.nits.ac.in (P. Pakray); sivaji.cse.ju@gmail.com (S. Bandyopadhyay) url: https://abdullahkhilji.github.io/ (A. F. U. R. Khilji); https://sahinurlaskar.github.io/ (S. R. Laskar); http://cs.nits.ac.in/partha/ (P. Pakray) orcid: 0000-0001-6621-1810 (A. F. U. R. Khilji); 0000-0002-8413-2718 (S. R. Laskar); 0000-0003-3834-5154 (P. Pakray) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) news detection. Fake news detection has also garnered a great deal of attention in the past from both industry and academia. [4]. Our work mainly focuses on the fake news detection task of the Urdu language. In this work, we have considered the task of fake news detection as a binary classification problem. We have used XLNet [5] which is based on generalized autoregressive pre-training for language understanding, to train our models. Also, we have created a monolingual corpus for the Urdu language and developed pre-trained language models for giving an impetus for various downstream NLP tasks in the Urdu language. Our work is based on the dataset and the task provided by the organizers of 2020 Fake News Detection in the Urdu Language [6, 7]. Rest of the paper is organized as follows. In Section 2 we discuss the relevant works in fake news detection. Dataset used in our work is described in Section 3. We describe our system in detail in Section 4. Experimental setup is discussed in brief in Section 5. We discuss the result of our model and its analysis in Section 6. Finally, we conclude along with future works in Section 7. 2. Related Work Work related to automated fake news detection can be broadly classified into three distinct categories’ viz. based on content, propagation, and social context [8]. Majorly, content-based approaches have been used in the past that leverages lexical and syntactical features to detect writing styles and words to capture the deceptive nature of the news. One such work by [9], uses satirical cues based on a support vector machine (SVM) [10] algorithm to detect fake news. In the past researchers have also worked upon to compare the linguistic features of fake news with real news [11] and also gives a basis to the role of using stylistic cues to define the truthfulness of a text. Work done by [12] have used computation models and have provided a comparative analysis with human-based fake news detection task. Researchers have also worked with comparing and analyzing the text from left-wing and right-wing news and comparing it with the mainstream [13] and have demonstrated stylometric inquiry into hyper-partisan and fake news. Very few works have been done on fake news detection on the Urdu language in the past, due to very less availability of the requisite dataset for this task. For the given Urdu fake news detection task [14] is the only available dataset. Work done by [15] uses machine translation (MT) approaches to detect fake news in Urdu language. Various machine learning (ML) approaches like random forest (RF), logistic regression (LR), and boosting (BO) have been utilized by [16] to detect fake news in the Spanish language. Works like [17] and [18] uses emotional cues in the text to detect fake news. 3. Dataset 3.1. Description For our work, the organizers of 2020 Fake News Detection in the Urdu Language Task provided us with the dataset for fake news detection [14]. The authors of [14] manually collected and annotated for this binary classification task. The train data included a total of 638 samples, 350 from real and 288 from fake. The evaluation set included a total of 262 samples, out of which 150 are real and 112 fake. The test set comprised of 400 news samples. Since, a large Urdu monolingual corpus is not available for pre-training our model (as discussed in Section 2). We have created our own Urdu monolingual corpus for pre-training the deep learning (DL) models. We have prepared the dataset by extracting Urdu sentences from various sources including web pages, blogs, books, and government websites. The resultant monolingual data obtained contains about 11 million sentences. 3.2. Preprocessing The data provided by the organizers was already preprocessed as discussed by the authors of [14]. For the monolingual corpus, we have used UrduHack1 to preprocess the collected corpus. The preprocessing step included the removal of URLs, email IDs, and mobile numbers. The text was also normalized, space was added after punctuation marks and extra spaces were removed. Diacritics were also removed along with normalizing Urdu characters to obtain the text in the proper Urdu range. 4. System Description For pretraining as well as fine-tuning we have followed the architecture of [5]. The pretraining and fine-tuning procedure are described in Section 4.1 and Section 4.2 respectively. We have used the XLNet model instead of bidirectional encoder representations from transformers (BERT) [19] as, the later model has some drawbacks to offer. The BERT based architecture tends to corrupt the input with masks as it does not consider the dependency between the masked positions. Also, the BERT model has a discrepancy between fine-tuning and pretraining the model [5] as some symbols like [ M A S K ] are absent during the fine-tuning step. 4.1. Pretraining Leveraging unlabeled corpora has been very beneficial in various NLP tasks. Moreover, data for unsupervised tasks are easily available in large quantities compared to their supervised counterparts. Principally, there are two distinct pre-training objective’s viz. autoencoding (AE) and autoregressive (AR) language modeling [5]. Hence, following the work of [5], we have pre-trained our model based on the XLNet architecture, which employs a generalized autoregressive method that takes the advantage of both pre-training objectives AE and the AR language modeling. Apart from the architectural improvements in BERT, this XLNet model based on [20], integrates the recurrence mechanism and relative encoding scheme that improves performance in tasks involving longer sentences, like the given task of fake news classification. 1 https://github.com/urduhack Table 1 Test Results of our Team CNLP-NITS on the Fake News Classification Task Precision Recall F1 Score Fake Class 0.8359 0.7133 0.8235 Real Class 0.8419 0.9160 0.8235 4.2. Fine Tuning For fine-tuning the pre-trained model for the main binary classification task i.e. to detect whether the news is legit or fake we have followed the fine-tuning procedure of BERT [19] as given by [5]. Following [5], a span-based prediction is employed, wherein a fixed length is sampled and consecutive span of tokens are randomly selected as prediction targets. The problem of pretrain finetune discrepancy is also solved here. 5. Experimental Setup We have trained the XLNet model that uses the AR pre-training method and employs the use of language modeling objectives based on permutation. Since, each individual Urdu docu- ment/news provided by the organizers was to be predicted as fake or real, following [5] we used a longer sequence length of 512. Due to a smaller classification dataset and limited computation power, we trained a much smaller model as originally trained by the authors of [5]. We have used 6 layers with 4 attention heads, and an embedding dimension of 256. Inherently, larger parameters were scaled down by four. Here, the XLNet based model is pre-trained on the same domain of news dataset. Feature weighting scheme’s are not used. The embedding features from the pre-training step (obtained after Section 4.1) is used in the fine-tuning step (as discussed in Section 4.2) for classification. The XLNet classification task can be termed as the classification algorithm used. 6. Results The results for the fake news detection task was declared by the organizers of 2020 Fake News Detection in the Urdu Language. The authors of this work participated by the team name CNLP-NITS and as shown by the overall results on the website2 our team stood second out of the 39 teams participated. The participated systems were evaluated based on precision, recall, and F1 score. Our system reported an overall accuracy of 0.8400 and an overall F1 macro score of 0.8370. Precision, Recall and F1 score for the fake class and the real class are as shown in Table 1. 2 https://www.urdufake2020.cicling.org/results-and-rankings 7. Conclusion and Future Work Our automated fake news detection system adopts the generalized autoregressors technique for the binary classification task. Even though our XLNet based model shows good results, there are avenues for improvements. The monolingual corpus required for the pre-training step can be increased further. Several architectural improvements could also be made to incorporate the challenges faced in the Urdu language. Acknowledgments Authors would like to thank the organizers of the 2020 Fake News Detection in the Urdu Language Task from Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico for organizing this competition. The authors are also grateful to the Center for Natural Language Processing (CNLP) and Department of Computer Science and Engineering at National Institute of Technology, Silchar for providing the requisite support and infrastructure to execute this work. References [1] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data mining perspective, in: SIGKDD Explor., 2017, pp. 22–36. [2] B. Nyhan, J. Reifler, When corrections fail: The persistence of political misperceptions, in: Political Behavior, 2010, pp. 303–330. [3] A. Peck, A problem of amplification: Folklore and fake news in the age of social media, in: The booktitle of American Folklore, 2020, pp. 329–351. [4] D. M. J. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J. Metzger, B. Nyhan, G. Pennycook, D. Rothschild, M. Schudson, S. A. Sloman, C. R. Sunstein, E. A. Thorson, D. J. Watts, J. L. Zittrain, The science of fake news, in: Science, 2018, pp. 1094–1096. [5] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019, pp. 5754–5764. [6] M. Amjad, G. Sidorov, A. Zhila, A. Gelbukh, P. Rosso, Urdufake@fire2020: Shared track on fake news detection in urdu (2020). Proceedings of the 12th Forum for Information Retrieval Evaluation (FIRE 2020), Hyderabad, India. [7] M. Amjad, G. Sidorov, A. Zhila, A. Gelbukh, P. Rosso, Overview of the shared task on fake news detection in urdu at fire 2020, CEUR Workshop Proceedings (2020). Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020), Hyderabad, India. [8] X. Zhou, R. Zafarani, Fake news: A survey of research, detection methods, and opportuni- ties, in: CoRR, 2018, p. 1. [9] V. Rubin, N. Conroy, Y. Chen, S. Cornwell, Fake news or truth? using satirical cues to detect potentially misleading news, in: Proceedings of the Second Workshop on Computational Approaches to Deception Detection, 2016, pp. 7–17. [10] W. S. Noble, What is a support vector machine?, in: Nature Biotechnology, 2006, pp. 1565–1567. [11] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language in fake news and political fact-checking, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2931–2937. [12] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, Automatic detection of fake news, in: Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, 2018, pp. 3391–3401. [13] M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, B. Stein, A stylometric inquiry into hyper- partisan and fake news, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2018, pp. 231–240. [14] M. Amjad, G. Sidorov, A. Zhila, H. Gómez-Adorno, I. Voronkov, A. F. Gelbukh, ”bend the truth”: Benchmark dataset for fake news detection in urdu language and its evaluation, in: J. Intell. Fuzzy Syst., 2020, pp. 2457–2469. [15] M. Amjad, G. Sidorov, A. Zhila, Data augmentation using machine translation for fake news detection in the urdu language, in: Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, 2020, pp. 2537–2542. [16] J. P. Posadas-Durán, H. Gómez-Adorno, G. Sidorov, J. J. M. Escobar, Detection of fake news in a new corpus for the spanish language, in: J. Intell. Fuzzy Syst., 2019, pp. 4869–4876. [17] A. Giachanou, P. Rosso, F. Crestani, Leveraging emotional signals for credibility detection, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, 2019, pp. 877–880. [18] B. Ghanem, P. Rosso, F. M. R. Pardo, An emotional analysis of false information in social media and news articles, in: ACM Trans. Internet Techn., 2020, pp. 19:1–19:18. [19] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186. [20] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, R. Salakhutdinov, Transformer-xl: Attentive language models beyond a fixed-length context, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019, pp. 2978–2988.