ULMFiT for Twitter Fake News Spreader Profiling Notebook for PAN at CLEF 2020 1 H. L. Shashirekha, 2F. Balouchzahi Department of Computer Science, Mangalore University, Mangalore - 574199, India 1 hlsrekha@gmail.com, 2frs_b@yahoo.com Abstract. 21st century is named as the age of information technologies. Social applications such as Facebook, Twitter, Instagram, etc. have become a quick and huge media for spreading news over the internet. At the same time, the ability for the wide spread of news that is of low quality with intentionally false information is creating havocs causing damage to the extent of losing lives in the society. Such news is termed as fake news and detecting the fake news spreader is drawing more attention these days as fake news can manipulate communities’ minds and also social trust. Until date, many studies have been done in this area and most of them are based on Machine Learning and Deep Learning approaches. In this paper, we have proposed a Universal Language Model Fine-Tuning model based on Transfer Learning to detect potential fake news spreaders on Twitter. The proposed model collects wiki text data to train the Language Model to capture general features of the language and this knowledge is transferred to build a classifier using fake news spreaders dataset provided by PAN 2020 to identify the fake news spreader. The results obtained on PAN 2020 fake news dataset are encouraging. 1 Introduction In this era, social media is overwhelming the lives of people and people are sharing various information using different platforms of social media such as Google+, Facebook, WhatsApp and Twitter [1]. The velocity of news spreading on internet is highly increasing due to the availability of various social media platforms and pocket friendly mobile data packs. Social media has become more attractive especially for the younger generation mainly because of the inherent benefits of fast dissemination of information and easy access to the information [2]. At the same time, the ability for the wide spread of news that is of low quality with intentionally false information is creating havocs causing damage to the extent of losing lives in the society [3]. Two major concepts of fake news are veracity and intention. Veracity is about the news that includes some information and the authenticity of that content is possible to be verified as they are. For example, in case of a news about earthquake in Japan, the probability of this news being true is higher but it is a challenge to prove that it is fake or not. Intention refers to the goal of spreader to use false information intentionally to mislead the reader. ________ Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece Fake news is not a new challenge as people have been exposed to propaganda, tabloid news, and satirical reporting since ages. But nowadays, the heavy dependence on the internet, trending stories on social media, new methods of monetizing content, etc., have been found to rely on information without using trustworthy traditional media outlets [4]. Fake news is hazardous since it is spread to manipulate readers’ opinions and beliefs [5]. Hence, detecting fake news spreaders becomes very much important in today’s scenario and is gaining popularity day by day as users play a key role in creating and sharing incorrect or false information intentionally or accidently [6]. In spite of many systems including automatic detection systems and human based systems, detection of fake news spreaders is still a challenging task [7]. Detecting fake news spreaders in Twitter can be modeled as a typical binary Text Classification (TC) problem that labels a given news spreader as fake or genuine. TC is a Supervised Machine Learning (ML) technique that automatically assigns a label from the predefined set of labels to a given unlabelled input. It has wide applications in various domains, such as target marketing, medical diagnosis, news classification, and document organization [8]. There are several popular approaches for TC in general and for fake news spreader profiling in particular. In this paper, we propose a Universal Language Model Fine-Tuning (ULMFiT) model for fake news spreader detection based on Transfer Learning (TL). 1.1 Transfer Learning TL is generally known as one of the novel inventions in the field of Deep Learning and Computer Vision. Conventionally, in ML every model is built from the scratch using a specific dataset. However, a model based on TL approach uses the knowledge obtained from building one model called as a source model in building another model called as target model. The former model is called as source task and later the target task. While the source task uses one dataset called as source dataset to build/learn the source learning system or source model, target task uses the knowledge obtained in building the source model along with the target dataset used for fine tuning the target model. For example, the source model can be a Language model (LM) that represents the general features of a language, target model can be TC, source dataset can be Wikipedia text and the target dataset can be fake news [9]. LM is a probability distribution over word sequences in a language and introduces a useful hypothesis space for many other NLP tasks [10]. As the knowledge obtained in building the source model is transferred to build the target model, learning is named as Transfer Learning. Figure 1 illustrates the difference between conventional ML and TL. After the introduction of TL, LM has drawn more attention as it acts as an informative knowledge of a language. 1.2 ULMFiT ULMFiT is a model based on TL and can be used for many NLP tasks such as TC and NER [9]. It uses the knowledge of LM as source model and then fine tunes the target model using the task-specific data or target dataset. Figure 2 represents architecture of ULMFiT. It includes 3 steps i) pre-training LM using large corpus like Wikipedia to capture the high-level language features and the resultant model is called as pre-trained LM ii) fine-tune the target model using pre-trained LM and task- specific or target dataset iii) final model which accepts the test/unlabelled data to assign a label. Figure 1. Conventional Machine Learning versus Transfer Learning Figure 2. Architecture of ULMFiT The advantage of TL is, when a given dataset is too small to train a learning model the knowledge obtained in a pre-trained LM on a source dataset can be transferred to the target task, resulting in the improvement of the target model even when the source and target datasets have different distributions or features [9] [11][12]. The rest of the paper is organized as follows. Section 2 gives the related work followed by the proposed methodology in section 3. While section 4 describes the experiments and results, section 5 gives the conclusion of the paper. 2 Related Works In spite of the availability of many automated tools and techniques for the detection of fake news spreaders, it is still a challenging task. Some of the relevant works are mentioned below: An Artificial Neural Network model for Language Identification task for Indian native Languages namely Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu written in Roman script has been explored by Hamada et. al. [1]. The data sets used in task are collection of comments from different regional newspapers and Facebook pages. They obtained an accuracy score of 35.30 %. The same authors also obtained accuracies of 47.60% and 47.30% respectively in another work using ensemble classifier made of multinomial Bayes, SVM and random forest tree [13]. Francisco et. al. [14] proposed Low Dimensionality Representation (LDR) for language variety identification and has applied LDR to the age and gender identification task at the PAN Lab at CLEF. The results they obtained are competitive with the best performing teams in the author profiling task. Shu et. al. [2] constructs a real-world dataset by measuring users trust level of "experienced1" and "native2" users on fake news. They have performed a comparative analysis of explicit and implicit profile features between these user groups, which reveals their potential to differentiate fake news. Shu et. al. [3] have explored the fake news problem from a data mining perspective, including feature extraction and model construction and have reviewed different approaches for fake news detection. Bilal et al. [5] presents an approach based on a combination of emotional information from documents using a deep learning network. The authors used one dataset including trusted news (real news) created from English Gig word corpus and another dataset is a collection of news from seven different unreliable news sites as false news and have reported an F1 score of 96%. A Bot detection approach using behavioral and other informal cues is proposed by Andrew et. al. [15]. They have used random forest classifier and a gradient boosting classifier and also applied a hyper parameter optimization on over 476 million revisions that has been collected from Wikipedia articles. They have reported the model performance as 88% precision and 60% recall. EmoCred model based on LSTM neural network proposed by Anastasia et. al. [16] incorporates emotional signals to differentiate between credible and non-credible claims. It accepts word embeddings as input from claims and a vector of emotional signals. The authors used Politifact3 that contain the text of the claims, the speaker, and the credit rating of each claim. Six different credibility ratings: true, mostly true, half true, mostly false, false, and pants-on-fire has been combined into two classes as true and false and obtained 61.7% F1 score for generating the emotional signals. “DeClarE” is an automated end-to-end neural network model proposed by Kashyap et. al. [17]. They capture signals from external evidence articles and model joint interactions between various factors like the context of a claim, the language of reporting articles, and the trustworthiness of their sources. Their model was evaluated on Snopes4, Politifact 5, and a SemEval Twitter rumor dataset and obtained F1 scores of 79% and 68% for Snopes and Politifact respectively and a macro accuracy score of 57% for SemEval dataset. 3 Methodology An overview of the proposed fake news spreader detection model is described in Figure 3. The model constructed using the state-of-the-art ULMFiT architecture developed by Howard et. al. [10] consists of pre-training the LM and then fine-tuning the fake news spreader detection model by using the pre-trained LM and fake news spreader dataset provided by PAN2020. Two separate models are constructed to detect the fake news given in English and Spanish. Inspired by Stephen et. al. [18], 1 Users who are able to recognize fake news items like false 2 Users who are more likely to believe fake news 3 It is a fact-checking website where the credibility of different claims is investigated. 4 www.snopes.com 5 www.politifact.com LM and Target classifier are created using text.models module from fastai library. This module implements the encoder for an ASGD Weight-Dropped LSTM (AWD- LSTM) which can be plugged in with a decoder to create an LM and also with some classifying layers to create a text classifier. AWD-LSTM is a regular LSTM to which several regularization and optimization techniques are applied and built layer by layer by grabbing a PyTorch neural network model [9]. Its architecture as described by Howard and Ruder [10] consists of a word embedding of size 400, 3 layers and 1150 hidden activations per layer. The AWD- LSTM has been dominating the state-of-the-art language modeling and many studies on word-level models incorporate AWD-LSTMs. It also has shown noticeable results on character-level models [18]. Figure 3. Overview of ULMFiT for Twitter fake news spreader profiling 3.1 Training LM (Source Learning Model) LM also called as source learning model is trained on the source data collected from English/Spanish Wikipedia. Source data set usually is an unannotated data set that contains general domain texts to train LM to obtain general features like grammar of the language. A sufficiently large English/Spanish text data are collected from Wikipedia to create a source dataset of English/Spanish language respectively and LM is trained to learn the general features of the language. Wikipedia articles that were available in the month of January 2020 are collected in xml format and then the sentences are extracted from the raw text using WikiExtractor6 module. Once the source model completes its learning the knowledge thus learned is used to build the target task of fake news spreader detection. The knowledge can also be saved for future use for other English/Spanish NLP applications. Details of source dataset for both the languages are given in Table 1. 3.2 Target Model The target model is created using the knowledge obtained from LM followed by fine-tuning the model using the target dataset. The pre-trained LM is used to train target task data for various cycles to fine-tune the knowledge based on target task. Target dataset is the labeled data used for classification tasks which is provided by PAN for registered users only. The dataset consists of 300 XML files in a folder per language (English, Spanish) [19]. Each folder contains: 6 https://github.com/attardi/wikiextractor  An XML file per author (Twitter user) consisting of 100 tweets each and the name of the XML file corresponds to the unique author id.  A truth.txt file with the list of authors and ground truth. The details of the dataset provided by PAN are given in Table 2. Target data is preprocessed and then used for fine-tuning the classification task. Preprocessing involves tokenization, removing punctuations and stop words, lemmatization and removing other unwanted characters. Emojis are small images used to express emotion and are useful in text analysis [13]. Hence, they are converted to respective words or phrases and those words or phrases are treated similar to content bearing words. Table 1. Details of source dataset Language No. Articles No. Sentences No. Words English 63341 2050239 68011619 Spanish 68490 1531438 64530355 Table 2. Details of target datasets provided by PAN Language No. of No. of tweets No. of class No. of class Authors per author 0 data 1 data English 100 300 150 150 Spanish 100 300 150 150 4 Experimental results As per PAN 2020 rules for submitting software in Virtual Machine (VM), learning model has to be first constructed locally and saved followed by loading the model in PAN VM and finally submitting the model through TIRA Integrated Research Architecture submission system [20]. ULMFiT model is created using Google Colab7 as it requires GPU and higher RAM size in learning cycles. The proposed model was evaluated through PAN submission system and the performance of model was made available by the task moderator. Model's runtime reported by PAN is 00:35:48 (hh:mm:ss). Almost half of this time is spent on loading the model using fastai library and rest for predictions. Details of results obtained by the proposed model are given in Table 3. The proposed model resulted with 64% accuracy for Spanish and 62% for English language data. 7 https://colab.research.google.com/ Table 3. Performance of the proposed model Language Accuracy (%) English 62 % Spanish 64 % 6 Conclusion This paper presents ULMFiT model for profiling fake tweet spreaders based on Transfer Learning approach. The proposed model is initially trained on a general domain English/Spanish data collected from Wikipedia to build an LM model, and then the acquired knowledge is transferred to build the fake news spreader detection task as the target model. The model resulted with 64% accuracy for Spanish and 62% for English language data. Further, the data collected from Wikipedia and LM can be used for any other English/Spanish NLP task. References 1. Nayel Hamada A., and H. L. Shashirekha. “Mangalore University INLI@ FIRE2018: Artificial Neural Network and Ensemble based Models for INLI”. In FIRE (Working Notes), pp. 110-118, 2018. 2. Shu Kai, Suhang Wang, and Huan Liu. “Understanding User Profiles on Social Media for Fake News Detection”. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430-435, 2018. 3. Shu Kai, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. “Fake News Detection on Social Media: A Data Mining Perspective”. ACM SIGKDD Explorations Newsletter 19, No. 1, pp. 22-36, 2017. 4. Haber Morey. “The Real Risks of Fake News”. Risk Management 64, no. 3, pp.10-12, 2017. 5. Ghanem Bilal, Paolo Rosso, and Francisco Rangel. “An Emotional Analysis of False Information in Social Media and News Articles”. ACM Transactions on Internet Technology (TOIT) 20, no. 2, pp. 1-18, 2020. 6. Giachanou Anastasia, Esteban A. Ríssola, Bilal Ghanem, Fabio Crestani, and Paolo Rosso. “The Role of Personality and Linguistic Patterns in Discriminating Between Fake News Spreaders and Fact Checkers”. In International Conference on Applications of Natural Language to Information Systems, Springer, Cham, pp. 181-192, 2020. 7. Vo Nguyen, and Kyumin Lee. “Learning from Fact-checkers: Analysis and Generation of Fact-checking Language”. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335-344, 2019. 8. Aggarwal Charu C., and Cheng Xiang Zhai. “A Survey of Text Classification Algorithms. In Mining Text Data”. Springer pp. 163-222, Boston, MA, 2012. 9. Faltl Sandra, Michael Schimpke and Constantin Hackober. “Ulmfit: State-Of- The-Art in Text Analysis”. Seminar Information Systems (WS18/19), 2019. 10. Howard Jeremy, and Sebastian Ruder. “Universal Language Model Fine-Tuning for Text Classification”, arXiv preprint arXiv: 1801.06146, 2018. 11. Semwal Tushar, Promod Yenigalla, Gaurav Mathur, and Shivashankar B. Nair. “A Practitioners Guide to Transfer Learning for Text Classification Using Convolution Neural Networks”. In Proceedings of the 2018 Society for Industrial and Applied Mathematics (SIAM) International Conference on Data Mining, pp. 513-521, 2018. 12. Pan Sinno Jialin, James T. Kwok, and Qiang Yang, “Transfer Learning via Dimensionality Reduction”, In AAAI, vol. 8, pp. 677-682, 2008. 13. Nayel, Hamada A., and H. L. Shashirekha. “Mangalore-University@ INLI-FIRE- 2017: Indian Native Language Identification using Support Vector Machines and Ensemble approach”. In FIRE (Working Notes), pp. 106-109, 2017. 14. Rangel Francisco, Marc Franco-Salvador, and Paolo Rosso. “A Low Dimensionality Representation for Language Variety Identification”. In International Conference on Intelligent Text Processing and Computational Linguistics, Springer, Cham, pp. 156-169, 2016. 15. Hall Andrew, Loren Terveen, and Aaron Halfaker. “Bot Detection on Wikidata Using Behavioral and Other Informal Cues”. Proceedings of the ACM on Human-Computer Interaction, Vol. 2, No. CSCW, Article 64, November 2018. 16. Giachanou Anastasia, Paolo Rosso, and Fabio Crestani. “Leveraging Emotional Signals for Credibility Detection”. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 877-880, 2019. 17. Popat Kashyap, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. “DeClarE: Debunking Fake News and False Claims Using Evidence-Aware Deep Learning”. arXiv preprint arXiv: 1809.06416, 2018. 18. Merity Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and Optimizing LSTM Language Models”. arXiv preprint arXiv: 1708.02182, 2017. 19. Rangel F., Giachanou A., Ghanem B., and Rosso P. “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter”. In: L. Cappellato, C. Eickhoff, N. Ferro, and A. Névéol (eds.) CLEF 2020 Labs and Workshops, Notebook Papers, CEUR Workshop Proceedings, 2020. 20. Potthast Martin, Tim Gollub, Matti Wiegmann, and Benno Stein. “TIRA Integrated Research Architecture”. In Information Retrieval Evaluation in a Changing World, Springer, Cham, pp. 123-160, 2019.