Detecting Fake News in Tweets from Text and Propagation Graph:
  IRISA’s Participation to the FakeNews Task at MediaEval 2020
                                                                               Vincent Claveau
                                                                   CNRS, IRISA, Univ. Rennes, France
                                                                       vincent.claveau@irisa.fr

ABSTRACT                                                                                    to the fixed string ’URL’. Twitter usernames are removed if they
This paper presents the participation of IRISA to the task of fake                          appear once, others are kept and the @ removed. The intuition is
news detection from tweets, relying either on the text or on propa-                         that some often cited users may be associated to a specific class.
gation information. For the text based detection, variants of BERT-                         Hashtags are kept (with # removed), and decomposed when they
based classification are proposed. In order to improve this standard                        contain a mix of capital and small letters (eg. #CovidHoax is changed
approach, we investigate the interest of augmenting the dataset by                          in CovidHoax Covid Hoax).
creating tweets with fine-tuned generative models. For the graph
based detection, we have proposed models characterizing the prop-                           2.2    Generating artificial examples
agation of the news or the users’ reputation.                                               For this task we wanted to investigate the use of generative models
                                                                                            in order to artificially augment and balance the datasets. Indeed,
1     INTRODUCTION AND RELATED WORK                                                         the performance of neural language models based on transformers
                                                                                            [14] makes this task realistic. To do so, we use GPT2 (Generative
This paper describes the systems that we developed for the text-
                                                                                            Pre-Trained Transformers), a model built from stacked transformers
based and structure-based MediaEval 2020 Fake News detection
                                                                                            (precisely, decoders) trained on a large corpus by auto-regression
challenge. These two subtasks and the datasets are detailed in [10]
                                                                                            [11]. Three GPT2 models – one for each class – are fine-tuned
and [12].
                                                                                            (from the 355M-parameter pre-trained model) with the tweets of
   Text classification is a common NLP task [6]. Although simple
                                                                                            the dev set. The amount of tweets available is very small; we stopped
machine learning approaches have shown promising results for
                                                                                            the iterations when perplexity reached 0.5. The way this stopping
fake news detection [8], the recent transformer-based architectures,
                                                                                            criterion impacts the results would need further investigations,
such as BeRT [2], have set new standards. Several large pre-trained
                                                                                            which were not possible due to the limited time of the challenge.
transformer models are now available; they are known to yield state-
                                                                                            For the generation, we randomly picked up tweets and kept the two
of-the-art results on many NLP tasks including text classification
                                                                                            first words to serve as bootstrap. The temperature, which controls
[16, inter alia]. We rely on one of these pre-trained models to build
                                                                                            the creativity of the model, was set at 0.7. Here again, we had no
our systems. In order to improve this standard approach, we have
                                                                                            time to investigate the impact of this parameter. Approximately
investigated the interest of augmenting the dataset artificially by
                                                                                            20,000 tweets were generated for each class. Here are some tweets
generating tweets with fine-tuned generative models (one for each
                                                                                            generated for the class ’5G conspiracy’:
class). These approaches and results are detailed in Sec. 2.
                                                                                            Crude and unproductive! Turn off the 5G in your area and see
   Similarly, classification of data represented as a graph, and in
                                                                                            if that helps. Covid19 is not funny. I hope that the Wuhan
particular node classification, is not new but the recent trend is
                                                                                            government puts an end to this immediately.
to use deep learning [5]. Yet, for the specific domain of fake news
                                                                                            "Immigrants are the cause of 5G towers, they’re the cause
detection, other approaches are possible. In particular, it has been
                                                                                            of the coronavirus outbreak, they’re the covid-19 victims,
shown that the fake news are propagated differently (and faster)
                                                                                            the 5G towers are the weapon which will eradicate the world
than legit news [15]. The use of node reputation and link-based
                                                                                            population, 5G lays the microchips for the virus, i read
analysis, as it is done in the detection of spam web pages from the
                                                                                            somewhere that the 5G was debuting prior to the introduction of
Web graph (such as TrustRank [4], an adaptation of PageRank [1])
                                                                                            the COVID-19 virus to negate some of the hype around COVID-19
is another inspiration for our approaches. Our two approaches are
further detailed in Sec. 3.
                                                                                            2.3    Classification models
2 TEXT-BASED APPROACHES                                                                     Our 4 classification variants are based on the RoBerta-large model
                                                                                            [7]. It was preferred over other transformer-based representations
2.1 Pre-processing                                                                          because its tokenizer is expected to be more suited for the tweet
From the tweets still online1 , the text is extracted and pre-processed                     writing specifics. We have tested models with different classification
as follows. Emojis are transformed into texts [13]. URLs are changed                        layers (SVM, logistic regression), with or without fine tuning, and
1 At retrieval time, respectively 227, 128 and 80 tweets were no longer available for the   with or without artificial examples. Finally, the submitted runs are
class ’non’, ’5G’, ’other’ in the dev set.                                                  the following ones:
                                                                                            model 1: tweet embedding from the Roberta model (not fine-tuned),
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons          and SVM (RGB kernel);
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online                                                   model 2: Roberta model with a linear classification layer, fine-tuned
                                                                                            on the task (3 epochs);
MediaEval’20, December 14-15 2020, Online                                                                                            V. Claveau

Table 1: Performance of the proposed systems for the text-               according to the inverse of its class proportion (’balanced’ strategy).
based and graph-based detection; models are detailed in                  With their optimal settings, the different learning algorithms finally
Sec. 2 and 3.                                                            show little differences. For this set of features, the submitted run
                                                                         was produced with a random forest (1,000 trees with a maximal
                              cross validation results      official     depth set to 5, Out-of-Bag weights used in the prediction).
    model                 MCC micro-F1 macro-F1              MCC
    model 1 (text)        0.4654     0.7460      0.5924     0.4680
                                                                         3.2    Modeling the propagation
    model 2 (text)        0.5345     0.7945      0.6253     0.5571
    model 3 (text)           -          -           -       0.4937       This set of features is built by considering how the tweet is propa-
    model 4 (text)           -          -           -       0.4888       gated (without considering the users’ reputations). These features
    reputation (graph)    0.4415     0.7274      0.5900     0.4093       can be used even if every involved user has never been seen be-
    propagation (graph)   0.3198     0.6051      0.4980     0.3036       fore and is not connected any known user. The features include
                                                                         (with n 0 the first user tweeting the piece of news): number of nodes
                                                                         in the propagation graph; total number of friends and followers
                                                                         (for all nodes implied), as well as the median, 25% percentile, 75%
model 3: same as model 2, with artificially generated examples (3
                                                                         percentile of followers; number of followers and friends of n 0 ; differ-
epochs);
                                                                         ence between the number of followers and friends of n 0 ; maximal,
model 4: same as model 3 (4 epochs).
                                                                         minimal, average, median, 25% percentile, 75% percentile of retweet
                                                                         time; times to reach at least 100, 1,000, 10,000 followers and so
2.4     Results of text-based detection
                                                                         on up to 200,000 followers. With this set of features, a SVM has
The results of our models are given in Tab. 1. When available, in        been used with the following parameters: standardized features
addition to the official score on the test set, we provide Matthews      (removed mean and scaled to unit variance), RBF kernel, C=0.9,
correlation coefficient (MCC), micro-F1 (accuracy) and macro-F1 on       gamma automatically set with the ’scale’ heuristics.
the dev data (80% for training, 20% for validation). Note that due to
the cost of the artificial example generation and the small amount
of data, the GPT2 models are fine-tuned on all the available dev         3.3    Results of graph-based detection
data; we do not have reliable results for models 3 and 4 (generated      The results of the systems are given in Tab. 1. The cross-validation
tweets added to the training set can be very similar to those in the     and official results are consistent; they both show the advantage of
validation set).                                                         the reputation-based approach, especially when considering micro-
   From the results, we see that fine-tuning the representation          F1. The difference between cross-validation and official test score
(model 2 vs. model 1) is beneficial. Unfortunately, the artificially     may be explained by a lower amount of already seen nodes in the
generated tweets (model 3 and 4) do not yield the expected im-           test set, compared to what was generated by cross-validation. A
provement. From the confusion matrices, one can see that the class       system exploiting all the proposed features (propagation + reputa-
’other conspiracy’ has the poorest results, with tweets being equally    tion) was also tested but obtained no statistical difference with the
labeled as ’5G’, ’non’ or ’other’.                                       reputation only features.
                                                                            For both models, the ’other conspiracy’ class is again the most
3     GRAPH-BASED APPROACHES                                             error-prone (proportionally), with an equal amount of the its tweets
                                                                         being classified in the three classes. Overall, for both feature sets,
For the second sub-task, we have proposed two models, based on
                                                                         many errors are caused by confusion between the 5G and non 5G
two different sets of features. They are described in the following
                                                                         conspiracy tweets.
subsections, as well as the machine learning algorithms adopted
and their results.
                                                                         4     CONCLUSION AND FUTURE WORK
3.1     Modeling the user’s reputation                                   For the detection of fake news based on the text, we have adopted
This set of features aims at taking into account if one of the users     a state-of-the-art approach based on RoBerta. The scores obtained
posting or propagating the news has already be seen. Each user is        show that there exists a large margin for progress, especially when
indeed associated with a score for each possible label, computed         dealing with close classes (5G vs. other conspiracies). The idea
from the numbers of training samples of each class it was associated     of incorporating artificially generated examples did not result in
with. We also take into account the scores of the neighbors of this      better performance and still needs some work. First, we may find
user, their own neighbors, and so on... In practice, this is imple-      better ways to set the training and generation hyper-parameters.
mented with the PageRank algorithm [1] on the undirected graph           Secondly, we plan to investigate the use of generative model to
with a dumping factor set to 0.8 (optimized by cross-validation).        expand the sample at inference time.
Finally, each sample ends up with one value for each class; these           For the detection based on the structure, we have shown that
three scores are the features used by the classifier.                    simple approaches like reputation already offered promising results,
   Several learning algorithms have been tested (logistic regression,    even on small datasets with many unseen-before nodes. In addition
random forests, SVM; as implemented in scikit learn [9]). The opti-      to this type of approach, we want to explore more recent node
mal settings for their hyper-parameters are grid-searched using 20%      representation techniques that make it possible to use deep learning,
of the dev set as validation set. The weight of each sample is adapted   such as node2vec [3] or subsequent variants.
FakeNews: Corona virus and 5G conspiracy                                                                 MediaEval’20, December 14-15 2020, Online


REFERENCES                                                                       [16] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,
 [1] Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-                     and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and
     Scale Hypertextual Web Search Engine. In Proceedings of the Seventh              Analysis Platform for Natural Language Understanding. In 7th Interna-
     International Conference on World Wide Web 7 (WWW7). Elsevier                    tional Conference on Learning Representations, ICLR 2019, New Orleans,
     Science Publishers B. V., Brisbane, Australia, 107–117.                          LA, USA, May 6-9, 2019.
 [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     2019. BERT: Pre-training of Deep Bidirectional Transformers for Lan-
     guage Understanding. In Proceedings of the 2019 Conference of the
     North American Chapter of the Association for Computational Linguis-
     tics: Human Language Technologies, Volume 1 (Long and Short Papers).
     Association for Computational Linguistics, Minneapolis, Minnesota,
     4171–4186. https://doi.org/10.18653/v1/N19-1423
 [3] Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable Feature
     Learning for Networks. In Proceedings of the 22nd ACM SIGKDD Inter-
     national Conference on Knowledge Discovery and Data Mining (KDD
     ’16). Association for Computing Machinery, New York, NY, USA, 855–
     864. https://doi.org/10.1145/2939672.2939754
 [4] Z. Gyngyi and H. Garcia-Molina. 2005. Link spam alliances. In Pro-
     ceedings of the 31st international conference on Very large data bases,
     VLDB. Trondheim, Norway, 517–528.
 [5] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representa-
     tion Learning on Graphs: Methods and Applications. IEEE Computer
     Society Technical Committee on Data Engineering (2017).
 [6] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, San-
     jana Mendu, Laura Barnes, and Donald Brown. 2019. Text Classifica-
     tion Algorithms: A Survey. Information (2019).
 [7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
     Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy-
     anov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Ap-
     proach. (2019). arXiv:cs.CL/1907.11692
 [8] Cédric Maigrot, Vincent Claveau, Ewa Kijak, and Ronan Sicre. 2016.
     MediaEval 2016: A multimodal system for the Verifying Multimedia
     Use task. In MediaEval 2016: ”Verfiying Multimedia Use” task. Hilver-
     sum, Netherlands. https://doi.org/10.1145/1235
 [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
     Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
     A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.
     2011. Scikit-learn: Machine Learning in Python. Journal of Machine
     Learning Research 12 (2011), 2825–2830.
[10] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes
     Moe, Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020.
     FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020.
     In MediaEval 2020 Workshop.
[11] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and
     Ilya Sutskever. 2019. Language Models are Unsupervised Multitask
     Learners. OpenAI Blog (2019).
[12] Daniel Thilo Schroeder, Konstantin Pogorelov, and Johannes Langguth.
     2019. FACT: a Framework for Analysis and Capture of Twitter Graphs.
     In 2019 Sixth International Conference on Social Networks Analysis,
     Management and Security (SNAMS). IEEE, 134–141.
[13] Kevin Wurster Taehoon Kim. 2020. Emoji Python library. (2020).
     https://pypi.org/project/emoji/
[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
     Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017.
     Attention is All you Need. In Advances in Neural Information Processing
     Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
     S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998–
     6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[15] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018.                    The
     spread of true and false news online.               Science 359, 6380
     (2018), 1146–1151.            https://doi.org/10.1126/science.aap9559
     arXiv:https://science.sciencemag.org/content/359/6380/1146.full.pdf