Detecting Conspiracy Theories from Tweets: Textual and Structural Approaches Haoming Guo1* , Adam Ash1* , David Chung1* , Gerald Friedland1 1 University of California, Berkeley mike0221@berkeley.edu,adamash@berkeley.edu dachung@berkeley.edu,fractor@berkeley.edu ABSTRACT of 55 tokens (pad 0s if below 55), and for each token we get a The sharing of biased and fake news on social media has skyrock- 300-dimensional embedding from a pretrained word2vec model[6]. eted in the past few years. These actions have caused real-world Then, we use the vectors to train a biLSTM for classifications. We problems and harm. The Fake News Detection Task 2020 has two choose the Adam optimizer, categorical cross entropy loss, 256 units subtasks: NLP-based approach and graph-based approach (Analyz- for LSTM and two fully connected layers for our final prediction. ing the repost structure of social media posts). We present baseline models for these two different subtasks and their performance. For 2.2 Transformers the NLP-based approach, Transformers yielded the best results with We experiment with pretrained transformers to classify the tweets. a Matthews Correlation Coefficient (MCC) score of 0.477. For the BERT, which stands for Bidirectional Encoder Representations from graph-based approach, the best results came from a Support Vector Transformers, was introduced in 2018 and achieved state-of-the-art Machine (SVM) model with a MCC score of 0.366. performance on most NLP tasks[2]. BERT uses a multi-layer bidi- rectional transformer encoder with a self-attention mechanism to 1 INTRODUCTION & RELATED WORK learn a language representation of the input texts. Following BERT, two modifications, XLNet[13] and RoBERTa[5] were proposed to This paper discusses social media natural language processing address some of the BERT’s shortcomings and outperformed BERT and graph-based processing on detecting conspiracy theories. We on a variety of tasks. present our work two subtasks: one that classifies tweets based We use a framework called flair[1] to obtain BERT, XLNet and on their content and metadata (includes images), and another that RoBERTa embeddings separately as 3 sets of features. We use the classifies tweets solely based on their graph-based structure with base 768-dimensional versions of all transformers. Then, we train a very little metadata (relative time posted, friends, followers). The fully connected neural network with 2 hidden layers on each set task overview paper[10] describes the dataset more in-depth as well of features to classify the tweets. Best hyperparameters including as providing information on how the dataset was constructed.[11] hidden layer size, learning rate and number of training iterations FNC-1, a similar benchmark on fake news and stance detection are tuned separately for each of BERT, XLNet and RoBERTa. from texts has received much attention from researchers[4]. Us- ing handcrafted features and a Multi-Layer Perceptron model has proved to perform well on the task[4]. Furthermore, Slovikovskaya 2.3 Basic Graph Features et al. showed that fine-tuning transformers achieve state-of-the-art Some features are hand prepared for each retweet graph. Features on the benchmark.[12] are either categorized as being calculated solely based on graph Methods incorporating graph structure have proved to be fairly structure, or calculated with the help of separate node information. effective in detecting "fake news" consisting of an article shared on Examples of features based solely off of graph structure include Twitter or other social media.[7] Moreover, deep learning methods edge count, node count, number of connected components, and on graphs of variable size and connectivity have been shown to be average clustering coefficient. Features based off of both node infor- effective tools for classification.[3] mation and graph structure include average time to retweet, origi- This paper presents several prediction models and features for nal tweeter’s follower count, and percentage of original tweeter’s an NLP approach as well as a graph-based approach. We present the followers who retweeted. performance of these predictors and describe our methodologies. 2.4 Computed Graph Features 2 APPROACH About 60000 random subgraphs are sampled from graphs in the 2.1 Bidirectional LSTM training set. To create each random sample, ten nodes and their corresponding edges are then randomly chosen from a graph in We use a bidirectional LSTM (BiLSTM) as our baseline model for the training set. Each randomly sampled subgraph is given a 100- the NLP track. We tokenize and lemmatize each tweet into a list dimensional vector corresponding to the subgraph’s flattened adja- * indicates authors with equal contributions cency matrix and a label corresponding to the label of its source Copyright 2020 for this paper by its authors. Use permitted under Creative Commons graph. A logistic regression classifier is run on all sampled sub- License Attribution 4.0 International (CC BY 4.0). MediaEval’20, December 14-15 2020, Online graphs. For each graph in the test set, ten random subgraphs are similarly computed. The average of the model’s predictions for MediaEval’20, December 14-15 2020, Online Guo et al. Table 1: NLP Validation Results Table 3: Structural Approach (Graph Only) Validation Re- sults Model Three-class MCC Two-class MCC Model Three-class MCC Two-class MCC BiLSTM 0.292 0.378 BERT 0.355 0.391 SVM 0.308 0.306 XLNet 0.342 0.426 Random Forest 0.321 0.304 RoBERTa 0.449 0.471 Neural Net 0.288 0.326 Table 2: NLP Official Test Results Table 4: Structural Approach Validation Results Model Three-class MCC Two-class MCC Model Three-class MCC Two-class MCC XLNet 0.326 0.318 SVM 0.276 0.389 RoBERTa 0.459 0.477 Random Forest 0.370 0.115 Neural Net 0.263 0.338 these 10 random subgraphs is added to the test set features for their 3.2 Structure-based Approaches corresponding graph. We evaluate our results using MCC as discussed in the previous For each graph in the training set, the Graph2Vec package[8] subsection. Once again, the data is split into an 80% training set and is used to create 64-dimensional representations for the graph’s a 20% validation set. Different models are tested for both the two- largest (highest node count) subgraph. This representation is used class and three-class problems. Models tested are SVM, neural nets, as 64 features for each graph. Also, 64-dimensional representations and random forest. The neural net has three layers of 64 Rectified are taken of each subgraph of each graph, and a weighted average Linear Units each, with a Sigmoid function output layer. is taken by the number of nodes in the subgraph is added to the For the two-class problem, a SVM model with a radial basis func- features for each graph. tion kernel outperforms the other models on validation sets. For the A Deepwalk[9] algorithm is also used to generate a length 64 three-class problem, a random forest model with 40 estimators and feature for each node in both the training and test sets. A logistic a maximum depth of four nodes outperforms the others. Average regression classifier is trained on the Deepwalk feature vectors for validation results using features computed only from the graph each node in each graph in the training set. For each graph in the structure are shown in Table 3, and average validation results using test set, the average of the predictions for each node was used as a features computed from all available data are shown in Table 4. feature. Our final classifiers are run on a test set roughly one third the size of our training set. Our two-class SVM model receives an MCC 3 RESULTS AND ANALYSIS of 0.370. Our three-class random forest model receives an average 3.1 Text-based Approaches MCC of 0.318. It is not surprising that our two-class classifier per- We split the data into an 80% training set and a 20% validation forms better, as using two classes instead of three leads to a dataset set. We evaluate our results using Matthews correlation coefficient with much less ambiguity. (MCC), which is considered a balanced measure even for unbalanced data distributions. We present validation results and the official test 4 DISCUSSION AND OUTLOOK set results in tables 1 and 2. Above, we presented and experimented with several methods to All transformers outperforms the baseline BiLSTM as in many detect conspiracy theories from social media content based on their other NLP tasks, but among the transformers RoBERTa signifi- text and graph structure. Overall, a transformer-based approach cantly outperforms the other. We analyze the reasons behind it. exhibited the best performance for text-based classification, while First, BERT and XLNet are pretrained on 16GB of Book Corpus SVM/Random Forest trained on our crafted graph features proved and English Wikipedia, while RoBERTa is pretrained on an addi- to be the best on structure-based classification. tional 144GB of CommonCrawl News dataset, Web text corpus, and There are many ways to extend the methodologies described Stories from Common Crawl. We think that this additional data above. Below we list some possible ways of furthering our work. not only improves RoBERTa’s generalizability, but it also makes (1) Our preliminary experimentation with the provided meta- the model more suitable for news subjects and informal language. data yielded worse results than our transformers-based approaches. Secondly, RoBERTa removes the Next Sentence Prediction (NSP) Further experiments could be done to determine whether training training objective. The NSP objective was hypothesized to improve a classifier on the metadata would yield better results. performance on tasks that require reasoning on pairs of sentences, (2) In this paper, we focused on text-based approaches and which is not a key element in our task. Liu et al. also showed in structure-based approaches separately for the specific sub-tasks. his paper the uselessness of the NSP objective in many settings. Incorporating different modalities such as analyzing tweet texts, Therefore, the removal of NSP loss is another possible reason why tweet structures, metadata, and images associated with the tweet RoBERTa performs the best. could prove to be useful. FakeNews: Corona virus and 5G Conspiracy MediaEval’20, December 14-15 2020, Online REFERENCES [1] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th In- ternational Conference on Computational Linguistics. 1638–1649. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805 [3] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. 2015. Convolutional Networks on Graphs for Learning Molecular Fingerprints. (2015). arXiv:cs.LG/1509.09292 [4] Andreas Hanselowski, Avinesh P.V.S., Benjamin Schiller, Felix Caspel- herr, Debanjan * Chaudhuri, Christian M. Meyer, and Iryna Gurevych. 2018. A Retrospective Analysis of the Fake News Challenge Stance- Detection Task. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018). http://tubiblio.ulb. tu-darmstadt.de/105434/ [5] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Ap- proach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv. org/abs/1907.11692 [6] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Lan- guage Resources and Evaluation (LREC 2018). [7] Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, and Michael M. Bronstein. 2019. Fake News Detection on Social Media using Geometric Deep Learning. (2019). arXiv:cs.SI/1902.06673 [8] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning Distributed Representations of Graphs. (2017). arXiv:cs.AI/1707.05005 [9] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. CoRR abs/1403.6652 (2014). arXiv:1403.6652 http://arxiv.org/abs/1403.6652 [10] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020. FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020. In MediaEval 2020 Workshop. [11] Daniel Thilo Schroeder, Konstantin Pogorelov, and Johannes Langguth. 2019. FACT: a Framework for Analysis and Capture of Twitter Graphs. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE, 134–141. [12] Valeriya Slovikovskaya. 2019. Transfer Learning from Transform- ers to Fake News Challenge Stance Detection (FNC-1) Task. CoRR abs/1910.14353 (2019). arXiv:1910.14353 http://arxiv.org/abs/1910. 14353 [13] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregres- sive Pretraining for Language Understanding. CoRR abs/1906.08237 (2019). arXiv:1906.08237 http://arxiv.org/abs/1906.08237