=Paper=
{{Paper
|id=Vol-3181/paper57
|storemode=property
|title=Classifying COVID-19 Conspiracy Tweets with Word Embedding and BERT
|pdfUrl=https://ceur-ws.org/Vol-3181/paper57.pdf
|volume=Vol-3181
|authors=Yuta Yanagi,Ryouhei Orihara,Yasuyuki Tahara,Yuichi Sei,Akihiko
Ohsuga
|dblpUrl=https://dblp.org/rec/conf/mediaeval/YanagiOTSO21
}}
==Classifying COVID-19 Conspiracy Tweets with Word Embedding and BERT==
Classifying COVID-19 Conspiracy Tweets with Word Embedding and BERT Yuta Yanagi1 , Ryohei Orihara1 , Yasuyuki Tahara1 , Yuichi Sei1 , Akihiko Ohsuga1 1 The University of Electro-Communications, Japan yanagi.yuta@ohsuga.lab.uec.ac.jp,orihara@acm.org,tahara@uec.ac.jp,sei@is.uec.ac.jp,ohsuga@uec.ac.jp ABSTRACT the BERT model improved classification performance for the 5G We, the team OTS-UEC contributed the automatic detection of con- conspiracy/the other conspiracy/the non-conspiracy. spiracy tweets in MediaEval 2021. The dataset has tweets that refer to COVID-19. Part of them argues/discusses the relationship of 3 APPROACH conspiracies. Following the results of the MediaEval 2020 working In this section, we show how to implement our proposed model in notes, we use a BERT-based classifier. We implement three pro- each subtask. posed models and compare them in the experiments. In the task of this year, the model also shows better results of classifying than 3.1 Preprocessing a text embedding-based one. This result suggests that using the The organizer sent us raw tweet texts as a dataset. Therefore, we pre-trained model is also suitable to classify conspiracy tweets by apply to preprocess following rules. small preparation processes. • Fix contracted forms by a provided tool [15] and manual processes. 1 INTRODUCTION • Make all alphabets to lowercase. FakeNews, one of the MediaEval 2021 tasks focuses on the auto- • Remove letters except for alphabets, numbers, and whites- matic classifying of tweets by conspiracies [10]. The FakeNews has paces. three classification subtasks. The first (Text-Based Misinformation • Replace all numbers to zero (0) except “covid19” Detection, MD) is classifying three stances classes. The given three • Eliminate stopwords by a tool from NLTK [3]. labels are supporting, discussing, and non-conspiracy (not mention The removed letters include emojis. When we improve the per- conspiracy). The second (Text-Based Conspiracy Theory Recogni- formances of the classifications, considering emojis may be able to tion, CTR) is nine binary classifications for pre-defined conspiracies extract more accurate tweet features. if referred to or not. The third (Text-Based Combined Misinforma- tion and Conspiracies Detection, CMCD) requires classifying three stances by the nine conspiracies (3 × 9 output types). 3.2 Language Models We compared the effect of using pre-trained language models in The FakeNews task requires making two model types. On the one every subtasks. In addition, we attempt to compare two language hand, “required run” needs to complete within the dataset. On the models. One is pre-trained NNLM [2] based, another is pre-trained other hand, “optional run (s)” allows using data outside the dataset. BERT [4] based. The results show there are solid improvements in The outside data includes pre-trained language models. using the pre-trained language model. Moreover, using the BERT We compare the effect of pre-trained language models on the based model gives the best result in the experiments. difference of results between these two model types. We have done all implementations in Keras [7]. 2 RELATED WORK 3.2.1 Required run. First of all, we get encoded tweets consist- The epidemic of COVID-19 affects not only in medical area but ing of integers by TextVectorization. Secondly, we obtain word also social media. Diffusing misinformation (including fake news) embedding by the Embedding layer. We initialize the layer by the reduces the credibility of governments and medical treatments like uniform distribution. Finally, we obtain a tweet feature by average- vaccines [13]. Moreover, part of people argues the relationships pooling of all word embeddings in GlobalAveragePooling1D. The between the epidemic and conspiracies by psychological influences dimensionality of output from the pooling is 128. We add a fully [5, 14]. Therefore, the automatic detection of conspiracy tweets is connected layer with a 10% dropout layer. The 32-dimensional array crucial to lighten the burden imposed on medical workers. is the tweet features in the required run. The FakeNews task in 2021 extends from the automatic detection of the 5G conspiracy from COVID-19 tweets in MediaEval 2020 3.2.2 Optional runs. In this run, we can use outside the given [11, 12]. Among its participants, two teams used the BERT model dataset includes pre-trained models. We use the BERT-based lan- in a single model [8] or an ensemble model [9]. In both cases, using guage model from the results of the FakeNews task in MediaEval 2020 [8, 9]. We assign small_bert from TensorFlow Hub [6]. We Copyright 2021 for this paper by its authors. Use permitted under Creative Commons also add the fully connected layer and obtain the tweet features. In License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online the stance classification subtask, we also compared with a NNLM based language model by TensorFlow Hub [2]. MediaEval’21, December 13-15 2021, Online Y. Yanagi et al. Table 2: Detail results of MCC in the conspiracy detection. # Type of stance labels # Refer to conspiracy #Agree with conspiracy See also overview paper for every abbreviation of pre-defined 1 Non-conspiracy 0 NO Except in training conspiracies [10]. 2 Discusses conspiracy 0 NO 1 YES Promotes/Supporting Model SC BMC A FV IP HRI PR NWO S 3 1 YES conspiracy Emb. 0.01 0.16 0.30 -0.09 0.15 0.1 0.19 0.2 0.16 BERT 0.04 0.41 0.44 0.09 0.05 0.53 0.30 0.41 0.13 Figure 1: Comparison table between the given labels and the new labels in Misinformation Detection. Table 3: The results of the couple binary classifications and single three-class classification with BERT in the MD. Table 1: Results of implemented models. Model name MCC Subtask name Model MCC Double binary classifications 0.413 Word emb. 0.142 Single three-class classification 0.258 Misinformation Detection BERT-based 0.413 NNLM-based 0.388 Word emb. 0.133 tweets. According to the task organizer, other participants also send Conspiracy Theory Recognition the all-one output. BERT-based 0.267 Table 2 shows the detailed result in the CTR subtask. The BERT Combined Misinformation Word emb. 0.000 model is better in seven of the nine pre-defined conspiracies. Even & Conspiracies Detection BERT-based 0.000 in the remaining three cases the differences are tiny. 4.2 Double Binary Classification 3.3 Classification Models Table 3 the results of two classification structures. Both of them use We prepare three classification models for each subtask. We input the BERT-based language models. We can confirm that the double the tweet features for them. binary classification shows a better score than the single three-class one. 3.3.1 Misinformation Detection. We build two binary classifiers because the ratio of the labels is nearly 2:1:1. Figure 1 shows the cor- respondences of the given labels and ones in this subtask imposed 5 DISCUSSION AND OUTLOOK by us. The first one considers if a tweet refers to any conspira- In this paper, we participate the FakeNews task that requires classi- cies. If it does, the second one considers if the tweet supports the fying tweets by conspiracies. To realize it, we employ pre-trained conspiracies or not. Therefore, during the training sequence, the language models from other models for the FakeNews task of Media- non-conspiracy tweets are not used for the second classifier. We Eval 2020 [11]. We compare them with models that use only word think this will help to train without bias from the imbalance of embedding. According to the experimental result, the pre-trained given labels. We compare in experiments the effect of this structure language model help to extract conspiracy information at the stance with the model that classified directly for three labels. classification and the conspiracy detection. However, in classifica- tion for the CMCD subtask, all output scores are the same label. We 3.3.2 Conspiracy Theory Recognition. We build a classifier for guess that the classification models do not work because the tweets nine outputs that parallel pre-defined conspiracies. We use another mentioning each pre-defined conspiracy are scattered. However, fully connected layer that outputs nine values. looking at the models of other teams, it is possible that we have designed our models incorrectly for the CMCD subtask. A closer 3.3.3 Combined Misinformation and Conspiracies Detection. We look at the result of CTK shows variation in the effectiveness of prepare nine three-class classifiers that deduce stances. We do not the pre-trained language model by the pre-defined conspiracies. use two binary classifiers due to the lack of tweets that refer to This result may come from the characteristics of the trend of tweet each conspiracy. content. It can be needed further researching. Moreover, we also compare the two classification structures at the MD subtask. The 4 RESULTS AND ANALYSIS experiment results show us that the double binary classification 4.1 Effect of Language Model is better than the single three-class classification. We expect this Table 1 shows the returned results of the FakeNews task. All result reason is nearly 2:1:1 of three classes ratio. If the ratio is different, values are the Matthews correlation coefficient (MCC) [1]. the trend will not continue. We can confirm that using the language model makes the results improve except for the CMCD subtask. In the CMCD, all output ACKNOWLEDGMENTS labels are one in those models, which means non-conspiracy. We This work was supported by JSPS KAKENHI Grant Numbers attribute this to the fact that by separating the classifiers by the JP18H03229, JP18H03340, 18K19835, JP19H04113, JP19K12107, pre-defined conspiracies, we increased the ratio of non-conspiracy JP21H03496. FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task MediaEval’21, December 13-15 2021, Online REFERENCES [15] Pascal van Kooten. 2020. contractions. https://github.com/kootenpv/ [1] Pierre Baldi, Søren Brunak, Yves Chauvin, Claus A. F. Andersen, and contractions. (2020). Henrik Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview . Bioinformatics 16, 5 (05 2000), 412–424. https://doi.org/10.1093/bioinformatics/16.5.412 [2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jan- vin. 2003. A neural probabilistic language model. The journal of machine learning research 3 (2003), 1137–1155. [3] Steven Bird and Edward Loper. 2004. NLTK: The Natural Language Toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, Barcelona, Spain, 214–217. https://www.aclweb.org/anthology/P04-3031 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). arXiv:cs.CL/1810.04805 [5] Karen M. Douglas. 2021. COVID-19 conspiracy the- ories. Group Processes & Intergroup Relations 24, 2 (2021), 270–275. https://doi.org/10.1177/1368430220982068 arXiv:https://doi.org/10.1177/1368430220982068 [6] Tensorflow hub. 2021. small_bert/bert_en_uncased_L-4_H-512_A-8. (2021). https://tfhub.dev/tensorflow/small [7] Nikhil Ketkar. 2017. Introduction to keras. In Deep learning with Python. Springer, 97–111. [8] Andrey Malakhov, Alessandro Patruno, and Stefano Bocconi. 2020. Fake News Classification with BERT. In Working Notes Proceed- ings of the MediaEval 2020 Workshop, Online, 14-15 December 2020 (CEUR Workshop Proceedings), Steven Hicks, Debesh Jha, Konstantin Pogorelov, Alba García Seco de Herrera, Dmitry Bogdanov, Pierre- Etienne Martin, Stelios Andreadis, Minh-Son Dao, Zhuoran Liu, José Vargas Quiros, Benjamin Kille, and Martha A. Larson (Eds.), Vol. 2882. CEUR-WS.org. http://ceur-ws.org/Vol-2882/paper38.pdf [9] Olga Papadopoulou, Giorgos Kordopatis-Zilos, and Symeon Pa- padopoulos. 2020. MeVer Team Tackling Corona Virus and 5G Con- spiracy Using Ensemble Classification Based on BERT. In Working Notes Proceedings of the MediaEval 2020 Workshop, Online, 14-15 De- cember 2020 (CEUR Workshop Proceedings), Steven Hicks, Debesh Jha, Konstantin Pogorelov, Alba García Seco de Herrera, Dmitry Bogdanov, Pierre-Etienne Martin, Stelios Andreadis, Minh-Son Dao, Zhuoran Liu, José Vargas Quiros, Benjamin Kille, and Martha A. Larson (Eds.), Vol. 2882. CEUR-WS.org. http://ceur-ws.org/Vol-2882/paper76.pdf [10] Konstantin Pogorelov, Daniel Thilo Schroeder, Stefan Brenner, and Johannes Langguth. 2021. FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task at MediaEval 2021. In the MediaEval 2021 Workshop, Online, 13-15 December 2020. Online. [11] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020. Fakenews: Corona virus and 5g conspiracy task at mediaeval 2020. In MediaEval 2020 Workshop. Online. [12] Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, Stefan Brenner, and Johannes Langguth. 2021. WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets. In 2021 Workshop on Open Challenges in Online Social Networks. Online, 21–25. [13] Jon Roozenbeek, Claudia R. Schneider, Sarah Dryhurst, John Kerr, Alexandra L. J. Freeman, Gabriel Recchia, Anne Marthe van der Bles, and Sander van der Linden. 2020. Susceptibility to misinformation about COVID-19 around the world. Royal Society Open Science 7, 10 (Oct. 2020), 201199. https://doi.org/10.1098/rsos.201199 [14] Joseph E Uscinski, Adam M Enders, Casey Klofstad, Michelle Seelig, John Funchion, Caleb Everett, Stefan Wuchty, Kamal Premaratne, and Manohar Murthi. 2020. Why do people believe COVID-19 conspiracy theories? Harvard Kennedy School Misinformation Review 1, 3 (2020).