=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_76
|storemode=property
|title=MeVer
Team
Tackling Corona Virus and 5G Conspiracy Using Ensemble Classification Based on BERT
|pdfUrl=https://ceur-ws.org/Vol-2882/paper76.pdf
|volume=Vol-2882
|authors=Olga Papadopoulou,Giorgos Kordopatis-Zilos,Symeon
Papadopoulos
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PapadopoulouKP20
}}
==MeVer
Team
Tackling Corona Virus and 5G Conspiracy Using Ensemble Classification Based on BERT==
MeVer Team Tackling Corona Virus and 5G Conspiracy
Using Ensemble Classification Based on BERT
Olga Papadopoulou Giorgos Kordopatis-Zilos Symeon Papadopoulos
Information Technologies Institute, Information Technologies Institute, Information Technologies Institute,
CERTH, Thessaloniki, Greece CERTH, Thessaloniki, Greece CERTH, Thessaloniki, Greece
olgapapa@iti.gr georgekordopatis@iti.gr papadop@iti.gr
ABSTRACT have been recently published dealing with the detection and veri-
This paper presents the approach developed by the Media Verifica- fication of COVID-19-related misinformation [2, 3, 10, 11]. Misin-
tion (MeVer) team to tackle the task of FakeNews: Coronavirus and formation can be spread in the form of text, images, and videos.
5G conspiracy at the MediaEval 2020 Challenge. We build a two- Natural language processing (NLP) is a means of dealing with many
stage classification approach based on ensemble learning of multi- types of content. For example, the authors of [8] collected a data-
ple classification networks. Due to the imbalanced and relatively base of debunked and verified user-generated videos and developed
small dataset, our ensemble method leads to improved performance a method to detect them using the contextual information surround-
compared to a single classification model. We fine-tune pre-trained ing them rather than the video content. The emergence of BERT
Bidirectional Encoder Representations from Transformers (BERT), (Bidirectional Encoder Representations from Transformers) has led
one of the most popular transformer models, on the problem of many researchers to use it for text classification and thus in the
Coronavirus and 5G conspiracy detection. Our approach achieved detection of fake news [5, 7]. A key limitation of emerging topics
a score of 0.413 in terms of the Matthews Correlation Coefficient and the need to build models dedicated to a specific topic is the
(MCC), which is the official evaluation metric of the task. lack of sufficient training samples. To this end, researchers are lean-
ing towards solutions based on ensemble methods, unsupervised
learning, and data augmentation.
1 INTRODUCTION
COVID-19 emerged as a health crisis (pandemic) and soon evolved 3 PROPOSED APPROACH
into an infodemic (βinfodemicβ refers to an overabundance of infor-
Figure 1 illustrates the pipeline of the proposed approach. We follow
mation). There are already harmful impacts of COVID-19 Conspir-
a two-step classification approach:
acy theories and specifically around 5G disinformation on society.
The incident of the British 5G towers fires because of coronavirus β’ First step consists of an initial classification based on en-
conspiracy theories [14] is a representative example of how impor- semble learning in order to provide a first-level classifica-
tant is to detect and prevent the dissemination of such theories. tion of Conspiracy and Non-conspiracy tweets.
The FakeNews: Coronavirus and 5G conspiracy task is a challenge β’ The second step consists of the final prediction that clas-
of MediaEval 2020 that focuses on the analysis of tweets around sifies the detected Conspiracy tweets as 5G-conspiracy or
Coronavirus and 5G conspiracy theories in order to detect misin- Other-conspiracy.
formation spreaders. For further details on the subtasks and the
respective dataset, the reader is referred to [9]. The provided dataset consists of 1,135 samples of the 5G-conspiracy
Our approach focuses on ensemble classification in order to class, 712 of the Other-conspiracy class and 4,198 samples of Non-
overcome the relatively small training dataset and predict more conspiracy class. As described in [4], imbalanced datasets for train-
accurately the Coronavirus and 5G conspiracy tweets. In short, ing machine learning algorithms or deep learning approaches pose
a first-level classification is applied using majority voting over risks of bias towards the majority class. To this end, we sub-sample
nine classifiers to detect conspiracy and non-conspiracy tweets. A training tweets of the majority classes in order to balance the train-
second-level classification is then applied to detect the conspiracy ing sets and build the proposed classifiers. Specifically, Table 1
tweets related to 5G over the other conspiracy ones. For the training presents the number of training samples considered per classifier.
process, we leverage on the pre-trained BERT [1] model and the In πΆπΏπ , the training samples of 5G-conspiracy and Other-conspiracy
implementation provided by the HuggingFace library [15]1 . are concatenated into an overall Conspiracy class (1,847 tweets)
and an equal number of tweets is randomly sampled from the Non-
2 RELATED WORK conspiracy tweets.
In the first step of our approach, we train π classifiers πΆπΏπ , which
In case of a pandemic such as that of the Coronavirus, the inten- are used to predict Conspiracy and Non-conspiracy tweets. π is
tional or unintentional dissemination of manipulated content, con- empirically selected to be nine. An odd number of classifiers makes
spiracy theories, and propaganda are critical [12]. Several works it possible to apply majority voting. Each classifier πΆπΏπ predicts a
1 https://huggingface.co/transformers/model_doc/bert.html label of 1 for Conspiracy or 0 for Non-conspiracy tweets. Majority
Γπ
voting is applied and a final prediction per tweet is given by π=1
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons πΆπΏπ > π /2 where, π = 9, and if true prediction = Conspiracy else
License Attribution 4.0 International (CC BY 4.0).
MediaEvalβ20, December 14-15 2020, Online
prediction = Non-conspiracy. For each model, different sample of
Non-conspiracy tweets is selected.
MediaEvalβ20, December 14-15 2020, Online O. Papadopoulou et al.
Figure 1: Our proposed pipeline for tackling the challenge of Corona virus and 5G conspiracy
Table 1: Summary of the training samples used to build the Table 2: Evaluation results in terms of MCC, the official met-
respective models ric proposed for the task.
Label πΆπΏπ πΆπΏππ’ππ‘π πΆπΏππππ π Method MCC
5G conspiracy 712 712 three-class BERT 0.42
1847 Proposed approach 0.81
Other conspiracy 712 712
Non-conspiracy 1847 712 -
drop rate to prevent overfitting. Our models are evaluated against
In the second step, the predictions of Non-conspiracy are consid- a validation set, and we select the versions that achieve the best
ered as final predictions without further processing while the Con- performance in terms of accuracy as our final models.
spiracy tweets are further processed to distinguish 5G-conspiracy
from Other-conspiracy. In this step, two additional models are trained 4 RESULTS AND ANALYSIS
focusing on the detection of 5G-conspiracy tweets. The first, πΆπΏ π 1 ,
Initially, we trained a three-class model using the implementation
is a three-class model (1: 5G-conspiracy, 2: Other-conspiracy and 3:
details presented in subsection 3.1. From the annotated dataset, we
Non-conspiracy) trained using random samples from the majority
randomly selected 100 samples per class as testing set and discarded
classes and the total number of minority class samples (Other-
them from the training phase in all runs. The performance of the
conspiracy). The other model, πΆπΏ π 2 , is a binary classifier trained
model is 0.42 in terms of MCC. In order to improve the performance,
on the two Conspiracy classes. The final decision is taken if πΆπΏ π 1 =
we implemented the presented two-step classification approach
πΆπΏ π 2 = 1 = 5G-conspiracy. In any other case, the tweet is labeled as
resulting in increase of the MCC metric to 0.81 as presented in
Other-conspiracy.
Table 2.
Our proposed approach achieved a score of 0.413 in terms of
3.1 Implementation details
MCC on the provided testing set of unseen tweets.
For tokenization, we employ bert-base-uncased of BertTokenizer
applied to the text of the tweets. The text is limited to 160 tokens 5 DISCUSSION AND OUTLOOK
as input to the network. Considering that the maximum tweet
length is 280 characters, it is most likely that the entire text is The proposed method achieves fairly accurate results in the task
processed to calculate the prediction. As a backbone network, we of FakeNews: Coronavirus and 5G conspiracy. More deep learning
employ the bert-base-uncased version of BERT [13], which is a models, variants of BERT or other models, will be used in future
compact transformer model, trained on lower-cased English text. experiments trying to achieve better performance. To tackle the
The network architecture consists of 12 layers (i.e., Transformer limitation of insufficient training samples, we also intend to experi-
blocks), with 768 hidden units, and 12 heads for multi-head attention ment with data augmentation approaches in order to create more
layers, resulting in a total of 109M parameters. samples of the minority classes and build more robust classifiers.
We fine-tune our networks using Adam optimizer [6] with learn-
ing rate 2 β 10β5 . The models are trained for 10 epochs with batch ACKNOWLEDGMENTS
size 32 and categorical cross-entropy as the loss function. Dur- This work is supported by the WeVerify project, which is funded
ing training, we use dropout after the backbone network with 0.3 by the European Commission under contract number 825297.
FakeNews: Corona virus and 5G conspiracy MediaEvalβ20, December 14-15 2020, Online
REFERENCES In MediaEval 2020 Workshop.
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [10] Juan Carlos Medina Serrano, Orestis Papakyriakopoulos, and Simon
2018. Bert: Pre-training of deep bidirectional transformers for language Hegelich. 2020. NLP-based Feature Extraction for the Detection of
understanding. arXiv preprint arXiv:1810.04805 (2018). COVID-19 Misinformation Videos on YouTube. In Proceedings of the
[2] Mohamed K Elhadad, Kin Fun Li, and Fayez Gebali. 2020. Detecting 1st Workshop on NLP for COVID-19 at ACL 2020.
Misleading Information on COVID-19. IEEE Access 8 (2020), 165201β [11] Karishma Sharma, Sungyong Seo, Chuizheng Meng, Sirisha Ramb-
165215. hatla, Aastha Dua, and Yan Liu. 2020. Coronavirus on social media:
[3] Tamanna Hossain, Robert L Logan IV, Arjuna Ugarte, Yoshitomo Mat- Analyzing misinformation in Twitter conversations. arXiv preprint
subara, Sameer Singh, and Sean Young. 2020. Detecting covid-19 arXiv:2003.12309 (2020).
misinformation on social media. (2020). [12] Samia Tasnim, Md Mahbub Hossain, and Hoimonty Mazumder. 2020.
[4] Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on deep Impact of rumors or misinformation on coronavirus disease (COVID-
learning with class imbalance. Journal of Big Data 6, 1 (2019), 27. 19) in social media. (2020).
[5] Heejung Jwa, Dongsuk Oh, Kinam Park, Jang Mook Kang, and [13] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
Heuiseok Lim. 2019. exBAKE: Automatic Fake News Detection Model Well-read students learn better: On the importance of pre-training
Based on Bidirectional Encoder Representations from Transformers compact models. arXiv preprint arXiv:1908.08962 (2019).
(BERT). Applied Sciences 9, 19 (2019), 4062. [14] Tom Warren. 2020. British 5G towers are being set on
[6] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic fire because of coronavirus conspiracy theories. (Apr
optimization. arXiv preprint arXiv:1412.6980 (2014). 2020). https://www.theverge.com/2020/4/4/21207927/
[7] Chao Liu, Xinghua Wu, Min Yu, Gang Li, Jianguo Jiang, Weiqing 5g-towers-burning-uk-coronavirus-conspiracy-theory-link
Huang, and Xiang Lu. 2019. A Two-Stage Model Based on BERT for [15] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond,
Short Fake News Detection. In International Conference on Knowledge Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RΓ©mi
Science, Engineering and Management. Springer, 172β183. Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von
[8] Olga Papadopoulou, Markos Zampoglou, Symeon Papadopoulos, and Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
Ioannis Kompatsiaris. 2019. A corpus of debunked and verified user- Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexan-
generated videos. Online information review (2019). der M. Rush. 2020. Transformers: State-of-the-Art Natural Lan-
[9] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes guage Processing. In Proceedings of the 2020 Conference on Empiri-
Moe, Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020. cal Methods in Natural Language Processing: System Demonstrations.
FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020. Association for Computational Linguistics, Online, 38β45. https:
//www.aclweb.org/anthology/2020.emnlp-demos.6