=Paper= {{Paper |id=Vol-2882/MediaEval_20_paper_76 |storemode=property |title=MeVer Team Tackling Corona Virus and 5G Conspiracy Using Ensemble Classification Based on BERT |pdfUrl=https://ceur-ws.org/Vol-2882/paper76.pdf |volume=Vol-2882 |authors=Olga Papadopoulou,Giorgos Kordopatis-Zilos,Symeon Papadopoulos |dblpUrl=https://dblp.org/rec/conf/mediaeval/PapadopoulouKP20 }} ==MeVer Team Tackling Corona Virus and 5G Conspiracy Using Ensemble Classification Based on BERT== https://ceur-ws.org/Vol-2882/paper76.pdf
             MeVer Team Tackling Corona Virus and 5G Conspiracy
                Using Ensemble Classification Based on BERT
             Olga Papadopoulou                                   Giorgos Kordopatis-Zilos                         Symeon Papadopoulos
     Information Technologies Institute,                     Information Technologies Institute,             Information Technologies Institute,
        CERTH, Thessaloniki, Greece                             CERTH, Thessaloniki, Greece                     CERTH, Thessaloniki, Greece
              olgapapa@iti.gr                                     georgekordopatis@iti.gr                             papadop@iti.gr

ABSTRACT                                                                             have been recently published dealing with the detection and veri-
This paper presents the approach developed by the Media Verifica-                    fication of COVID-19-related misinformation [2, 3, 10, 11]. Misin-
tion (MeVer) team to tackle the task of FakeNews: Coronavirus and                    formation can be spread in the form of text, images, and videos.
5G conspiracy at the MediaEval 2020 Challenge. We build a two-                       Natural language processing (NLP) is a means of dealing with many
stage classification approach based on ensemble learning of multi-                   types of content. For example, the authors of [8] collected a data-
ple classification networks. Due to the imbalanced and relatively                    base of debunked and verified user-generated videos and developed
small dataset, our ensemble method leads to improved performance                     a method to detect them using the contextual information surround-
compared to a single classification model. We fine-tune pre-trained                  ing them rather than the video content. The emergence of BERT
Bidirectional Encoder Representations from Transformers (BERT),                      (Bidirectional Encoder Representations from Transformers) has led
one of the most popular transformer models, on the problem of                        many researchers to use it for text classification and thus in the
Coronavirus and 5G conspiracy detection. Our approach achieved                       detection of fake news [5, 7]. A key limitation of emerging topics
a score of 0.413 in terms of the Matthews Correlation Coefficient                    and the need to build models dedicated to a specific topic is the
(MCC), which is the official evaluation metric of the task.                          lack of sufficient training samples. To this end, researchers are lean-
                                                                                     ing towards solutions based on ensemble methods, unsupervised
                                                                                     learning, and data augmentation.
1    INTRODUCTION
COVID-19 emerged as a health crisis (pandemic) and soon evolved                      3    PROPOSED APPROACH
into an infodemic (β€˜infodemic’ refers to an overabundance of infor-
                                                                                     Figure 1 illustrates the pipeline of the proposed approach. We follow
mation). There are already harmful impacts of COVID-19 Conspir-
                                                                                     a two-step classification approach:
acy theories and specifically around 5G disinformation on society.
The incident of the British 5G towers fires because of coronavirus                         β€’ First step consists of an initial classification based on en-
conspiracy theories [14] is a representative example of how impor-                           semble learning in order to provide a first-level classifica-
tant is to detect and prevent the dissemination of such theories.                            tion of Conspiracy and Non-conspiracy tweets.
The FakeNews: Coronavirus and 5G conspiracy task is a challenge                            β€’ The second step consists of the final prediction that clas-
of MediaEval 2020 that focuses on the analysis of tweets around                              sifies the detected Conspiracy tweets as 5G-conspiracy or
Coronavirus and 5G conspiracy theories in order to detect misin-                             Other-conspiracy.
formation spreaders. For further details on the subtasks and the
respective dataset, the reader is referred to [9].                                       The provided dataset consists of 1,135 samples of the 5G-conspiracy
   Our approach focuses on ensemble classification in order to                       class, 712 of the Other-conspiracy class and 4,198 samples of Non-
overcome the relatively small training dataset and predict more                      conspiracy class. As described in [4], imbalanced datasets for train-
accurately the Coronavirus and 5G conspiracy tweets. In short,                       ing machine learning algorithms or deep learning approaches pose
a first-level classification is applied using majority voting over                   risks of bias towards the majority class. To this end, we sub-sample
nine classifiers to detect conspiracy and non-conspiracy tweets. A                   training tweets of the majority classes in order to balance the train-
second-level classification is then applied to detect the conspiracy                 ing sets and build the proposed classifiers. Specifically, Table 1
tweets related to 5G over the other conspiracy ones. For the training                presents the number of training samples considered per classifier.
process, we leverage on the pre-trained BERT [1] model and the                       In 𝐢𝐿𝑖 , the training samples of 5G-conspiracy and Other-conspiracy
implementation provided by the HuggingFace library [15]1 .                           are concatenated into an overall Conspiracy class (1,847 tweets)
                                                                                     and an equal number of tweets is randomly sampled from the Non-
2    RELATED WORK                                                                    conspiracy tweets.
                                                                                         In the first step of our approach, we train 𝑁 classifiers 𝐢𝐿𝑖 , which
In case of a pandemic such as that of the Coronavirus, the inten-                    are used to predict Conspiracy and Non-conspiracy tweets. 𝑁 is
tional or unintentional dissemination of manipulated content, con-                   empirically selected to be nine. An odd number of classifiers makes
spiracy theories, and propaganda are critical [12]. Several works                    it possible to apply majority voting. Each classifier 𝐢𝐿𝑖 predicts a
1 https://huggingface.co/transformers/model_doc/bert.html                            label of 1 for Conspiracy or 0 for Non-conspiracy tweets. Majority
                                                                                                                                                          Í𝑁
                                                                                     voting is applied and a final prediction per tweet is given by 𝑖=1
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons   𝐢𝐿𝑖 > 𝑁 /2 where, 𝑁 = 9, and if true prediction = Conspiracy else
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online
                                                                                     prediction = Non-conspiracy. For each model, different sample of
                                                                                     Non-conspiracy tweets is selected.
MediaEval’20, December 14-15 2020, Online                                                                             O. Papadopoulou et al.




                  Figure 1: Our proposed pipeline for tackling the challenge of Corona virus and 5G conspiracy

Table 1: Summary of the training samples used to build the               Table 2: Evaluation results in terms of MCC, the official met-
respective models                                                        ric proposed for the task.

          Label                𝐢𝐿𝑖    πΆπΏπ‘šπ‘’π‘™π‘‘π‘–     πΆπΏπ‘π‘œπ‘›π‘ π‘                                    Method                 MCC
          5G conspiracy                 712         712                                      three-class BERT       0.42
                               1847                                                          Proposed approach      0.81
          Other conspiracy              712         712
          Non-conspiracy       1847     712          -

                                                                         drop rate to prevent overfitting. Our models are evaluated against
   In the second step, the predictions of Non-conspiracy are consid-     a validation set, and we select the versions that achieve the best
ered as final predictions without further processing while the Con-      performance in terms of accuracy as our final models.
spiracy tweets are further processed to distinguish 5G-conspiracy
from Other-conspiracy. In this step, two additional models are trained   4   RESULTS AND ANALYSIS
focusing on the detection of 5G-conspiracy tweets. The first, 𝐢𝐿 𝑓 1 ,
                                                                         Initially, we trained a three-class model using the implementation
is a three-class model (1: 5G-conspiracy, 2: Other-conspiracy and 3:
                                                                         details presented in subsection 3.1. From the annotated dataset, we
Non-conspiracy) trained using random samples from the majority
                                                                         randomly selected 100 samples per class as testing set and discarded
classes and the total number of minority class samples (Other-
                                                                         them from the training phase in all runs. The performance of the
conspiracy). The other model, 𝐢𝐿 𝑓 2 , is a binary classifier trained
                                                                         model is 0.42 in terms of MCC. In order to improve the performance,
on the two Conspiracy classes. The final decision is taken if 𝐢𝐿 𝑓 1 =
                                                                         we implemented the presented two-step classification approach
𝐢𝐿 𝑓 2 = 1 = 5G-conspiracy. In any other case, the tweet is labeled as
                                                                         resulting in increase of the MCC metric to 0.81 as presented in
Other-conspiracy.
                                                                         Table 2.
                                                                            Our proposed approach achieved a score of 0.413 in terms of
3.1    Implementation details
                                                                         MCC on the provided testing set of unseen tweets.
For tokenization, we employ bert-base-uncased of BertTokenizer
applied to the text of the tweets. The text is limited to 160 tokens     5   DISCUSSION AND OUTLOOK
as input to the network. Considering that the maximum tweet
length is 280 characters, it is most likely that the entire text is      The proposed method achieves fairly accurate results in the task
processed to calculate the prediction. As a backbone network, we         of FakeNews: Coronavirus and 5G conspiracy. More deep learning
employ the bert-base-uncased version of BERT [13], which is a            models, variants of BERT or other models, will be used in future
compact transformer model, trained on lower-cased English text.          experiments trying to achieve better performance. To tackle the
The network architecture consists of 12 layers (i.e., Transformer        limitation of insufficient training samples, we also intend to experi-
blocks), with 768 hidden units, and 12 heads for multi-head attention    ment with data augmentation approaches in order to create more
layers, resulting in a total of 109M parameters.                         samples of the minority classes and build more robust classifiers.
   We fine-tune our networks using Adam optimizer [6] with learn-
ing rate 2 βˆ— 10βˆ’5 . The models are trained for 10 epochs with batch      ACKNOWLEDGMENTS
size 32 and categorical cross-entropy as the loss function. Dur-         This work is supported by the WeVerify project, which is funded
ing training, we use dropout after the backbone network with 0.3         by the European Commission under contract number 825297.
FakeNews: Corona virus and 5G conspiracy                                                               MediaEval’20, December 14-15 2020, Online


REFERENCES                                                                          In MediaEval 2020 Workshop.
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.          [10] Juan Carlos Medina Serrano, Orestis Papakyriakopoulos, and Simon
    2018. Bert: Pre-training of deep bidirectional transformers for language        Hegelich. 2020. NLP-based Feature Extraction for the Detection of
    understanding. arXiv preprint arXiv:1810.04805 (2018).                          COVID-19 Misinformation Videos on YouTube. In Proceedings of the
[2] Mohamed K Elhadad, Kin Fun Li, and Fayez Gebali. 2020. Detecting                1st Workshop on NLP for COVID-19 at ACL 2020.
    Misleading Information on COVID-19. IEEE Access 8 (2020), 165201–          [11] Karishma Sharma, Sungyong Seo, Chuizheng Meng, Sirisha Ramb-
    165215.                                                                         hatla, Aastha Dua, and Yan Liu. 2020. Coronavirus on social media:
[3] Tamanna Hossain, Robert L Logan IV, Arjuna Ugarte, Yoshitomo Mat-               Analyzing misinformation in Twitter conversations. arXiv preprint
    subara, Sameer Singh, and Sean Young. 2020. Detecting covid-19                  arXiv:2003.12309 (2020).
    misinformation on social media. (2020).                                    [12] Samia Tasnim, Md Mahbub Hossain, and Hoimonty Mazumder. 2020.
[4] Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on deep                 Impact of rumors or misinformation on coronavirus disease (COVID-
    learning with class imbalance. Journal of Big Data 6, 1 (2019), 27.             19) in social media. (2020).
[5] Heejung Jwa, Dongsuk Oh, Kinam Park, Jang Mook Kang, and                   [13] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
    Heuiseok Lim. 2019. exBAKE: Automatic Fake News Detection Model                 Well-read students learn better: On the importance of pre-training
    Based on Bidirectional Encoder Representations from Transformers                compact models. arXiv preprint arXiv:1908.08962 (2019).
    (BERT). Applied Sciences 9, 19 (2019), 4062.                               [14] Tom Warren. 2020.            British 5G towers are being set on
[6] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic             fire because of coronavirus conspiracy theories.                (Apr
    optimization. arXiv preprint arXiv:1412.6980 (2014).                            2020).                https://www.theverge.com/2020/4/4/21207927/
[7] Chao Liu, Xinghua Wu, Min Yu, Gang Li, Jianguo Jiang, Weiqing                   5g-towers-burning-uk-coronavirus-conspiracy-theory-link
    Huang, and Xiang Lu. 2019. A Two-Stage Model Based on BERT for             [15] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond,
    Short Fake News Detection. In International Conference on Knowledge             Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RΓ©mi
    Science, Engineering and Management. Springer, 172–183.                         Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von
[8] Olga Papadopoulou, Markos Zampoglou, Symeon Papadopoulos, and                   Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
    Ioannis Kompatsiaris. 2019. A corpus of debunked and verified user-             Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexan-
    generated videos. Online information review (2019).                             der M. Rush. 2020. Transformers: State-of-the-Art Natural Lan-
[9] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes            guage Processing. In Proceedings of the 2020 Conference on Empiri-
    Moe, Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020.              cal Methods in Natural Language Processing: System Demonstrations.
    FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020.                Association for Computational Linguistics, Online, 38–45. https:
                                                                                    //www.aclweb.org/anthology/2020.emnlp-demos.6