<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MeVer Team Tackling Corona Virus and 5G Conspiracy Using Ensemble Classification Based on BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olga Papadopoulou</string-name>
          <email>olgapapa@iti.gr</email>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgos Kordopatis-Zilos</string-name>
          <email>georgekordopatis@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <email>papadop@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute, CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents the approach developed by the Media Verification (MeVer) team to tackle the task of FakeNews: Coronavirus and 5G conspiracy at the MediaEval 2020 Challenge. We build a twostage classification approach based on ensemble learning of multiple classification networks. Due to the imbalanced and relatively small dataset, our ensemble method leads to improved performance compared to a single classification model. We fine-tune pre-trained Bidirectional Encoder Representations from Transformers (BERT), one of the most popular transformer models, on the problem of Coronavirus and 5G conspiracy detection. Our approach achieved a score of 0.413 in terms of the Matthews Correlation Coeficient (MCC), which is the oficial evaluation metric of the task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        COVID-19 emerged as a health crisis (pandemic) and soon evolved
into an infodemic (‘infodemic’ refers to an overabundance of
information). There are already harmful impacts of COVID-19
Conspiracy theories and specifically around 5G disinformation on society.
The incident of the British 5G towers fires because of coronavirus
conspiracy theories [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a representative example of how
important is to detect and prevent the dissemination of such theories.
The FakeNews: Coronavirus and 5G conspiracy task is a challenge
of MediaEval 2020 that focuses on the analysis of tweets around
Coronavirus and 5G conspiracy theories in order to detect
misinformation spreaders. For further details on the subtasks and the
respective dataset, the reader is referred to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Our approach focuses on ensemble classification in order to
overcome the relatively small training dataset and predict more
accurately the Coronavirus and 5G conspiracy tweets. In short,
a first-level classification is applied using majority voting over
nine classifiers to detect conspiracy and non-conspiracy tweets. A
second-level classification is then applied to detect the conspiracy
tweets related to 5G over the other conspiracy ones. For the training
process, we leverage on the pre-trained BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] model and the
implementation provided by the HuggingFace library [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In case of a pandemic such as that of the Coronavirus, the
intentional or unintentional dissemination of manipulated content,
conspiracy theories, and propaganda are critical [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Several works
1https://huggingface.co/transformers/model_doc/bert.html
have been recently published dealing with the detection and
veriifcation of COVID-19-related misinformation [
        <xref ref-type="bibr" rid="ref10 ref11 ref2 ref3">2, 3, 10, 11</xref>
        ].
Misinformation can be spread in the form of text, images, and videos.
Natural language processing (NLP) is a means of dealing with many
types of content. For example, the authors of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] collected a
database of debunked and verified user-generated videos and developed
a method to detect them using the contextual information
surrounding them rather than the video content. The emergence of BERT
(Bidirectional Encoder Representations from Transformers) has led
many researchers to use it for text classification and thus in the
detection of fake news [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ]. A key limitation of emerging topics
and the need to build models dedicated to a specific topic is the
lack of suficient training samples. To this end, researchers are
leaning towards solutions based on ensemble methods, unsupervised
learning, and data augmentation.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>PROPOSED APPROACH</title>
      <p>Figure 1 illustrates the pipeline of the proposed approach. We follow
a two-step classification approach:
• First step consists of an initial classification based on
ensemble learning in order to provide a first-level
classification of Conspiracy and Non-conspiracy tweets.
• The second step consists of the final prediction that
classifies the detected Conspiracy tweets as 5G-conspiracy or
Other-conspiracy.</p>
      <p>
        The provided dataset consists of 1,135 samples of the 5G-conspiracy
class, 712 of the Other-conspiracy class and 4,198 samples of
Nonconspiracy class. As described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], imbalanced datasets for
training machine learning algorithms or deep learning approaches pose
risks of bias towards the majority class. To this end, we sub-sample
training tweets of the majority classes in order to balance the
training sets and build the proposed classifiers. Specifically, Table 1
presents the number of training samples considered per classifier.
In  , the training samples of 5G-conspiracy and Other-conspiracy
are concatenated into an overall Conspiracy class (1,847 tweets)
and an equal number of tweets is randomly sampled from the
Nonconspiracy tweets.
      </p>
      <p>In the first step of our approach, we train  classifiers  , which
are used to predict Conspiracy and Non-conspiracy tweets.  is
empirically selected to be nine. An odd number of classifiers makes
it possible to apply majority voting. Each classifier  predicts a
label of 1 for Conspiracy or 0 for Non-conspiracy tweets. Majority
voting is applied and a final prediction per tweet is given by Í=1
 &gt;  /2 where,  = 9, and if true prediction = Conspiracy else
prediction = Non-conspiracy. For each model, diferent sample of
Non-conspiracy tweets is selected.</p>
      <p>In the second step, the predictions of Non-conspiracy are
considered as final predictions without further processing while the
Conspiracy tweets are further processed to distinguish 5G-conspiracy
from Other-conspiracy. In this step, two additional models are trained
focusing on the detection of 5G-conspiracy tweets. The first,  1,
is a three-class model (1: 5G-conspiracy, 2: Other-conspiracy and 3:
Non-conspiracy) trained using random samples from the majority
classes and the total number of minority class samples
(Otherconspiracy). The other model,  2, is a binary classifier trained
on the two Conspiracy classes. The final decision is taken if  1 =
 2 = 1 = 5G-conspiracy. In any other case, the tweet is labeled as
Other-conspiracy.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Implementation details</title>
      <p>
        For tokenization, we employ bert-base-uncased of BertTokenizer
applied to the text of the tweets. The text is limited to 160 tokens
as input to the network. Considering that the maximum tweet
length is 280 characters, it is most likely that the entire text is
processed to calculate the prediction. As a backbone network, we
employ the bert-base-uncased version of BERT [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which is a
compact transformer model, trained on lower-cased English text.
The network architecture consists of 12 layers (i.e., Transformer
blocks), with 768 hidden units, and 12 heads for multi-head attention
layers, resulting in a total of 109M parameters.
      </p>
      <p>
        We fine-tune our networks using Adam optimizer [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with
learning rate 2 ∗ 10−5. The models are trained for 10 epochs with batch
size 32 and categorical cross-entropy as the loss function.
During training, we use dropout after the backbone network with 0.3
Method
three-class BERT
Proposed approach
      </p>
      <p>MCC
0.42
0.81
drop rate to prevent overfitting. Our models are evaluated against
a validation set, and we select the versions that achieve the best
performance in terms of accuracy as our final models.
4</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>Initially, we trained a three-class model using the implementation
details presented in subsection 3.1. From the annotated dataset, we
randomly selected 100 samples per class as testing set and discarded
them from the training phase in all runs. The performance of the
model is 0.42 in terms of MCC. In order to improve the performance,
we implemented the presented two-step classification approach
resulting in increase of the MCC metric to 0.81 as presented in
Table 2.</p>
      <p>Our proposed approach achieved a score of 0.413 in terms of
MCC on the provided testing set of unseen tweets.
5</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>The proposed method achieves fairly accurate results in the task
of FakeNews: Coronavirus and 5G conspiracy. More deep learning
models, variants of BERT or other models, will be used in future
experiments trying to achieve better performance. To tackle the
limitation of insuficient training samples, we also intend to
experiment with data augmentation approaches in order to create more
samples of the minority classes and build more robust classifiers.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is supported by the WeVerify project, which is funded
by the European Commission under contract number 825297.
FakeNews: Corona virus and 5G conspiracy</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mohamed</surname>
            <given-names>K Elhadad</given-names>
          </string-name>
          , Kin Fun Li,
          <string-name>
            <given-names>and Fayez</given-names>
            <surname>Gebali</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <source>Detecting Misleading Information on COVID-19. IEEE Access</source>
          <volume>8</volume>
          (
          <year>2020</year>
          ),
          <fpage>165201</fpage>
          -
          <lpage>165215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Tamanna</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <surname>Robert L Logan</surname>
            <given-names>IV</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arjuna</surname>
            <given-names>Ugarte</given-names>
          </string-name>
          , Yoshitomo Matsubara,
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Sean</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Detecting covid-19 misinformation on social media</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Justin</surname>
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Johnson and Taghi M Khoshgoftaar</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Survey on deep learning with class imbalance</article-title>
          .
          <source>Journal of Big Data</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>2019</year>
          ),
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Heejung</given-names>
            <surname>Jwa</surname>
          </string-name>
          , Dongsuk Oh, Kinam Park, Jang Mook Kang, and
          <string-name>
            <given-names>Heuiseok</given-names>
            <surname>Lim</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>exBAKE: Automatic Fake News Detection Model Based on Bidirectional Encoder Representations from Transformers (BERT)</article-title>
          .
          <source>Applied Sciences</source>
          <volume>9</volume>
          ,
          <issue>19</issue>
          (
          <year>2019</year>
          ),
          <fpage>4062</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Chao</given-names>
            <surname>Liu</surname>
          </string-name>
          , Xinghua Wu, Min Yu,
          <string-name>
            <given-names>Gang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianguo</given-names>
            <surname>Jiang</surname>
          </string-name>
          , Weiqing Huang, and
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A Two-Stage Model Based on BERT for Short Fake News Detection</article-title>
          .
          <source>In International Conference on Knowledge Science, Engineering and Management</source>
          . Springer,
          <fpage>172</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Olga</given-names>
            <surname>Papadopoulou</surname>
          </string-name>
          , Markos Zampoglou, Symeon Papadopoulos, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A corpus of debunked and verified usergenerated videos</article-title>
          .
          <source>Online information review</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Langguth</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020</article-title>
          . In MediaEval 2020 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Juan</given-names>
            <surname>Carlos Medina Serrano</surname>
          </string-name>
          , Orestis Papakyriakopoulos, and
          <string-name>
            <given-names>Simon</given-names>
            <surname>Hegelich</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>NLP-based Feature Extraction for the Detection of COVID-19 Misinformation Videos on YouTube</article-title>
          .
          <source>In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Karishma</surname>
            <given-names>Sharma</given-names>
          </string-name>
          , Sungyong Seo, Chuizheng Meng, Sirisha Rambhatla, Aastha Dua, and Yan Liu.
          <year>2020</year>
          .
          <article-title>Coronavirus on social media: Analyzing misinformation in Twitter conversations</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>12309</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Samia</surname>
            <given-names>Tasnim</given-names>
          </string-name>
          , Md Mahbub Hossain, and
          <string-name>
            <given-names>Hoimonty</given-names>
            <surname>Mazumder</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Impact of rumors or misinformation on coronavirus disease (COVID19) in social media</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Iulia</surname>
            <given-names>Turc</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Well-read students learn better: On the importance of pre-training compact models</article-title>
          . arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>08962</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Warren</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>British 5G towers are being set on ifre because of coronavirus conspiracy theories</article-title>
          .
          <source>(Apr</source>
          <year>2020</year>
          ). https://www.theverge.com/
          <year>2020</year>
          /4/4/21207927/ 5g-towers
          <article-title>-burning-uk-coronavirus-conspiracy-theory-link</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Wolf</given-names>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Transformers: State-of-the-Art Natural Language Processing</article-title>
          .
          <source>In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics</source>
          , Online,
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . https: //www.aclweb.org/anthology/2020.emnlp-demos.
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>