<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fake News Detection in Social Media Using Graph Neural Networks and NLP Techniques: A COVID-19 Use-Case</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abdullah Hamid</string-name>
          <email>Abdullahhamid@uetpeshawar.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nasrullah Sheikh</string-name>
          <email>nasrullah.sheikh@ibm.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naina Said</string-name>
          <email>nainasaid@uetpeshawar.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kashif Ahmad</string-name>
          <email>kahmad@hbku.edu.qa</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asma Gul</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laiq Hasan</string-name>
          <email>laiqhasan@uetpeshawar.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ala Al-Fuqaha</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DCSE, University of Engineering and Technology</institution>
          ,
          <addr-line>Peshawar</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Statistics, Shaheed Benazir Bhutto Women University</institution>
          ,
          <addr-line>Peshawar</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University</institution>
          ,
          <addr-line>Qatar Foundation, Doha</addr-line>
          ,
          <country country="QA">Qatar</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>IBM Research - Almaden</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The paper presents our solutions for the MediaEval 2020 task namely FakeNews: Corona Virus and 5G Conspiracy Multimedia Twitter-Data-Based Analysis. The task aims to analyze tweets related to COVID-19 and 5G conspiracy theories to detect misinformation spreaders. The task is composed of two sub-tasks namely (i) text-based, and (ii) structure-based fake news detection. For the first task, we propose six diferent solutions relying on Bag of Words (BoW) and BERT embedding. Three of the methods aim at binary classification task by diferentiating in 5G conspiracy and the rest of the COVID-19 related tweets while the rest of them treat the task as ternary classification problem. In the ternary classification task, our BoW and BERT based methods obtained an F1-score of .606% and .566% on the development set, respectively. On the binary classification, the BoW and BERT based solutions obtained an average F1-score of .666% and .693%, respectively. On the other hand, for structure-based fake news detection, we rely on Graph Neural Networks (GNNs) achieving an average ROC of .95% on the development set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the modern world, social media is playing its part in several ways,
for instance in news dissemination and information sharing, social
media outlets, such as Twitter, Facebook, and Instagram, have been
proved very efective [
        <xref ref-type="bibr" rid="ref1 ref6 ref7 ref9">1, 6, 7, 9</xref>
        ]. However, it also comes with
several challenges, such as collecting information from several sources,
detecting and filtering misinformation [
        <xref ref-type="bibr" rid="ref11 ref4 ref5">4, 5, 11</xref>
        ]. Similar to other
events and pandemics, being one of the deadly pandemics in the
history, COVID-19 has been the subject of discussion over social
media since its emergence. Without any surprise, a lot of
misinformation about the pandemic are circulated over social networks.
In order to identify misinformation spreaders and filter fake news
about COVID-19 and 5G conspiracy, a task namely "FakeNews:
Corona Virus and 5G Conspiracy Multimedia Twitter-Data-Based
Analysis" has been proposed in the benchmark MediaEval 2020
competition [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        This paper provides a detailed description of the methods
proposed by team DCSE_UETP for the fake news detection task. The
task consists of two parts, namely (i) text-based misinformation
detection (TMD), and (ii) structure-based misinformation
detection (SMD). The first task (TMD) is based on textual analysis of
COVID-19 related information shared on Twitter during January
2020 and 15th of July 2020, and aims to detect diferent types of
conspiracy theories about COVID-19 and its vaccines, such as that
"the 5G weakens the immune system and thus caused the current
corona-virus pandemic etc., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In the SMD task, the participants
are provided with a set of graphs, each representing a sub-graph
of Twitter, and corresponds to a single tweet where the vertices
of the graphs represent accounts. Similar to TMD, in this task, the
participants need to detect and diferentiate between 5G and other
COVID-19 conspiracy theories.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Methodology for TMD Task</title>
      <p>
        For the text-based analysis, we employed two diferent methods
including a (i) Bag of Words (BoW), and a (ii) BERT model-based
solution [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Before proceeding with the proposed methods, it is
to be noted that the dataset provided for the text-based analysis
is not balanced where one of the classes namely non-conspiracy
contains a very high number of samples while the rest are
composed of relatively fewer samples. In total, the majority class
contains 4412, while the other two classes, namely 5G conspiracies, and
other conspiracies, are composed of 1263 and 785 samples,
respectively. In order to balance the dataset, we rely on an ensemble of
diferent re-sampled datasets, where  models are built/trained by
dividing the class with a higher number of samples into n-difering
parts as illustrated in Figure 1. After training  models, the results
of the models are combined using two diferent late fusion methods
including a majority voting method, and summation of the
posterior probabilities. In the majority voting, since we have four models,
in the case of tie we consider the accumulative probabilities/scores
to assign a label to a test sample.
      </p>
      <p>Before deploying BoW and BERT, text has been cleaned by
removing punctuation’s keys, such as commas, full-stops, emojis,
URLs, and stop words. Once the text is pre-processed, we proceed
with the tokenization and creation of BoW vocabulary, which is
followed by generation of the feature vector for each sentence. A
Naives Bayes classifier is then trained on the extracted features.
On the other hand, a logistic regression model is trained on word
embeddings generated via BERT.</p>
      <p>Training Samples</p>
      <p>Models
C1
C1
C1</p>
      <p>C2
C2
C2</p>
      <p>C_3_1
C_3_2
C_3_n</p>
      <p>Late Fusion</p>
      <p>Predicted_Label</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology for SMD Task</title>
      <p>
        Graphs representation learning using Graph Neural Networks (GNNs)
have been shown to be efective in various domains such as social
networks, biological networks, and financial networks. GNNs
aggregate the neighborhood representation within k hops and then
apply a pooling such as SUM, MEAN, MAX to obtain the final
representation of the node. Furthermore, GNN’s can be used to learn
the representation of a simple graph structures [
        <xref ref-type="bibr" rid="ref10 ref12 ref2">2, 10, 12</xref>
        ], which
then can be used to classify the graphs. For graph classification,
these methods learn the representation of nodes, followed by graph
READOUT method, which is aggregating the node features obtained
after the final iteration of GNN.
      </p>
      <p>
        We model this problem as a graph classification task. Following
Keyule et al.[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we train our model using three classes of the
graphs 5G Conspiracy, non-conspiracy, other-conspiracy, and learn
the representation of the graphs.
3
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
    </sec>
    <sec id="sec-6">
      <title>Evaluation Metric</title>
      <p>For the evaluation of the proposed methods, we used two diferent
metrics, namely (i) Micro F1-Score, and (ii) AUC (Area Under The
Curve) ROC (Receiver Operating Characteristics) curve. AUC ROC
is the oficial evaluation metric on the task, and all the test results
are reported in terms of AUC ROC. On the other hand, F1-score is
used for the evaluation of the methods on the development set.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Runs Description in TMD Task</title>
      <p>For TMD, we submitted six diferent runs mainly relying on two
approaches, namely BERT and BoW, under two late fusion schemes.
Three of the runs are based on binary classification while the three
deal the task as ternary classification problem. It is to be noted
that the fusion schemes are used to combine the scores/output of
the four individual models trained as result of the data balancing
method as described earlier.</p>
      <p>The first three runs are based on the ternary classification task,
where run 1 and run 2 are based on BoW with majority voting and
accumulative classification scores of the individual models. The
third and final ternary run is based on BERT features, where a
logistic regression model is trained on word embeddings generated
by BERT. As can be seen in Table 1a, overall, better results are
obtained with BoW approach under the majority voting scheme.</p>
      <p>The last three runs are based on the binary classification task,
where the first two (i.e., Run 4 and Run 5) are based on BoW with
majority voting and accumulative classification based fusion methods
while the final one (i.e., Run 6) is based on BERT with accumulative
score based fusion scheme. As expected, the performance of all the
methods is significantly higher on the binary classification task
compared to ternary classification task.</p>
      <p>Similar trend has been also observed on the test set, where overall
better results are obtained with BoW under majority voting scheme.
For training the model, we divide the dataset into train/valid/valid
(80/10/10). We used the grid search to obtain the best
hyperparameters. The model has four MLP layers, and use MAX and MEAN
operations for neighbor pooling and graph pooling respectively.
The model is trained on 1000 epochs with a learning rate of 0.01, and
dropout 0.3 is applied on the final layer output. The final embedding
size is 128. We evaluate our model on AUC-ROC and the result of
the test set is given in Table 1(b). The results show that the model
has discriminative power to learn to classify the graph structures.
Furthermore, it shows that the difusion of information depending
on the type of information being spread forms a difusion pattern.
4</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>The challenge is composed of two tasks, one aiming to analyze and
detect COVID-19 related fake news using tweets’ text while the
other aims to analyze network structure for the possible detection
of the fake news. For the first task, we mainly relied on two
state-ofthe-art methods namely BoW and BERT embeddings under diferent
fusion schemes. Overall better results are obtained with BoW under
the majority voting scheme. For the SMD task, we rely on GNNs
to diferentiate among diferent conspiracy theories on
COVID19. In the current implementations, both textual and structural
information are used independently, in the future we aim to enrich
the structural information with the textual information for better
detection of fake news.
FakeNews: Corona virus and 5G conspiracy</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kashif</given-names>
            <surname>Ahmad</surname>
          </string-name>
          , Konstantin Pogorelov, Michael Riegler, Nicola Conci, and
          <string-name>
            <given-names>Pal</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Social media and satellites: Disaster event detection, linking and summarization</article-title>
          .
          <source>MULTIMEDIA TOOLS AND APPLICATIONS 78</source>
          ,
          <issue>3</issue>
          (
          <year>2019</year>
          ),
          <fpage>2837</fpage>
          -
          <lpage>2875</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Cătălina</given-names>
            <surname>Cangea</surname>
          </string-name>
          , Petar Veličković, Nikola Jovanović, Thomas Kipf, and
          <string-name>
            <given-names>Pietro</given-names>
            <surname>Liò</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Towards Sparse Hierarchical Graph Classifiers</article-title>
          . (
          <year>2018</year>
          ).
          <article-title>arXiv:stat</article-title>
          .ML/
          <year>1811</year>
          .01287
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Siva</given-names>
            <surname>Charan Reddy Gangireddy</surname>
          </string-name>
          , Cheng Long, and
          <string-name>
            <given-names>Tanmoy</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Unsupervised Fake News Detection: A Graphbased Approach</article-title>
          .
          <source>In Proceedings of the 31st ACM Conference on Hypertext and Social Media</source>
          .
          <fpage>75</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yi</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shanika</given-names>
            <surname>Karunasekera</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Leckie</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Graph Neural Networks with Continual Learning for Fake News Detection from Social Media</article-title>
          . arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>03316</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Muhammad</given-names>
            <surname>Imran</surname>
          </string-name>
          , Prasenjit Mitra, and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Castillo</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Twitter as a lifeline: Human-annotated twitter corpora for NLP of crisis-related messages</article-title>
          .
          <source>arXiv preprint arXiv:1605.05894</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Chuang</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Xiu-Xiu</surname>
            <given-names>Zhan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zi-Ke</surname>
            <given-names>Zhang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gui-Quan Sun</surname>
          </string-name>
          , and Pak Ming Hui.
          <year>2015</year>
          .
          <article-title>How events determine spreading patterns: information transmission via internal and external influences on social networks</article-title>
          .
          <source>New Journal of Physics</source>
          <volume>17</volume>
          ,
          <issue>11</issue>
          (
          <year>2015</year>
          ),
          <fpage>113045</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Langguth</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020</article-title>
          . In MediaEval 2020 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Naina</given-names>
            <surname>Said</surname>
          </string-name>
          , Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Laiq Hassan, Nasir Ahmad, and
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Conci</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Natural disasters detection in social media and satellite imagery: a survey</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>78</volume>
          ,
          <issue>22</issue>
          (
          <year>2019</year>
          ),
          <fpage>31267</fpage>
          -
          <lpage>31302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Keyulu</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Weihua Hu, Jure Leskovec, and
          <string-name>
            <given-names>Stefanie</given-names>
            <surname>Jegelka</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>How Powerful are Graph Neural Networks? CoRR abs/</article-title>
          <year>1810</year>
          .00826 (
          <year>2018</year>
          ). arXiv:
          <year>1810</year>
          .00826 http://arxiv.org/abs/
          <year>1810</year>
          .00826
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Shuo</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Kai Shu, Suhang Wang, Renjie Gu,
          <string-name>
            <surname>Fan Wu</surname>
          </string-name>
          , and Huan Liu.
          <year>2019</year>
          .
          <article-title>Unsupervised fake news detection on social media: A generative approach</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , Vol.
          <volume>33</volume>
          .
          <fpage>5644</fpage>
          -
          <lpage>5651</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Rex</surname>
            <given-names>Ying</given-names>
          </string-name>
          , Jiaxuan You, Christopher Morris, Xiang Ren, William L.
          <string-name>
            <surname>Hamilton</surname>
            , and
            <given-names>Jure</given-names>
          </string-name>
          <string-name>
            <surname>Leskovec</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hierarchical Graph Representation Learning with Diferentiable Pooling</article-title>
          . CoRR abs/
          <year>1806</year>
          .08804 (
          <year>2018</year>
          ). arXiv:
          <year>1806</year>
          .08804 http://arxiv.org/abs/
          <year>1806</year>
          .08804
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>