<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>FakeNews Detection Using Pre-trained Language Models and Graph Convolutional Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nguyen Manh Duc Tuan</string-name>
          <email>ductuan024@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pham Quang Nhat Minh</string-name>
          <email>minhpham@aimesoft.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aimesoft JSC.</institution>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Toyo Unversity</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We introduce methods for detecting FakeNews related to coronavirus and 5G conspiracy based on textual data and graph data. For the Text-Based Fake News Detection subtask, we proposed a neural network that combines textual features encoded by a pre-trained BERT model and metadata of tweets encoded by a multi-layer perceptron model. In the Structure-Based Fake News Detection subtask, we applied Graph Convolutional Networks (GCN) and proposed some features at each node of GCN. Experimental results show that textual data contains more useful information for detecting FakeNews than graph data, and using meta-data of tweets improved the result of the text-based model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In this paper, we present our methods for two subtasks of the
FakeNews Detection Task at MediaEval 2020 [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. We formalize the
FakeNews detection task as a classification problem. In text-based
subtask, we applied BERT model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which is the state-of-the-art
model in many NLP tasks. BERT model has been shown to be
efective in many NLP tasks including text classification. We used
CovidTwitter-BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (CT-BERT), which was trained on a corpus of
160M tweets about the coronavirus. The data used to train CT-BERT
has the same domain as the domain of data provided for the
FakeNews detection task, and we expect that we can obtain better results
with CT-BERT compared with the general BERT models trained
on open-domain data. We combined metadata-based features with
textual features obtained by CT-BERT and fine-tuned CT-BERT on
our task-specific data. Experimental results show that combining
metadata with textual features is better than using textual features
only. In the structure-based subtask, we adopted Graph
Convolutional Networks (GCN) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to capture the relations of nodes in
retweet graphs.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        One of the approaches to fake news detection is using the content of
the news. Content-based features are extracted from textual aspects
and visual aspects. Textual information can be extracted by layers
of CNN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. From textual information, we can observe features that
are specific to fake news, such as writing style or emotions [
        <xref ref-type="bibr" rid="ref13 ref16 ref3">3, 13,
16</xref>
        ]. Furthermore, both textual and visual information can be used
together to detect fake news [
        <xref ref-type="bibr" rid="ref16 ref17 ref5">5, 16, 17</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>In this section, we describe our methods for two subtasks: text-based
misinformation detection and structure-based misinformation
detection.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Text-Based Misinformation Detection</title>
      <p>Since tweet data is very noisy, we performed pre-processing
steps as follows before putting data into CT-BERT model.</p>
      <p>We deleted mentions and emojis with tweet-preprocessor,
a pre-processing library for tweet data.</p>
      <p>We changed the words into lowercase forms.</p>
      <p>There are some emojis written in text format such as “:)”,
“:(”, etc. We changed those emojis into sentiment words
“happy” or “sad”.</p>
      <p>We deleted punctuation characters that are not useful such
as “;”, “:”, “-”, “=”.</p>
      <p>
        We did tokenization, word normalization, word
segmentation with ekphrasis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a text analysis tool for social
medias.
      </p>
      <p>
        FakeNews detection data is unbalanced, in which the number
of tweets labeled as a conspiracy is much smaller than the number
of tweets labeled as non-conspiracy. Therefore, we balanced the
dataset with Easy Data Augmentation (EDA) method [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Pre-processed and augmented data was then put into neural
networks. In our work, we conducted experiments with two models
as follows.</p>
      <p>In the first model, we simply passed a tweet text into CT-BERT
and used the hidden vector at [CLS] token as the representation
of the tweet. The hidden state at [CLS] is then put into a sigmoid
layer for 2-class classification or into a softmax layer for 3-class
classification.</p>
      <p>
        In the second model, we combined text-based features with
metadata based features in a neural network shown in Figure 1. First, we
get the embedding vector of a tweet text using CT-BERT. After that,
we used 1D-CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with diferent filter sizes. By doing that, we
can use more information from various sources for prediction. We
passed metadata-based features into a fully-connected layer with
batch normalization. Finally, we concatenated metadata features
with all outputs from 1D-CNN and passed them into a sigmoid layer
for 2-class classification or a softmax layer for 3-class
classification. In addition to provided metadata, we extracted other features
including the number of retweets, favorites, characters, words,
question marks, hashtags, mentions, and URLs in the tweet, the posted
time of the tweet, and a binary feature to indicate whether or not it
is a sensitive tweet. From users’ profiles, we extracted the number
of friends, followers, groups, favorites, and statuses that users have
posted. We also used the created time and whether or not the users’
profiles have been edited, and whether they are verified accounts or
not. In total, we extracted 22 features including metadata features.
      </p>
      <p>
        In experiments, we used the implementation of BERT in the
library Transformers of HuggingFace [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Structure-Based Misinformation Detection</title>
      <p>
        We applied Graph Convolutional Network [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (GCN) for
structurebased subtask. The model uses traditional GCN on first-order
proximity matrix and second-order proximity matrix. The first order
proximity is created by adding edges in the original adjacency
matrix in order to a directed graph into an undirected graph. The
second-order proximity matrix is also an undirected graph and is
created by taking into account shared neighbors of each two nodes.
      </p>
      <p>We passed three created graphs into two layers of GCN, with the
iflter size of 64. After that, we concatenated three output graphs
horizontally and then used global max pooling to get the embedded
vector of the entire graph. Finally, we passed it into a fully-connected
layer of 512 nodes with dropout then added a sigmoid layer for
2-class classification or a softmax layer for 3-class classification.</p>
      <p>In GCN, from the input graph, for each node, by using networkx
library1 we created nine features: page-rank, in/out-degree, hub,
and authority, betweenness centrality, closeness, number of
triangles, eigenvector centrality. For the first run, we use only the nine
extracted features as node features. For the second run, we include
provided metadata features into node features.
4
4.1</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
    </sec>
    <sec id="sec-7">
      <title>Text-Based Misinformation Detection</title>
      <p>We submitted two runs for each of two-class classifier and
threeclass classifier.
1https://networkx.org
•</p>
      <p>Table 1 shows results of our submitted runs. For the first run
with tweets only, we obtained 0.361 of Matthews correlation
coeficient (MCC) and 0.412 of MCC for 2-class and 3-class classification,
respectively. In the second run, using tweets and other features, we
obtained 0.396 of MCC and 0.419 of MCC for 2-class and 3-class
classification, respectively.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Structure-Based Misinformation Detection</title>
      <p>We submitted two runs in the structure-based subtask.
•
•</p>
      <p>Run-1: We used 9 extracted features as node features in
graphs.</p>
      <p>Run-2: We included metadata-based features along with 9
extracted features as node features.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>We have presented our proposed methods for the two subtasks
at MediaEval 2020 FakeNews Detection Task. In the text-based
subtask, we have shown that using metadata-based features and
other proposed features outperformed the model with only text
features. The MCC scores of our proposed models are still low,
especially in the structure-based subtask. In future work, we plan
to use external resources to compare diferent information sources
and calculate the probability that a piece of information is false. We
believe that it is a natural way to detect misinformation.
FakeNews: Corona virus and 5G conspiracy</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Christos</given-names>
            <surname>Baziotis</surname>
          </string-name>
          , Nikos Pelekis, and
          <string-name>
            <given-names>Christos</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Messagelevel and Topic-based Sentiment Analysis</article-title>
          .
          <source>In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-</source>
          <year>2017</year>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <fpage>747</fpage>
          -
          <lpage>754</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Souvick</given-names>
            <surname>Ghosh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chirag</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Towards automatic fake news classification</article-title>
          .
          <source>Proceedings of the Association for Information Science and Technology 55</source>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ),
          <fpage>805</fpage>
          -
          <lpage>807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Rohit</given-names>
            <surname>Kumar</surname>
          </string-name>
          <string-name>
            <surname>Kaliyar</surname>
          </string-name>
          , Anurag Goswami, Pratik Narang, and
          <string-name>
            <given-names>Soumendu</given-names>
            <surname>Sinha</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>FNDNet-A deep convolutional neural network for fake news detection</article-title>
          .
          <source>Cognitive Systems Research</source>
          <volume>61</volume>
          (
          <year>2020</year>
          ),
          <fpage>32</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dhruv</given-names>
            <surname>Khattar</surname>
          </string-name>
          , Jaipal Singh Goud,
          <string-name>
            <given-names>Manish</given-names>
            <surname>Gupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Vasudeva</given-names>
            <surname>Varma</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MVAE: Multimodal Variational Autoencoder for Fake News Detection</article-title>
          .
          <source>In The World Wide Web Conference (WWW '19)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>2915</fpage>
          -
          <lpage>2921</lpage>
          . https: //doi.org/10.1145/3308558.3313552
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional Neural Networks for Sentence Classiifcation</article-title>
          . (
          <year>2014</year>
          ).
          <source>arXiv:cs.CL/1408.5882</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Identifying Tweets with Fake News</article-title>
          .
          <source>In 2018 IEEE International Conference on Information Reuse and Integration (IRI)</source>
          .
          <volume>460</volume>
          -
          <fpage>464</fpage>
          . https://doi.org/10.1109/IRI.
          <year>2018</year>
          .00073
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Müller</surname>
          </string-name>
          , Marcel Salathé, and
          <string-name>
            <given-names>Per E</given-names>
            <surname>Kummervold</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <string-name>
            <surname>COVIDTwitter-BERT: A Natural Language Processing Model to Analyse</surname>
          </string-name>
          COVID-19 Content on Twitter. (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>2005</year>
          .07503
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Langguth</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020</article-title>
          . In MediaEval 2020 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Thilo</surname>
          </string-name>
          <string-name>
            <surname>Schroeder</surname>
          </string-name>
          , Konstantin Pogorelov, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Langguth</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>FACT: a Framework for Analysis and Capture of Twitter Graphs</article-title>
          .
          <source>2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)</source>
          (
          <year>2019</year>
          ),
          <fpage>134</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kai</surname>
            <given-names>Shu</given-names>
          </string-name>
          , Xinyi Zhou, Suhang Wang,
          <string-name>
            <surname>Reza Zafarani</surname>
          </string-name>
          , and Huan Liu.
          <year>2019</year>
          .
          <article-title>The Role of User Profile for Fake News Detection</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .SI/
          <year>1904</year>
          .13355
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Zekun</surname>
            <given-names>Tong</given-names>
          </string-name>
          , Yuxuan Liang, Changsheng Sun,
          <string-name>
            <given-names>David S.</given-names>
            <surname>Rosenblum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lim</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Directed Graph Convolutional Network</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .LG/
          <year>2004</year>
          .13970
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Yaqing</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fenglong Ma</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
            , Ye Yuan, G. Xun, Kishlay Jha, Lu Su, and
            <given-names>Jing</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection</article-title>
          .
          <source>Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Wei</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kai</given-names>
            <surname>Zou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <fpage>6383</fpage>
          -
          <lpage>6389</lpage>
          . https: //www.aclweb.org/anthology/D19-1670
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Wolf</given-names>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and others.
          <year>2019</year>
          .
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          .
          <source>ArXiv</source>
          (
          <year>2019</year>
          ), arXiv-
          <fpage>1910</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yang</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Lei Zheng, Jiawei Zhang, Qingcai Cui,
          <string-name>
            <given-names>Zhoujun</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Philip</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>TI-CNN: Convolutional Neural Networks for Fake News Detection</article-title>
          . (
          <year>2018</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>1806</year>
          .00749
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Xinyi</surname>
            <given-names>Zhou</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jindi Wu</surname>
            , and
            <given-names>Reza</given-names>
          </string-name>
          <string-name>
            <surname>Zafarani</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>SAFE: Similarity-Aware Multi-Modal Fake News Detection</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>2003</year>
          .04981
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Xinyi</given-names>
            <surname>Zhou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Reza</given-names>
            <surname>Zafarani</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Network-based Fake News Detection: A Pattern-driven Approach</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .SI/
          <year>1906</year>
          .04210
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>