<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fake News Classification with BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrey Malakhov</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Patruno</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Bocconi andrey@zephyros.solutions</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>alex@zephyros.solutions</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>stefano@zephyros.solutions</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the usage of the BERT family transformers for the multi-class classification task "FakeNews: Corona virus and 5G conspiracy" track. This is a Natural Language Processing based Fake News detection challenge organized by MediaEval. It demonstrates how one can benefit from using pretrained transformers for tweet discrimination. We investigated the application of the Bidirectional Encoder Representations from Transformers (BERT) model for a classification task on a Twitter dataset. The task is described in the MediaEval webpage [1] and is discussed in Pogorelov et al. 2020 [2]. The dataset is composed of two parts, the full-text tweets content and a sequence of images representing graph networks. The dataset is described in Schroeder et al. 2019 [3]. Due to limited amount of time available, we decided to work only on the full-text tweets and discarded the network graph information. We trained a classifier on top of the pretrained BERT base version in order to discriminate between 5G conspiracy, other conspiracy or non-conspiracy tweets.</p>
      </abstract>
      <kwd-group>
        <kwd>Figure 1</kwd>
        <kwd>Training Loss</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>The analysis is mainly based on the BERT model, which is described
in Devlin et al. 2018 [4]. The BERT model adds bidirectionality to
the language model, which is based on (but diferent from) the
unidirectional architecture of the original Transformer paper (Vaswani
et al. 2017) [5].</p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>This section describes steps that were applied in order to train the
model.</p>
    </sec>
    <sec id="sec-4">
      <title>Data preprocessing</title>
      <p>From the train data the tweet text was extracted, we didn’t use any
metadata, the model is purely trained on text. The first step was to
exclude the hash sign (#) from the text. This is necessary for the
case when hashtags are incorporated into the tweet text as a part
of speech, for example:
This is an #example of a #tweet with #hashtags within the #text
The second step is to replace user mentions in tweets, for example
in case of a reply to multiple users. In this case the length of the text
can be inflated and this can lead to problems given the sequence
size limitation for the BERT model. Also this is done to eliminate
the lack of information in the username for our task. The example
below illustrates the replacement:
original tweet: @username1, @username2, @username3 test
message
processed tweet: username test message
The third step is to use the BERT tokenizer. The tokenizer prepares
the text to be fed to the model by splitting the sentence in tokens,
adding special tokens (for example to mark the beginning of a
sentence), padding the sentence, etc. We limited the length of the
tweets up to 256 tokens so that when the text is shorter than that,
it will be padded accordingly. If the length of the tweet is larger
than 256, the sentence is truncated. Very few tweets exceeded this
length, so that this choice shortens the computational time
without a significant loss of information and therefore appears to be
justified.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Classification model</title>
      <p>The model uses a simple linear layer on the output of the BERT
base with three output neurons. Adding a second linear layer does
not improve the results significantly. The optimizer for the model
was chosen to be ADAM without weight decay for bias and
normalization layers.
4</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>The model was trained for 20 epochs. In Figure 1 and Figure 2 you
can see the log-loss per training iteration for both training and
validation datasets.</p>
      <p>We chose the epoch for our final model based on accuracy per
class, in the end the epoch 19 was chosen with a confusion matrix in
Table 1. The application of the model on the test set gave us a final
score of 0.400846 when using the oficial metric of the MediaEval
task.
5G conspiracy</p>
      <p>Other conspiracy</p>
      <p>Non-conspiracy
5</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>
        We can see that even with a simple classifier we can reach a
significant accuracy on the raw text tweets. The following improvements
can be done:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Use multiple folds over train data to get a robuster model
(2) Increase complexity of the classifier with several layers
(3) Use BERT Large as a foundation instead of BERT base
(
        <xref ref-type="bibr" rid="ref2">4</xref>
        ) Incorporate meta data from tweets as number of replies,
likes, retweets, mentions, etc. to the classifier part
(5) Use an ensemble of transformers (ROBERTA, ALBERT,
      </p>
      <p>BERT, etc.)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] https://multimediaeval.github.io/editions/2020/tasks/fakenews/ [2]
          <string-name>
            <surname>Pogorelov</surname>
          </string-name>
          et al.,
          <year>2020</year>
          ,
          <article-title>"FakeNews: Corona Virus and 5G Conspiracy Task at MediaEval 2020"</article-title>
          , MediaEval 2020 Workshop [3]
          <string-name>
            <surname>Schroeder</surname>
          </string-name>
          et al.
          <year>2019</year>
          ,
          <article-title>"FACT: a Framework for Analysis and Capture of Twitter Graphs."</article-title>
          ,
          <source>In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)</source>
          , pp.
          <fpage>134</fpage>
          -
          <lpage>141</lpage>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Devlin</surname>
          </string-name>
          et al.
          <year>2018</year>
          ,
          <year>2017</year>
          ,
          <article-title>"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"</article-title>
          ,
          <year>Arxiv 1810</year>
          .
          <volume>04805</volume>
          [5]
          <string-name>
            <surname>Vaswani</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ,
          <article-title>"Attention is all you need"</article-title>
          ,
          <source>Arxiv</source>
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>