<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A BERT based Two-stage Fake News Spreaders Profiling System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shih-Hung Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sheng-Lun Chien</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering, Chaoyang University of Technology Taichung</institution>
          ,
          <country country="TW">Taiwan, R.O.C</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>This paper describes our two-stage classification approach to the CLEF 2020 lab: Profiling Fake News Spreaders on Twitter. The task can be briefly defined as: Given a Twitter feed, determine whether its author is keen to be a spreader of fake news. Our approach is to adopt the pretrained model BERT as a tweet classifier and to spot potential spreaders whose tweets are strongly suspected as fake news. The performance of our approach can reach 0.71 on the English data set during developing the system. However, the performance drop to 0.56 in the final PAN at CLEF 2020 shared task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A great amount of fake news and rumors are propagated in online social networks.
According to the experience on developing anti-spam techniques, it is a good
approach to spot the source instead of trying to check the content one-by-one. The
aim of profiling fake news spreaders task at PAN-2020 is to know if it is possible to
discriminate authors who have posted some fake news in the past from those who
have never done it before [1].</p>
      <p>The organizers propose the task from a multilingual perspective, and provide data
set in English and Spanish, and recommend the participants to take part in both
languages. The uncompressed dataset consists in a folder per language (en, es). Each
folder contains an XML file per author (Twitter user) with 100 tweets and the
filename of these XML files corresponding to the unique author IDs. There are also a
separate truth.txt file with the list of authors and the ground truth of whether they are
fake news spreaders or not. The performance of a system will be ranked by accuracy
in discriminating between the two classes.</p>
      <p>However, due to the limitation of time and resource, we just build a system only
for tweets in English based on the content analysis and skip the tweets in Spanish.
The decision process of our system is a two-stage classification approach to the</p>
      <p>Profiling Fake News Spreaders on Twitter task. Our system adopted the pre-trained
bidirectional transformer language model, known as BERT [2] as our NLP tool for
content analysis. During the training phrase, we fine-tune of the pretrained model
BERT as a tweet classifier and use it to classify each tweet as potential fake news or
not. Then our system spot a spreader by checking the percentage of each author’s
tweets that is classified as fake news. If the percentage is higher than a threshold, then
we consider the author is a fake news spreader.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The BERT Pre-trained Model</title>
      <p>The system flow is shown in the following figures. Figure 1 shows the BERT
model and classifier architecture. The core of our system is the pretrained language
model “BERT”. The BERT model is a bidirectional transformer pre-trained using a
combination of masked language modeling (MLM) objective and next sentence
prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.
BERT stands for Bidirectional Encoder Representations from Transformers. BERT is
designed to pre-train deep bidirectional representations from unlabeled text. As a
result, the pre-trained BERT model can be fine-tuned with just one additional output
layer to create models for new tasks. The implementation of BERT that we use is
BERT for sequence classification from Hugging face library1. The pretrained model is
“bert-base-uncased”, that required all English text to be in lower case. The
hyperparameter in the training phrase: Hidden size = 768, Learning r= 6.0e-5, and Vocab =
30522. We train the model 10 epochs in each experiment setting.</p>
      <p>Class
Linear Classifier</p>
      <p>BERT
[CLS]</p>
      <p>W1</p>
      <p>W2</p>
      <p>W3
…</p>
      <p>Wn
1 https://huggingface.co/transformers/model_doc/bert.html
English characters are kept for training data. Then we combine all data into a training
dataset for the model to let it learn which tweet may be telling the fake news or not.</p>
      <p>Start
Input all</p>
      <p>XML Files
Extract every tweets and
associate the truth label for</p>
      <p>each tweet
Process the text and keep
only English character
Convert the file</p>
      <p>Into CSV format
Transform the CSV file</p>
      <p>Into Training dataset
Run the BERT classification</p>
      <p>End</p>
      <p>There are 300 authors and each has 100 tweets in the training set. We find that half
of them are spreaders however we do not know whether each tweet is a fake news or
not. However, we assume that all tweets belong to the spreaders are potential fake
news and all tweets belong to the non-spreaders are real news. Thus, we trained a
classifier that can classify the news into potential fake ones and real ones.</p>
      <p>We know this assumption is imprecise, the classifier cannot spot the fake news
well. So when we need to use it to spot a spreader, we set a threshold mechanism to
prevent overly identify too many authors as spreaders. Only if the percentage of an
author’s tweets passed the threshold he/she will be labelled as spreader. An author
with only a few tweets that are classified as fake news will not be labelled as a
spreader. The decision is made by an empirical threshold. We divide the training set
into two parts and use this developing set to find the best threshold, where 70% of the
data used as training set and 30% of the data used as test set. Figure 3 shows the
accuracy vs. threshold result, where the threshold range from 60% to 90%. The
system can get a 0.71 accuracy value with a threshold 74%. The threshold is selected
manually and used in our system.</p>
      <p>To know how the system might perform, we conduct several similar experiments
on the training set. Table 1 shows the test results. The accuracy value is around 0.65
to 0.71 given enough training data, i.e. 60% to 80% of the data in training set. We
expect that our system can get similar result in formal test.</p>
      <p>Figure 4 shows how our system do the test. Before testing the data, we also do the
data preprocessing first. We extracted every tweet for the one Author file (XML file)
and used the model to predict every tweet. After all tweets of one author were labeled
1 or 0, we have a threshold mechanism to make decisions on whether the author is a
spreader or not by checking the percentage of 1 exceeded 74% or not. Then our
system will put the final answer with author id to a XML file and finish the task.
Input</p>
      <p>XML file
Extract every</p>
      <p>tweets
Transform the data
into Test set CSV file</p>
      <p>Input</p>
      <p>Test dataset
Run BERT classifier for</p>
      <p>every tweets
Calculate the average
class labels of tweets
for each author
Check the average</p>
      <p>pass</p>
      <p>The threshold or not</p>
      <p>The test part is taken on the virtual machine provided by the organizers, we met
some technical error. In the training phrase, all the non-English characters are filtered,
only English characters are kept for training data, but we omitted this part during the
test phrase. This is one of the reasons that our system performance decreased. Table 2
shows our system official final test result vs. some benchmarks. The accuracy value
of our system is 0.560, which is equal to the LSTM benchmark but lower than our
best result on the development set.</p>
      <p>
        This paper describes our two-stage classification approach to Profiling Fake News
Spreaders on Twitter task. The performance of our approach can reach 0.7 on the
development set. However, the performance drop to 0.56 in the f
        <xref ref-type="bibr" rid="ref3">inal PAN evaluation
at CLEF 2020</xref>
        shared task.
      </p>
      <p>As future work, we intend to investigate what are the other information that might
help to detect fake news spreaders [3]. For example, in addition to news content and
labels, fake news articles in some datasets also provide information on social network
of Twitter which contains Twitter users and their following relationships, i.e.,
useruser relationships, and how the news has propagated (tweeted/re-tweeted) by users,
i.e., news-user relationships [4].</p>
      <p>Acknowledgements
This study was supported by the Ministry of Science and Technology under the grant
number MOST 109-2221-E-324-024</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Rangel F.</given-names>
            ,
            <surname>Giachanou</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ghanem</surname>
          </string-name>
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          <string-name>
            <surname>P</surname>
          </string-name>
          .
          <article-title>Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter</article-title>
          . In: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickhoff</surname>
          </string-name>
          , N.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Névéol (eds.)
          <article-title>CLEF 2020 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings.CEUR-WS.org Devlin</source>
          , j.,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          , arXiv:
          <year>1810</year>
          .
          <article-title>04805v2 [cs</article-title>
          .CL]
          <article-title>(2018) N</article-title>
          .
          <string-name>
            <surname>Ruchansky</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Seo</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu. CSI</surname>
          </string-name>
          :
          <article-title>A Hybrid Deep Model for Fake News Detection</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management</source>
          , pages
          <fpage>797</fpage>
          -
          <lpage>806</lpage>
          . ACM, (
          <year>2017</year>
          )
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          . Beyond News Contents:
          <article-title>The Role of Social Context for Fake News Detection</article-title>
          .In
          <string-name>
            <surname>WSDM</surname>
          </string-name>
          , (
          <year>2019</year>
          )
          <article-title>Rangel F</article-title>
          .,
          <string-name>
            <surname>Franco-Salvador</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso P. A Low Dimensionality</surname>
          </string-name>
          <article-title>Representation for Language Variety Identification</article-title>
          .
          <source>In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016</source>
          , Springer-Verlag,
          <source>Revised Selected Papers</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <source>LNCS(9624)</source>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          (arXiv:
          <fpage>1705</fpage>
          .10754)
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>An Emotional Analysis of False Information in Social Media and News Articles</article-title>
          .
          <source>ACM Transactions on Internet Technology (TOIT)</source>
          ,
          <volume>20</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>