<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>1Fake news detection: Network data from social media used to predict fakes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Torstein Granskogen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jon Atle Gulla</string-name>
          <email>jon.atle.gulla@ntnu.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Trondheim</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fake news has swept through the media world in the last few years, and with that comes a wish to be able to accurately and automatically detect these fakes such that action can be taken against them. Social network sites are among one of the places where this kind of data are most shared. Using the structure of these sites, we can predict to a high degree if a post is fake or not. We are doing this not by analyzing the contents of the posts, but using the social structure of the site. These social network data mimics the real world where people with similar interests will come together around topics and positions. Using logistic regression and crowd sourcing algorithms, we consolidate previous findings, with prediction accuracy as high as 93 % on datasets consisting from 4200 posts to 15,500. The algorithms show best performance on full datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake news detection</kwd>
        <kwd>Social Networks</kwd>
        <kwd>Contextual Information</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <sec id="sec-2-1">
        <title>Problem description</title>
        <p>Fake news is a phenomenon that has swept over the world in a massive way the last
few years. Suddenly we feel like we are bombarded by news that we cannot know are
true or not. To combat this, the scientific community is figuring out ways to
automatically detect when a piece of information is reliable or not. In this paper we propose to
use a different approach. Our approach is based not on the contents of the news articles,
text snippets, tweets etc., but on the traffic and users, and their relations.</p>
        <p>
          As shown in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] there is a high correlation between the users that actively either
comment or like fake articles and stories on Facebook. We want to build on this idea,
both by expanding the techniques used by [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], but also by trying to apply it on data that
is not as structured as social media. Finally, we want to generate a web-of-trust structure
on top of the existing data, that can be used to compute a reliability score for nodes. We
1 Copyright held by the author(s). NOBIDS 2017
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>hope that this type of scoring can be used on other actors, such as news agencies,
publishers and other important contributors in the information industry.</p>
      <p>
        The dataset we have chosen to go for is twofold, whereas we have recreated the dataset
used in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to the best possible match using the same techniques. We are collecting
older data, from 2016-07-01 to 2016-12-31. Some of the data is no longer available,
and therefore the dataset is not complete, but it contains about one third of the original
data. We take this into account when comparing the results to the original ones. The
information is volatile, especially the fake parts since Facebook actively removes
unwanted information on their site [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The data gathered contained the posts from the different sources of scientific and
non-scientific sources, together with the likes from those posts, including likes in
comments. The likes were concatenation into the post ID, instead of individual comments.
The posts were sorted into what community they belonged to, such that a hierarchy of
source post  likes was generated. The identifier for the source was a string of
numbers, and each post consisted of the ID sourceID_postID. Following that, the ID of the
users was the only information stored per post, no other information about the users
were used. The data was manipulated to find the likes from each unique user, but also
to find the occurrence of users in the same posts. The datasets were gathered using the
Facebook Graph API[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.1
      </p>
      <sec id="sec-3-1">
        <title>Original dataset</title>
        <p>The original dataset consisted of 15,500 posts and 909,236 users, while the one we
were able to generate consists of 4286 posts with a total of 158,789 users.</p>
        <p>This dataset is a combination of scientific and nonscientific pages. The non-scientific
pages are known to publish or embrace fake information, whereas the scientific ones
are known to only publish truthful information. This leads to a two-way differentiation,
where we have two major nodes that contain the extremes that helps us in differentiating
news stories.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>New dataset</title>
        <p>
          In addition to this dataset, we have gathered our own, both to test the same methods
as in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] on a different dataset, but also to check if locale, location or topic have an
impact on the results. Locale is the geographical and social affiliation that the users
have. The second dataset is divided in the same way as the first, and is comprised of a
combination of sources from [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The two sources were needed to get a
dataset of similar size and complexity. Not all the sources had a Facebook page, so all of
them are not part of the dataset. The complete list of the sources used in both datasets
can be found in Table 1.
        </p>
        <p>
          The new dataset consists of 5943 posts, over 9,5 million likes and 5,6 million
unique users. This means that the new dataset consists of less posts, but more users
and likes. This is because the sources for the data are mostly from big English or
international mainstream sites, especially the scientific ones, which will then have
much greater coverage than the mostly local Italian sites that were used in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], and
containing a bigger spread in locale. This was done to check if a more densely
populated dataset with more low-quality users would perform as good as the
geographically restricted results as [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Scientific
        </p>
        <p>Original dataset</p>
        <p>Non-scientific</p>
        <p>New dataset</p>
        <p>Non-scientific</p>
        <p>Scientific
The Wall Street
Journal
The Economist
BBC News
NPR
CBS
ABC News
USA Today
The Guardian
NBC</p>
        <p>Before it’s News
InfoWars
Real News. Right
Now.</p>
        <p>American Flavor
World Politics Now
We Conservative
Washington Feed
American People
Network
Uspoln
US INFO News
Clash Daily
Eco(R)esistenza</p>
        <p>The Washington Post
Scientificast
Cicap.org
Oggiscienza.it
Queryonline
Gravitazeroeu
COELUM
Astronomia
MedBunker
Scienze Naturali
Perché vaccino
Le Scienze
Vera scienza
Scienza in rete
Galileo, giornale di
scienza e problemi
globali
Scie Chimiche:
Informazione Corretta
Complottismo? No
grazie
Scienza Live
In Difesa della
Sperimentazione Animale
Italia Unita per la Sci- AmbienteBio
enza
La scienza come non
l’avete mai vista
Liberascienza</p>
        <p>Scienza di Confine
CSSC - Cieli Senza
Scie Chimiche
STOP ALLE SCIE
CHIMICHE
vaccinibasta
Tanker Enemy
Scie Chimiche
MES Dittatore
Europeo
Lo sai
Curarsialnaturale
La Resistenza
Radical Bio
Fuori da Matrix
Graviola Italia
Signoraggio.it
Informare Per
Resistere
Sul Nuovo Ordine
Mondiale
Avvistamenti e
Contatti</p>
        <p>Umani in Divenire</p>
      </sec>
      <sec id="sec-3-3">
        <title>Methodology</title>
        <p>The methods used were based on two different algorithms, Logistic Regression(LR)
and Harmonic Boolean Label Crowdsourcing(HBLC). LR is a simpler algorithm than
HBLC and does not transfer information, whereas HBLC does this. LR considers a set
of posts I and users U, where each post I has a set of features   where x = 1 if a user
liked the post and 0 otherwise. The posts are classified based on the users liked them.</p>
        <p>The classification is done using a LR model, where each user is given a weight for
each user. The summed weight of a post indicated whether it is a hoax or not. The
higher the weight, the more likely a post is to be hoax.</p>
        <p>HBLC is based on a Boolean label where the label here is True or False. The value
is set to be True if the user likes the posts, i.e. gives the post confidence. The dataset is
represented by a bipartite graph consisting of the users, the likes and the posts. The
harmonic algorithm contains two beta distributions that represents the number of times
a user has been seen respectively hoax or non-hoax posts.</p>
        <p>HBLC calculates the quality of the post based on these distributions of all the users
that have interacted with it, and if the quality is negative it is considered a hoax, and a
non-hoax otherwise. Because of the iterative nature of the harmonic algorithm, it can
propagate information, such that a hoax user will have an increased value in its hoax
beta distribution, and reflected on post beliefs, and consequently infers with the
preferences of other, similar users.</p>
        <p>
          A more detailed description of both LR and HBLC can be found in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Preliminary results</title>
      <p>
        We have to a been able to recreate the results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] got using our own version of their
dataset with similar results, thereby confirming the findings from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A discussing
regarding these results in detail can be found in section 3.2.
      </p>
      <p>
        Since we were not able to fully recreate the dataset from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the results cannot be
compared directly. Instead we can use them to test the boundaries for the viability of
the different algorithms, and thereby get an indication on how much data is needed for
adequate results.
3.1
3.1.1
      </p>
      <sec id="sec-4-1">
        <title>Dataset results</title>
      </sec>
      <sec id="sec-4-2">
        <title>Original dataset</title>
        <p>The results on the smaller dataset we gathered does not impact the results very much,
but we see that the smaller the dataset, the more each piece impacts the total score and
thus the standard deviation will increase, and the robustness of the results falls.</p>
        <p>In addition to these tests, we have done some work on testing other algorithms and
how they react to this kind of network data. There is still work to be done to figure out
the best parameters using different techniques for this kind of problem, since the data
are non-textual and different from what these methods are normally applied on, and to
figure out if they are applicable at all.</p>
        <p>For the original dataset, we can see that the differences using logistic regression
(LR) on the two different versions of the dataset are minor. This is a good indication
to LR being a robust algorithm for this kind of data. It performs similarly and
predictably on much lower volumes of data. The standard deviation increases, but that is to
be expected as the individual posts have a bigger impact in a smaller dataset.</p>
        <p>
          On the other hand, harmonic Boolean label crowdsourcing (HBLC) seems to be
more volatile when the size of the dataset decreases. This might be an indication to
HBLC needing bigger datasets to perform as good as it did in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
3.1.2
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>New dataset</title>
        <p>On the new dataset, we can see that the results are similar the original dataset, which
gives a good indication that the algorithms can handle data from different sources.</p>
        <p>For LR the results are almost identical to the original dataset. This is an indication
that LR is a robust and reliable algorithm. Since the sources were not checked for
structural similarities before being collected, this goes to show that if the input data can be
divided in non-scientific or scientific groups, LR can be used for good results.</p>
        <p>
          For HBLC, we can see that it performs better than LR overall, but it seems to be
more prone to changes when working with smaller datasets. However, on larger
datasets, HBLC can predict with very high accuracy whether a post is truthful or not.
However, HBLC does not produce as good results on our dataset compared to the one
used originally in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Logistic Regression
Harmonic BLC
Logistic Regression
Harmonic BLC</p>
        <p>Avg. accuracy
0.772
0.939
Avg. accuracy
0.794/ 0.732</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Further work</title>
      <p>Going forward, we would like to improve the results we have. This can be done in
several ways, and we are going to concentrate on a few of them. First and foremost,
we want to look at how further preprocessing of the data will change the results. Since
the datasets have a clear majority of users that have a few or just a single like per post,
and these users do not contribute much to the result since they have few connections
to the rest of the data, removing these or in some way reduce their impact will most
likely improve the results.</p>
      <p>In addition to this, when using some of the more well-known sites as sources, such
as The Wall Street Journal and BBC News, the number of users and data increases
rapidly, and the runtime increases even faster. Because of this, a few different
approaches can be used. If the system is going to be used in a time sensitive fashion,
applying a best-effort algorithm like simulated annealing might help. These kinds of
algorithms will give a best possible solution within a given timeframe, and will come
closer to the optimal solution the more time it is given to find it. Another way to
decrease the complexity is to cluster the users in one way or another. By clustering the
users after either closeness to each other or how important they are, the number of
operations will be drastically reduced, but some information will be lost to the loss of
granularity.</p>
      <p>Since the number of usable users are so sparse when dealing with the mainstream
sites, this leads to the intersection dataset being really small compared to the total
size. An example is the fact that out of over 5.6 million users, only 14 thousand of
these have liked posts from both scientific and nonscientific sources. This might be
because of the choice of fake news sites, but also indicates that a certain size is
needed for a site being viable. To be able to use these algorithms successfully in an
industrial setting, we need to be able to either extrapolate the value each user has, or
else the intersection dataset will be too small for reliable results.</p>
      <p>
        Because of that, we want to try to apply a web-of-trust, like what was done in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
on top of the existing results and in that way, try to use that as an early classifier just
based on the users. The web will consist of users and the weighted edges between
them. Then we can use these weights based on what nodes are already contained in
the different posts and then extrapolate and use the social data such as nearest
neighbor or clustering to get an indication what these users prefer. Then this score can be
used in addition to the one from the algorithms and hopefully give a better indication
on whether the post is fake or not.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        We have shown that logistic regression and harmonic Boolean label crowdsourcing
both are viable algorithms on datasets that differs from the original ones that [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
published. In datasets with smaller intersection between the users, both algorithms perform
worse, but we hope to remedy this later by further preprocessing of the data. The
algorithms used show robustness in different datasets, one where the number of users
compared to pages are small, and another which has more users on a smaller count of
pages.
      </p>
      <p>
        The approach proposed here does also not consider what kind of fake or truthful
information is shown, such as whether the fakes are serious fabrications, large-scale
hoaxes or humorous fakes, as mentioned in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tacchini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballarin</surname>
            , G.,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Vedova</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moret</surname>
          </string-name>
          , S.,
          <string-name>
            <surname>de Alfaro</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Some Like it Hoax: Automated Fake News Detection in Social Networks</article-title>
          .
          <source>In: Technical Report UCSC-SOE-17-05</source>
          . School of Engineering University of California, Santa
          <string-name>
            <surname>Cruz</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>CNET</given-names>
            <surname>Article</surname>
          </string-name>
          <article-title>Mark Zuckerberg on fake news</article-title>
          , https://www.cnet.com/news/facebook-fake
          <article-title>-news-mark-zuckerberg/</article-title>
          ,
          <source>last accessed</source>
          <year>2017</year>
          /11/6.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>The</given-names>
            <surname>Facebook Graph</surname>
          </string-name>
          <string-name>
            <surname>API</surname>
          </string-name>
          , https://developers.facebook.com/docs/graph-api/,
          <source>last accessed</source>
          <year>2017</year>
          /11/2.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Buzzfeed</given-names>
            <surname>Political</surname>
          </string-name>
          <article-title>News Data repository</article-title>
          , https://github.com/rpitrust/fakenewsdata1, last accessed
          <year>2017</year>
          /10/28.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>PolitiFact</surname>
          </string-name>
          <article-title>'s guide to fake news websites</article-title>
          , http://www.politifact.com/punditfact/article/2017/apr/20/politifacts-guide
          <article-title>-fake-news-websites-and-what-they/</article-title>
          ,
          <source>last accessed</source>
          <year>2017</year>
          /10/28.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>V. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conroy</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          :
          <article-title>Deception Detection for News: Three Types of Fakes</article-title>
          . University of Western Ontario, London, Ontario (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tavakolifard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeroth</surname>
            ,
            <given-names>K. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulla</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          :
          <article-title>Does Social Contact Matter? Modelling the Hidden Web of Trust Underlying Twitter*</article-title>
          .
          <source>In: WWW '13 Proceedings of the 22nd International Conference on World Wide Web</source>
          , p.
          <fpage>981</fpage>
          -
          <lpage>988</lpage>
          . Norwegian University of Science and Technology, Trondheim, Norway and University of California at Santa Barbara, Santa Barbara, USA (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>