1
        Fake news detection: Network data from social media
                       used to predict fakes

                           Torstein Granskogen 1 and Jon Atle Gulla2
          1
              Norwegian University of Science and Technology, Trondheim, Norway
                                   torsteig@stud.ntnu.no
         2
              Norwegian University of Science and Technology, Trondheim, Norway
                                  jon.atle.gulla@ntnu.no


          Abstract. Fake news has swept through the media world in the last few years,
          and with that comes a wish to be able to accurately and automatically detect these
          fakes such that action can be taken against them.
            Social network sites are among one of the places where this kind of data are
          most shared. Using the structure of these sites, we can predict to a high degree if
          a post is fake or not. We are doing this not by analyzing the contents of the posts,
          but using the social structure of the site. These social network data mimics the
          real world where people with similar interests will come together around topics
          and positions. Using logistic regression and crowd sourcing algorithms, we con-
          solidate previous findings, with prediction accuracy as high as 93 % on datasets
          consisting from 4200 posts to 15,500. The algorithms show best performance on
          full datasets.

          Keywords: Fake news detection, Social Networks, Contextual Information


1         Introduction

1.1       Problem description

Fake news is a phenomenon that has swept over the world in a massive way the last
few years. Suddenly we feel like we are bombarded by news that we cannot know are
true or not. To combat this, the scientific community is figuring out ways to automati-
cally detect when a piece of information is reliable or not. In this paper we propose to
use a different approach. Our approach is based not on the contents of the news articles,
text snippets, tweets etc., but on the traffic and users, and their relations.
    As shown in [1] there is a high correlation between the users that actively either
comment or like fake articles and stories on Facebook. We want to build on this idea,
both by expanding the techniques used by [1], but also by trying to apply it on data that
is not as structured as social media. Finally, we want to generate a web-of-trust structure
on top of the existing data, that can be used to compute a reliability score for nodes. We


1 Copyright held by the author(s). NOBIDS 2017
2


hope that this type of scoring can be used on other actors, such as news agencies, pub-
lishers and other important contributors in the information industry.


2      Dataset

The dataset we have chosen to go for is twofold, whereas we have recreated the dataset
used in [1] to the best possible match using the same techniques. We are collecting
older data, from 2016-07-01 to 2016-12-31. Some of the data is no longer available,
and therefore the dataset is not complete, but it contains about one third of the original
data. We take this into account when comparing the results to the original ones. The
information is volatile, especially the fake parts since Facebook actively removes un-
wanted information on their site [2].
   The data gathered contained the posts from the different sources of scientific and
non-scientific sources, together with the likes from those posts, including likes in com-
ments. The likes were concatenation into the post ID, instead of individual comments.
The posts were sorted into what community they belonged to, such that a hierarchy of
source post  likes was generated. The identifier for the source was a string of num-
bers, and each post consisted of the ID sourceID_postID. Following that, the ID of the
users was the only information stored per post, no other information about the users
were used. The data was manipulated to find the likes from each unique user, but also
to find the occurrence of users in the same posts. The datasets were gathered using the
Facebook Graph API[3].

2.1    Original dataset

   The original dataset consisted of 15,500 posts and 909,236 users, while the one we
were able to generate consists of 4286 posts with a total of 158,789 users.
   This dataset is a combination of scientific and nonscientific pages. The non-scientific
pages are known to publish or embrace fake information, whereas the scientific ones
are known to only publish truthful information. This leads to a two-way differentiation,
where we have two major nodes that contain the extremes that helps us in differentiating
news stories.

2.2    New dataset
In addition to this dataset, we have gathered our own, both to test the same methods
as in [1] on a different dataset, but also to check if locale, location or topic have an
impact on the results. Locale is the geographical and social affiliation that the users
have. The second dataset is divided in the same way as the first, and is comprised of a
combination of sources from [4] and [5]. The two sources were needed to get a da-
taset of similar size and complexity. Not all the sources had a Facebook page, so all of
them are not part of the dataset. The complete list of the sources used in both datasets
can be found in Table 1.
   The new dataset consists of 5943 posts, over 9,5 million likes and 5,6 million
unique users. This means that the new dataset consists of less posts, but more users
and likes. This is because the sources for the data are mostly from big English or
                                                                                             3


international mainstream sites, especially the scientific ones, which will then have
much greater coverage than the mostly local Italian sites that were used in [1], and
containing a bigger spread in locale. This was done to check if a more densely popu-
lated dataset with more low-quality users would perform as good as the geograph-
ically restricted results as [1].

                              Table 1. Sources used for datasets.
                Original dataset                                  New dataset
      Scientific             Non-scientific          Scientific             Non-scientific
Scientificast           Scienza di Confine     The Wall Street Jour-
                                                                        Before it’s News
                                               nal
Cicap.org               CSSC - Cieli Senza     The Economist
                                                                        InfoWars
                        Scie Chimiche
Oggiscienza.it          STOP ALLE SCIE         BBC News                 Real News. Right
                        CHIMICHE                                        Now.
Queryonline             vaccinibasta           NPR                      American Flavor
Gravitazeroeu           Tanker Enemy           ABC News                 World Politics Now
COELUM Astrono-         Scie Chimiche          CBS
                                                                        We Conservative
mia
MedBunker               MES Dittatore Eu-      USA Today
                                                                        Washington Feed
                        ropeo
In Difesa della Speri-  Lo sai                 The Guardian             American People Net-
mentazione Animale                                                      work
Italia Unita per la Sci- AmbienteBio           NBC
                                                                        Uspoln
enza
La scienza come non Eco(R)esistenza            The Washington Post
                                                                        US INFO News
l’avete mai vista
Liberascienza            Curarsialnaturale                              Clash Daily
Scienze Naturali        La Resistenza
Perché vaccino          Radical Bio
Le Scienze              Fuori da Matrix
Vera scienza            Graviola Italia
Scienza in rete         Signoraggio.it
Galileo, giornale di    Informare Per Re-
scienza e problemi      sistere
globali
Scie Chimiche: In-      Sul Nuovo Ordine
formazione Corretta     Mondiale
Complottismo? No        Avvistamenti e Con-
grazie                  tatti
Scienza Live            Umani in Divenire
4


2.3     Methodology

The methods used were based on two different algorithms, Logistic Regression(LR)
and Harmonic Boolean Label Crowdsourcing(HBLC). LR is a simpler algorithm than
HBLC and does not transfer information, whereas HBLC does this. LR considers a set
of posts I and users U, where each post I has a set of features 𝑥𝑖𝑢 where x = 1 if a user
liked the post and 0 otherwise. The posts are classified based on the users liked them.
    The classification is done using a LR model, where each user is given a weight for
each user. The summed weight of a post indicated whether it is a hoax or not. The
higher the weight, the more likely a post is to be hoax.
    HBLC is based on a Boolean label where the label here is True or False. The value
is set to be True if the user likes the posts, i.e. gives the post confidence. The dataset is
represented by a bipartite graph consisting of the users, the likes and the posts. The
harmonic algorithm contains two beta distributions that represents the number of times
a user has been seen respectively hoax or non-hoax posts.
    HBLC calculates the quality of the post based on these distributions of all the users
that have interacted with it, and if the quality is negative it is considered a hoax, and a
non-hoax otherwise. Because of the iterative nature of the harmonic algorithm, it can
propagate information, such that a hoax user will have an increased value in its hoax
beta distribution, and reflected on post beliefs, and consequently infers with the prefer-
ences of other, similar users.
    A more detailed description of both LR and HBLC can be found in [1].


3       Preliminary results

We have to a been able to recreate the results [1] got using our own version of their
dataset with similar results, thereby confirming the findings from [1]. A discussing re-
garding these results in detail can be found in section 3.2.
   Since we were not able to fully recreate the dataset from [1], the results cannot be
compared directly. Instead we can use them to test the boundaries for the viability of
the different algorithms, and thereby get an indication on how much data is needed for
adequate results.

3.1     Dataset results

3.1.1      Original dataset
The results on the smaller dataset we gathered does not impact the results very much,
but we see that the smaller the dataset, the more each piece impacts the total score and
thus the standard deviation will increase, and the robustness of the results falls.
                                                                                                     5


             In addition to these tests, we have done some work on testing other algorithms and
          how they react to this kind of network data. There is still work to be done to figure out
          the best parameters using different techniques for this kind of problem, since the data
          are non-textual and different from what these methods are normally applied on, and to
          figure out if they are applicable at all.
             For the original dataset, we can see that the differences using logistic regression
          (LR) on the two different versions of the dataset are minor. This is a good indication
          to LR being a robust algorithm for this kind of data. It performs similarly and predict-
          ably on much lower volumes of data. The standard deviation increases, but that is to
          be expected as the individual posts have a bigger impact in a smaller dataset.
             On the other hand, harmonic Boolean label crowdsourcing (HBLC) seems to be
          more volatile when the size of the dataset decreases. This might be an indication to
          HBLC needing bigger datasets to perform as good as it did in [1].

          3.1.2       New dataset

          On the new dataset, we can see that the results are similar the original dataset, which
          gives a good indication that the algorithms can handle data from different sources.
             For LR the results are almost identical to the original dataset. This is an indication
          that LR is a robust and reliable algorithm. Since the sources were not checked for struc-
          tural similarities before being collected, this goes to show that if the input data can be
          divided in non-scientific or scientific groups, LR can be used for good results.
             For HBLC, we can see that it performs better than LR overall, but it seems to be
          more prone to changes when working with smaller datasets. However, on larger da-
          tasets, HBLC can predict with very high accuracy whether a post is truthful or not.
          However, HBLC does not produce as good results on our dataset compared to the one
          used originally in [1].


                                      Table 2. New dataset, algorithm results.
                                       One-page-out                                 Half-pages-out
                      Avg. accuracy          Stdev.                 Avg. accuracy           Stdev.
Logistic Regression   0.772                   0.288                 0.683                  0.121
Harmonic BLC          0.939                   0.234                 0.906                  0.102

                      Table 3. Original dataset compared to our from the same sources.
                                       One-page-out                                 Half-pages-out
                      Avg. accuracy          Stdev.                 Avg. accuracy           Stdev.
Logistic Regression   0.794/ 0.732            0.303 / 0.363         0.716 / 0.745          0.143 / 0.093
Harmonic BLC          0.992 / 0.978           0.023 / 0.075         0.993 / 0.955          0.002 / 0.062
6


4      Further work

Going forward, we would like to improve the results we have. This can be done in
several ways, and we are going to concentrate on a few of them. First and foremost,
we want to look at how further preprocessing of the data will change the results. Since
the datasets have a clear majority of users that have a few or just a single like per post,
and these users do not contribute much to the result since they have few connections
to the rest of the data, removing these or in some way reduce their impact will most
likely improve the results.
   In addition to this, when using some of the more well-known sites as sources, such
as The Wall Street Journal and BBC News, the number of users and data increases
rapidly, and the runtime increases even faster. Because of this, a few different ap-
proaches can be used. If the system is going to be used in a time sensitive fashion, ap-
plying a best-effort algorithm like simulated annealing might help. These kinds of al-
gorithms will give a best possible solution within a given timeframe, and will come
closer to the optimal solution the more time it is given to find it. Another way to de-
crease the complexity is to cluster the users in one way or another. By clustering the
users after either closeness to each other or how important they are, the number of op-
erations will be drastically reduced, but some information will be lost to the loss of
granularity.
   Since the number of usable users are so sparse when dealing with the mainstream
sites, this leads to the intersection dataset being really small compared to the total
size. An example is the fact that out of over 5.6 million users, only 14 thousand of
these have liked posts from both scientific and nonscientific sources. This might be
because of the choice of fake news sites, but also indicates that a certain size is
needed for a site being viable. To be able to use these algorithms successfully in an
industrial setting, we need to be able to either extrapolate the value each user has, or
else the intersection dataset will be too small for reliable results.
   Because of that, we want to try to apply a web-of-trust, like what was done in [7],
on top of the existing results and in that way, try to use that as an early classifier just
based on the users. The web will consist of users and the weighted edges between
them. Then we can use these weights based on what nodes are already contained in
the different posts and then extrapolate and use the social data such as nearest neigh-
bor or clustering to get an indication what these users prefer. Then this score can be
used in addition to the one from the algorithms and hopefully give a better indication
on whether the post is fake or not.


5      Conclusion

We have shown that logistic regression and harmonic Boolean label crowdsourcing
both are viable algorithms on datasets that differs from the original ones that [1] pub-
lished. In datasets with smaller intersection between the users, both algorithms perform
worse, but we hope to remedy this later by further preprocessing of the data. The
                                                                                   7


algorithms used show robustness in different datasets, one where the number of users
compared to pages are small, and another which has more users on a smaller count of
pages.
   The approach proposed here does also not consider what kind of fake or truthful
information is shown, such as whether the fakes are serious fabrications, large-scale
hoaxes or humorous fakes, as mentioned in [6].
8


References

1. Tacchini, E., Ballarin, G., D. Vedova, M. L., Moret, S., de Alfaro, L.: Some Like
   it Hoax: Automated Fake News Detection in Social Networks. In: Technical Report
   UCSC-SOE-17-05. School of Engineering University of California, Santa Cruz
   (2017).
2. CNET Article Mark Zuckerberg on fake news, https://www.cnet.com/news/face-
   book-fake-news-mark-zuckerberg/, last accessed 2017/11/6.
3. The Facebook Graph API, https://developers.facebook.com/docs/graph-api/, last
   accessed 2017/11/2.
4. Buzzfeed Political News Data repository, https://github.com/rpitrust/fakenews-
   data1, last accessed 2017/10/28.
5. PolitiFact’s guide to fake news websites, http://www.politifact.com/punditfact/ar-
   ticle/2017/apr/20/politifacts-guide-fake-news-websites-and-what-they/, last ac-
   cessed 2017/10/28.
6. Rubin, V. L., Chen, Y., Conroy, N. J.: Deception Detection for News: Three Types
   of Fakes. University of Western Ontario, London, Ontario (2015).
7. Tavakolifard, M., Almeroth, K. C., Gulla, J. A.: Does Social Contact Matter? Mod-
   elling the Hidden Web of Trust Underlying Twitter*. In: WWW ’13 Proceedings
   of the 22nd International Conference on World Wide Web, p. 981-988. Norwegian
   University of Science and Technology, Trondheim, Norway and University of Cal-
   ifornia at Santa Barbara, Santa Barbara, USA (2013).