<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Some Like it Hoax : Automated Fake News Detection in Social Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eugenio Tacchini</string-name>
          <email>eugenio.tacchini@unicatt.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Ballarin</string-name>
          <email>gabriele.ballarin@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco L. Della Vedova</string-name>
          <email>marco.dellavedova@unicatt.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Moret</string-name>
          <email>moret.stefano@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca de Alfaro</string-name>
          <email>luca@ucsc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science</institution>
          ,
          <addr-line>UC Santa Cruz, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ecole Polytechnique Federale de Lausanne</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Independent researcher</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universita Cattolica</institution>
          ,
          <addr-line>Brescia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universita Cattolica</institution>
          ,
          <addr-line>Piacenza</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the recent years, the reliability of information on the Internet has emerged as a crucial issue of modern society. Social network sites (SNSs) have revolutionized the way in which information is spread by allowing users to freely share content. As a consequence, SNSs are also increasingly used as vectors for the di usion of misinformation and hoaxes. The amount of disseminated information and the rapidity of its di usion make it practically impossible to assess reliability in a timely manner, highlighting the need for automatic online hoax detection systems. As a contribution towards this objective, we show that Facebook posts can be classi ed with high accuracy as hoaxes or non-hoaxes on the basis of the users who \liked" them. We present two classi cation techniques, one based on logistic regression, the other on a novel adaptation of boolean crowdsourcing algorithms. On a dataset consisting of 15,500 Facebook posts and 909,236 users, we obtain classi cation accuracies exceeding 99% even when the training set contains less than 1% of the posts. We further show that our techniques are robust: they work even when we restrict our attention to the users who like both hoax and nonhoax posts. These results suggest that mapping the di usion pattern of information can be a useful component of automatic hoax detection systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The World Wide Web (WWW) has revolutionized the way in which
information is disseminated. In particular, social network sites (SNSs) are
platforms where content can be freely shared, enabling users to actively
participate to - and, possibly, in uence - information di usion processes.
As a consequence, SNSs are also increasingly used as vectors for the
dissemination of spam [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], conspiracy theories and hoaxes, i.e. intentionally
crafted fake information. This recently led to the emphatic de nition of
our current times as the age of misinformation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A signi cant share of
hoaxes on SNSs di uses rapidly, with a peak in the rst 2 hours [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This
nding, together with the high amount of shared content, highlights the
need of automatic online hoax detection systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In the literature, various approaches have been proposed for automatic
hoax detection, covering quite heterogeneous applications. Historically,
one of the rst applications has been hoax detection in e-mail messages
and webpages. In the context of scam e-mail detection, spamassassin uses
keyword-based methods with logistic regression [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; Petkovic et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and
Ishak et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed the use of distance-based methods; Vukovic et
al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] applied neural network and advanced text processing; Yevseyeva
et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used evolutionary algorithms for the development of anti-spam
lters. Shari et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] applied logistic regression to automatically detect
scam on webpages, reaching an accuracy of 98%.
      </p>
      <p>
        The concepts of trust and reputation [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] can be adopted for
hoax detection in applications with a dominant social component.
Metrics and algorithms for this purpose have been proposed by Golbeck and
Hendler [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Adler and de Alfaro [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] developed a content-driven user
reputation system for Wikipedia, allowing to predict the quality of new
contributions. The detection of Wikipedia hoaxes has been addressed e.g.
in [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. More recently, automatic hoax detection in SNSs has gained
increasing interest. As an example, Chen et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] developed a
semisupervised scam detector for Twitter based on self-learning and
clustering analysis, while Ito et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] proposed the use of Latent Dirichlet
Allocation (LDA) to assess the credibility of tweets.
      </p>
      <p>
        The key idea behind our work, which constitutes its main novelty,
is that hoaxes can be identi ed with great accuracy on the basis of the
users that interact with them. In particular, focusing on Facebook, we
answer the following research question: Can a hoax be identi ed based
on the users who \liked" it? We consider a dataset consisting of 15,500
posts and 909,236 users; the posts originate from pages that deal with
either scienti c topics or with conspiracies and fake scienti c news [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We
propose two classi cation techniques. One consists in applying logistic
regression, considering the user interaction with posts as features. The
other technique consists in a novel adaptation of boolean label
crowdsourcing techniques to a setting where a training set is available, but no
prior assumption on users being mostly reliable can be made.
      </p>
      <p>The proposed techniques yield an accuracy exceeding 99% even for
training sets consisting of less of 1% of posts. These results are obtained
in spite of the fact that the communities of users participating in the
scienti c and conspiracy pages overlap. Our main contributions, in
summary, are: i ) the proposal of a novel way to identify hoaxes on SNSs based
on the users who interacted with them rather than their content; ii ) an
improved version of the harmonic crowdsourcing method, suited to hoax
detection in SNSs; iii ) the application on Facebook and, in particular, on
a representative dataset obtained from the literature.</p>
      <p>The code we developed for this paper is available from https://
github.com/gabll/some-like-it-hoax.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>Our dataset consists in all the public posts and posts' likes of a list of
selected Facebook pages during the second semester of 2016: from Jul. 1st,
2016 to Dec. 31st, 2016. We collected the data by means of the Facebook
Graph API6 on Jan. 27th, 2017.</p>
      <p>
        We based our selection of pages on [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In that work, the authors
present a list of Facebook pages divided into two categories: scienti c news
sources vs. conspiracy news sources. We assume all posts from scienti c
pages to be reliable, i.e. \non-hoaxes", and all posts from conspiracy pages
to be \hoaxes". Among the 73 pages listed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we limited our analysis
to the top 20 pages of both categories. It is worth noting that at the time
of data collection, not all the pages were still available: some of them had
been deleted in the meantime, or were no longer publicly accessible. We
note also that the actual posts comprising our dataset are distinct from
those originally included in the dataset of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as we performed our data
collection in a di erent, and more recent, period.
      </p>
      <p>The resulting dataset, the so-called complete dataset, is composed of
15,500 posts from 32 pages (14 conspiracy and 18 scienti c), with more
than 2,300,00 likes by 900,000+ users (Table 1). Among posts, 8,923
(57.6%) are hoaxes and 6,577 (42.4%) non-hoaxes.</p>
      <p>As a rst observation, the distribution of the number of likes per post
is exponential-like, as attested by the histograms in Fig. 1 (a); the
majority of the posts have few likes. Hoax posts have, on average, more likes
than non-hoax posts. In particular, some gures about the number of likes
per post are: average, 204.5 (for hoax post) vs. 84.0 (non-hoax); median,
22 (hoax) vs. 14 (non-hoax); maximum, 121,491 (hoax) vs. 13,608
(nonhoax).
6 See https://developers.facebook.com/docs/graph-api. We used version 2.6.
Likes per post</p>
      <p>Hoax posts</p>
      <p>Non hoax posts
100
N. of likes
(a)
Likes per user</p>
      <p>100
N. of likes
(b)
50
150
200
50
150
200</p>
      <p>A second observation is related to the number of likes per user: once
again, Fig. 1 (b) shows an exponential-like distribution. The majority
of the users appears in the dataset with one single like (629,146 users,
69.2%), while the maximum number of likes by a user is 1,028. Users
can be divided into three categories based on what they liked: i ) those
who liked hoax posts only, ii ) those who liked non-hoax posts only, and
iii ) those who liked at least one post belonging to a hoax page, and
one belonging to a non-hoax page. Fig. 2 (a) shows that, despite a high
polarization, there are many users in the mixed category: among users
with at least 2 likes, 209,280 (74.7%) liked hoax post only, 56,671 (20.3%)
liked non-hoax post only, and 14,139 (5.0%) are in the mixed category.
This latter category gives rise to the intersection dataset, which consists
only of the users who liked both hoax and non-hoax posts, and of the
posts these users liked. The intersection dataset was introduced to study
the performance of our methods for communities of users that are not
strongly polarized towards hoax or non-hoax posts, as will be discussed
in Section 4. The composition of the intersection dataset is summarized
in Table 1.
40
35
s
ts30
o
p
xa25
o
h
-n20
o
n
to15
s
iek10
L 5
00 5 10 15 20 25 30 35 40</p>
      <p>Likes to hoax posts
(a)
105</p>
      <p>A third observation concerns the relation between pages, measured
by the number of users that pages have in common: given each pair of
pages, we study how many users liked at least one post from one page
and one post from the other page. Fig. 2 (b) shows the result as a
symmetric matrix: each page vs. each other page. Color intensity displays
that hoax pages have more users in common with other hoax pages
(upleft part, which appears darker) than with non-hoax pages (up-right and
bottom-left). The same applies to non-hoax pages (bottom-right).
Nevertheless, the gure shows that the communities gravitating around hoax
and non-hoax pages share many common users (as evidenced also from
the composition of the intersection dataset).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Algorithmic Classi cation of Posts</title>
      <p>
        Our goal is to classify posts into hoax and non-hoax posts. According to
the analysis of social media sharing by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], \users tend to aggregate in
communities of interest, which causes reinforcement and fosters con
rmation bias, segregation, and polarization", and \users mostly tend to select
and share content according to a speci c narrative and to ignore the rest."
This suggests that the set of users who like a post should be highly
indicative of the nature of the post. We present two approaches, one based on
logistic regression, the other based on boolean crowdsourcing algorithms.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Classi cation via logistic regression</title>
        <p>We formulate the post classi cation problem as a supervised learning,
binary classi cation problem. We consider a set of posts I and a set of
users U . Each post i 2 I has an associated set of features fxiu j u 2 U g,
where xiu = 1 if u liked post i, and xiu = 0 otherwise. We classify the
posts on the basis of their features, that is, on the basis of which users
liked them.</p>
        <p>To perform the classi cation, we use a logistic regression model. The
logistic regression model learns a weight wu for each user u 2 U ; the
probability pi that a post i is non-hoax is then given by pi = 1=(1 + e yi ),
where yi = Pu2U xiuwu. Intuitively, wu &gt; 0 (resp. wu &lt; 0) indicates that
u likes mostly non-hoax (resp. hoax) posts.</p>
        <p>We chose logistic regression for two reasons. First, logistic regression is
well suited to problems with a very large, and uniform, set of features. In
our case, we have about a million features (users) in our dataset, but a real
application would involve up to hundreds of millions of users. Second, our
logistic regression setting enjoys a non-interference property with respect
to unrelated set of users that facilitates learning, and is appealing on
conceptual grounds. Speci cally, assume that the set of users and posts
are partitioned into disjoint subsets U = U1 [ U2, I = I1 [ I2, so that
users in Uk like only posts in Ik, for k = 1; 2. This situation can arise, for
instance, when there are two populations of users and posts in di erent
languages, or simply when two topics are very unrelated. In such a setting,
it is equivalent to train a single model, or to train separately two models,
one for I1, U1, one for I2, U2, and then take their \union". This because
the weights wu for u 2 U3 k do not matter for classifying posts in Ik,
k = 1; 2, since the features xiu with i 2 Ik and u 2 U3 k are all zero.
In other words, models for unrelated communities do not interfere: if we
learn a model for I1, U1, we do not need to revise the model once the
community I2, U2 is discovered: all we need to do is learn a model of this
second community, and use it jointly with the rst.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Classi cation via harmonic boolean label crowdsourcing</title>
        <p>The weak aspect of logistic regression is that it does not transfer
information across users who liked some of the same posts. In particular, if
the training set does not contain any post liked by a user u, then logistic
regression will not be able to learn anything about u, and wu will be
undetermined. Thus, posts that are only liked by users not in the
training set cannot be classi ed. As an alternative approach, we propose to
perform the hoax/non-hoax classi cation using algorithms derived from
crowdsourcing, and precisely, from the boolean label crowdsourcing (BLC)
problem.</p>
        <p>
          In the BLC problem, users provide True/False labels for posts,
indicating for instance whether a post is vandalism, or whether it violates
community guidelines. The BLC problem consists in computing the
consensus labels from the user input [
          <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
          ]. We model liking a post as
voting True on that post.
        </p>
        <p>
          Our setting di ers from standard BLC in one important respect.
Standard BLC algorithms do not use a learning set: rather, they assume that
people are more likely to tell the truth than to lie. The algorithms
compare what people say, correct for the e ect of the liars, and reconstruct a
consensus truth [
          <xref ref-type="bibr" rid="ref20 ref22">20, 22</xref>
          ]. In our setting, we cannot assume that users are
more likely to tell the truth, that is, like preferentially non-hoax posts.
Indeed, hoax articles may well have more \likes" than non-hoax ones.
Rather, we will rely on a learning set of posts for which the ground truth
is known.
        </p>
        <p>
          We present here an adaptation of the harmonic algorithm of [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] to
a setting with a learning set of posts. We chose the harmonic algorithm
because it is computationally e cient, can cope with large datasets, and it
o ers good accuracy in practice, as evidenced in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Furthermore, while
the harmonic algorithm can be adapted to the presence of a learning set,
it is less obvious how to do so for some of the other algorithms, such as
those of [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>We represent the dataset as a bipartite graph (I [ U; L), where L
I U is the set of likes. We denote by @i = fu j (i; u) 2 Lg and @u =
fi j (i; u) 2 Lg the 1-neighborhoods of a post i 2 I and user u 2 U ,
respectively.</p>
        <p>The harmonic algorithm maintains for each node v 2 I [ U two
nonnegative parameters v, v. These parameters de ne a beta distribution:
intuitively, for a user u, u 1 represents the number of times we have seen
the user like a non-hoax post, and u 1 represents the number of times
we have seen the user like a hoax post. For a post i, i 1 represents the
number of non-hoax votes it has received, and i 1 represents the number
of hoax votes it has received. For each node v, let pv = v=( v + v) be the
mean of its beta distribution: for a user u, pu is the (average) probability
that the user is truthful (likes non-hoax posts), and for a post i, pi is
the (average) probability that i is not a hoax. Letting qv = 2pv 1 =
( v v)=( v + v), positive values of qv indicate propensity for
nonhoax, and negative values, propensity for hoax.</p>
        <p>Let the training set consist of two subets IH ; IN I of known hoax
and non-hoax posts. The algorithm sets qi := 1 for all i 2 IH , and
qi := 1 for all i 2 IN ; it sets qi = 0 for all other posts i 2 I n (IH [ IN ).
The algorithm then proceeds by iterative updates. First, for each user
u 2 U , it lets:
qu := ( u
u)=( u + u) :
u := B
The positive constants A, B determine the amount of evidence needed
to sway the algorithm towards believing that a user likes hoax or
nonhoax posts: the higher the values of A and B, the more evidence will be
required. After some experimentation, we settled on the values A = 5:01
and B = 5, corresponding to a very weak a-priori preference of users
for non-hoax posts. This corresponds to needing about 5 \likes" from
known good (resp bad) users to reach a 2:1 probability ratio in favor of
non-hoax (resp. hoax), which seems intuitively reasonable. The algorithm
then updates the values for each post i 2 I n (IH [ IN ) by:
i := B0
We choose A0 = B0 = 5, thus adopting a symmetrical a-priori for items
being hoax vs. non-hoax. The updates (1){(2) are performed iteratively;
while they could be performed until a xpoint is reached, we just perform
them 5 times, as further updates do not yield increased accuracy. Finally,
we classify a post i as hoax if qi &lt; 0, and as non-hoax otherwise.</p>
        <p>The harmonic algorithm satis es the non-interference property
described for logistic regression, since information is only propagated along
graph edges that correspond to \likes".</p>
        <p>The harmonic algorithm is able to propagate information from posts
where the ground truth is known, to posts that are connected by
common users. In the rst iteration, the users who liked mostly hoax (resp.
non-hoax) posts will see their (resp. ) coe cient increase, and thus
their preferences will be characterized. In the next iteration, the user
preferences will be re ected on post beliefs, and these post beliefs will
subsequently be used to infer the preferences of more users, and so on.
We will see how the ability to transfer information will allow the
harmonic algorithm to reach high levels of accuracy even starting from small
training sets.
(1)
(2)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We characterize the performance of the logistic regression and harmonic
BLC algorithm via two sets of experiments. The rst set of experiments
measures the accuracy of the algorithms as a function of the number of
posts available as training set. Since the training set can be produced,
in general, only via a laborious process of manual post inspection, these
results tell us how much do we need to invest in manual labeling, to reap
the bene ts of automated classi cation. The second set of experiments
measures how much information our learning is able to transfer from one
set of pages to another. As the community of Facebook users is organized
around pages, these experiments shed light on how much what we learn
from one community can be transferred to another, via the shared users
among communities.
4.1</p>
      <sec id="sec-4-1">
        <title>Accuracy of classi cation vs. training set size</title>
        <p>Cross-validation analysis. We performed a standard cross-validation
analysis of logistic regression and of the harmonic algorithm for BLC. The
cross-validation was performed by dividing the posts in the dataset into
80% training and 20% testing, and performing a 5-fold cross-validation
analysis. Both approaches performed remarkably well, with accuracies
exceeding 98.6% for logistic regression and 99.4% for the harmonic
algorithm.</p>
        <p>Accuracy vs. training set size. Cross-validation is not the most insightful
evaluation of our algorithms. In classifying news posts as hoax or
nonhoax, there is a cost involved in creating the training set, as it may be
necessary to examine each post individually. The interesting question is
not the level of accuracy we can reach when we know the ground truth
for 80% of the posts, but rather, how large a training set do we need in
order to reach a certain level of accuracy. In order to be able to scale up
to the size of social network information sharing, our approaches need to
be able to produce an accurate classi cation relying on a small fraction
of posts of known class.</p>
        <p>To better understand this point, it helps to contrast the situation for
standard ML settings, versus our post-classi cation problem. In standard
ML settings, the set of features is chosen in advance, and the model that
is developed from the 80% of data in the training set is expected to be
useful for all future data, and not merely the 20% that constitutes the
evaluation set. Thus, cross-validation provides a measure of performance
1.0
0.9
for any future data. In contrast, in our setting the \features" consist in
the users that liked the posts. The larger the set of posts we consider, the
larger the set of users that might have interacted with them; we cannot
assume that the model developed from 80% of our data will be valid for
any set of future posts to be classi ed. Rather, the interesting question
is, how many posts do we need to randomly select and classify, in order
to be able to automatically classify all others?</p>
        <p>We report the classi cation accuracy both for the complete dataset,
and for the intersection dataset. The intersection dataset (de ned in
Section 2) allows to study the performance of our methods for communities
of users that are not strongly polarized towards hoax or non-hoax posts.</p>
        <p>In Fig. 3, we report the accuracy our methods as a function of the
size of the training set. In the gure, the classi cation accuracy reported
for each training set size is the average of 50 runs. In each run, we select
randomly a subset of posts to serve as training set, and we measure the
classi cation accuracy on all other posts. The error bars in the gure
denote the standard deviation of the classi cation accuracy of each run.
Thus, the error bars provide an indication of run-to-run variablity (how
much the accuracy varies with the particular training set), rather than of
the precision in measuring the average accuracy. The standard deviation
with which the average accuracy is known is about seven times smaller.</p>
        <p>For the complete dataset, the harmonic BLC algorithm is the
superior one. As long as the training set contains at least 0.5% of the posts,
or about 80 posts, the accuracy exceeds 99.4%. For even lower training
set sizes the accuracy decreases, but it is still about 80% for a training
set consisting of 0.1% of posts, or about 15 posts. Logistic regression is
somewhat inferior, but still yields accuracy above 90% for training sets
consisting of only 1% of the posts.</p>
        <p>On the intersection dataset, on the other hand, the logistic regression
approach is the superior one. While the di erences between the logistic
regression and harmonic BLC algorithms is not large, the performance of
logistic regression starts at 91.6% for a training set consisting of 10% of
posts, and degrades towards 56% for a training set consisting of 0.1% of
posts, maintaining a performance margin of 3{4% over harmonic BLC.</p>
        <p>Generally, these results indicate that harmonic BLC is more e cient
at transfering information across the dataset. Its inferior performance for
the intersection dataset may be explained by the fact that the arti cial
construction of the intersection dataset biases towards the transfer of
erroneous information. Most users have only a few likes (see Figure 1).
The intersection dataset lters out all users who liked only one post,
and of the users who liked two posts, the intersection dataset lters out
all those who liked two posts of the same hoax/non-hoax class. As a
consequence, the intersection dataset heavily over-samples \straddling"
users who like exactly two posts, one hoax, one not; these straddling
users constitute 32% of the users in the intersection dataset. When the
two posts liked by a straddling user belong one to the training, one to the
evaluation dataset, the straddling user contributes in the wrong direction
to the classi cation of the post in the evaluation set.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Cross-page learning</title>
        <p>As the community of Facebook users naturally revolves around common
interests and pages, an interesting question concerns whether what we
learn from one community of users on one page transfers to other pages.
In order to answer this question, we test our classi ers on posts related
to pages that they have not seen during the training phase. This further
allows to assess the validity of the proposed method in real-world
situations, in which the system will need to detect fake news in new pages,
i.e. pages not belonging to the ground truth. To this end, we perform two
experiments in which the set of pages from which we learn, and those
Logistic regression
Harmonic BLC
on which we test, are disjoint. In the rst experiment, one-page-out, we
select in turn each page, and we place all its posts in the testing set; the
posts belonging to all other pages are in the training set. In the second
experiment, half-pages-out, we perform 50 runs. In each run, we randomly
select a set consisting of half of the pages in the dataset, and we place
the posts belonging to those pages in the testing set, and all others in the
training set. The results are reported in Table 2.</p>
        <p>The results clearly indicate that the harmonic BLC algorithm is the
superior one for transferring information across pages, achieving
essentially perfect accuracy in both one-page-out and half-page-out
experiments. Surprisingly, for harmonic BLC, the performance is slightly
superior in the half-pages-out than in the one-page-out experiments. This
is due to the fact that for one page the performance is only 87.3%; the
performance for all other pages is always above 97.2%, and is 100% for 23
pages in the dataset. The poor performance on one particular page drags
down the average for one-page-out, compared to half-pages-out where
better-performing pages ameliorate the average.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>The high accuracy achieved by both logistic regression and the harmonic
BLC algorithm con rm our basic hypothesis: the set of users that interacts
with news posts in social network sites can be used to predict whether
posts are hoaxes.</p>
      <p>We presented two techniques for exploiting this information: one based
on logistic regression, the other on boolean label crowdsourcing (BLC).
Both algorithms provide good performance, with the harmonic BLC
algorithm providing accuracy above 99% even when trained over sets of posts
consisting of 0.5% of the full dataset (or about 80 posts). This suggests
that the algorithms can scale up to the size of entire social networks,
while requiring only a modest amount of manual classi cation.</p>
      <p>We also analyzed the extent to which our performance depends on
the community of users naturally aggregating around pages of similar
content. We showed that the harmonic BLC algorithm can transfer
information across pages: even when only half of the pages are represented
in the training set, the performance is above 99%. Even on the
\intersection dataset", consisting of only users who liked both hoax and non-hoax
posts, our methods achieve performance of 90%, albeit requiring for this
a training set consisting of 10% of the posts; this produces evidence that
our approach might work even when applied to communities of users that
are not strongly polarized towards scienti c vs. conspiracy pages. We note
that the intersection dataset is a borderline example that does not occur
in the communities we studied. Together, these results seem to indicate
that the techniques proposed may be su ciently robust for an extensive
application in a real-world scenario.</p>
      <p>
        Future work involves the implementation of the presented method
within a real-world Facebook online automated hoax detection system.
To do this, two steps are foreseen: i ) the extension to other community
languages besides the Italian community considered as example
application in this work, and ii ) the classi cation of posts for the associated
extension of the ground truth. For the rst point, under the
assumption that there is no substantial di erence among countries and language
communities, the method can be replicated by appropriately enlarging
the ground truth to include posts (and therefore users) not related to
the Italian Facebook community. For the second point, in this work we
assumed that all posts published by conspiracy pages can be classi ed
as hoaxes, and that all posts published by scienti c pages can be
classi ed as non-hoaxes. This merely practical simpli cation, based on the
approach and ndings in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], avoided the need for a manual classi cation
of the individual posts. However, in a real-world application, single post
classi cation can of course be adopted. Additionally, we see the interest
of evaluating the use of other machine learning methods besides logistic
regression and harmonic crowdsourcing.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Heymann</surname>
          </string-name>
          , G. Koutrika, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>\Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges"</article-title>
          .
          <source>In: IEEE Internet Computing 11.6</source>
          (
          <issue>Nov</issue>
          .
          <year>2007</year>
          ), pp.
          <volume>36</volume>
          {
          <fpage>45</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Davidescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scala</surname>
          </string-name>
          , G. Caldarelli, and
          <string-name>
            <given-names>W.</given-names>
            <surname>Quattrociocchi</surname>
          </string-name>
          . \
          <article-title>Science vs Conspiracy: Collective Narratives in the Age of Misinformation"</article-title>
          .
          <source>In: PLOS ONE 10.2</source>
          (
          <issue>Feb</issue>
          .
          <year>2015</year>
          ),
          <year>e0118093</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Del Vicario</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scala</surname>
          </string-name>
          , G. Caldarelli,
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Quattrociocchi</surname>
          </string-name>
          . \
          <article-title>The Spreading of Misinformation Online"</article-title>
          . en.
          <source>In: Proceedings of the National Academy of Sciences 113.3</source>
          (
          <issue>Jan</issue>
          .
          <year>2016</year>
          ), pp.
          <volume>554</volume>
          {
          <fpage>559</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Sierra</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ribagorda</surname>
          </string-name>
          .
          <article-title>\A First Step towards Automatic Hoax Detection"</article-title>
          .
          <source>In: Proceedings. 36th Annual 2002 International Carnahan Conference on Security Technology</source>
          .
          <year>2002</year>
          , pp.
          <volume>102</volume>
          {
          <fpage>114</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mason</surname>
          </string-name>
          . \
          <article-title>Filtering spam with spamassassin"</article-title>
          .
          <source>In: HEANet Annual Conference</source>
          .
          <year>2002</year>
          , p.
          <fpage>103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Petkovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kostanjcar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Pale</surname>
          </string-name>
          .
          <article-title>E-Mail System for Automatic Hoax Recognition</article-title>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ishak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-P.</given-names>
            <surname>Yong</surname>
          </string-name>
          . \
          <article-title>Distance-Based Hoax Detection System"</article-title>
          .
          <source>In: 2012 International Conference on Computer Information Science (ICCIS)</source>
          . Vol.
          <volume>1</volume>
          . June 2012, pp.
          <volume>215</volume>
          {
          <fpage>220</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vukovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pripuzic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Belani</surname>
          </string-name>
          . \
          <article-title>An Intelligent Automatic Hoax Detection System"</article-title>
          . en.
          <source>In: Knowledge-Based and Intelligent Information and Engineering Systems</source>
          . Springer, Berlin, Heidelberg, Sept.
          <year>2009</year>
          , pp.
          <volume>318</volume>
          {
          <fpage>325</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Yevseyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basto-Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ruano-Ordas</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Mendez</surname>
          </string-name>
          . \
          <article-title>Optimising Anti-Spam Filters with Evolutionary Algorithms"</article-title>
          .
          <source>In: Expert Systems with Applications</source>
          <volume>40</volume>
          .10 (
          <issue>Aug</issue>
          .
          <year>2013</year>
          ), pp.
          <volume>4010</volume>
          {
          <fpage>4021</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shari</surname>
          </string-name>
          , E. Fink, and J. G. Carbonell. \
          <article-title>Detection of Internet Scam Using Logistic Regression"</article-title>
          .
          <source>In: 2011 IEEE International Conference on Systems, Man, and Cybernetics</source>
          . Oct.
          <year>2011</year>
          , pp.
          <volume>2168</volume>
          {
          <fpage>2172</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Mui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohtashemi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Halberstadt</surname>
          </string-name>
          .
          <article-title>\A Computational Model of Trust and Reputation for E-Businesses"</article-title>
          .
          <source>In: Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 7 - Volume 7. HICSS '02</source>
          . Washington, DC, USA: IEEE Computer Society,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dellarocas</surname>
          </string-name>
          . \
          <article-title>The digitization of word of mouth: Promise and challenges of online feedback mechanisms"</article-title>
          .
          <source>In: Management science 49.10</source>
          (
          <year>2003</year>
          ), pp.
          <volume>1407</volume>
          {
          <fpage>1424</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Golbeck</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          . \
          <article-title>Accuracy of Metrics for Inferring Trust and Reputation in Semantic Web-Based Social Networks"</article-title>
          . en. In:
          <article-title>Engineering Knowledge in the Age of the Semantic Web</article-title>
          . Springer, Berlin, Heidelberg, Oct.
          <year>2004</year>
          , pp.
          <volume>116</volume>
          {
          <fpage>131</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Adler</surname>
          </string-name>
          and L. de Alfaro.
          <article-title>\A Content-Driven Reputation System for the Wikipedia"</article-title>
          .
          <source>In: Proceedings of the 16th International Conference on World Wide Web. WWW '07</source>
          .
          <string-name>
            <surname>Ban</surname>
          </string-name>
          , Alberta, Canada: ACM,
          <year>2007</year>
          , pp.
          <volume>261</volume>
          {
          <fpage>270</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Gerling</surname>
          </string-name>
          . \
          <article-title>Automatic vandalism detection in Wikipedia"</article-title>
          .
          <source>In: European Conference on Information Retrieval</source>
          . Springer.
          <year>2008</year>
          , pp.
          <volume>663</volume>
          {
          <fpage>668</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B</given-names>
            <surname>Adler</surname>
          </string-name>
          , L. De Alfaro,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mola-Velasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>West</surname>
          </string-name>
          . \
          <article-title>Wikipedia vandalism detection: Combining natural language, metadata, and reputation features"</article-title>
          .
          <source>In: Computational linguistics and intelligent text processing</source>
          (
          <year>2011</year>
          ), pp.
          <volume>277</volume>
          {
          <fpage>288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>West</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Leskovec. \</surname>
          </string-name>
          <article-title>Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes"</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on World Wide Web. WWW '16</source>
          . Montreal, Quebec, Canada: International World Wide Web Conferences Steering Committee,
          <year>2016</year>
          , pp.
          <volume>591</volume>
          {
          <fpage>602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chandramouli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Subbalakshmi</surname>
          </string-name>
          . \
          <article-title>Scam Detection in Twitter"</article-title>
          . en. In: Data Mining for Service. Ed.
          <article-title>by K. Yada</article-title>
          .
          <source>Studies in Big Data 3</source>
          . Springer Berlin Heidelberg,
          <year>2014</year>
          , pp.
          <volume>133</volume>
          {
          <fpage>150</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koike</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Oyama</surname>
          </string-name>
          . \
          <article-title>Assessment of Tweet Credibility with LDA Features"</article-title>
          .
          <source>In: Proceedings of the 24th International Conference on World Wide Web. WWW '15 Companion</source>
          . New York, NY, USA: ACM,
          <year>2015</year>
          , pp.
          <volume>953</volume>
          {
          <fpage>958</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Karger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>Shah</surname>
          </string-name>
          . \
          <article-title>Iterative Learning for Reliable Crowdsourcing Systems"</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          .
          <year>2011</year>
          , pp.
          <year>1953</year>
          {
          <year>1961</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Ihler</surname>
          </string-name>
          . \
          <article-title>Variational Inference for Crowdsourcing"</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          .
          <year>2012</year>
          , pp.
          <volume>692</volume>
          {
          <fpage>700</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>L. de Alfaro</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Polychronopoulos</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Shavlovsky</surname>
          </string-name>
          . \
          <article-title>Reliable Aggregation of Boolean Crowdsourced Tasks"</article-title>
          .
          <source>In: Third AAAI Conference on Human Computation and Crowdsourcing</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>