<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Automatic Framework to Continuously Monitor Multi-Platform Information Spread</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhouhan Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Aslett</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jen Rosiere Reynolds</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juliana Freire</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonathan Nagler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joshua A. Tucker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Bonneau</string-name>
          <email>bonneaug@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University</institution>
          ,
          <addr-line>New York NY 10003</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Identifying and tracking the proliferation of misinformation, or fake news, poses unique challenges to academic researchers and online social networking platforms. Fake news increasingly traverses multiple platforms, posted on one platform and then re-shared on another, making it di cult to manually track the spread of individual messages. Also, the prevalence of fake news cannot be measured by a single indicator, but requires an ensemble of metrics that quantify information spread along multiple dimensions. To address these issues, we propose a framework called Information Tracer, that can (1) track the spread of news URLs over multiple platforms, (2) generate customizable metrics, and (3) enable investigators to compare, calibrate, and identify possible fake news stories. We implement a system that tracks URLs over Twitter, Facebook and Reddit and operationalize three impact indicators { Total Interaction, Breakout Scale and Coe cient of Tra c Manipulation { to quantify news spread patterns. Using a collection of human-veri ed false URLs, we show that URLs from di erent origins have di erent propensities to spread to multiple platforms, cover di erent topics, while exhibit similar retweet patterns. We also demonstrate how our system can discover URLs whose spread pattern deviate from the norm, and be used to coordinate human fact-checking of news domains. Our framework provides a readily usable solution for researchers to trace information across multiple platforms, to experiment with new indicators, and to discover low-quality news URLs in near real-time.</p>
      </abstract>
      <kwd-group>
        <kwd>misinformation</kwd>
        <kwd>cross platform</kwd>
        <kwd>fake news</kwd>
        <kwd>human computer interaction</kwd>
        <kwd>information ow</kwd>
        <kwd>anomaly detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The COVID-19 pandemic has increased the consumption of news via social
media. For example, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] a recent global survey found that, since the beginning of
      </p>
      <p>
        COVID-19, 43% of consumers increased time spent on YouTube, 40% on
Facebook and 23% on Twitter. As people spend more time consuming news from
online platforms, the volume of online misinformation has also increased,
resulting in the World Health Organization declaring an Infodemic [23]. To mitigate
misinformation and promote high-quality content, it is important for us to rst
understand where information originates and how it spreads. Two major
technical challenges remain. First, information is often posted on one platform and
shared on another, but recent work in cross-platform news spread focus on
single events, which are ad-hoc and not scalable [
        <xref ref-type="bibr" rid="ref7">21,7</xref>
        ]. Second, there is no uni ed
approach to measure and quantify information spread. Di erent measurements
result in di erent estimations of misinformation prevalence. For example, [22]
points out that depending on the chosen datasets and metrics, the amount of
misinformation on Twitter can range between 1% to 70%. Measuring the
prevalence of fake news with a single indicator is inadequate.
      </p>
      <p>In this paper, we propose a framework called Information Tracer that
contributes three major improvements to previous work. First, we de ne a uni ed
data collection pipeline to trace and visualize data from multiple platforms.
Second, we support a multi-pronged approach that uses multiple indicators to
measure information spread. Third, we provide a user interface to enable
researchers to comparatively identify URLs with unusual metrics, and to
facilitate fact-checkers by contextualizing URL spread across multiple platforms.
We implement Information Tracer to track URLs over three platforms {
Twitter, Facebook, and Reddit, the most popular mobile social networking
platforms in the United States as of September 2019 [19]. To quantify information
spread, we operationalize three impact indicators { Total Interaction, Breakout
Scale, and Coe cient of Tra c Manipulation. Finally, we create a web interface
(https://informationtracer.com/) to visualize both raw data and aggregated
statistics.</p>
      <p>We also present three real-world applications to demonstrate the
capability of Information Tracer. In Application One, we investigate three main
questions using a collection of fake news URLs from four origins (Twitter, Facebook,
YouTube, News outlets):
1. Do URLs from di erent origins have di erent likelihood to spread across
multiple platforms?
2. Do they have di erent Twitter retweet tra c patterns?
3. Do they cover di erent topics?
We nd that URLs from Facebook are less likely to spread over multiple
platforms; URLs from di erent platforms cover di erent false stories; and there is
no signi cant di erence in retweet patterns.</p>
      <p>Application Two (A2) and Three (A3) include human oversight and
interaction, so called human-in-the-loop capability, to our framework. In A2, we
demonstrate how Information Tracer can assist humans to identify URLs whose impact
indicators deviate from the sample average. In A3, we instruct human coders to
fact-check qualities of news domains with the help of Information Tracer. We
show that our system can potentially reduce the time it takes to discover
previously unknown low-quality news sites.</p>
      <p>The paper is organized as the following: Section 2 details each component of
Information Tracer system. Section 3 applies our framework in three three
realworld settings. Section 4 discusses the limitation of our research. We examine
related work in Section 5, and conclude the paper in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Information Tracer System</title>
      <p>On a high level, Information Tracer consists of three components { data
collector, data aggregator, and data visualizer; these modules collect data,
generate summary statistics, and enable visualizing data respectively. Figure 1
shows the system architecture. In this section, we detail how we implement each
component.</p>
      <p>Although we implement our framework with a particular set of con gurations
that help us answer our research questions, our proposed framework is
customizable and users can design their own metrics to better answer other questions.
A metric can be a simple count, or a numerical output from a machine learning
model. Our framework is also extendable { users can integrate additional sources
(social media platforms, weblogs, messaging software) into the system, without
altering the overall data pipeline.</p>
      <p>…
The goal of the data collector is to parse queries submitted by end users, then
collect posts that match those queries from a list of platforms. For the scope of
this paper, we restrict the query to a valid URL, and we consider three
platforms { Twitter, Facebook and Reddit. We focus on the URL because it has a
well-de ned structure, is indexed by all three platforms, and serves as a unique
identi er of news stories.</p>
      <p>URL sanitization and normalization. Before we make API calls to each
platform, we sanitize and normalize the input URL to maximize the number of
matched posts on our three platforms. We sanitize a URL in following ways:
{ Remove pre xes http://, https:// and www.. For example, query http://www.abc.com/xyz
will become abc.com/xyz. This ensures that we match all posts that refer to
article abc.com/xyz.
{ Remove query parameters. A query parameter is a substring that follows a \?".</p>
      <p>They are usually appended at the end of URL for tracking purposes. We strip
query parameters to normalize the input URL with a few exceptions. For
example, a standard YouTube video looks like youtube.com/watch?v=VideoID,
in which \?" is important and cannot be removed. We maintain an allowlist
of such domains.</p>
      <p>
        Twitter Collection Our Twitter search is powered by Twitter Academic Track
API1. This API provides us with access to Twitter's full-archive tweet corpus.
As of February 17, 2021, the API imposes a cap of 10,000,000 tweets per month.
Due to this rate limit, we have to be judicious about how we collect tweets. Our
strategy is to collect in uential tweets that receive a high level of interactions
such as retweets and replies, and to avoid collecting tweets with low interaction
(tweets along the \long tail"). This intuition comes from a previous study on
Twitter user characterization, which nds that a small number of in uential users
control most of conversation di usion [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Because the de nition of \in uential"
is subjective, we introduce ve tunable parameters that can be speci ed by
users during query submission { minimum number of retweets (min retweets),
minimum number of replies (min replies ), maximum number of original tweets
(max originals ), maximum number of retweets (max retweets), and maximum
number of replies (max replies ). The following is our data collection protocol:
1. Given a URL=q, min retweets=x, min replies=y, we construct a special URL
{ https://twitter.com/search?q=min_retweets:x%20min_replies:y%20url:
q&amp;f=live. This URL returns us matched original tweets, with at least x
retweets and y replies. We use a Python Selenium headless browser to
automatically visit this URL, scroll down the page, and extract max originals
number of tweets, or until there is no result. We have to use a headless browser
to automate this process because the two search parameters (min retweets,
min replies ) are not available via the API.
2. Then for each original tweet with id=TweetID, we use full-archive search
endpoint 2 to collect retweets and replies. We set query=status/TweetID to
retrieve all quoted tweets and retweets of quoted tweets. We set query=conversation id:TweetID
to match all replies of the original tweet. We collect max retweets and max replies
number of results.
1 https://developer.twitter.com/en/solutions/academic-research
2 https://api.twitter.com/2/tweets/search/all
      </p>
      <p>Our Twitter collection module is thus customizable: by tuning each threshold
one can collect more or less tweets, and adapt to di erent questions and API
rate limits. For example, to collect all matched retweets and replies one can set
min retweets=0, min replies=0, max originals=1, max replies=1, max retweets=1.
In practice, we strongly recommend setting thresholds to avoid burning API
usage. These settings should be application-speci c, and thus, we present use cases
in Section 3.</p>
      <p>Facebook Collection. We use Crowdtangle to collect Facebook public posts
containing the input URL. Crowdtangle is a tool that collects and aggregates
engagement data of Facebook, Instagram and Reddit posts. It provides API to
journalists and academic researchers. We use the Search API to collect
Facebook posts containing the input URL. The API returns up to 1,000 posts. To
collect in uential posts, we use the sort parameter to retrieve posts with the
highest score. The score is a metric designed by Crowdtangle to indicate if a
post \overperforms." 3 Importantly, Crowdtangle does not index every single
Facebook page. According to Crowdtangle's documentation4, as of February 24,
2021, more than six million Facebook pages, groups, and veri ed pro les are
indexed. This includes \all public Facebook pages and groups with more than 100K
likes, all US-based public groups with 2k+ members, and all veri ed pro les,"
and therefore misses private groups and pages.</p>
      <p>Reddit Collection. Similar to Facebook data collection, we use Crowdtangle
to collect the top 1,000 Reddit posts containing the input URL sorted by the
\overperform" score. Crowdtangle indexes more than 20,000 of the most active
sub-reddits, and adds more sub-reddits on an ongoing basis.</p>
      <p>To summarize, due to limitations from each API endpoint, we are not able
to retrieve every post that matches a query. Speci cally, private posts are
unavailable, and posts from less popular groups may not be indexed yet. We argue
that the omission of those low-interaction posts are acceptable because they do
not play a signi cant role in spreading information. From a resource allocation
perspective, storing only popular posts (cutting o the long tail) saves storage
space, and improves data processing speed.
2.2</p>
      <sec id="sec-2-1">
        <title>Component Two: data aggregator</title>
        <p>The goal of data aggregator is to distill intelligence from heterogeneous
crossplatform data sources. It achieves this goal by calculating summary statistics
to quantify information spread. In this paper, we refer to those statistics as
impact indicators, as they indicate the relative impact of a URL on one or more
platforms. Over the years many indicators have been proposed and explored. In
this paper, we operationalize three indicators { Total Interaction, Breakout</p>
      </sec>
      <sec id="sec-2-2">
        <title>Scale [12], and Coe cient of Tra c Manipulation (CTM) [11].</title>
        <p>
          We choose those measurements because they are compatible with our dataset.
Speci cally, Breakout Scale requires multi-platform data to measure information
3 https://help.crowdtangle.com/en/articles/3213537-crowdtangle-codebook
4 https://help.crowdtangle.com/en/articles/1140930-what-data-is-crowdtangle-tracking
spread, CTM requires retweet data to measure Twitter tra c pattern, and Total
Interaction requires total number of interactions of every post. All three types
of data are available in our collection. We want to point out that our framework
is indicator-agnostic. The indicators we operationalized may be more helpful on
one dataset but less on another. We now introduce each indicator in detail.
Total Interaction Interaction count is a simple yet e ective measurement to
quantify the popularity of a post. This metric has proven to be useful in recent
studies to quantify fake news spread during COVID-19 [
          <xref ref-type="bibr" rid="ref10 ref15 ref5">15,5,10</xref>
          ]. For each URL,
we de ne its total interaction as the summation of total interactions of every
Twitter, Facebook, and Reddit post. We de ne the post-level total interaction
as5:
{ Twitter post. The total number of retweets, replies and likes.
{ Facebook post. The total number of reactions, shares and comments.
{ Reddit post. The total number of upvotes and comments.
        </p>
        <p>
          Breakout Scale Breakout Scale is originally proposed as a comparative model
for measuring and calibrating Information Operations (IOs) based on \data that
are observable, replicable, veri able, and available from the moment they were
posted [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]." It measures how many platforms an IO percolates to, and assigns
an IO to one of six categories, as shown in Table 1.
        </p>
        <p>We nd the Breakout Scale framework appealing as it allows us to quantify
how many platforms a URL is popular over. To operationalize this framework,
we use total interaction as a proxy for popularity. Formally, for each URL u, we
denote the total number of interactions it receives on platform p as interactionp.
We then set a threshold t, if interactionp &gt; t, we consider u to be popular
on platform p. The nal Breakout Scale for u is the total number of popular
platforms.</p>
        <p>
          Coe cient of Tra c Manipulation (CTM) We also compute ne-grained
indicators that quantify platform-speci c patterns. Because we only have page
and group level statistics for Facebook and Reddit posts, we focus on
summarizing Twitter tra c here, for which we have full access. CTM is a comparative
model that allows one to compare di erent Twitter tra c ows \against
measurable criteria and assess which of those movements appear to have been subject
to manipulation [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]."
        </p>
        <p>Originally, CTM was a weighted average of three measurements: the average
number of tweets per user (m1), the percentage of retweets as a proportion of
total tweets (m2), and the proportion of tweets generated by top fty accounts
(m3). After analyzing real-world Twitter tra c containing manipulated
hashtags, the authors concluded that m1 and m3 are more informative to identify
5 De nitions for Facebook and Reddit posts are adopted from https://help.</p>
        <p>crowdtangle.com/en/articles/1184978-crowdtangle-glossary
manipulated tra c. In our implementation, we modify and de ne CTM as a
tuple of two values: average number of tweets per user, and proportion of tweets
generated by top 10% accounts. We focus on percentage instead of top fty
accounts as, in our experiments, we nd tweet threads with fewer than fty
accounts.</p>
        <p>We want to note here that a high CTM does not always imply tra c
manipulation. For example, a tweet thread with high CTM could be caused by authentic
users who are engaged in the conversation and replied many times. Similarly, a
tweet corpus with low CTM might be manipulated by a sophisticated bot
campaign, in which each bot only creates one tweet, thus evading this metric. In
Section 3 we show how to use our system to discover the cause of high CTM.
Data visualization is a key element of both validating this platform and
enabling needed human interaction. Thus we aim to facilitate real-time exchange
of cross-platform data and intelligence. We propose and implement two main
data visualizations { a summary page and an item-wise detail page.</p>
        <p>Summary page visualization The summary page allows investigators to
compare, calibrate and identify data points (in our case URLs) with unusual spread
patterns. Our summary page is available at https://informationtracer.com/
intelligence. We currently use a scatter plot to visualize all three impact
indicators. Investigators can identify an interesting quadrant, zoom in, and click
on individual point (which represents a URL) to navigate to the detail page.
Item-wise Detail visualization page The detail page allows investigators
to visualize individual posts from di erent platforms, and explore how posts
interact with each other along multiple dimensions, such as temporal, network,
and contextual. Figure 2 is a rendering of one detail page that contextualizes the
spread pattern of URL www.armyfortrump.com. Those visualizations provide
answers to questions such as when the URL is shared on each platform, who
posted it, and how users who share the URL interacted with each other via
retweet and reply.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Real world applications</title>
      <p>We now introduce three real-world applications (denoted as A1, A2 and A3).
A1 uses Information Tracer to understand and compare how fake news URLs
from di erent origins spread over three platforms. We focus on four origins for
the fake-news we will trace: Twitter, Facebook, Youtube and News domains. A2
and A3 incorporate human-in-the-loop intelligence. For A2, we use Information
Tracer to discover URLs with unusual impact indicators. For A3, we instruct
human coders to assess qualities of news domains using our system. In the rest
of the section, we rst introduce our data sources, then explain each application.
3.1</p>
      <sec id="sec-3-1">
        <title>Overview of datasets</title>
      </sec>
      <sec id="sec-3-2">
        <title>Google Fact Check Dataset (abbr. Google FN ). The Google Fact Check</title>
        <p>
          Dataset is a repository of false claims, fact checked by journalists around the
world. The dataset has been adopted by many fact checkers around the world,
including those veri ed by International Fact Checking Network (IFCN). It also
powers fact checking features behind Google Search, Google News and Bing
Search [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          We collect all US-based claims from Google Fact Check Dataset during 2020.
To do so, we rst download all claims from the web portal [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We sort all claims
by fact checking organizations, and manually check the origin of top 30
organizations (which account for more than 90% of all claims). We identify six
organizations that operate in the United States { politifact.com, factcheck.
org, washingtonpost.com, usatoday.com, nytimes.com, poynter.org. Then
for each claim from each organization, we examine the API structure, and
extract URLs from eld entry[\itemReviewed"], which are URLs that point to the
source of fake news. If the URL is archived, we run another script to extract the
original URL from archived page. In the end, we extract 1427 unique URLs.
IFCN COVID-19 Fake News Dataset (abbr. IFCN FN ). Our second
dataset contains 8,627 false claims compiled by fact checkers among IFCN. The
earliest entry is from 1/5/2020, and the latest entry is from 8/26/2020. Each
entry contains a URL that points to the source of false claim. However, there
are several special cases:
1. Shortened URLs. We write a script to resolve their nal landing URLs.
2. Non-URL texts. Some URLs are plain texts such as "web page removed",
"There is no link," "It is an e-mail". We remove those entries.
3. Duplicated URLs. We only keep the rst entry.
        </p>
        <p>After the cleanup, the dataset has 4178 unique URLs. We then use the
country column to select URLs whose column value is \United States." In the end
our IFCN dataset has 501 URLs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Tracing Information Cross-platform After we compile two datasets, we</title>
        <p>use Information Tracer to collect posts containing those URLs from three
platforms, as described in Section 2. For Twitter collection, we set min retweets=10,
min replies=2, max originals=50, and max replies=max retweets=20,000. The
two minimum thresholds lter out low-information tweets, and the three
maximum thresholds prevent us from burning API quota. Finally, to calculate the
Breakout Scale, we de ne the breakout threshold to be 100, which means a
URL is considered popular on a platform if its total number of interactions from
that platform is above 100. We experimented di erent thresholds and found the
resulting trends to be consistent.
3.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Application 1 : understanding how fake news URLs spread across platforms</title>
        <p>Application 1 demonstrates the core utility of Information Tracer, which is its
ability to quantify information spread over multiple platforms. To facilitate
further discussion, we categorize each fake news URL into one of four origins:
Twitter, Facebook, YouTube and News domains, and consider how URL from each
origin is shared on three platforms { Twitter, Facebook, Reddit. Here, Twitter
and Facebook can be both origin and destination platforms . When we say
the origin of URL A is Twitter, we simply mean A is created on Twitter (i.e.,
A is a tweet). When we say URL B breaks out on Twitter, we mean there is
a high number of tweets that contain URL B, while B can originate from any
platform. Table 2 shows number of URLs from each origin in IFCN and Google
datasets. Speci cally, the de nition of each origin is:
1. Twitter. URL has a pattern twitter.com/username/tweetid
2. Facebook. URL has a pattern facebook.com/username/type/id, or facebook.com/photo?fbid=id.</p>
        <p>type can be posts or videos.
3. Youtube. URL is a YouTube video. For example: youtube.com/watch?v=videoid.
4. News domain. URL is a news article. For example: breitbart.com/article</p>
        <p>Given this taxonomy and our multi-dimensional indicators, we investigate
three questions regarding fake news URLs from di erent origins. We start by
analyzing multi-platform patterns (using Breakout Scale and Total Interaction),
then comparing single-platform tra c pattern (using CTM), and nally
understanding contents of fake news from each origin using unsupervised topic
modeling.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Q1: do URLs from di erent origins have di erent likelihoods of break</title>
        <p>ing out over multiple platforms? Using the Breakout Scale, we plot the
percentage of URLs within each origin that spread on 0, 1, 2 and 3 platforms, shown
IFCN FN
Google FN
in Figure 3. We nd fake news URLs originating from Facebook (Facebook pages
or images) are the least likely to spread over two or more platforms. Speci cally,
more than 90% of URLs from Facebook do not break out on other platforms. In
contrast, 40% of URLs from Twitter, YouTube and News domains break out on
more than one platform, and 20% of URLs from Twitter and YouTube break out
on all three major platforms. This suggests that when fake news is generated on
Facebook, it is more likely to stay within the platform. When a fake news URL
travels across platforms, it is more likely to be a tweet, a YouTube video, or a
news article.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Q2: do URLs from di erent origins receive di erent number of inter</title>
        <p>actions and di erent Twitter tra c? To answer this question, we calculate
the median value of Total Interaction and Coe cient of Tra c Manipulation
(CTM), listed in Table 3. We use median value instead of mean because the
distribution of each indicator is skewed by extremely large values.</p>
        <p>The value of total interaction is heavily in uenced by the scale and other
aspects of the underlying dataset and API availability. For example, in IFCN
and Google FN datasets, Facebook URLs have a median total interaction of
zero, which indicates that more than half Facebook URLs come from Facebook
groups with low interactions and are not indexed by Crowdtangle API. In
addition, in the Google FN dataset, fake news URLs from Twitter have a median
total interaction of 544,444, a value way larger than total interaction from other
origins. Upon further investigation, we nd that most of tweets in the dataset
spread political fake news, and are created by high-pro le accounts that receive
unusually high number of interactions, such as @realDonaldTrump (suspended
account of former President Donald Trump, 88 million followers at the time of
suspension) and @seanhannity (TV Host for Fox News, 5.3 million followers as
of February 2021).</p>
        <p>For CTM, we do not nd any di erence among URLs from di erent origins.
Speci cally, median values of average-tweet-per-user range from 1.03 to 1.08, and
median values of percent-tweets-from-top-10%-users range from 14 to 18. The
fact that there is no di erence on the aggregated level does not mean no di erence
on the individual level. In Section 3.3 we show how to identify individual URL
whose indicators deviate from the norm.</p>
        <p>Fig. 3: Percentage of fake news URLs that break out on 0, 1, 2, or 3 platforms,
separated by origin. If the origin is a News domain, YouTube or Twitter, a URL
is more likely to spread over two or more platforms.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Q3: do fake news URLs from di erent origins cover di erent topics?</title>
        <p>To better understand the substance of fake news, we investigate whether URLs
from di erent origins cover di erent topics. To quantify topics, we use
nonnegative matrix factorization (NMF), an unsupervised clustering algorithm that
factorizes a document-word matrix into a document-topic matrix and a
wordtopic matrix. Using both matrices, we can identify top words within each topic,
and top topics a document belongs to. Previous work has used NMF to discover
meaningful political topics from tweets censored by Turkish government [20].</p>
        <p>
          In both IFCN and Google datasets, there is a \claim" column that
summarizes the content of each false URL. The input to the NMF algorithm is thus
a claim-word matrix, where each row is a claim, and each column is a unique
word. The cell value is the tf-idf weight of the word. We lower-case all words,
choose a dictionary size of 5,000 (that is, our matrix has 5,000 columns in
maximum), and remove all English stopwords. We use Python Scikit-learn package
Fig. 4: Percentage of fake news URLs that belongs to each topic, separated by
origins of the URL. Each color represents a topic. The legend shows keywords
that are most likely to appear within each topic. Topics are discovered using
nonnegative matrix factorization. We nd that fake news originating from di erent
platforms cover di erent topics. For example, in IFCN dataset, topic \5G causes
coronavirus" is more discussed on YouTube than on other platforms,
percentagewise.
to calculate tf-idf and run NMF [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We experiment with di erent number of
topics, and nd that clustering claims of URLs into 6 topics give us meaningful
and interpretable results.
        </p>
      </sec>
      <sec id="sec-3-8">
        <title>Application 2 : investigating news stories with unusual spread patterns</title>
        <p>Fig. 5: Multi-dimensional visualization of impact indicators. Each marker
represents one URL. Its color re ects the Breakout Scale. Its text re ects origin of
the URL. Its size is in proportion to the total interaction the URL received in
logarithmic scale.</p>
        <p>The previous section shows how our framework can compare news spread
patterns of various groups of URLs. Even though aggregated analysis is helpful
to reveal trends or patterns, investigators may also want to examine individual
data points. In this section we show a case study that uses Information Tracer to
understand a URL whose impact indicator deviates from the sample mean. The
URL (denoted as u1) we consider is a YouTube video from our IFCN dataset. The
link to the video is https://youtube.com/watch?v=zFN5LUaqxOA. The video
falsely claims that coronavirus is caused by 5G. Though the video has been
removed by YouTube, tweets containing the link are still available. This URL
has an average-tweet-per-user (part of CTM) value of 2, the highest number
among all URLs in IFCN dataset. Figure 5 shows that the URL (inside the red
circle) is on the top-right quadrant, a clear outlier.</p>
        <p>To understand why u1 has a high CTM, we navigate to its detail page6,
study its retweet network, and nd several accounts that repeatedly sent u1 to
targeted users. For example, Figure 6 shows Twitter user @erlhel sharing u1 with
veri ed accounts, while encouraging users to watch u1. This spammy behavior
boosts the average-tweet-per-user count. Even though we can not assess whether
account @erlhel is human or bot, its behavior requires more intervention such
as account warning or account suspension.
6 The detail page is available on our web interface:https://informationtracer.com/
?url=youtube.com/watch?v=zFN5LUaqxOA. We encourage investigators to explore
the retweet and reply networks.</p>
        <p>Fig. 6: Screenshots of two reply chains. Using Information Tracer, we nd account
@erlhel replied under multiple veri ed accounts, encouraging users to watch
a YouTube video that falsely claims that coronavirus is caused by 5G. This
repetitive tweeting pattern results in a high coe cient of tra c manipulation
(CTM).</p>
      </sec>
      <sec id="sec-3-9">
        <title>Application 3 : assessing quality of unknown news domains</title>
        <p>
          Another promising use case for Information Tracer is to facilitate human
factchecking. To assess the utility of our framework, we recruit 30 native English
speakers from surgehq.ai, a platform that provides high-skill workforce. We
ask coders to assess veracity of potentially fake news domains. To obtain such
a list of domains, we adopt and deploy a proactive fake news discovery system,
developed by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The system works in two steps. In step one, it collects live
tweets from Twitter Streaming API based on pre-de ned keywords, extracts
domains embedded in tweets, clusters domains together if they are shared by
similar users, and identi es the cluster most related to political news. In step
two, the system assigns a fakeness score to each domain based on a pre-trained
supervised classi er. Our deployed system collected tweets containing keyword
\election" from October 29, 2020 to November 11, 2020. We set a clustering
threshold of 0.6, and selected top 30 unlabeled domains sorted by the fakeness
score. A detailed list of discovered domains is available at https://zhouhanc.
github.io/misinformation-discoverer/. For each domain, we used
Information Tracer to collect social media posts containing URLs from that domain, and
visualized results on our web interface. For instance, Figure 2 is the screenshot
of social media presence of one discovered domain armyfortrump. com .
        </p>
        <p>
          We then randomly assigned one domain to one coder, and asked everyone
to assess the factualness of the domain with the help of Information Tracer.
Speci cally, we asked coders to look for following signals: is the domain shared
across multiple platforms (e.g., Facebook, Twitter, Reddit)? If so what groups
are sharing the domain on each platform? What hashtags do they use, are they
veri ed? To teach coders how to navigate through our web interface, we also
shared with them a detailed video instruction7.
7 The 5-minute video instruction is available on Google Drive: https://drive.
google.com/file/d/1Hqaql5MHlyUKWAwmF7_uKCNyg_ed_nfB/view?usp=sharing
A comprehensive analysis of the accuracy of our deployed fake news discovery
system is beyond the scope of this paper. The result we want to highlight is
people's perceived utility of Information Tracer. According to Figure 7, when
asked \how helpful is Information Tracer," 93% coders nd it at least somewhat
helpful, and 57% nd it very helpful. In addition, we asked coders if they had
any feedback about Information Tracer. One said \I like it a lot except the node
part is really hard to understand"; the other pointed out \It was easier to look
at the page/Twitter page itself, but the recent tweets on Information Tracer gave
a good idea of what the site would be." We will improve our system based on
those suggestions.
Data access remains a bottleneck. Despite recent collaboration between
academia and social media platforms, getting access to more accurate metrics
remains a challenge. For example, [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] points out that social media platforms
aggregate two types of metrics { impressions and expressions. Impressions are
publicly available statistics such as number of retweets, replies and likes.
Expressions are more ne-grained measurements such as \who scrolls what tweet
thread for how many seconds." Impressions can be a better proxy to estimate
the popularity of a post and to derive Breakout Scale. Unfortunately, current
API does not expose impressions data. We hope to engage with platforms and
deepen current collaboration.
        </p>
        <p>Observational data versus experimental data. Even if we can collect all
social media posts, a gap remains where people's online actions do not
necessarily translate to real-world behavioral changes. For example, a story that receives
more interactions may or may not change more people's behaviors. To measure
behavioral change, controlled experiments are often required. We plan to
introduce our framework to the broader political behavior research community. We
also plan to collect alternative data sources such as direct web tra c log or
responses from human subjects to validate our observation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        Information tracking tools Many open-source tracking systems have been
built over the years. For example, Hoaxy is a system to visualize the spread of
fact-checking claims [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. FakeNewsTracker [18] is a similar framework to
collect, analyze, and visualize tweets related to fake news claims. More recently, [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
build dashboard to analyze COVID-19 misinformation. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides a more
detailed list of open source tools that track misinformation. Limitations of current
frameworks are (1) only focusing on a single platform (usually Twitter) and (2)
not providing su cient metrics to assess the impact of di erent news stories.
Our framework aims to overcome those limitations.
      </p>
      <sec id="sec-4-1">
        <title>Cross-platform misinformation spread Research shows that misinformation</title>
        <p>
          are increasingly spread over multiple platforms. Understanding where
misinformation originates, and where it gets ampli ed, can help researchers design e
ective mitigation strategies [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Recently, [21] analyzes the disinformation
campaign targeting the White Helmets group using Twitter and Youtube data. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
studies how di erent types of news spread on 4chan and Reddit. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] collected
URLs from four platforms, Facebook, Twitter, Reddit, and 4chan, quanti ed
information di usion, and measured the impact of content moderation. As
suggested by [22], previous research in tracing cross-platform news spread lacks a
uni ed data collection pipeline and well-de ned metrics. Our framework aims to
ll this gap.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we propose and implement Information Tracer, a framework to
track and quantify information spread across multiple platforms. We
operationalize three metrics { Total Interaction, Breakout Scale and Coe cient of Tra c
Manipulation, and apply our framework on real world datasets. We nd that
fake news URLs with di erent origins have di erent likelihoods to spread over
multiple platforms, with URLs from Facebook being the least likely to spread
over multiple platforms. Finally, our real-world use cases demonstrate that
Information Tracer can help investigators to identify abnormal spread patterns,
facilitate fact-checking, and design better intervention strategies.
18. Shu, K., Mahudeswaran, D., Liu, H.: Fakenewstracker: a tool for fake news
collection, detection, and visualization. Computational and Mathematical
Organization Theory 25(1), 60{71 (2019). https://doi.org/10.1007/s10588-018-09280-3,
https://doi.org/10.1007/s10588-018-09280-3
19. Statista: Most popular mobile social networking apps in the united states
as of september 2019 (2020), https://www.statista.com/statistics/248074/
most-popular-us-social-networking-apps-ranked-by-audience/, [Online;
accessed January 8, 2021]
20. Tanash, R.S., Chen, Z., Thakur, T., Wallach, D.S., Subramanian, D.:
Known unknowns: An analysis of twitter censorship in turkey. In:
Proceedings of the 14th ACM Workshop on Privacy in the Electronic
Society. p. 11{20. WPES '15, Association for Computing Machinery, New York,
NY, USA (2015). https://doi.org/10.1145/2808138.2808147, https://doi.org/
10.1145/2808138.2808147
21. Wilson, T., Starbird, K.: Cross-platform disinformation campaigns: lessons learned
and next steps (2021), https://doi.org/10.37016/mr-2020-002
22. Yang, K.C., Pierri, F., Hui, P.M., Axelrod, D., Torres-Lugo, C., Bryden, J.,</p>
      <p>Menczer, F.: The covid-19 infodemic: Twitter versus facebook (2020)
23. Zarocostas, J.: How to ght an infodemic. The Lancet 395(10225), 676
(2021/02/19 2020). https://doi.org/10.1016/S0140-6736(20)30461-X, https://
doi.org/10.1016/S0140-6736(20)30461-X</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Global</given-names>
            <surname>Online</surname>
          </string-name>
          Content Consumption Doubles in
          <year>2020</year>
          . https://doubleverify. com/newsroom/global-online
          <article-title>-content-consumption-doubles-</article-title>
          <string-name>
            <surname>in-</surname>
          </string-name>
          2020-research-shows/ (
          <year>2020</year>
          ),
          <source>[Online; accessed February 22</source>
          ,
          <year>2021</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Download</given-names>
            <surname>Fact</surname>
          </string-name>
          <article-title>Check Data</article-title>
          . https://datacommons.org/factcheck/download (
          <year>2021</year>
          ),
          <source>[Online; accessed April 11</source>
          ,
          <year>2021</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Google</given-names>
            <surname>Fact</surname>
          </string-name>
          <article-title>Check Tools API</article-title>
          . https://developers.google.com/fact-check/ tools/api (
          <year>2021</year>
          ),
          <source>[Online; accessed April 11</source>
          ,
          <year>2021</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <article-title>Scikit-learn API Reference</article-title>
          .
          <article-title>Non-Negative Matrix Factorization (NMF)</article-title>
          . https: //scikit-learn.org/stable/modules/generated/sklearn.decomposition. NMF.html (
          <year>2021</year>
          ),
          <source>[Online; accessed April 11</source>
          ,
          <year>2021</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Berriche</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Internet users engage more with phatic posts than with health misinformation on facebook</article-title>
          .
          <source>Palgrave Communications</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <volume>71</volume>
          (
          <year>2020</year>
          ). https://doi.org/10.1057/s41599-020-0452-1, https://doi.org/10.1057/ s41599-020-0452-1
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bild</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Dick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.P.</given-names>
            ,
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.M.</given-names>
            ,
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.S.:</surname>
          </string-name>
          <article-title>Aggregate characterization of user behavior in twitter and analysis of the retweet graph</article-title>
          .
          <source>ACM Trans. Internet Technol</source>
          .
          <volume>15</volume>
          (
          <issue>1</issue>
          ) (
          <year>Mar 2015</year>
          ). https://doi.org/10.1145/2700060, https: //doi.org/10.1145/2700060
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>BURTON</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>KOEHORS</surname>
          </string-name>
          , D.: Research note:
          <article-title>The spread of political misinformation on online subcultural platforms (</article-title>
          <year>2021</year>
          ), https://doi.org/10.37016/ mr-2020
          <source>-40</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freire</surname>
          </string-name>
          , J.:
          <article-title>Proactive discovery of fake news domains from real-time social media feeds</article-title>
          .
          <source>In: Companion Proceedings of the Web Conference</source>
          <year>2020</year>
          . p.
          <volume>584</volume>
          {
          <fpage>592</fpage>
          . WWW '
          <volume>20</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2020</year>
          ). https://doi.org/10.1145/3366424.3385772, https://doi.org/ 10.1145/3366424.3385772
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Corporation</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          :
          <article-title>Tools That Fight Disinformation Online</article-title>
          . https://www.rand. org/research/projects/truth-decay/fighting-disinformation/search.html (
          <year>2019</year>
          ), [Online; accessed 13-January-2021]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Moscadelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albora</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biamonte</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giorgetti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Innocenzio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paoli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonanni</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonaccorsi</surname>
          </string-name>
          , G.:
          <article-title>Fake news and covid19 in italy: Results of a quantitative observational study</article-title>
          .
          <source>International journal of environmental research and public health</source>
          <volume>17</volume>
          (
          <issue>16</issue>
          ),
          <volume>5850</volume>
          (08
          <year>2020</year>
          ). https://doi.org/10.3390/ijerph17165850, https://pubmed.ncbi.
          <source>nlm.nih. gov/32806772</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nimmo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Measuring tra c manipulation on twitter (</article-title>
          <year>2021</year>
          ), https: //comprop.oii.ox.ac.uk/wp-content/uploads/sites/93/2019/01/ Manipulating-Twitter-Traffic.pdf,
          <source>[Online; accessed January 5</source>
          ,
          <year>2021</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nimmo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The breakout scale: Measuring the impact of in uence operations (</article-title>
          <year>2021</year>
          ), https://www.brookings.edu/wp-content/uploads/2020/09/ Nimmo_influence_operations_PDF.pdf,
          <source>[Online; accessed January 5</source>
          ,
          <year>2021</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>PAPAKYRIAKOPOULOS</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SERRANO</surname>
            ,
            <given-names>J.C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>HEGELICH</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The spread of COVID-19 conspiracy theories on social media and the e ect of content moderation (</article-title>
          <year>2021</year>
          ), https://doi.org/10.37016/mr-2020
          <source>-034</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pasquetto</surname>
            ,
            <given-names>I.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swire-Thompson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amazeen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          , et al.:
          <article-title>Tackling misinformation: What researchers could do with social media data (</article-title>
          <year>2021</year>
          ), https: //doi.org/10.37016/mr-2020
          <source>-49</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pulido</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz-Eugenio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redondo-Sama</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villarejo-Carballido</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A new application of social impact in social media for overcoming fake news in health</article-title>
          .
          <source>International journal of environmental research and public health 17(7)</source>
          ,
          <volume>2430</volume>
          (04
          <year>2020</year>
          ). https://doi.org/10.3390/ijerph17072430, https://pubmed.ncbi.
          <source>nlm.nih.gov/32260048</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciampaglia</surname>
            ,
            <given-names>G.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flammini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menczer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Hoaxy</surname>
          </string-name>
          .
          <source>Proceedings of the 25th International Conference Companion on World Wide Web - WWW '16 Companion</source>
          (
          <year>2016</year>
          ). https://doi.org/10.1145/2872518.2890098, http://dx.doi. org/10.1145/2872518.2890098
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rambhatla</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Covid-19 on social media: Analyzing misinformation in twitter conversations (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>