Visualising the Propagation of News on the Web

                 Svitlana Vakulenko*, Max Göbel†, Arno Scharl* and Lyndon Nixon*
                                             * MODUL University Vienna
                                    † Vienna University of Economics and Business
                                                   Vienna, Austria
                              {svitlana.vakulenko,arno.scharl,lyndon.nixon}@modul.ac.at
                                                 max.goebel@wu.ac.at


                                                                 2     Related Work
                                                                 Information diffusion is an established research field
                        Abstract                                 traditionally applied to explicit networks such as social
                                                                 media, but less studied in communication scenarios
    When newsworthy events occur, information                    where information sources tend to be implicit.
    quickly spreads across the Web, along offi-                     One research area that links news articles to trace
    cial news outlets as well as across social me-               the origin of an information piece is text reuse (plagia-
    dia platforms. Information diffusion models                  rism) detection. This approach has been recently ap-
    can help to uncover the path of an emerging                  plied to analyse information exchange networks based
    news story across these channels, and thereby                on historical newspaper texts [CIK14] and to study
    shed light on how these channels interact. The               the evolution of memes [SHE+ 13]. In contrast to this
    presented work enables journalists and other                 work, our approach does not track stable phrases, but
    stakeholders to trace back the distribution                  uses information pieces directly as relations.
    process of news stories, and to identify their                  Yang and Leskovec [YL10] model the total num-
    origin as well as central information hubs who               ber of infected nodes over time determined by the in-
    have amplified their dissemination.                          fluence function of nodes infected in the past. They
                                                                 formulate this problem as an instance of Non-Negative
                                                                 Least Squares and use it to predict the volume of infor-
1    Introduction                                                mation diffusion in the future. Their approach differs
                                                                 from ours since it does not model implicit network to
Newsworthy events are communicated via traditional               surface implicit links between the information sources.
news media sources such as CNN and the New York
Times, as well as social media platforms. However, the
                                                                 3     Information Diffusion Model
specific path a story takes via various news distributors
and the interplay with the social network discussion is          3.1   Modeling Information Contagions
not well studied yet. This limits further research on ru-
                                                                 We propose a ‘bag-of-relations’ document representa-
mour detection and news content verification. This pa-
                                                                 tion model to capture the essential information con-
per presents an approach developed in the EU-funded
                                                                 tained in textual documents, such as news articles.
Pheme project (www.pheme.eu), tracking information
                                                                 The main idea behind our approach is to represent
contagions across various media sources including ma-
                                                                 each document as a set of relations, represented as
jor online news publishers as well as single Twitter
                                                                 n-grams-like similarity strings. Unlike n-grams, these
users.
                                                                 strings are constructed from grammatical dependency
                                                                 relations instead of the sequential order of words in a
Copyright © 2016 for the individual papers by the paper’s au-    sentence. We employ a dependency parser to obtain
thors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
                                                                 parse trees for each of the sentences and extract the
                                                                 relations by traversing these trees. The relations are
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F., R.and
D. Albakour (eds.): Proceedings of the NewsIR’16 Workshop at     then modeled as triples of the form:
ECIR, Padua, Italy, 20-March-2016, published at http://ceur-
ws.org                                                                  s (subject) – p (predicate) – o (object)
   We start off with the task of finding all the pred-      closest neighbour). This assumption allows us to sim-
icates in the sentence, which play the role of triggers     plify the model and avoid making assumptions about
to finding the corresponding relations. We normalize        the similarity threshold value, i.e. how similar the ar-
the predicates to the form: ‘{synsets (or lemmas)} +        ticles should be to be linked in the diffusion model.
{flags}’, by detecting for each verb the corresponding         The diffusion process is modeled as a graph with
WordNet synset (or taking the verb’s lemma other-           two types of edges: (1) explicit links referencing the
wise), tense, voice, negation and auxiliary verbs (e.g.     source URL - edge direction: from the source to the
‘did not say’ is transformed to ‘state D N’).               post with the URL; (2) implicit links to connect simi-
   We define a set of words to be excluded from the         lar posts that share the same information contagions -
predicate phrase to improve the results. For example,       edge direction: from the older to the more recent post.
there are trivial relations, which are common among            We link news articles to social media posts by query-
all news articles and which we would like to elimi-         ing the Twitter API with the URL of a news article
nate, e.g. the ones triggered by the predicates: ‘print’,   to obtain all the tweets which reference it explicitly.
‘post’, ‘update’. Words that do not carry any seman-        News media often do not cite their information sources
tic information of the predicate, but are used solely       apart from the references to the major news agencies,
for grammatical purposes (e.g. ‘will’, ‘do’), are also      e.g. Reuters. Therefore, we focus on uncovering the la-
excluded.                                                   tent relations between the news articles, which we con-
   We introduce special symbols to preserve the gram-       struct based on content similarity. We construct the
matical information removed at the previous step. As        diffusion graph with edges generated using the pair-
such, D indicates the past tense, F – future tense, N –     wise similarity values computed over the relation bags
negation, A - auxiliary verb (‘would’). Since there are     of the articles.
multiple ways to express negation or past tense, this          There are two methods to compute similarity be-
approach allows to disambiguate and group together          tween a pair of news articles: (1) considering the in-
semantically-equivalent relations.                          tersection of the relation bags, (2) hashing the relation
   Then, for each predicate we pick the adjacent            bags and computing the similarity between the rela-
branches with clauses that correspond to the subject        tion hashes. While the first method, returning an inte-
and objects of the relation. We designed a simple           ger for the number of shared relations, is simple and in-
heuristic for English language texts: assign the node       tuitive, it is limited to considering only exact matches
to the subject-element if it precedes the predicate in      between relations. The second method is more power-
the sentence, and to the object otherwise (i.e. when it     ful by allowing for approximately similar relations.
follows the predicate).                                        We test both methods to compute similarity for any
   We construct separate relations for each object-         two relation bags complementary to each other to eval-
element related to the predicate and one relation with      uate which of them performs better in practice. We use
an empty object, if the subject is not empty. This          Nilsimsa hashing and Hamming distance to generate
simple heuristic allows us to create several fine-grained   and compare the relation hashes. Nilsimsa is one of the
relations with different levels of detail. For example,     most popular algorithms for locality-sensitive hashing
a sentence “The plane landed in Panama on Tuesday”          and is traditionally employed for spam detection in
will be decomposed into: ‘plane - land D’, ‘plane -         emails. Hamming distance measures the proportion
land D - in Panama’, ‘plane - land D - on Tuesday’.         of positions in the hash at which the corresponding
This approach enables us to spot those articles that        symbols are different.
report on the same event but provide complementing
or contradicting details.                                   4     Experiment
                                                            4.1    Dataset and Configuration
3.2   Modeling Diffusion Cascades
                                                            The dataset is based on a recent news media snap-
We assume that all articles sharing the same infor-         shot exported from the Pheme dashboard [SWG+ 16],
mation contagion are related to each other, i.e there       which contains 71,000 articles published between 27
is a path for every pair of articles within the diffusion   November and 3 December 2015. We ran the relation
graph. We included this assumption into our model by        extraction procedure on this corpus and picked one of
enforcing the connectivity requirement over our diffu-      the frequent information contagions to illustrate how
sion graph: for each node (except the root node), we        it can be backtracked across the online media:
generate an incoming edge that links the node to its
                                                                  s: president barack obama – p: state D – o:
source. Here, we also use the single source assumption:
for all nodes (except the root node), there is exactly         This relation provided us with a cluster of 12 news
one incoming edge linking the node to its source (the       articles. It is able to capture all the expressions with
                   Figure 1: Sample information diffusion model: ‘president barack obama state D’

a predicate that belongs to the WordNet synset ‘state’       modeling approach is promising and merits further re-
and is used in the past tense (‘D’), such as “pres-          search. In future work we will further evaluate our
ident barack obama said”, thereby indicating state-          approach against baseline methods.
ments made by President Obama.
   For each article we retrieved the tweets via its URL      6   Acknowledgments
using Twitter Search API, which resulted in 150 tweets
                                                             The presented work was conducted within the PHEME
(127 and 23 for two of the articles). We used the net-
                                                             Project (www.pheme.eu), which has received funding
workx1 and matplotlib2 Python libraries to visualize
                                                             from the European Union’s Seventh Framework Pro-
the resulting diffusion graph (see Figure 1).
                                                             gramme for Research, Technological Development and
                                                             Demonstration under Grant Agreement No. 611233.
4.2     Results
The nodes of the graph in Figure 1 represent the indi-       References
vidual posts published at discrete time intervals (red:      [CIK14]    Giovanni Colavizza, Mario Infelise, and
news articles; green: tweets). The sources get infected                 Frederic Kaplan. Mapping the Early Mod-
in the sequence order aligned along the vertical time                   ern News Flow: An Enquiry by Robust
axis as indicated on the node labels. The same source                   Text Reuse Detection. In Social Informat-
may appear more than once within the same network if                    ics, pages 244–253, 2014.
it has published multiple articles containing the same
information contagion within the given time interval.        [SHE+ 13] Caroline Suen, Sandy Huang, Chantat Ek-
   The edges of the graph represent direct links in case               sombatchai, Rok Sosic, and Jure Leskovec.
of tweets, or content similarity in case of articles. Con-             NIFTY: A System for Large Scale Infor-
tent similarity values are indicated as weights over                   mation Flow Tracking and Clustering. In
the corresponding edges. Values closer to 0 indicate                   22nd International Conference on World
more similar articles. Light edges indicate that the                   Wide Web, pages 1237–1248, 2013.
adjacent articles share a single information contagion,      [SWG+ 16] Arno Scharl, Albert Weichselbraun, Max
while solid edges indicate that the articles have more                 Göbel, Walter Rafelsberger, and Ruslan
than one information contagion in common.                              Kamolov. Scalable knowledge extraction
                                                                       and visualization for web intelligence. In
5     Conclusion and Future Work                                       49th Hawaii International Conference on
                                                                       System Sciences, pages 3749–3757, 2016.
We showed how to uncover the latent relations between
news articles and used them to infer a model of the im-      [YL10]     Jaewon Yang and Jure Leskovec. Modeling
plicit diffusion network, which constitute an important                 information diffusion in implicit networks.
step towards rumour detection research. The results of                  In 10th International Conference on Data
our initial experiment indicate that our relation-based                 Mining, pages 599–608, 2010.
    1 networkx.github.io
    2 matplotlib.org