Visualising the Propagation of News on the Web Svitlana Vakulenko*, Max Göbel†, Arno Scharl* and Lyndon Nixon* * MODUL University Vienna † Vienna University of Economics and Business Vienna, Austria {svitlana.vakulenko,arno.scharl,lyndon.nixon}@modul.ac.at max.goebel@wu.ac.at 2 Related Work Information diffusion is an established research field Abstract traditionally applied to explicit networks such as social media, but less studied in communication scenarios When newsworthy events occur, information where information sources tend to be implicit. quickly spreads across the Web, along offi- One research area that links news articles to trace cial news outlets as well as across social me- the origin of an information piece is text reuse (plagia- dia platforms. Information diffusion models rism) detection. This approach has been recently ap- can help to uncover the path of an emerging plied to analyse information exchange networks based news story across these channels, and thereby on historical newspaper texts [CIK14] and to study shed light on how these channels interact. The the evolution of memes [SHE+ 13]. In contrast to this presented work enables journalists and other work, our approach does not track stable phrases, but stakeholders to trace back the distribution uses information pieces directly as relations. process of news stories, and to identify their Yang and Leskovec [YL10] model the total num- origin as well as central information hubs who ber of infected nodes over time determined by the in- have amplified their dissemination. fluence function of nodes infected in the past. They formulate this problem as an instance of Non-Negative Least Squares and use it to predict the volume of infor- 1 Introduction mation diffusion in the future. Their approach differs from ours since it does not model implicit network to Newsworthy events are communicated via traditional surface implicit links between the information sources. news media sources such as CNN and the New York Times, as well as social media platforms. However, the 3 Information Diffusion Model specific path a story takes via various news distributors and the interplay with the social network discussion is 3.1 Modeling Information Contagions not well studied yet. This limits further research on ru- We propose a ‘bag-of-relations’ document representa- mour detection and news content verification. This pa- tion model to capture the essential information con- per presents an approach developed in the EU-funded tained in textual documents, such as news articles. Pheme project (www.pheme.eu), tracking information The main idea behind our approach is to represent contagions across various media sources including ma- each document as a set of relations, represented as jor online news publishers as well as single Twitter n-grams-like similarity strings. Unlike n-grams, these users. strings are constructed from grammatical dependency relations instead of the sequential order of words in a Copyright © 2016 for the individual papers by the paper’s au- sentence. We employ a dependency parser to obtain thors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. parse trees for each of the sentences and extract the relations by traversing these trees. The relations are In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F., R.and D. Albakour (eds.): Proceedings of the NewsIR’16 Workshop at then modeled as triples of the form: ECIR, Padua, Italy, 20-March-2016, published at http://ceur- ws.org s (subject) – p (predicate) – o (object) We start off with the task of finding all the pred- closest neighbour). This assumption allows us to sim- icates in the sentence, which play the role of triggers plify the model and avoid making assumptions about to finding the corresponding relations. We normalize the similarity threshold value, i.e. how similar the ar- the predicates to the form: ‘{synsets (or lemmas)} + ticles should be to be linked in the diffusion model. {flags}’, by detecting for each verb the corresponding The diffusion process is modeled as a graph with WordNet synset (or taking the verb’s lemma other- two types of edges: (1) explicit links referencing the wise), tense, voice, negation and auxiliary verbs (e.g. source URL - edge direction: from the source to the ‘did not say’ is transformed to ‘state D N’). post with the URL; (2) implicit links to connect simi- We define a set of words to be excluded from the lar posts that share the same information contagions - predicate phrase to improve the results. For example, edge direction: from the older to the more recent post. there are trivial relations, which are common among We link news articles to social media posts by query- all news articles and which we would like to elimi- ing the Twitter API with the URL of a news article nate, e.g. the ones triggered by the predicates: ‘print’, to obtain all the tweets which reference it explicitly. ‘post’, ‘update’. Words that do not carry any seman- News media often do not cite their information sources tic information of the predicate, but are used solely apart from the references to the major news agencies, for grammatical purposes (e.g. ‘will’, ‘do’), are also e.g. Reuters. Therefore, we focus on uncovering the la- excluded. tent relations between the news articles, which we con- We introduce special symbols to preserve the gram- struct based on content similarity. We construct the matical information removed at the previous step. As diffusion graph with edges generated using the pair- such, D indicates the past tense, F – future tense, N – wise similarity values computed over the relation bags negation, A - auxiliary verb (‘would’). Since there are of the articles. multiple ways to express negation or past tense, this There are two methods to compute similarity be- approach allows to disambiguate and group together tween a pair of news articles: (1) considering the in- semantically-equivalent relations. tersection of the relation bags, (2) hashing the relation Then, for each predicate we pick the adjacent bags and computing the similarity between the rela- branches with clauses that correspond to the subject tion hashes. While the first method, returning an inte- and objects of the relation. We designed a simple ger for the number of shared relations, is simple and in- heuristic for English language texts: assign the node tuitive, it is limited to considering only exact matches to the subject-element if it precedes the predicate in between relations. The second method is more power- the sentence, and to the object otherwise (i.e. when it ful by allowing for approximately similar relations. follows the predicate). We test both methods to compute similarity for any We construct separate relations for each object- two relation bags complementary to each other to eval- element related to the predicate and one relation with uate which of them performs better in practice. We use an empty object, if the subject is not empty. This Nilsimsa hashing and Hamming distance to generate simple heuristic allows us to create several fine-grained and compare the relation hashes. Nilsimsa is one of the relations with different levels of detail. For example, most popular algorithms for locality-sensitive hashing a sentence “The plane landed in Panama on Tuesday” and is traditionally employed for spam detection in will be decomposed into: ‘plane - land D’, ‘plane - emails. Hamming distance measures the proportion land D - in Panama’, ‘plane - land D - on Tuesday’. of positions in the hash at which the corresponding This approach enables us to spot those articles that symbols are different. report on the same event but provide complementing or contradicting details. 4 Experiment 4.1 Dataset and Configuration 3.2 Modeling Diffusion Cascades The dataset is based on a recent news media snap- We assume that all articles sharing the same infor- shot exported from the Pheme dashboard [SWG+ 16], mation contagion are related to each other, i.e there which contains 71,000 articles published between 27 is a path for every pair of articles within the diffusion November and 3 December 2015. We ran the relation graph. We included this assumption into our model by extraction procedure on this corpus and picked one of enforcing the connectivity requirement over our diffu- the frequent information contagions to illustrate how sion graph: for each node (except the root node), we it can be backtracked across the online media: generate an incoming edge that links the node to its s: president barack obama – p: state D – o: source. Here, we also use the single source assumption: for all nodes (except the root node), there is exactly This relation provided us with a cluster of 12 news one incoming edge linking the node to its source (the articles. It is able to capture all the expressions with Figure 1: Sample information diffusion model: ‘president barack obama state D’ a predicate that belongs to the WordNet synset ‘state’ modeling approach is promising and merits further re- and is used in the past tense (‘D’), such as “pres- search. In future work we will further evaluate our ident barack obama said”, thereby indicating state- approach against baseline methods. ments made by President Obama. For each article we retrieved the tweets via its URL 6 Acknowledgments using Twitter Search API, which resulted in 150 tweets The presented work was conducted within the PHEME (127 and 23 for two of the articles). We used the net- Project (www.pheme.eu), which has received funding workx1 and matplotlib2 Python libraries to visualize from the European Union’s Seventh Framework Pro- the resulting diffusion graph (see Figure 1). gramme for Research, Technological Development and Demonstration under Grant Agreement No. 611233. 4.2 Results The nodes of the graph in Figure 1 represent the indi- References vidual posts published at discrete time intervals (red: [CIK14] Giovanni Colavizza, Mario Infelise, and news articles; green: tweets). The sources get infected Frederic Kaplan. Mapping the Early Mod- in the sequence order aligned along the vertical time ern News Flow: An Enquiry by Robust axis as indicated on the node labels. The same source Text Reuse Detection. In Social Informat- may appear more than once within the same network if ics, pages 244–253, 2014. it has published multiple articles containing the same information contagion within the given time interval. [SHE+ 13] Caroline Suen, Sandy Huang, Chantat Ek- The edges of the graph represent direct links in case sombatchai, Rok Sosic, and Jure Leskovec. of tweets, or content similarity in case of articles. Con- NIFTY: A System for Large Scale Infor- tent similarity values are indicated as weights over mation Flow Tracking and Clustering. In the corresponding edges. Values closer to 0 indicate 22nd International Conference on World more similar articles. Light edges indicate that the Wide Web, pages 1237–1248, 2013. adjacent articles share a single information contagion, [SWG+ 16] Arno Scharl, Albert Weichselbraun, Max while solid edges indicate that the articles have more Göbel, Walter Rafelsberger, and Ruslan than one information contagion in common. Kamolov. Scalable knowledge extraction and visualization for web intelligence. In 5 Conclusion and Future Work 49th Hawaii International Conference on System Sciences, pages 3749–3757, 2016. We showed how to uncover the latent relations between news articles and used them to infer a model of the im- [YL10] Jaewon Yang and Jure Leskovec. Modeling plicit diffusion network, which constitute an important information diffusion in implicit networks. step towards rumour detection research. The results of In 10th International Conference on Data our initial experiment indicate that our relation-based Mining, pages 599–608, 2010. 1 networkx.github.io 2 matplotlib.org