Detecting Attention Dominating Moments Across Media Types Igor Brigadir Derek Greene Pádraig Cunningham {igor.brigadir, derek.greene, padraig.cunningham}@insight-centre.org Insight Centre for Data Analytics University College Dublin, Ireland different forms of news media attempt to record and disseminate information deemed important enough to Abstract communicate, and as the barriers to broadcasting and sharing information are removed, attention becomes a In this paper we address the problem of iden- scarce commodity. tifying attention dominating moments in on- We define the problem of detecting attention domi- line media. We are interested in discovering nating moments across different media types, as a col- moments when everyone seems to be talking lapse in diversity in the content generated by a set about the same thing. We investigate one par- of online sources in a topic during a given time pe- ticular aspect of breaking news: the tendency riod. Media types here include mainstream news arti- of multiple sources to concentrate attention on cles, blog posts, and tweets. These media types differ a single topic, leading to a collapse in diver- in both the category of topics covered [22], and their sity of content for a period of time. In this use of language [10]. In the context of Twitter, we de- work we show that diversity at a topic level fine sources as unique user accounts. For mainstream is effective for capturing this effect in blogs, news and blogs, sources refer to individual publica- in news articles, and on Twitter. The phe- tions or outlets. Publications may have different num- nomenon is present in three distinctly differ- bers of authors, but as unique author information is ent media types, each with their own unique not available, we treat each unique blog or news outlet features. We describe the phenomenon us- as a single source. ing case studies relating to major news stories In Section 3, we describe the two stages of our pro- from September 2015. posed event detection procedure. In the first stage, content generated by the news, blog and tweet sources 1 Introduction is grouped into broad topical categories, through the The problem of detecting breaking news events has application of matrix factorization to the content gen- inspired a host of approaches, extracting useful sig- erated by these sources. In the second stage, we ex- nals from activity on social networks, newswire, and amine the variation in similarity between content gen- other types of media. The online communication plat- erated by sources within a given topic during a given forms that have been adopted allow these events to time period, in order to identify a collapse in diver- persist in some form. These digital traces can never sity within a topic which corresponds to an attention fully capture the original experience, but offer us an dominating moment. In Section 5, we evaluate this opportunity to revisit significant phenomena with dif- procedure on a collection of one million news articles ferent points of view, or help us to characterise and and blog posts from September 2015, along with a par- learn something about the processes involved. Many allel corpus of tweets collected during the same time period. Copyright c 2016 for the individual papers by the paper’s Rather than formulating the problem as tracking authors. Copying permitted for private and academic purposes. the evolution of topics themselves, we consider the di- This volume is published and copyrighted by its editors. versity of content within a specific topic over time. The In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopfgartner, R. Campos and D. Albakour (eds.): Proceedings motivation is that, for instance, a collapse in diversity of the NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March- around a major sporting event will be strongly evi- 2016, published at http://ceur-ws.org dent in certain news sources, but not evident in others. The distinction is important, as this approach is more diversity within a fixed time window. suited to retrospective analysis, when the entire collec- tion of documents of interest is available. The topics 3 Proposed Method do not change over time, as opposed to a real-time set- ting where topics must be updated as new documents Our objective is to detect when multiple articles in arrive [21]. The information need is guided by two ma- a topical stream become less diverse, signalling the jor questions. Firstly, when have significant collapses emergence of an attention dominating news story. We in diversity occurred in a topic of interest? Secondly, consider attention to a phenomenon as the main driv- are there differences between media types when these ing force behind the decision to produce or broadcast a events occur? communication. Using the diversity of content within Our main contributions here are: 1) a diversity- a time window, we attempt to characterise instances based approach of detecting attention dominating where a particular piece of information becomes dom- news events; 2) a comparison between traditional news inant. Concretely, for each type of media, NMF is sources, blogs, and Twitter during these events. 3) a used to assign topics to documents; for documents in parallel corpus of newsworthy tweets for the NewsIR a topic, we calculate diversity between documents in dataset. a time window. This type of analysis allows us to ex- amine the extent to which the onset of an important breaking news event is accompanied by a collapse in 2 Related Work textual content diversity, both within a group of news In previous work, attention dominating news sto- sources and across different media types. ries have been described as media explosions [2] or firestorms [14]. The idea of combining signals from 3.1 Finding Topics multiple sources for detecting or tracking evolution of We apply a Non-negative Matrix Factorization (NMF) events proved effective in the past. Osborne et al. [16] topic modeling approach to extract potentially inter- used signals from Wikipedia page views, together with esting topics from a stream of tweets or set of articles. Twitter to improve “first story detection”. Concurrent For each media source, we build a tf-idf weighted term- Wikipedia edits were used as a signal for breaking news document matrix and use this as input to NMF. detection in [19]. We also considered LDA to infer topics in these Topic modeling applied to parallel corpora of news datasets. The choice of NMF over LDA was primarily and tweets has been previously explored by a number due to computation time. LDA was significantly more of researchers [6, 9, 11]. Extensions to LDA to ac- computationally expensive than NMF with NNDSVD count for tweet specific features have been proposed [1] initialisation. NMF also tends to produce more co- [22]. A comparison between Twitter and content from herent topics [17]. newswires was explored in [18]. A Non-negative Ma- trix Factorization (NMF) approach is used for topic 3.2 Measuring Diversity detection in [20]. The same tf-idf representation used for topic modeling How offline phenomena link to bursty behaviour on- is used in diversity calculations. Each article, blog line is discussed in [5] and [12]. In [12] Shannon’s Di- post or tweet is a tf-idf vector. A separate document- versity Index was used to detect a “contraction of at- term matrix is built for each media type. Stopwords tention” in a tweet stream by measuring diversity of and words occurring in fewer than 10 documents are hashtags. In contrast, we employ a different measure removed. of diversity based on document similarity, applying it to streams from different media types segmented by To measure diversity, we calculate the mean cosine topic. Methods for automatically detecting anomalies similarity between all unique pairs of articles within a or significant changes in a time series are discussed in topic for a fixed time window. Given a set of docu- [4]. In [15] a change-point detection approach is ap- ments D in a time window, the diversity is: plied to time series constructed from Tweet keyword P frequencies. i,j∈D,i6=j cosSim(Di , Dj ) diversity(D) = − P|D|−1 As a broad overview, the common components in- i=1 i volved in detecting high impact, attention dominat- ing news stories include: selecting relevant subsets Where cosSim(Di , Dj ) is the cosine similarity of of documents; representation and feature extraction; tf-idf vectors of documents i and j in a time window. constructing time series from features; event detection In practice, calculating similarities between all pairs and analysis. In this paper we concentrate on a sin- of documents can be efficiently performed in parallel, gle key feature of breaking news: a collapse in content and can be calculated in a matter of seconds. Longer time windows consider more document counts of US politicians and other journalists who tend pairs, which naturally result in smoother trends. In to cover US politics related stories. contrast, shorter time windows are more sensitive to Gathering all members of such lists covering differ- brief attention dominating events, but also false posi- ent countries and topics follows the expert-digest strat- tive spikes—where a small number of articles happen egy from [7]. A tweet dataset collected independently to be similar in content, but do not constitute an at- of news and blog articles preserves Twitter-specific fea- tention dominating story. tures and topics. Source and document counts are An alternative to content diversity is also consid- summarised in Table 1. ered. Ignoring document content, and just consider- ing the sources of articles, diversity is calculated with Media Type Sources Documents Docs. per 24h Shannon’s Diversity Index: News 18,948 730,634 8,177 Blogs 73,403 253,488 23,568 R X Tweets 30,448 3,274,089 125,568 H0 = − pi ln pi Table 1: Summary of overall source and document i=1 counts by media type after filtering, and average num- Where pi is the proportion of documents produced ber of documents in a 24 hour window. by the ith source in a time window of interest, R is total number of sources in a given media type. Of the original 1 million articles provided, 15,878 Both diversity measures produce a single diversity were filtered as non-English4 or outside the date range value per time window, generating a univariate time of interest (i.e. created between 2015-09-01 and 2015- series. Changes in diversity that are 2 standard devi- 09-31). Tweet language filtering was performed using ations away from the mean are naively considered to meta-data provided in the tweet. be important enough to warrant attention. Exploring more robust and well established methods for change 5 Attention Dominating Events point detection such as [15, 4] is left for future work. For the case studies described in Section 5, the win- In order to compare the same topics across differ- dow length was set to 8 hours. While the fast-paced ent media types, we compare the top 10 terms repre- “24/7 news cycle” is described as a constant flood of senting the topics from different models. Specifically, information, we find that all three mediums largely when topics from two different models have strongly- follow a more traditional publishing cycle, with promi- overlapping (using Jaccard similarity) top term lists, nent spikes in number of published articles on weekday this indicates that similar events were discussed in mornings, and low numbers of articles published out- both media types. side of normal office hours. A more detailed analysis Topics in a model that do not have any overlapping of publishing times and characteristics will be explored terms with topics in other models, suggest that con- in future work. tent unique to a platform is prominent. For example: the “live, periscope, follow, stream, updates” topic in 4 Datasets the tweet corpus has no equivalent among the news or blog topics. This reflects the fact that the Periscope To explore attention dominating news stories, we app became popular with journalists for broadcasting apply the method described above to three media short live video streams and Twitter is the main plat- sources: mainstream news, blogs, and tweets. For the form where these streams are announced. The “music, first two sources, the NewsIR dataset1 is used. For album, song, video, band ” topic is prominent in the the final source, we use our own parallel corpus col- blogs and Twitter, but is not present in news. This lected from Twitter2 . In contrast to previous work may reflect the fact that most Twitter accounts and [6, 11] where tweets are retrieved based on keywords blogs are far more personal in nature. extracted from news articles, the parallel corpus was An indicative, but not necessary feature of attention derived from a large set of newsworthy sources, cu- domination news is the presence of a similar topic on rated by journalists [3]. Journalists on Twitter curate multiple platforms. To illustrate the phenomenon of lists3 of useful sources by location or general topic of topical diversity collapse, we now describe three case interest—for example “US Politics” may contain ac- studies. 1 Available from: http://research.signalmedia.co/ 4 https://github.com/optimaize/language-detector was newsir16/signal-dataset.html used for language detection. Interestingly, language detection 2 Data: https://dx.doi.org/10.6084/m9.figshare.2074105 proved effective for filtering “spammy” articles containing 3 Examples of such lists are available https://twitter.com/ obfuscated text, large numbers of urls, or containing tabular storyful/lists/ and https://twitter.com/syflmid/lists data. For each case study, we present the following: Top drowning quickly spread online and made global head- 10 topic terms for a topic in a media type, and a plot lines. This was a particularly far-reaching story, dom- of diversity over time, where: inating news coverage until an announcement on re- laxing controls on the Austro-Hungarian border by • Solid lines show diversity of documents over time. Chancellors Faymann of Austria and Merkel of Ger- • Dashed lines show Shannon Diversity of sources. many. Both Twitter and mainstream news streams ex- perienced a diversity collapse, while Blogs maintained • Highlighted time periods are when major devel- more diverse set of articles. Between 19th and 21st, opments occurred—based on Wikipedia Current smaller drops in diversity coinside with Pope Francis’ Events Portal5 for September 2015. visit, where the issue of refugees was a prominent topic • Dot and Triangle markers indicate periods when of discussion. diversity drops 2 standard deviations below the mean. 5.2 Donald Trump Presidential Campaign 5.1 European Refugee Crisis Donald Trump’s presidential campaign has attracted considerable attention across all types of media6 . Po- The European crisis began in 2015, as increasing num- sitions on issues of immigration and religion are par- bers of refugees from areas in Syria, Afghanistan, and ticularly polarising, frequently causing controversies in Western Balkans [8] sought asylum in the EU. Figure 1 mainstream media. shows a plot of diversity for the documents assigned to this topic in each 8 hour time window, for the three Media Top 10 Topic Terms media types. To help with visualisation, raw diversity Blogs trump, donald, republican, presidential, debate, values are standardised with z-scores on the y axis, gop, president, candidates, candidate, bush while the x axis grid separates days. News trump, republican, presidential, donald, debate, clinton, bush, fiorina, candidates, campaign Media Top 10 Topic Terms Tweets trump, im, love, donald, going, debate, happy, Blogs refugees, syria, syrian, war, president, govern- gop, president, think ment, military, europe, russia, iran News refugees, migrants, border, hungary, eu, europe, 1.0 european, refugee, asylum, germany 0.0 ­1.0 Tweets refugees, syrian, hungary, help, migrants, europe, ­2.0 border, germany, austria, asylum ­3.0 ­4.0 2.0 ­5.0 Blogs 1.0 0.0 1.0 ­1.0 0.0 ­2.0 ­1.0 ­3.0 Blogs ­2.0 ­3.0 1.0 ­4.0 News 0.0 1.0 ­1.0 0.0 ­2.0 ­1.0 ­2.0 ­3.0 News ­3.0 ­4.0 ­5.0 1.0 ­6.0 Tweets 0.0 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ­1.0 Figure 2: Standardised diversity scores for Donald ­2.0 Trump Presidential Campaign topic ­3.0 Tweets 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Significant events marked around 12th, 17th, 21st Figure 1: Standardised diversity scores for the Euro- in Figure 2 relate to: Trump’s comments on Senator pean refugee crisis topic during September 2015, across Rand Paul on Twitter which was discussed on main- three media types. stream news around 12th, but not as prominently on The downward trend in diversity between Septem- blogs. On the 16th-17th coverage of a republican pres- ber 3rd and 5th in the refugee crisis topic can be ex- idential debate hosted by CNN; and 21st—mainstream plained by the death of Aylan Kurdi. News of his news coverage of reactions to events on 17th: during 5 https://en.wikipedia.org/wiki/Portal:Current_ 6 https://en.wikipedia.org/wiki/Donald_Trump_ events/September_2015 presidential_campaign,_2016 a town hall meeting in Rochester, Donald Trump de- In the Twitter stream, the notable event around clined to correct a man who said that President Obama 16th-17th is due to large numbers of similar tweets as is a Muslim. preparations for the visit were being discussed, and The statement prompted a significant drop in the #TellThePope trended briefly. diversity of stories across all platforms. On the 25th, Earlier in the month, we see evidence of overlap- during a speech given to conservative voters in Wash- ping attention dominating events. Between 6th and ington, Trump called fellow Republican presidential 7th September, the Pope announced the Vatican’s candidate Marco Rubio “a clown”. Based on the data, churches will welcome families of refugees. This an- it appears that the reaction to the latter on Twitter nouncement followed a significant development in the was not as pronounced as among journalists and blog- ongoing European refugee crisis: around 6,500 refugees gers. arrived in Vienna following Austria’s and Germany’s decision to waive asylum system rules. This suggests 5.3 Pope Francis visits North America that an attention dominating news event in one topic The visit of Pope Francis spanned 19 to 27 Septem- can trigger events in other topics, especially where ber 2015, where the itinerary included venues in both prominent public figures are involved. Cuba and the United States. This event is a good il- lustrative example as it was widely documented7 , and highlights a case where a collapse in diversity did not occur at the same time on different media platforms. 6 Discussion Media Top 10 Topic Terms While the diversity measure we propose is relatively Blogs pope, francis, church, catholic, visit, cuba, popes, simple, it can be easily augmented to account for other climate, philadelphia, vatican factors. In the simplest form, every similarity value News pope, francis, catholic, church, philadelphia, between a unique pair of articles within a time window popes, cuba, united, vatican, visit carries an equal weight in the diversity calculation, Tweets pope, francis, visit, house, congress, popeindc, implying that a strong similarity between two highly cuba, white, popeinphilly, philadelphia influential publishers is just as important as between two inconsequential publishers with a small audience. 1.0 However, this weight could be tuned, either manually 0.0 or automatically using external information (e.g. Alexa ­1.0 rankings). Accounting for social context [13] could also ­2.0 Blogs be achieved by augmenting the topic modeling stage of the process. Instead of using a classic tf-idf vector 1.0 space model, alternative representations that capture 0.5 0.0 more semantic similarity between documents can be ­0.5 ­1.0 used. We aim to explore extensions to this measure in ­1.5 ­2.0 future work. ­2.5 News The sequence of events in the European refugee cri- 1.0 sis and papal visit case studies suggest that it may be 0.0 possible to identify and track major developments with ­1.0 ­2.0 global impact by linking attention dominating mo- ­3.0 Tweets ments across multiple topics, as well as across sources 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 on different platforms. Social media communities both Figure 3: Standardised diversity scores for the Papal influence and are influenced by traditional news media visit topic during September 2015. [11]. Stories break both on Twitter and through tradi- tional news publishers. Tracking or linking instances of In the case of news publishers, the largest drop in diversity collapse to explain the direction of influence diversity coincided with the beginning of the Pope’s between the different media types is also a potential visit to Havana. Twitter users and bloggers reacted avenue for future work. more on September 23rd and 24th, when the Pope met with Barack Obama and became the first Pope to Acknowledgments: This publication has emanated address a joint session of US Congress. from research conducted with the support of Sci- 7 https://en.wikipedia.org/wiki/Pope_Francis’_2015_ ence Foundation Ireland (SFI) under Grant Number visit_to_North_America SFI/12/RC/2289. References [13] J. Kalyanam, A. Mantrach, D. Saez-Trumper, H. Vahabi, and G. Lanckriet. Leveraging social [1] C. Boutsidis and E. Gallopoulos. Svd based ini- context for modeling topic evolution. In Proc. tialization: A head start for nonnegative matrix 21th ACM SIGKDD International Conference on factorization. Pattern Recognition, 41(4), 2008. Knowledge Discovery and Data Mining, pages [2] A. E. Boydstun. Making the news: Politics, the 517–526, 2015. media, and agenda setting. University of Chicago [14] H. Lamba, M. M. Malik, and J. Pfeffer. A tem- Press, 2013. pest in a teacup? analyzing firestorms on twitter. [3] I. Brigadir, D. Greene, and P. Cunningham. In Proc. International Conference on Advances in Adaptive representations for tracking breaking Social Networks Analysis and Mining, pages 17– news on twitter. CoRR, abs/1403.2923, 2014. 24, 2015. [4] P. Esling and C. Agon. Time-series data min- [15] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. ing. ACM Computing Surveys (CSUR), 45(1):12, Change-Point Detection in Time-Series Data by 2012. Relative Density-Ratio Estimation. ArXiv e- prints, Mar. 2012. [5] Y. Gandica, J. Carvalho, F. S. D. Aidos, R. Lam- biotte, and T. Carletti. On the origin of bursti- [16] M. Osborne, S. Petrovic, R. McCreadie, C. Mac- ness in human behavior: The wikipedia edits case, donald, and I. Ounis. Bieber no more: First story 2016. detection using twitter and wikipedia. In SI- GIR Workshop on Time-aware Information Ac- [6] W. Gao, P. Li, and K. Darwish. Joint topic mod- cess, 2012. eling for event summarization across news and so- cial media streams. In Proc. 21st ACM interna- [17] D. OCallaghan, D. Greene, J. Carthy, and P. Cun- tional conference on Information and knowledge ningham. An analysis of the coherence of descrip- management, pages 1173–1182. ACM, 2012. tors in topic modeling. Expert Systems with Ap- plications, 42(13):5645 – 5657, 2015. [7] S. Ghosh, M. B. Zafar, P. Bhattacharya, N. Sharma, N. Ganguly, and K. Gummadi. On [18] S. Petrovic, M. Osborne, R. McCreadie, C. Mac- sampling the wisdom of crowds: Random vs. ex- donald, I. Ounis, and L. Shrimpton. Can twitter pert sampling of the twitter stream. In Proceed- replace newswire for breaking news? In Proc. 7th ings of the 22nd ACM international conference on International Conference on Weblogs and Social Conference on information & knowledge manage- Media, ICWSM, 2013. ment, pages 1739–1744. ACM, 2013. [19] T. Steiner, S. van Hooland, and E. Summers. Mj [8] E.-M. P. Giulio Sabbati and S. Saliba. Asylum in no more: Using concurrent wikipedia edit spikes the eu: Facts and figures. European Parliamen- with social network plausibility checks for break- tary Research Service, (PE 551.332), mar 2015. ing news detection. In Proc. 2nnd International Conference on World Wide Web, pages 791–794, [9] Y. Hu, A. John, F. Wang, and S. Kambhampati. 2013. Et-lda: Joint topic modeling for aligning events and their twitter feedback. In AAAI Conference [20] C. K. Vaca, A. Mantrach, A. Jaimes, and on Artificial Intelligence, 2012. M. Saerens. A time-based collective factorization for topic discovery and monitoring in news. In [10] Y. Hu, K. Talamadupula, and S. Kambhampati. Proceedings of the 23rd international conference Dude, srsly?: The surprisingly formal nature of on World wide web, pages 527–538. ACM, 2014. Twitter’s language, pages 244–253. AAAI press, 2013. [21] K. Zhai and J. Boyd-Graber. Online latent dirich- let allocation with infinite vocabulary. In Proc. [11] T. Hua, F. Chen, C.-T. Lu, and N. Ramakrish- 30th International Conference on Machine Learn- nan. Topical analysis of interactions between news ing, pages 561–569, 2013. and social media. Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016. [22] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and tradi- [12] A. Jungherr and J. Pascal. Forecasting the pulse: tional media using topic models. In Advances in how deviations from regular patterns in online Information Retrieval, pages 338–349. Springer, data can identify offline phenomena. Internet Re- 2011. search, 23(5):589–607, 2013.