Exploring Significant Interactions in Live News Erich Schubert Andreas Spitz Michael Gertz Heidelberg University, Germany {schubert,spitz,gertz}@informatik.uni-heidelberg.de Abstract News monitoring is of interest to detect cur- rent news and track developing stories, but also to explore what is being talked about. In this article, we present an approach to mon- itoring live feeds of news articles and detect- ing significant (co-)occurrences of terms com- pared to a learning background corpus. We vi- sualize the result as a graph-structured seman- tic word cloud that uses a stochastic neighbor embedding (SNE) based layout and visualizes Figure 1: News coverage of the Super Bowl LII edges between related terms. We give visual on a map, based on the location of the described news examples of our prototype that processes news events [TLP+ 08]. Similarly, an exploration along a as they are crawled from dozens of news sites. temporal axis is possible, which leads to the generation of timelines of news [SA00]. However, such a temporal 1 Introduction exploration is best employed for archives of news and The prospect of obtaining an overview of recently pub- suffers when processing live streams with very short in- lished news at a glance is becoming increasingly diffi- tervals of interest. Another natural direction for news cult, simply due to the sheer number of available ar- aggregation and exploration are the events or incidents ticles that are published daily by a multitude of news that are contained in the articles [FA07]. However, the outlets. Ultimately, any system that is designed to detection of events in unstructured text is a challeng- provide this functionality to a user has to present se- ing task with relatively low recall, and is thus not nec- lected content from such articles to the user, who then essarily ideal for obtaining an encompassing view on selects interesting items and investigates them further. current news with many emerging events. A frequently Thus, any system that is designed to give an overview used alternative for the exploration of (arbitrary) texts of news has to present the entirety of current news in are word clouds (for an overview, see [BKP14]), which a way that provides both a summary of contents and provide intuitive representations of document contents that enables the user to select items for subsequent and do not rely on an extended stack of natural lan- reading. To address this task, numerous approaches guage processing tools. However, while word clouds have been proposed that aim to aggregate and visual- provide the user with an overview of words in the doc- ize current news in a format that is interpretable by uments, they do not account for the co-occurrences of the user. For example, the articles can be embedded words or their significance. spatially and displayed in a geographic representation To address these shortcomings, we investigate the application of semantic word clouds to the task of rep- Copyright c 2018 for the individual papers by the papers’ resenting and exploring news streams in real time. To authors. Copying permitted for private and academic purposes. obtain an informative layout, instead of relying on This volume is published and copyrighted by its editors. just the placement of words, we utilize the connec- In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 tions between individual words that we derive from a Workshop at ECIR, Grenoble, France, 26-March-2018, pub- normalization in relation to a background corpus and lished at http://ceur-ws.org stochastic neighbor embedding [SSW+ 17]. The result- ing representation then provides a quick overview of sulting extracted text is then processed using Stan- recently published news, while emphasizing significant ford CoreNLP [MSB+ 14] and the entity extraction word cooccurrences in the selected timeframe (e.g., see discussed before, and fed into our model. An initial Figure 1). Furthermore, our proposed system can eas- semantic layout is computed on the server side and ily serve as an index of related news articles for further pushed to the clients via a Web socket. A force-graph exploration by the user. in D3.js is used for the final layout fine-tuning and the Related work. The European Media Moni- result is rendered in the browser. tor [AV09] monitors 5000 sources in 70 languages (with a focus on Europe), and provides the top 10 clus- 2.1 Text Processing ters for each language in their NewsBrief. Our ap- We extract news articles from the news websites (En- proach is more dynamic, and emphasizes the inter- glish or German, for now) and split them into para- actions of terms and entities. In [KMS+ 10], the au- graphs based on their HTML structure. We then use thors explore interaction networks of persons in news, Stanford CoreNLP [MSB+ 14] to further split the para- but aggregated over a static four months snapshot, graphs into sentences and annotate parts-of-speech while we explore increased interactions within the lat- (POS). To lemmatize words (and to split compound est news only. SigniTrend [SWK14, SWK16] was pri- words in German news), we employ an approach based marily built for analyzing a stream of Tweets, but has on the Hunspell spell checking system that is used, for also been applied to a stream of Reuters news. It is example, by Firefox, Google Chrome, and LibreOffice. conceptually similar to our approach, but we focus on For our analysis, we only use entities, verbs, and the requirements for news exploration rather than just nouns (the latter two based on their POS tag). Other event detection. Implicit networks of terms or entities words are not considered when counting cooccur- support the representation of document collections or rences. Since we only consider verbs and nouns, we streams as graph structures [SG16], and can be used only need a small set of stop words with common verbs for searching entity relations in the documents or visu- such as “be”, “have” and “give”. Entites are merged alizing them as explorable subgraphs [SAG17]. How- into a single token with the canonical name, which ac- ever, they do not include a normalization of relations counts for synonyms and abbreviations, such as “EU” against a background corpus and are thus limited in and “European Union”, as long as they are included assessing the significance of emerging news. in the entity linker’s dictionary. We use a Gaussian weighting scheme for cooccur- 2 Methodology rences with weight wd = exp(−d2 /2σ 2 ) for words at a token distance of d, and use σ = 4 with a window width While our approach is inspired by Signi- of 12. We currently only consider cooccurrences within Trend [SWK14], we divert from it in several the same sentence for this paper (this differs, e.g., from important ways. Much of our text processing is the LOAD model for implicit entity networks [SG16]). derived from [SSW+ 17], and this paper can be seen as a continuation of this work. In contrast to SigniTrend [SWK14], we do not con- sider a standard deviation, which we leave as future We process documents in the input stream in micro- work. Such an estimation of the standard deviation batches of 25 articles, because we need enough data to as done in their model may be beneficial to filtering avoid false significance detections. The new data con- trivial recurring patterns such as weekday names. tributes 10% to the current counts, while 90% is based on the previous counts, which yields an exponentially 2.2 Learning the Background Model weighted moving window. After about 200 documents, the weight of a document is reduced by half. This is Our initial word cooccurrence frequency model is necessary to have stable enough frequency estimates, trained by counting word occurrences and cooccur- while keeping the computational effort sufficiently low. rences on Wikipedia. We used the January 1st 2018 We use a distributed microservice architecture dump of the English Wikipedia for the experiments around an Apache Kafka message broker to enable presented here. Our entity detection system is also live processing of streaming data. One service receives trained on this data, based on the probability that push notifications for new news results and adds them a page containing the entity string contains a link to the processing queue. The crawler service then to the expected target page. To reduce the mem- crawls the articles and stores them in a database. The ory requirements (counting all word cooccurences on third service extracts the text content from these ar- Wikipedia requires an enormous amount of memory), ticles, and removes the boilerplate (parts of the web we use a hashing-based approach as previously used by site such as navigation, menus, and advertisements SigniTrend [SWK14, SWK16]. The accuracy of this that are not part of the actual news article). The re- hashing technique is studied in detail in [SSW+ 17]. Because the word distribution in Wikipedia is dif- ferent from news sources, and since we also want to emphasize the new developments, we continually in- corporate the documents we have already seen into our background model. To this end, we update our background model at the end of each micro-batch such that it combines a fraction 1−η of the previous model, and a fraction η based on the documents we just pro- cessed. We used η = 1% in our model, so that the half-life time of data is about 70 batch iterations. Us- ing the hash table, this approach can be implemented efficiently and does not require storing or revisiting documents outside the current micro-batch. Figure 2: Word cloud for the opening of the Olympic Winter Games in South Korea, with North Korean 2.3 Judging Significance Kim Yo-jong meeting South Korean president Moon Jae-in. The oiled Tonga flag bearer, Pita Taufatofua, In order to evaluate the significance of a term or a pair got a lot of attention. of terms t, we have to compare the observed frequency c(t) with an estimated frequency E[t]. However, there are several biases for which we need to adjust. The bias term βD = 12 /|D| accounts for the document sizes of the batch, while βC = 12 /|C| accounts for a corpus size bias, where |D| is the number of documents in the batch, and |C| is the number of documents in the nor- malization corpus. The prior probability p = k/|W | is the number of words k we want to choose from the vocabulary of size |W | (details can be found in [SSW+ 17]). We obtain the overall relevance of a term as n o c(t)−βD r(t) = max 0, E[t]+β C ·p (1) Figure 3: Declassification of the Nunes memo, Larry The weighted (co-)occurrence c(t) is obtained by Nassar sentencing, and Groundhog Day. counting term frequencies in the current micro-batch, whereas the value E[t] is estimated from the back- ground model’s hash table as the minimum of all buck- cause it to overemphasize separation.1 The Student ets, similar to count-min sketches [CM05]. t-distribution used by tSNE causes many non-zero The motivation of this measure is to capture the pairwise repulsive forces, while in regular SNE, these interestingness of a term (or term cooccurrence), and forces become 0 once words are sufficiently separated. yields a ratio coefficient where values of 1 and below The initial word layout is generated in the backend, correspond to a non-interesting frequency. This ratio while the web browser frontend uses a force directed then is converted into a probability using the common graph to optimize the layout for the user’s screen size. r(t) transformation p(t) = r(t)+1 to transform the ratio to We employ forces that draw words to the desired loca- tion, forces that try to keep linked words close to each a probability. other, and additional forces that try to avoid word overlap based on a bounding box collision algorithm. 2.4 Word Cloud Visualization A key contribution of the visualization is the inclu- sion of links between cooccurring words and entities. We use semantic word clouds [SSW+ 17] for visualiza- The link strength visualizes the value of the probabil- tion. However, rather than using t-stochastic neighbor ity p described above, i.e., it measures how much more embedding (tSNE), we found the results to be bet- common this connection is than expected. It is not ter with the predecessor SNE [HR02]. We attribute just displaying the count, which would produce too this to the sparsity of the similarity matrix (where many uninteresting links. most entries will have 0 significance of cooccurrence, which is different from the distance-based matrixes 1 A demonstration of this effect can be seen at commonly used), and the heavier tails of tSNE, that https://stats.stackexchange.com/a/264647 2.5 Word Clustering We cluster words based on their link strength, using the very classic AGNES hierarchical clustering algo- rithm [KR90] with group-average linkage, and extract clusters using the method described in [SSW+ 17]. This approach tries to find 8 clusters with a minimum size of 2 words, while it puts unclustered words into a separate “noise” set, inspired by DBSCAN clustering [EKSX96, SSE+ 17]. The main benefit over DBSCAN is that we do not need to set appropriate thresholds, but can extract clusters based on a desired number and minimum cluster size. Furthermore, the group- (a) Emerging average linkage strategy takes the pairwise relatedness of all cluster members into account, and has a lower tendency to transitively merge topics. Experimentally, it produced more intuitive results compared to other linking methods. Our clustering implementations are derived from the open-source implementations avail- able in the ELKI data mining framework [SKE+ 15]. In future work, we intend to cluster links instead of words (or a hybrid of these two approaches) because we repeatedly observe words that are used in two sep- arate topics. For example, in Figure 3, the word “use” causes an undesired connection between the Nunes (b) Increasing coverage memo event, and the use of Sarin gas in Syria. 3 Experiments The objective evaluation of explorative data min- ing methods such as clustering is inherently difficult [FGK+ 10]. For example, a user study that involves asking users to summarize recent events when pre- sented with either a word cloud or just a ranked list of articles, takes considerable effort. In the case of our model in particular, such a study is unlikely to con- tribute to the deeper understanding of the approach. Since we add additional information to the visualiza- (c) Olympics taking center stage tion by clustering words and adding relations between them, it is difficult to imagine a setting in which a user Figure 4: Temporal development of the first Olympic will find them less informative than a traditional word gold medal for a German athlete in Biathlon (in Ger- cloud. In the following, we therefore present some man news) taking center stage, while recent political anecdotal evidence and leave it to the reader to de- news on government coalition troubles slowly fade. cide on whether this approach is worth exploring. Figure 1 on the first page shows the Super Bowl In Figure 4, using German news, we show the emer- LII news coverage, including the touchdown by Zach gence of a cluster around the first gold medal for a Ertz after a pass by Nick Foles, but also the announce- German athlete in the 2018 Winter Olympics. Prior ment of Kylie Jenners baby (and pregnancy), which to this event, the center stage was occupied by poli- broke Instagram records. Figure 2 is a snapshot of tics, and the ongoing struggle of German politicians the Olympics opening ceremony, and Figure 3 of the to form a coalition government. Within an hour, the Nunes memo. The latter also shows the limitations main news events are centered around the Olympic of our approach: the declassification of this memo has games (unsurprisingly, as there has been no major new been discussed in the media since January 18, and did developments on the political side; so this is the de- not come at much surprise. Much of the new coverage sired behavior). Figure 5 showcases the development discusses the concern that it might be used to try to of news coverage of a passenger airplane crash in Rus- fire Jeff Sessions, and thus Robert Mueller. sia, exhibiting a similar development in English news. Words occurring in different contexts can cause un- desired links between clusters. For example, “police car” and “train car” share the common term “car”, but will often belong to different events. Thus, a more complex clustering based on links instead of words – or a hybrid that uses both – could improve the clus- tering quality. Instead of clustering, the use of topic models for colorizing the plots is also worth exploring, but it is not obvious how to adapt topic models to use (co-)ocurrence significance instead of mere frequency. Recurring patterns are problematic, such as month (a) Prior to the event names, weekdays, or the word “weekend”. While calendar words are easily added to a stopword list, a learning approach is preferable. Entity linking with Wikipedia performs very well with frequent en- tities, but cannot adapt to entities not yet present in Wikipedia [GTBd14, EAM+ 17], as needed for emerg- ing events. For example, an article on Larry Nassar was not created until January 18, 2018. The system will benefit from an improved entity disambiguation and the ability to learn new entities (such as names) on the fly, even when they cannot yet be linked. Temporal aspects are of high interest: when do clus- ters and links emerge, and when do they disappear? In (b) Emerging our prototype, the user can navigate between micro- batches (each covering about 15-30 minutes of news data) with the arrow buttons, but we do not over- lay frequency histograms onto the words as done in some of the related work [LHKC10, LBSW12]. Differ- ent events may require different timeframes for anal- ysis. For example, the Super Bowl itself was in the media days before the actual event, and we observe a gradual ramp-up. In-game events will occur at a much shorter time frame. This may require using multiple estimators, learning at different rates. The quality of the results depends strongly on the crawling lag, i.e., the delay between when the publica- (c) Dominant topic tion time of the article and the time when it is added to the feed. With regular crawling, it is common to Figure 5: Temporal development of news covering the see lags of much over 30 minutes. Best results are ob- plane crash of the Saratov Airlines passenger flight. tained if push notifications by the publishers notify the 4 Conclusions and Future Directions system of new contents, which will usually reduce the lag to less than 5 minutes (compared to the reported This paper calls for future research in many directions. publishing time; we cannot avoid lag on the publisher The current visualization is limited by readability side between assigning the publishing date, the actual aspects because we cannot display much more than availability on the web site, and the notification being 50 words. To improve the user experience when ex- delivered). Republished articles and duplicates add ploring the data, it will be interesting to instead cluster further data quality challenges. This requires complex many more words and allow a drill-down approach in systems to gather the data, until news outlets provide which the user can zoom into groups of words that are reliable and low-latency push notification APIs. then displayed in more detail with additional words. All-in-all, the significance of cooccurrences and the Our prototype provides a single link for each word. visualization thereof, is an interesting step towards ex- A Google News style user interface that allows the user ploring the interactions of multiple entities in current to browse related results from different news sources news articles, and allows both for interesting applica- based on short snippets would be beneficial. tions as well as further research. References [MSB+ 14] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. [AV09] M. Atkinson and E. Van der Goot. Near The Stanford CoreNLP natural language real time information mining in multilin- processing toolkit. In ACL System Demon- gual news. In WWW, 2009. strations, 2014. [BKP14] L. Barth, S. G. Kobourov, and S. Pupyrev. Experimental comparison of semantic [SA00] R. C. Swan and J. Allan. Automatic gen- word clouds. In SEA, 2014. eration of overview timelines. In ACM SI- GIR Conf. on Research and Development [CM05] G. Cormode and S. Muthukrishnan. An in Information Retrieval, 2000. improved data stream summary: the count-min sketch and its applications. J. [SAG17] A. Spitz, S. Almasian, and M. Gertz. Algorithms, 55(1), 2005. EVELIN: exploration of event and entity links in implicit networks. In Int. Conf. [EAM+ 17] J. Esquivel, D. Albakour, M. Martinez- on World Wide Web, WWW Companion, Alvarez, D. Corney, and S. Moussa. On 2017. the long-tail entities in news. In ECIR, 2017. [SG16] A. Spitz and M. Gertz. Terms over LOAD: Leveraging named entities for [EKSX96] M. Ester, H.-P. Kriegel, Jörg Sander, and cross-document extraction and summa- X. Xu. A density-based algorithm for dis- rization of events. In ACM SIGIR Conf. covering clusters in large spatial databases on Research and Development in Informa- with noise. In ACM KDD, 1996. tion Retrieval, 2016. [FA07] A. Feng and J. Allan. Finding and linking [SKE+ 15] E. Schubert, A. Koos, T. Emrich, A. Züfle, incidents in news. In ACM CIKM, 2007. K. A. Schmid, and A. Zimek. A framework [FGK+ 10] I. Färber, S. Günnemann, H.-P. Kriegel, for clustering uncertain data. P. VLDB P. Kröger, E. Müller, E. Schubert, T. Seidl, Endowment, 8(12), 2015. and A. Zimek. On using class-labels in [SSE+ 17] E. Schubert, Jörg Sander, M. Ester, H.- evaluation of clusterings. In MultiClust P. Kriegel, and X. Xu. DBSCAN revis- Workshop, 2010. ited, revisited: Why and how you should [GTBd14] D. Graus, M. Tsagkias, L. Buitinck, and (still) use DBSCAN. ACM Transactions M. de Rijke. Generating pseudo-ground on Database Systems (TODS), 42(3), 2017. truth for predicting new concepts in social streams. In ECIR, 2014. [SSW+ 17] E. Schubert, A. Spitz, M. Weiler, J. Geiß, and M. Gertz. Semantic word clouds [HR02] G. E. Hinton and S. T. Roweis. Stochastic with background corpus normalization and neighbor embedding. In NIPS 15, 2002. t-distributed stochastic neighbor embed- ding. CoRR, abs/1708.03569, 2017. [KMS+ 10] M. Krstajic, F. Mansmann, A. Stoffel, M. Atkinson, and D. A. Keim. Processing [SWK14] E. Schubert, M. Weiler, and H.-P. Kriegel. online news streams for large-scale seman- SigniTrend: Scalable detection of emerging tic analysis. In ICDE Workshops, 2010. topics in textual streams by hashed signif- [KR90] L. Kaufman and P. J. Rousseeuw. Agglom- icance thresholds. In ACM KDD, 2014. erative Nesting (Program AGNES). John [SWK16] E. Schubert, M. Weiler, and H.-P. Kriegel. Wiley & Sons, Inc., 1990. SPOTHOT: Scalable detection of geo- [LBSW12] S. Lohmann, M. Burch, H. Schmauder, spatial events in large textual streams. In and D. Weiskopf. Visual analysis of SSDBM, 2016. microblog content using time-varying co- [TLP+ 08] B. E. Teitler, M. D. Lieberman, occurrence highlighting in tag clouds. In D. Panozzo, J. Sankaranarayanan, Advanced Visual Interfaces, AVI, 2012. H. Samet, and J. Sperling. Newsstand: a [LHKC10] B. Lee, N. Henry Riche, A. K. Karlson, new view on news. In ACM SIGSPATIAL, and S. Carpendale. Sparkclouds: Visualiz- 2008. ing trends in tag clouds. IEEE Trans. Vis. Comput. Graph., 16(6), 2010.