Exploring Significant Interactions in Live News

                        Erich Schubert                Andreas Spitz             Michael Gertz
                                             Heidelberg University, Germany
                                {schubert,spitz,gertz}@informatik.uni-heidelberg.de


                       Abstract
    News monitoring is of interest to detect cur-
    rent news and track developing stories, but
    also to explore what is being talked about. In
    this article, we present an approach to mon-
    itoring live feeds of news articles and detect-
    ing significant (co-)occurrences of terms com-
    pared to a learning background corpus. We vi-
    sualize the result as a graph-structured seman-
    tic word cloud that uses a stochastic neighbor
    embedding (SNE) based layout and visualizes                    Figure 1: News coverage of the Super Bowl LII
    edges between related terms. We give visual                 on a map, based on the location of the described news
    examples of our prototype that processes news               events [TLP+ 08]. Similarly, an exploration along a
    as they are crawled from dozens of news sites.              temporal axis is possible, which leads to the generation
                                                                of timelines of news [SA00]. However, such a temporal
1    Introduction                                               exploration is best employed for archives of news and
The prospect of obtaining an overview of recently pub-          suffers when processing live streams with very short in-
lished news at a glance is becoming increasingly diffi-         tervals of interest. Another natural direction for news
cult, simply due to the sheer number of available ar-           aggregation and exploration are the events or incidents
ticles that are published daily by a multitude of news          that are contained in the articles [FA07]. However, the
outlets. Ultimately, any system that is designed to             detection of events in unstructured text is a challeng-
provide this functionality to a user has to present se-         ing task with relatively low recall, and is thus not nec-
lected content from such articles to the user, who then         essarily ideal for obtaining an encompassing view on
selects interesting items and investigates them further.        current news with many emerging events. A frequently
Thus, any system that is designed to give an overview           used alternative for the exploration of (arbitrary) texts
of news has to present the entirety of current news in          are word clouds (for an overview, see [BKP14]), which
a way that provides both a summary of contents and              provide intuitive representations of document contents
that enables the user to select items for subsequent            and do not rely on an extended stack of natural lan-
reading. To address this task, numerous approaches              guage processing tools. However, while word clouds
have been proposed that aim to aggregate and visual-            provide the user with an overview of words in the doc-
ize current news in a format that is interpretable by           uments, they do not account for the co-occurrences of
the user. For example, the articles can be embedded             words or their significance.
spatially and displayed in a geographic representation             To address these shortcomings, we investigate the
                                                                application of semantic word clouds to the task of rep-
    Copyright c 2018 for the individual papers by the papers’   resenting and exploring news streams in real time. To
authors. Copying permitted for private and academic purposes.   obtain an informative layout, instead of relying on
This volume is published and copyrighted by its editors.
                                                                just the placement of words, we utilize the connec-
     In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez,
B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18
                                                                tions between individual words that we derive from a
Workshop at ECIR, Grenoble, France, 26-March-2018, pub-         normalization in relation to a background corpus and
lished at http://ceur-ws.org                                    stochastic neighbor embedding [SSW+ 17]. The result-
ing representation then provides a quick overview of         sulting extracted text is then processed using Stan-
recently published news, while emphasizing significant       ford CoreNLP [MSB+ 14] and the entity extraction
word cooccurrences in the selected timeframe (e.g., see      discussed before, and fed into our model. An initial
Figure 1). Furthermore, our proposed system can eas-         semantic layout is computed on the server side and
ily serve as an index of related news articles for further   pushed to the clients via a Web socket. A force-graph
exploration by the user.                                     in D3.js is used for the final layout fine-tuning and the
   Related work.          The European Media Moni-           result is rendered in the browser.
tor [AV09] monitors 5000 sources in 70 languages (with
a focus on Europe), and provides the top 10 clus-            2.1   Text Processing
ters for each language in their NewsBrief. Our ap-
                                                             We extract news articles from the news websites (En-
proach is more dynamic, and emphasizes the inter-
                                                             glish or German, for now) and split them into para-
actions of terms and entities. In [KMS+ 10], the au-
                                                             graphs based on their HTML structure. We then use
thors explore interaction networks of persons in news,
                                                             Stanford CoreNLP [MSB+ 14] to further split the para-
but aggregated over a static four months snapshot,
                                                             graphs into sentences and annotate parts-of-speech
while we explore increased interactions within the lat-
                                                             (POS). To lemmatize words (and to split compound
est news only. SigniTrend [SWK14, SWK16] was pri-
                                                             words in German news), we employ an approach based
marily built for analyzing a stream of Tweets, but has
                                                             on the Hunspell spell checking system that is used, for
also been applied to a stream of Reuters news. It is
                                                             example, by Firefox, Google Chrome, and LibreOffice.
conceptually similar to our approach, but we focus on
                                                                For our analysis, we only use entities, verbs, and
the requirements for news exploration rather than just
                                                             nouns (the latter two based on their POS tag). Other
event detection. Implicit networks of terms or entities
                                                             words are not considered when counting cooccur-
support the representation of document collections or
                                                             rences. Since we only consider verbs and nouns, we
streams as graph structures [SG16], and can be used
                                                             only need a small set of stop words with common verbs
for searching entity relations in the documents or visu-
                                                             such as “be”, “have” and “give”. Entites are merged
alizing them as explorable subgraphs [SAG17]. How-
                                                             into a single token with the canonical name, which ac-
ever, they do not include a normalization of relations
                                                             counts for synonyms and abbreviations, such as “EU”
against a background corpus and are thus limited in
                                                             and “European Union”, as long as they are included
assessing the significance of emerging news.
                                                             in the entity linker’s dictionary.
                                                                We use a Gaussian weighting scheme for cooccur-
2    Methodology                                             rences with weight wd = exp(−d2 /2σ 2 ) for words at a
                                                             token distance of d, and use σ = 4 with a window width
While our approach is inspired by Signi-
                                                             of 12. We currently only consider cooccurrences within
Trend [SWK14], we divert from it in several
                                                             the same sentence for this paper (this differs, e.g., from
important ways. Much of our text processing is
                                                             the LOAD model for implicit entity networks [SG16]).
derived from [SSW+ 17], and this paper can be seen
as a continuation of this work.                                 In contrast to SigniTrend [SWK14], we do not con-
                                                             sider a standard deviation, which we leave as future
   We process documents in the input stream in micro-
                                                             work. Such an estimation of the standard deviation
batches of 25 articles, because we need enough data to
                                                             as done in their model may be beneficial to filtering
avoid false significance detections. The new data con-
                                                             trivial recurring patterns such as weekday names.
tributes 10% to the current counts, while 90% is based
on the previous counts, which yields an exponentially
                                                             2.2   Learning the Background Model
weighted moving window. After about 200 documents,
the weight of a document is reduced by half. This is         Our initial word cooccurrence frequency model is
necessary to have stable enough frequency estimates,         trained by counting word occurrences and cooccur-
while keeping the computational effort sufficiently low.     rences on Wikipedia. We used the January 1st 2018
   We use a distributed microservice architecture            dump of the English Wikipedia for the experiments
around an Apache Kafka message broker to enable              presented here. Our entity detection system is also
live processing of streaming data. One service receives      trained on this data, based on the probability that
push notifications for new news results and adds them        a page containing the entity string contains a link
to the processing queue. The crawler service then            to the expected target page. To reduce the mem-
crawls the articles and stores them in a database. The       ory requirements (counting all word cooccurences on
third service extracts the text content from these ar-       Wikipedia requires an enormous amount of memory),
ticles, and removes the boilerplate (parts of the web        we use a hashing-based approach as previously used by
site such as navigation, menus, and advertisements           SigniTrend [SWK14, SWK16]. The accuracy of this
that are not part of the actual news article). The re-       hashing technique is studied in detail in [SSW+ 17].
    Because the word distribution in Wikipedia is dif-
ferent from news sources, and since we also want to
emphasize the new developments, we continually in-
corporate the documents we have already seen into
our background model. To this end, we update our
background model at the end of each micro-batch such
that it combines a fraction 1−η of the previous model,
and a fraction η based on the documents we just pro-
cessed. We used η = 1% in our model, so that the
half-life time of data is about 70 batch iterations. Us-
ing the hash table, this approach can be implemented
efficiently and does not require storing or revisiting
documents outside the current micro-batch.
                                                            Figure 2: Word cloud for the opening of the Olympic
                                                            Winter Games in South Korea, with North Korean
2.3   Judging Significance                                  Kim Yo-jong meeting South Korean president Moon
                                                            Jae-in. The oiled Tonga flag bearer, Pita Taufatofua,
In order to evaluate the significance of a term or a pair
                                                            got a lot of attention.
of terms t, we have to compare the observed frequency
c(t) with an estimated frequency E[t]. However, there
are several biases for which we need to adjust. The
bias term βD = 12 /|D| accounts for the document sizes
of the batch, while βC = 12 /|C| accounts for a corpus
size bias, where |D| is the number of documents in the
batch, and |C| is the number of documents in the nor-
malization corpus. The prior probability p = k/|W |
is the number of words k we want to choose from
the vocabulary of size |W | (details can be found in
[SSW+ 17]). We obtain the overall relevance of a term
as
                          n           o
                              c(t)−βD
              r(t) = max 0, E[t]+β  C
                                        ·p            (1)
                                                            Figure 3: Declassification of the Nunes memo, Larry
   The weighted (co-)occurrence c(t) is obtained by         Nassar sentencing, and Groundhog Day.
counting term frequencies in the current micro-batch,
whereas the value E[t] is estimated from the back-
ground model’s hash table as the minimum of all buck-       cause it to overemphasize separation.1 The Student
ets, similar to count-min sketches [CM05].                  t-distribution used by tSNE causes many non-zero
   The motivation of this measure is to capture the         pairwise repulsive forces, while in regular SNE, these
interestingness of a term (or term cooccurrence), and       forces become 0 once words are sufficiently separated.
yields a ratio coefficient where values of 1 and below         The initial word layout is generated in the backend,
correspond to a non-interesting frequency. This ratio       while the web browser frontend uses a force directed
then is converted into a probability using the common       graph to optimize the layout for the user’s screen size.
                          r(t)
transformation p(t) = r(t)+1   to transform the ratio to    We employ forces that draw words to the desired loca-
                                                            tion, forces that try to keep linked words close to each
a probability.
                                                            other, and additional forces that try to avoid word
                                                            overlap based on a bounding box collision algorithm.
2.4   Word Cloud Visualization                                 A key contribution of the visualization is the inclu-
                                                            sion of links between cooccurring words and entities.
We use semantic word clouds [SSW+ 17] for visualiza-
                                                            The link strength visualizes the value of the probabil-
tion. However, rather than using t-stochastic neighbor
                                                            ity p described above, i.e., it measures how much more
embedding (tSNE), we found the results to be bet-
                                                            common this connection is than expected. It is not
ter with the predecessor SNE [HR02]. We attribute
                                                            just displaying the count, which would produce too
this to the sparsity of the similarity matrix (where
                                                            many uninteresting links.
most entries will have 0 significance of cooccurrence,
which is different from the distance-based matrixes            1 A demonstration of this effect can be seen at

commonly used), and the heavier tails of tSNE, that         https://stats.stackexchange.com/a/264647
2.5   Word Clustering
We cluster words based on their link strength, using
the very classic AGNES hierarchical clustering algo-
rithm [KR90] with group-average linkage, and extract
clusters using the method described in [SSW+ 17].
This approach tries to find 8 clusters with a minimum
size of 2 words, while it puts unclustered words into a
separate “noise” set, inspired by DBSCAN clustering
[EKSX96, SSE+ 17]. The main benefit over DBSCAN
is that we do not need to set appropriate thresholds,
but can extract clusters based on a desired number
and minimum cluster size. Furthermore, the group-                                (a) Emerging
average linkage strategy takes the pairwise relatedness
of all cluster members into account, and has a lower
tendency to transitively merge topics. Experimentally,
it produced more intuitive results compared to other
linking methods. Our clustering implementations are
derived from the open-source implementations avail-
able in the ELKI data mining framework [SKE+ 15].
   In future work, we intend to cluster links instead of
words (or a hybrid of these two approaches) because
we repeatedly observe words that are used in two sep-
arate topics. For example, in Figure 3, the word “use”
causes an undesired connection between the Nunes
                                                                             (b) Increasing coverage
memo event, and the use of Sarin gas in Syria.

3     Experiments
The objective evaluation of explorative data min-
ing methods such as clustering is inherently difficult
[FGK+ 10]. For example, a user study that involves
asking users to summarize recent events when pre-
sented with either a word cloud or just a ranked list
of articles, takes considerable effort. In the case of our
model in particular, such a study is unlikely to con-
tribute to the deeper understanding of the approach.
Since we add additional information to the visualiza-
                                                                        (c) Olympics taking center stage
tion by clustering words and adding relations between
them, it is difficult to imagine a setting in which a user   Figure 4: Temporal development of the first Olympic
will find them less informative than a traditional word      gold medal for a German athlete in Biathlon (in Ger-
cloud. In the following, we therefore present some           man news) taking center stage, while recent political
anecdotal evidence and leave it to the reader to de-         news on government coalition troubles slowly fade.
cide on whether this approach is worth exploring.
    Figure 1 on the first page shows the Super Bowl             In Figure 4, using German news, we show the emer-
LII news coverage, including the touchdown by Zach           gence of a cluster around the first gold medal for a
Ertz after a pass by Nick Foles, but also the announce-      German athlete in the 2018 Winter Olympics. Prior
ment of Kylie Jenners baby (and pregnancy), which            to this event, the center stage was occupied by poli-
broke Instagram records. Figure 2 is a snapshot of           tics, and the ongoing struggle of German politicians
the Olympics opening ceremony, and Figure 3 of the           to form a coalition government. Within an hour, the
Nunes memo. The latter also shows the limitations            main news events are centered around the Olympic
of our approach: the declassification of this memo has       games (unsurprisingly, as there has been no major new
been discussed in the media since January 18, and did        developments on the political side; so this is the de-
not come at much surprise. Much of the new coverage          sired behavior). Figure 5 showcases the development
discusses the concern that it might be used to try to        of news coverage of a passenger airplane crash in Rus-
fire Jeff Sessions, and thus Robert Mueller.                 sia, exhibiting a similar development in English news.
                                                                  Words occurring in different contexts can cause un-
                                                              desired links between clusters. For example, “police
                                                              car” and “train car” share the common term “car”,
                                                              but will often belong to different events. Thus, a more
                                                              complex clustering based on links instead of words –
                                                              or a hybrid that uses both – could improve the clus-
                                                              tering quality. Instead of clustering, the use of topic
                                                              models for colorizing the plots is also worth exploring,
                                                              but it is not obvious how to adapt topic models to use
                                                              (co-)ocurrence significance instead of mere frequency.
                                                                  Recurring patterns are problematic, such as month
                  (a) Prior to the event                      names, weekdays, or the word “weekend”. While
                                                              calendar words are easily added to a stopword list,
                                                              a learning approach is preferable. Entity linking
                                                              with Wikipedia performs very well with frequent en-
                                                              tities, but cannot adapt to entities not yet present in
                                                              Wikipedia [GTBd14, EAM+ 17], as needed for emerg-
                                                              ing events. For example, an article on Larry Nassar
                                                              was not created until January 18, 2018. The system
                                                              will benefit from an improved entity disambiguation
                                                              and the ability to learn new entities (such as names)
                                                              on the fly, even when they cannot yet be linked.
                                                                  Temporal aspects are of high interest: when do clus-
                                                              ters and links emerge, and when do they disappear? In
                      (b) Emerging                            our prototype, the user can navigate between micro-
                                                              batches (each covering about 15-30 minutes of news
                                                              data) with the arrow buttons, but we do not over-
                                                              lay frequency histograms onto the words as done in
                                                              some of the related work [LHKC10, LBSW12]. Differ-
                                                              ent events may require different timeframes for anal-
                                                              ysis. For example, the Super Bowl itself was in the
                                                              media days before the actual event, and we observe a
                                                              gradual ramp-up. In-game events will occur at a much
                                                              shorter time frame. This may require using multiple
                                                              estimators, learning at different rates.
                                                                  The quality of the results depends strongly on the
                                                              crawling lag, i.e., the delay between when the publica-
                   (c) Dominant topic                         tion time of the article and the time when it is added
                                                              to the feed. With regular crawling, it is common to
Figure 5: Temporal development of news covering the           see lags of much over 30 minutes. Best results are ob-
plane crash of the Saratov Airlines passenger flight.         tained if push notifications by the publishers notify the
4    Conclusions and Future Directions                        system of new contents, which will usually reduce the
                                                              lag to less than 5 minutes (compared to the reported
This paper calls for future research in many directions.      publishing time; we cannot avoid lag on the publisher
   The current visualization is limited by readability        side between assigning the publishing date, the actual
aspects because we cannot display much more than              availability on the web site, and the notification being
50 words. To improve the user experience when ex-             delivered). Republished articles and duplicates add
ploring the data, it will be interesting to instead cluster   further data quality challenges. This requires complex
many more words and allow a drill-down approach in            systems to gather the data, until news outlets provide
which the user can zoom into groups of words that are         reliable and low-latency push notification APIs.
then displayed in more detail with additional words.              All-in-all, the significance of cooccurrences and the
   Our prototype provides a single link for each word.        visualization thereof, is an interesting step towards ex-
A Google News style user interface that allows the user       ploring the interactions of multiple entities in current
to browse related results from different news sources         news articles, and allows both for interesting applica-
based on short snippets would be beneficial.                  tions as well as further research.
References                                                 [MSB+ 14] C. D. Manning, M. Surdeanu, J. Bauer,
                                                                     J. R. Finkel, S. Bethard, and D. McClosky.
[AV09]      M. Atkinson and E. Van der Goot. Near
                                                                     The Stanford CoreNLP natural language
            real time information mining in multilin-
                                                                     processing toolkit. In ACL System Demon-
            gual news. In WWW, 2009.
                                                                     strations, 2014.
[BKP14]     L. Barth, S. G. Kobourov, and S. Pupyrev.
            Experimental comparison of semantic            [SA00]      R. C. Swan and J. Allan. Automatic gen-
            word clouds. In SEA, 2014.                                 eration of overview timelines. In ACM SI-
                                                                       GIR Conf. on Research and Development
[CM05]      G. Cormode and S. Muthukrishnan. An                        in Information Retrieval, 2000.
            improved data stream summary: the
            count-min sketch and its applications. J.      [SAG17]     A. Spitz, S. Almasian, and M. Gertz.
            Algorithms, 55(1), 2005.                                   EVELIN: exploration of event and entity
                                                                       links in implicit networks. In Int. Conf.
[EAM+ 17] J. Esquivel, D. Albakour, M. Martinez-                       on World Wide Web, WWW Companion,
          Alvarez, D. Corney, and S. Moussa. On                        2017.
          the long-tail entities in news. In ECIR,
          2017.                                            [SG16]      A. Spitz and M. Gertz.      Terms over
                                                                       LOAD: Leveraging named entities for
[EKSX96] M. Ester, H.-P. Kriegel, Jörg Sander, and                    cross-document extraction and summa-
         X. Xu. A density-based algorithm for dis-                     rization of events. In ACM SIGIR Conf.
         covering clusters in large spatial databases                  on Research and Development in Informa-
         with noise. In ACM KDD, 1996.                                 tion Retrieval, 2016.
[FA07]      A. Feng and J. Allan. Finding and linking
                                                           [SKE+ 15] E. Schubert, A. Koos, T. Emrich, A. Züfle,
            incidents in news. In ACM CIKM, 2007.
                                                                     K. A. Schmid, and A. Zimek. A framework
[FGK+ 10] I. Färber, S. Günnemann, H.-P. Kriegel,                  for clustering uncertain data. P. VLDB
          P. Kröger, E. Müller, E. Schubert, T. Seidl,             Endowment, 8(12), 2015.
          and A. Zimek. On using class-labels in
                                                           [SSE+ 17]   E. Schubert, Jörg Sander, M. Ester, H.-
          evaluation of clusterings. In MultiClust
                                                                       P. Kriegel, and X. Xu. DBSCAN revis-
          Workshop, 2010.
                                                                       ited, revisited: Why and how you should
[GTBd14] D. Graus, M. Tsagkias, L. Buitinck, and                       (still) use DBSCAN. ACM Transactions
         M. de Rijke. Generating pseudo-ground                         on Database Systems (TODS), 42(3), 2017.
         truth for predicting new concepts in social
         streams. In ECIR, 2014.                           [SSW+ 17] E. Schubert, A. Spitz, M. Weiler, J. Geiß,
                                                                     and M. Gertz. Semantic word clouds
[HR02]      G. E. Hinton and S. T. Roweis. Stochastic                with background corpus normalization and
            neighbor embedding. In NIPS 15, 2002.                    t-distributed stochastic neighbor embed-
                                                                     ding. CoRR, abs/1708.03569, 2017.
[KMS+ 10] M. Krstajic, F. Mansmann, A. Stoffel,
          M. Atkinson, and D. A. Keim. Processing          [SWK14]     E. Schubert, M. Weiler, and H.-P. Kriegel.
          online news streams for large-scale seman-                   SigniTrend: Scalable detection of emerging
          tic analysis. In ICDE Workshops, 2010.                       topics in textual streams by hashed signif-
[KR90]      L. Kaufman and P. J. Rousseeuw. Agglom-                    icance thresholds. In ACM KDD, 2014.
            erative Nesting (Program AGNES). John          [SWK16]     E. Schubert, M. Weiler, and H.-P. Kriegel.
            Wiley & Sons, Inc., 1990.                                  SPOTHOT: Scalable detection of geo-
[LBSW12] S. Lohmann, M. Burch, H. Schmauder,                           spatial events in large textual streams. In
         and D. Weiskopf.      Visual analysis of                      SSDBM, 2016.
         microblog content using time-varying co-
                                                           [TLP+ 08] B. E. Teitler, M. D. Lieberman,
         occurrence highlighting in tag clouds. In
                                                                     D. Panozzo,      J. Sankaranarayanan,
         Advanced Visual Interfaces, AVI, 2012.
                                                                     H. Samet, and J. Sperling. Newsstand: a
[LHKC10] B. Lee, N. Henry Riche, A. K. Karlson,                      new view on news. In ACM SIGSPATIAL,
         and S. Carpendale. Sparkclouds: Visualiz-                   2008.
         ing trends in tag clouds. IEEE Trans. Vis.
         Comput. Graph., 16(6), 2010.