Exploring a Large News Collection
                            Using Visualization Tools

                     Tiago Devezas1, 2                  José Devezas2            Sérgio Nunes1, 2
                    tdevezas@fe.up.pt                   jld@fe.up.pt               ssn@fe.up.pt
                                 INESC TEC1 and DEI2 , FEUP, University of Porto
                                          Rua Dr. Roberto Frias, s/n
                                           4200-465 Porto, Portugal


                                                                three distinct visualization tools for exploring large
                                                                news collections, and apply them to the Signal Me-
                       Abstract                                 dia One-Million News Articles Dataset1 , a collection
                                                                of one million news and blog articles.
    The overwhelming amount of news content                        We show three use cases that highlight how these
    published online every day has made it in-                  tools allow the investigation of distinct dimensions of
    creasingly difficult to perform macro-level                 the data. The first case evaluates how the hierarchy of
    analysis of the news landscape. Visual ex-                  importance given to a set of select global events, mani-
    ploration tools harness both computing power                fested through the amount of coverage, varies between
    and human perception to assist in making                    news and blog sources. The second investigates the
    sense of large data collections. In this paper,             publication patterns of both source types during 24-
    we employed three visualization tools to ex-                hour and seven-day weekly cycles. The third use case
    plore a dataset comprising one million articles             studies the variation of topical diversity for news and
    published by news organizations and blogs.                  blogs over time and employs a visualization tool devel-
    The visual analysis of the dataset revealed                 oped specifically for this work. To develop this tool,
    that 1) news and blog sources evaluate very                 an analysis was conducted to identify the topic vectors
    differently the importance of similar events,               representing the directions followed daily by the arti-
    granting them distinct amounts of coverage,                 cles’ contents, compute a diversity score, and measure
    2) there are both dissimilarities and overlaps              the topic diversity over time for news and blogs.
    in the publication patterns of the two source
    types, and 3) the content’s direction and di-
                                                                2   Corpus Characterization
    versity behave differently over time.
                                                                The Signal 1M Dataset is comprised of one million
1    Introduction                                               articles published by 93,345 distinct media sources of
                                                                two types: news and blogs. An analysis of the articles’
Finding valuable information in large collections of            media type reveals that 18,533 sources published ex-
data can resemble looking for a needle in a haystack.           clusively news articles, 74,333 sources published only
An effective way to address this problem is the use             blog stories, and 479 had documents of both types.
of data visualization tools to explore datasets [Kei01].        As for the article count by media type, nearly three-
The presentation of abstract data through interactive           fourths were news (734,488 or 73.4%) and one-fourth
visual tools leverages human perceptual abilities and           blog items (265,512 or 26.6%). Thus, despite its lower
enhances cognitive performance, thus promoting dis-             amount, news sources were responsible for the publi-
covery and sensemaking. In this paper, we present               cation of the majority of articles.
Copyright c 2016 for the individual papers by the paper’s au-      Even though the publication period extends from
thors. Copying permitted for private and academic purposes.     Jul 2nd 2015 to Sep 30th 2015, the majority of the arti-
This volume is published and copyrighted by its editors.        cles were published between Sep 1st 2015 and Sep 30th
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf-   2015 (987,248 or 98.7%). Of these, 734,488 (74.4%)
gartner, R. Campos and D. Albakour (eds.): Proceedings of the
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016,           1 http://research.signalmedia.co/newsir16/

published at http://ceur-ws.org                                 signal-dataset.html
were news articles and 265,512 (26.9%) blog articles.        Tracker [KNAMK13]. The application represents the
The highest number of articles published by a single         evolution of stories over time, and how they merge
source was 192,228 and the lowest amount, a single           and split. Story clusters are displayed as rectangles
article. Regarding the overall distribution of articles,     whose size corresponds to the number of articles and
the majority of the sources (91,693 or 98.2%) published      include labels for the story title and the most impor-
100 articles or less, 1,565 sources (1.7%) published be-     tant keywords. Related clusters have the same color,
tween 101 and 1000 articles, 85 (0.09%) between 1,001        are edge-connected, and can be zoomed to the level of
and 5,000, one (0.001%) between 5001 and 10000, and          the individual articles that compose them.
one between 10,000 and 20,000 articles.                         The NewsStream service [NGSM15] provides sev-
   The topic analysis conducted for each media type          eral interactive tools to visually explore a continuously
stream (see Section 5.3.2) found that the top five n-        updated collection of financial articles, published via
grams, based on the TF-IDF score of the topic vec-           the RSS feeds of multiple news and blog sources. The
tors, were ‘south africa’, ‘pope francis’, ‘total volume     system displays occurrences and co-occurrences of fi-
table’, ‘high school football’, and ‘college football’ for   nancial and geographic entities in the news, the related
news articles, and ‘star wars’, ‘school district’, ‘syrian   sentiment, a summary of the linked content through
refugees’, ‘executive director’, and ‘kansas city’ for the   tag clouds, and temporal country co-occurrence net-
blog document set.                                           works displayed on a world map.

3    Visualization of Large News                             4     The MediaViz Platform
     Archives                                                The MediaViz platform [DNR15] aims to assist in gain-
The visualization and analysis of large volumes of news      ing insight from a large archive of news through inter-
content is an emerging field of research [KBMK10].           active visualization tools. It comprises two compo-
The ThemeRiver application [HHN02] was one of the            nents. The first is a back-end application that fetches
first efforts in this domain. It provides an interac-        and stores articles published via the RSS feeds of mul-
tive visualization of thematic changes across a large        tiple online news sources and provides access to the
set of news documents over time. It uses a metaphor          data through an API. The second is a client applica-
of a river to assist in the recognition of relationships,    tion which retrieves the data provided by the API and
trends and patterns in the data. Themes are displayed        allows its exploration through interactive visualization
as colored streams whose width — the measure of its          tools. Our approach is based on open technologies and
strength — varies as it flows across time from left to       was built with extensibility in mind: the client appli-
right. A similar river-like visual metaphor is employed      cation is decoupled from the back-end so it can be
by the NewsLab system [GLYR07], which allows ex-             configured to work with different datasets with min-
ploratory analysis of the temporal variation of themes,      imal effort. For this paper, we stored the Signal 1M
and their hierarchical structure, from a large collection    Dataset in a relational database and built a simple
of news videos.                                              API. No major modifications were required for the ex-
    Krstajić et al. [KBK11] present CloudLines, a visu-     isting visualization tools to work with the new API.
alization technique to display a compact view of mul-        However, a new tool was developed to explore topic
tiple time series, each showing a sequence of related        diversity over time for news and blog articles. A fully
events and event episodes (high density sequences of         functional demo is available online2 .
events). The relative importance of events is conveyed
through variations in the clusters’ opacity and size.        5     MediaViz Visualization Tools
The system also permits fine-detailed analysis of indi-
vidual event data points.                                    Rather than focusing on individual sources, we opted
    The complexities of visualizing the dynamics of          to explore the two types of media sources that com-
news data streams are addressed by Krstajić et              prise the corpus — news and blogs —, as they allow a
al. [KBMK10]. The system displays the evolution of           macro-level analysis and comparison of the dataset.
news in real-time by converting the stream into threads
comprised of similar articles. In addition to showing        5.1    Variations in Coverage
recent threads, the system computes the threads’ rel-        The dynamics of the coverage that each source type
evance on the fly — based on the items’ age and their        granted to different themes over time are displayed by
relationships — to determine which threads to keep on        the Keywords tool. Users can insert multiple search
screen and which ones to remove.                             terms and see how many articles (in absolute terms or
    The development of news stories and their re-
lationships through time is also explored by Story               2 http://irlab.fe.up.pt/p/mediaviz/newsir/
as a percentage of all articles published on the respec-   5.2   Publication Patterns
tive day) with those keywords were published daily
                                                           The Sources tool allows the comparison of publication
during the selected period. Additional context can be
                                                           patterns (count and percentage of articles) for multiple
obtained by clicking the data points, which displays
                                                           sources according to distinct temporal granularities:
a list of all related articles. Each list item includes
                                                           weekly, monthly and 24-hour cycles. To have compara-
the title, summary, publication date and the source’s
                                                           ble results, publication times are converted to the UTC
name, and can be clicked to display the full text.
                                                           time standard. The ability to compare several sources
                                                           in the same screen can thus provide meaningful per-
                                                           spectives regarding their production cycles. This can
                                                           be seen in Figure 2. News sources published a higher
                                                           percentage of articles than blogs during business days,
                                                           a behavior that is reversed during the weekend. While
                                                           this pattern might be expected, given the particulari-
                                                           ties of each media type, the Sources tool quantitatively
                                                           shows that such assertion is indeed true.


                                                           Figure 2: MediaViz Sources tool. Percentage of arti-
                                                           cles published by both source types for each day of the
Figure 1: MediaViz Keywords tool. Top: Daily per-          week.
centage of articles published by all news sources con-
taining the given terms. Bottom: Daily percentage of          When looking at a 24-hour cycle, news and blog
articles published by all blog sources containing the      sources exhibit similar patterns. As Figure 3 displays,
same terms.                                                publications follow a typical working schedule: the
                                                           most active publication period occurs between 08:00
                                                           and 16:00 UTC and then gradually decreases. One
   Figure 1 displays the daily percentage of articles      possible explanation for this overlap is the growing
published between Sep 1st 2015 and Sep 30th 2015           professionalization and influence of blogs, which often
by each source type with the terms ‘star wars’, ‘chile     compete with traditional news sources for online eye-
earthquake’, ‘tsipras’, and ‘stampede saudi arabia’.       balls. The most significant difference between the two
These particular terms were chosen because they are        patterns, the news sources’ peak at 07:00, can be po-
related with some relevant global events — identified      tentially explained by the publication of early morning
after consulting several online resources — that took      news.
place on September 2015. The visualization’s peaks
highlight the selected events: the merchandise for the
latest Star Wars movie was released on Sep 4th; an
earthquake in Chile which led to the evacuation of mil-
lions of people took place on Sep 16th; on Sep 20th
Alexis Tsipras was reelected as Prime Minister of
Greece after resigning and calling for a snap election;
and, on Sep 24th, hundreds of people died after a stam-
pede during the annual pilgrimage to Mecca, in Saudi
Arabia. As shown in Figure 1, the attention given to
these events differed greatly between the two source       Figure 3: MediaViz Sources tool. Percentage of arti-
types. News sources (top), gave similar attention to       cles published by both source types during a 24-hour
each event, while in blogs (bottom), the primacy be-       cycle.
longs to articles mentioning Star Wars.
5.3     Diversity Explorer                                  rately for the news and blog corpora. Next, we cal-
                                                            culated the mean and standard deviation for the ob-
The Diversity Explorer tool was developed specifically
                                                            tained values, and combined the mean E[X] and stan-
for this work. Below we describe our strategy for de-
                                                            dard deviation σ(X) into a diversity score, as described
tecting topics and measuring topical diversity between
                                                            in Equation 1.
the news and blog streams.

5.3.1    Topic Detection                                      score(X) = E[X] − 2 × (F (E[X]; 0.5, 1/50) × 0.5)
Our topic detection strategy was based on the clus-                              × (1 − E[X]) × (E[X] × σ(X))     (1)
tering of text documents using n-grams of size n = 2
(bigrams) and n = 3 (trigrams) as features. The base
strategy consisted of, for a given day, transforming                                            1
each document into a bag of n-grams and then run-                           F (x; µ, s) =            x−µ          (2)
                                                                                            1 + e−    s
ning k-means [HW79] using the n-gram frequencies as
features. The value of k was selected based on the             The idea was for the variance to affect the mean
Silhouette method [Rou87], by testing successive val-       cosine distance in the following way: for a low mean,
ues of k ∈ [2, 15] for a random sample of 100 or less       a low variance would result in a small increase, while
documents — in case less than 100 documents were            a high variance would result in a large increase; for
available. Constraining the value of k, indirectly en-      a high mean, a low variance would result in a small
forced the number of topics to range between 2 and 15.      decrease, while a high variance would result in a large
The result of this process was a set of k topics, repre-    decrease. For example, given a mean cosine distance of
sented by the centroid of each cluster and associated       0.9, with a 0.9 standard deviation, we know that there
with the documents for each day.                            are several values below the mean and that, since we
   Prior to the clustering phase, and in order to ensure    are using a normalized cosine distance, its maximum is
performance, we reduced the number of features by re-       one. Thus, it makes sense that we would decrease (neg-
moving n-grams that were over 99.6% sparse, i.e., fea-      ative sign) the diversity score with the intuition that
tures with more than 99.6% zeros, that were less useful     a subset of documents would be less diverse among
in distinguishing documents, were simply discarded.         themselves than average. On the other hand, for a
The sparsity threshold of 99.6% was determined em-          mean cosine distance of 0.1, it would only make sense
pirically, by experimenting with the largest daily doc-     to increase (positive sign) the value based on the stan-
ument set and ensuring that the number of features          dard deviation. To determine sign, we took advan-
would not explode (99% decrease from 1,834,310 to           tage of a logistic distribution (Equation 2), centered
350 features for the largest daily document set), but       on µ = 0.5 and scaled to s = 1/50. We used this
also with smaller daily document sets to ensure that        as a sign function by shifting the result by −0.5 and
the number of features would not be too small (nearly       multiplying by 2, which gave us a value in the interval
0% decrease for daily document sets with less than 100      [−1, 1] with a sigmoidal behavior. We then combined
documents). After completing the feature reduction          the mean and standard deviation to obtain the abso-
process, we repeated the previously described cluster-      lute value of increase or decrease, and multiplied it by
ing process for the smaller matrix, obtaining k topic       the sign function.
vectors that illustrated the different directions of fol-      We repeated this process for news, blogs, and the
lowed contents in daily news.                               concatenated n-gram daily vectors of both corpora, for
                                                            an overall topic diversity measurement. This resulted
5.3.2    Measuring Topic Diversity                          in a diversity score between zero and one, where zero
In order to measure topic diversity within a corpus,        meant that all the topics were exactly the same, while
we took the topic vectors for a given day and did           one meant that all the topics were completely distinct.
an element-wise aggregation based on the maximum            Based on our results, topics have, overall for the com-
weight of each n-gram. This resulted in a set of daily      bined samples, a diversity score of 0.970, a value that
vectors, describing the overall topical direction of news   is as high as 0.986 for blogs, and as low as 0.976 for
and blog articles per day.                                  news. Topic diversity is similarly high in either case,
   Our approach to measuring topic diversity was            despite blogs having a slightly higher diversity score.
based on a combined distance metric between all n-
                                                            5.3.3   Exploring Diversity Over Time
gram daily vectors, for a given corpus — the more dis-
tant the topics are from every other topic, the higher      We also measured topic diversity over time, for small
the diversity. We computed the normalized cosine dis-       temporal windows, comparing news and blogs. Fig-
tances X for each pair of n-gram daily vectors, sepa-       ure 4 shows the resulting diversity score for a sequence
of 5-day windows starting at the given date (x-axis),    6   Conclusion
from Sep 1st to Sep 30th 2015, with news in green
                                                         In this paper we presented the exploration of the Sig-
and blogs in red. As we can see, both corpora have a
                                                         nal 1M Dataset, which comprises a large collection
diversity behavior that is similar over time, with the
                                                         of news and blog articles, using distinct visualization
exception of the temporal windows from Sep 15th to
                                                         tools. The visual analysis of the corpus provided in-
Sep 19th 2015. Correlation between the two diver-
                                                         teresting perspectives that would be much more dif-
sity score distributions is 28.9% for the whole month
                                                         ficult to obtain without the assistance of such tools.
of September, but raises to 69.3% when ignoring the
                                                         The Keywords tool allowed us to see that news and
period of 15–19 Sep. We calculated the differences be-
                                                         blog sources granted different levels of importance to a
tween diversity scores over time and found that the
                                                         given set of keywords related with major global events
temporal window starting at Sep 19th 2015 repre-
                                                         that took place on September 2015. It was also evi-
sented the largest break in consistency between news
                                                         dent, using the Sources tool, that the temporal publi-
and blogs, with a difference in diversity of 0.205.
                                                         cation patterns of these two media behaved differently
   We analyzed the n-grams of the topics, for each
                                                         — blogs published a higher percentage of content dur-
corpus, within this temporal window. For the news
                                                         ing the weekend than news sources —, but also in a
corpus, we found 111 unique n-grams out of 175 to-
                                                         similar fashion — both sources followed an identical
tal n-grams, meaning that 63.43% of the n-grams are
                                                         curve during a 24-hour cycle. Finally, through the
unique, which indicates a high diversity. On the other
                                                         Diversity Explorer tool, we were able to visualize vari-
hand, for the blog corpus, we found 64 unique n-grams
                                                         ations in the dynamics of topical diversity over time
out of 164 total n-grams, meaning that 39.02% of the
                                                         for each media type’s content stream.
n-grams are unique, which indicates a low diversity.
This is consistent with our diversity score. We also     Acknowledgements
calculated the Jaccard index for the set of n-grams of
each corpora, for the Sep 19th 2015 temporal window,     Project ‘NORTE-01-0145-FEDER-000020’ is financed
finding that 15.89% of the total number of unique n-     by the North Portugal Regional Operational Pro-
grams appears in both news and blogs.                    gramme (NORTE 2020), under the PORTUGAL 2020
                                                         Partnership Agreement, and through the European
                                                         Regional Development Fund (ERDF).

                                                         References
                                                         [DNR15]       Tiago Devezas, Sérgio Nunes, and
                                                                       Marı́a Teresa Rodrı́guez.      MediaViz:
                                                                       An interactive visualization platform for
                                                                       online media studies. In Proceedings
                                                                       of the 2015 International Workshop on
                                                                       Human-centric Independent Computing,
                                                                       pages 7–11. ACM, 2015.
                                                         [GLYR07]      Mohammad Ghoniem, Dongning Luo,
                                                                       Jing Yang, and William Ribarsky.
                                                                       Newslab: Exploratory broadcast news
                                                                       video analysis. In Visual Analytics Sci-
                                                                       ence and Technology, 2007. VAST 2007.
                                                                       IEEE Symposium on, pages 123–130.
                                                                       IEEE, 2007.
                                                         [HHN02]       Susan Havre, Beth Hetzler, and Lucy
                                                                       Nowell.    Themerivertm: In search
                                                                       of trends, patterns, and relationships.
                                                                       IEEE Transactions on Visualization and
                                                                       Computer Graphics, 8(1):9–20, 2002.
Figure 4: MediaViz diversity explorer. Top: diversity
over time for windows of 5 days, starting at the given   [HW79]        J A Hartigan and M A Wong. A K-
date. Bottom: number of documents for windows of 5                     Means Clustering Algorithm. Journal of
days, starting at the given date.                                      the Royal Statistical Society, 28(1):100–
                                                                       108, 1979.
[KBK11]    Miloš Krstajić, Enrico Bertini, and       [KNAMK13] Miloš Krstajić, Mohammad Najm-
           Daniel A Keim.       Cloudlines: Com-                 Araghi,    Florian Mansmann,        and
           pact display of event episodes in multi-              Daniel A Keim. Story tracker: In-
           ple time-series. Visualization and Com-               cremental visual text analytics of
           puter Graphics, IEEE Transactions on,                 news story development. Information
           17(12):2432–2439, 2011.                               Visualization, 12(3-4):308–323, 2013.
[KBMK10]   Miloš Krstajić, Enrico Bertini, Florian   [NGSM15]    Petra Kralj Novak, Miha Grcar, Borut
           Mansmann, and Daniel A Keim. Vi-                        Sluban, and Igor Mozetic.       Analy-
           sual analysis of news streams with ar-                  sis of financial news with newsstream,
           ticle threads. In Proceedings of the                    technical report IJS-DP-11965. CoRR,
           First International Workshop on Novel                   abs/1508.00027, 2015.
           Data Stream Pattern Mining Tech-
           niques, pages 39–46. ACM, 2010.             [Rou87]     Peter J. Rousseeuw. Silhouettes: A
                                                                   graphical aid to the interpretation and
[Kei01]    Daniel A Keim. Visual exploration of                    validation of cluster analysis. Journal of
           large data sets. Communications of the                  Computational and Applied Mathemat-
           ACM, 44(8):38–44, 2001.                                 ics, 20:53–65, 1987.