<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring a Large News Collection Using Visualization Tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jose Devezas</string-name>
          <email>tdevezas@fe.up.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>jld@fe.up.pt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>, FEUP, University of Porto Rua Dr. Roberto Frias</institution>
          ,
          <addr-line>s/n 4200-465 Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sergio Nunes</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tiago Devezas</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>The overwhelming amount of news content published online every day has made it increasingly di cult to perform macro-level analysis of the news landscape. Visual exploration tools harness both computing power and human perception to assist in making sense of large data collections. In this paper, we employed three visualization tools to explore a dataset comprising one million articles published by news organizations and blogs. The visual analysis of the dataset revealed that 1) news and blog sources evaluate very di erently the importance of similar events, granting them distinct amounts of coverage, 2) there are both dissimilarities and overlaps in the publication patterns of the two source types, and 3) the content's direction and diversity behave di erently over time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Finding valuable information in large collections of
data can resemble looking for a needle in a haystack.
An e ective way to address this problem is the use
of data visualization tools to explore datasets [Kei01].
The presentation of abstract data through interactive
visual tools leverages human perceptual abilities and
enhances cognitive performance, thus promoting
discovery and sensemaking. In this paper, we present
three distinct visualization tools for exploring large
news collections, and apply them to the Signal
Media One-Million News Articles Dataset1, a collection
of one million news and blog articles.</p>
      <p>We show three use cases that highlight how these
tools allow the investigation of distinct dimensions of
the data. The rst case evaluates how the hierarchy of
importance given to a set of select global events,
manifested through the amount of coverage, varies between
news and blog sources. The second investigates the
publication patterns of both source types during
24hour and seven-day weekly cycles. The third use case
studies the variation of topical diversity for news and
blogs over time and employs a visualization tool
developed speci cally for this work. To develop this tool,
an analysis was conducted to identify the topic vectors
representing the directions followed daily by the
articles' contents, compute a diversity score, and measure
the topic diversity over time for news and blogs.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus Characterization</title>
      <p>The Signal 1M Dataset is comprised of one million
articles published by 93;345 distinct media sources of
two types: news and blogs. An analysis of the articles'
media type reveals that 18;533 sources published
exclusively news articles, 74;333 sources published only
blog stories, and 479 had documents of both types.
As for the article count by media type, nearly
threefourths were news (734;488 or 73:4%) and one-fourth
blog items (265;512 or 26:6%). Thus, despite its lower
amount, news sources were responsible for the
publication of the majority of articles.</p>
      <p>Even though the publication period extends from
Jul 2nd 2015 to Sep 30th 2015, the majority of the
articles were published between Sep 1st 2015 and Sep 30th
2015 (987;248 or 98:7%). Of these, 734;488 (74:4%)
1http://research.signalmedia.co/newsir16/
signal-dataset.html
were news articles and 265;512 (26:9%) blog articles.
The highest number of articles published by a single
source was 192;228 and the lowest amount, a single
article. Regarding the overall distribution of articles,
the majority of the sources (91;693 or 98:2%) published
100 articles or less, 1;565 sources (1:7%) published
between 101 and 1000 articles, 85 (0:09%) between 1;001
and 5;000, one (0:001%) between 5001 and 10000, and
one between 10;000 and 20;000 articles.</p>
      <p>The topic analysis conducted for each media type
stream (see Section 5.3.2) found that the top ve
ngrams, based on the TF-IDF score of the topic
vectors, were `south africa', `pope francis', `total volume
table', `high school football', and `college football' for
news articles, and `star wars', `school district', `syrian
refugees', `executive director', and `kansas city' for the
blog document set.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Visualization of Large News</title>
    </sec>
    <sec id="sec-4">
      <title>Archives</title>
      <p>The visualization and analysis of large volumes of news
content is an emerging eld of research [KBMK10].
The ThemeRiver application [HHN02] was one of the
rst e orts in this domain. It provides an
interactive visualization of thematic changes across a large
set of news documents over time. It uses a metaphor
of a river to assist in the recognition of relationships,
trends and patterns in the data. Themes are displayed
as colored streams whose width | the measure of its
strength | varies as it ows across time from left to
right. A similar river-like visual metaphor is employed
by the NewsLab system [GLYR07], which allows
exploratory analysis of the temporal variation of themes,
and their hierarchical structure, from a large collection
of news videos.</p>
      <p>Krstajic et al. [KBK11] present CloudLines, a
visualization technique to display a compact view of
multiple time series, each showing a sequence of related
events and event episodes (high density sequences of
events). The relative importance of events is conveyed
through variations in the clusters' opacity and size.
The system also permits ne-detailed analysis of
individual event data points.</p>
      <p>The complexities of visualizing the dynamics of
news data streams are addressed by Krstajic et
al. [KBMK10]. The system displays the evolution of
news in real-time by converting the stream into threads
comprised of similar articles. In addition to showing
recent threads, the system computes the threads'
relevance on the y | based on the items' age and their
relationships | to determine which threads to keep on
screen and which ones to remove.</p>
      <p>The development of news stories and their
relationships through time is also explored by Story
Tracker [KNAMK13]. The application represents the
evolution of stories over time, and how they merge
and split. Story clusters are displayed as rectangles
whose size corresponds to the number of articles and
include labels for the story title and the most
important keywords. Related clusters have the same color,
are edge-connected, and can be zoomed to the level of
the individual articles that compose them.</p>
      <p>The NewsStream service [NGSM15] provides
several interactive tools to visually explore a continuously
updated collection of nancial articles, published via
the RSS feeds of multiple news and blog sources. The
system displays occurrences and co-occurrences of
nancial and geographic entities in the news, the related
sentiment, a summary of the linked content through
tag clouds, and temporal country co-occurrence
networks displayed on a world map.
4</p>
    </sec>
    <sec id="sec-5">
      <title>The MediaViz Platform</title>
      <p>The MediaViz platform [DNR15] aims to assist in
gaining insight from a large archive of news through
interactive visualization tools. It comprises two
components. The rst is a back-end application that fetches
and stores articles published via the RSS feeds of
multiple online news sources and provides access to the
data through an API. The second is a client
application which retrieves the data provided by the API and
allows its exploration through interactive visualization
tools. Our approach is based on open technologies and
was built with extensibility in mind: the client
application is decoupled from the back-end so it can be
con gured to work with di erent datasets with
minimal e ort. For this paper, we stored the Signal 1M
Dataset in a relational database and built a simple
API. No major modi cations were required for the
existing visualization tools to work with the new API.
However, a new tool was developed to explore topic
diversity over time for news and blog articles. A fully
functional demo is available online2.
5</p>
    </sec>
    <sec id="sec-6">
      <title>MediaViz Visualization Tools</title>
      <p>Rather than focusing on individual sources, we opted
to explore the two types of media sources that
comprise the corpus | news and blogs |, as they allow a
macro-level analysis and comparison of the dataset.
5.1</p>
      <sec id="sec-6-1">
        <title>Variations in Coverage</title>
        <p>The dynamics of the coverage that each source type
granted to di erent themes over time are displayed by
the Keywords tool. Users can insert multiple search
terms and see how many articles (in absolute terms or
2http://irlab.fe.up.pt/p/mediaviz/newsir/
as a percentage of all articles published on the
respective day) with those keywords were published daily
during the selected period. Additional context can be
obtained by clicking the data points, which displays
a list of all related articles. Each list item includes
the title, summary, publication date and the source's
name, and can be clicked to display the full text.</p>
        <p>Figure 1 displays the daily percentage of articles
published between Sep 1st 2015 and Sep 30th 2015
by each source type with the terms `star wars', `chile
earthquake', `tsipras', and `stampede saudi arabia'.
These particular terms were chosen because they are
related with some relevant global events | identi ed
after consulting several online resources | that took
place on September 2015. The visualization's peaks
highlight the selected events: the merchandise for the
latest Star Wars movie was released on Sep 4th; an
earthquake in Chile which led to the evacuation of
millions of people took place on Sep 16th; on Sep 20th
Alexis Tsipras was reelected as Prime Minister of
Greece after resigning and calling for a snap election;
and, on Sep 24th, hundreds of people died after a
stampede during the annual pilgrimage to Mecca, in Saudi
Arabia. As shown in Figure 1, the attention given to
these events di ered greatly between the two source
types. News sources (top), gave similar attention to
each event, while in blogs (bottom), the primacy
belongs to articles mentioning Star Wars.
The Sources tool allows the comparison of publication
patterns (count and percentage of articles) for multiple
sources according to distinct temporal granularities:
weekly, monthly and 24-hour cycles. To have
comparable results, publication times are converted to the UTC
time standard. The ability to compare several sources
in the same screen can thus provide meaningful
perspectives regarding their production cycles. This can
be seen in Figure 2. News sources published a higher
percentage of articles than blogs during business days,
a behavior that is reversed during the weekend. While
this pattern might be expected, given the
particularities of each media type, the Sources tool quantitatively
shows that such assertion is indeed true.</p>
        <p>When looking at a 24-hour cycle, news and blog
sources exhibit similar patterns. As Figure 3 displays,
publications follow a typical working schedule: the
most active publication period occurs between 08:00
and 16:00 UTC and then gradually decreases. One
possible explanation for this overlap is the growing
professionalization and in uence of blogs, which often
compete with traditional news sources for online
eyeballs. The most signi cant di erence between the two
patterns, the news sources' peak at 07:00, can be
potentially explained by the publication of early morning
news.
The Diversity Explorer tool was developed speci cally
for this work. Below we describe our strategy for
detecting topics and measuring topical diversity between
the news and blog streams.
5.3.1</p>
      </sec>
      <sec id="sec-6-2">
        <title>Topic Detection</title>
        <p>Our topic detection strategy was based on the
clustering of text documents using n-grams of size n = 2
(bigrams) and n = 3 (trigrams) as features. The base
strategy consisted of, for a given day, transforming
each document into a bag of n-grams and then
running k-means [HW79] using the n-gram frequencies as
features. The value of k was selected based on the
Silhouette method [Rou87], by testing successive
values of k 2 [2; 15] for a random sample of 100 or less
documents | in case less than 100 documents were
available. Constraining the value of k, indirectly
enforced the number of topics to range between 2 and 15.
The result of this process was a set of k topics,
represented by the centroid of each cluster and associated
with the documents for each day.</p>
        <p>Prior to the clustering phase, and in order to ensure
performance, we reduced the number of features by
removing n-grams that were over 99:6% sparse, i.e.,
features with more than 99:6% zeros, that were less useful
in distinguishing documents, were simply discarded.
The sparsity threshold of 99:6% was determined
empirically, by experimenting with the largest daily
document set and ensuring that the number of features
would not explode (99% decrease from 1;834;310 to
350 features for the largest daily document set), but
also with smaller daily document sets to ensure that
the number of features would not be too small (nearly
0% decrease for daily document sets with less than 100
documents). After completing the feature reduction
process, we repeated the previously described
clustering process for the smaller matrix, obtaining k topic
vectors that illustrated the di erent directions of
followed contents in daily news.
5.3.2</p>
      </sec>
      <sec id="sec-6-3">
        <title>Measuring Topic Diversity</title>
        <p>In order to measure topic diversity within a corpus,
we took the topic vectors for a given day and did
an element-wise aggregation based on the maximum
weight of each n-gram. This resulted in a set of daily
vectors, describing the overall topical direction of news
and blog articles per day.</p>
        <p>Our approach to measuring topic diversity was
based on a combined distance metric between all
ngram daily vectors, for a given corpus | the more
distant the topics are from every other topic, the higher
the diversity. We computed the normalized cosine
distances X for each pair of n-gram daily vectors,
separately for the news and blog corpora. Next, we
calculated the mean and standard deviation for the
obtained values, and combined the mean E[X] and
standard deviation (X) into a diversity score, as described
in Equation 1.</p>
        <p>score(X) = E[X]
2
(F (E[X]; 0:5; 1=50)</p>
        <p>0:5)
(1</p>
        <p>E[X])
(E[X]
(X)) (1)
F (x; ; s) =</p>
        <p>1
1 + e x s
(2)</p>
        <p>The idea was for the variance to a ect the mean
cosine distance in the following way: for a low mean,
a low variance would result in a small increase, while
a high variance would result in a large increase; for
a high mean, a low variance would result in a small
decrease, while a high variance would result in a large
decrease. For example, given a mean cosine distance of
0.9, with a 0.9 standard deviation, we know that there
are several values below the mean and that, since we
are using a normalized cosine distance, its maximum is
one. Thus, it makes sense that we would decrease
(negative sign) the diversity score with the intuition that
a subset of documents would be less diverse among
themselves than average. On the other hand, for a
mean cosine distance of 0.1, it would only make sense
to increase (positive sign) the value based on the
standard deviation. To determine sign, we took
advantage of a logistic distribution (Equation 2), centered
on = 0:5 and scaled to s = 1=50. We used this
as a sign function by shifting the result by 0:5 and
multiplying by 2, which gave us a value in the interval
[ 1; 1] with a sigmoidal behavior. We then combined
the mean and standard deviation to obtain the
absolute value of increase or decrease, and multiplied it by
the sign function.</p>
        <p>We repeated this process for news, blogs, and the
concatenated n-gram daily vectors of both corpora, for
an overall topic diversity measurement. This resulted
in a diversity score between zero and one, where zero
meant that all the topics were exactly the same, while
one meant that all the topics were completely distinct.
Based on our results, topics have, overall for the
combined samples, a diversity score of 0:970, a value that
is as high as 0:986 for blogs, and as low as 0:976 for
news. Topic diversity is similarly high in either case,
despite blogs having a slightly higher diversity score.
5.3.3</p>
      </sec>
      <sec id="sec-6-4">
        <title>Exploring Diversity Over Time</title>
        <p>We also measured topic diversity over time, for small
temporal windows, comparing news and blogs.
Figure 4 shows the resulting diversity score for a sequence
of 5-day windows starting at the given date (x-axis),
from Sep 1st to Sep 30th 2015, with news in green
and blogs in red. As we can see, both corpora have a
diversity behavior that is similar over time, with the
exception of the temporal windows from Sep 15th to
Sep 19th 2015. Correlation between the two
diversity score distributions is 28:9% for the whole month
of September, but raises to 69:3% when ignoring the
period of 15{19 Sep. We calculated the di erences
between diversity scores over time and found that the
temporal window starting at Sep 19th 2015
represented the largest break in consistency between news
and blogs, with a di erence in diversity of 0:205.</p>
        <p>We analyzed the n-grams of the topics, for each
corpus, within this temporal window. For the news
corpus, we found 111 unique n-grams out of 175
total n-grams, meaning that 63:43% of the n-grams are
unique, which indicates a high diversity. On the other
hand, for the blog corpus, we found 64 unique n-grams
out of 164 total n-grams, meaning that 39:02% of the
n-grams are unique, which indicates a low diversity.
This is consistent with our diversity score. We also
calculated the Jaccard index for the set of n-grams of
each corpora, for the Sep 19th 2015 temporal window,
nding that 15:89% of the total number of unique
ngrams appears in both news and blogs.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper we presented the exploration of the
Signal 1M Dataset, which comprises a large collection
of news and blog articles, using distinct visualization
tools. The visual analysis of the corpus provided
interesting perspectives that would be much more
difcult to obtain without the assistance of such tools.
The Keywords tool allowed us to see that news and
blog sources granted di erent levels of importance to a
given set of keywords related with major global events
that took place on September 2015. It was also
evident, using the Sources tool, that the temporal
publication patterns of these two media behaved di erently
| blogs published a higher percentage of content
during the weekend than news sources |, but also in a
similar fashion | both sources followed an identical
curve during a 24-hour cycle. Finally, through the
Diversity Explorer tool, we were able to visualize
variations in the dynamics of topical diversity over time
for each media type's content stream.</p>
      <sec id="sec-7-1">
        <title>Acknowledgements</title>
        <p>Project `NORTE-01-0145-FEDER-000020' is nanced
by the North Portugal Regional Operational
Programme (NORTE 2020), under the PORTUGAL 2020
Partnership Agreement, and through the European
Regional Development Fund (ERDF).
[DNR15]
[GLYR07]
[HHN02]
[HW79]</p>
        <sec id="sec-7-1-1">
          <title>Tiago Devezas, Sergio Nunes, and</title>
          <p>Mar a Teresa Rodr guez. MediaViz:
An interactive visualization platform for
online media studies. In Proceedings
of the 2015 International Workshop on
Human-centric Independent Computing,
pages 7{11. ACM, 2015.</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>Mohammad Ghoniem, Dongning Luo,</title>
          <p>Jing Yang, and William Ribarsky.
Newslab: Exploratory broadcast news
video analysis. In Visual Analytics
Science and Technology, 2007. VAST 2007.
IEEE Symposium on, pages 123{130.
IEEE, 2007.</p>
        </sec>
        <sec id="sec-7-1-3">
          <title>Susan Havre, Beth Hetzler, and Lucy Nowell. Themerivertm: In search of trends, patterns, and relationships.</title>
          <p>IEEE Transactions on Visualization and
Computer Graphics, 8(1):9{20, 2002.</p>
        </sec>
        <sec id="sec-7-1-4">
          <title>J A Hartigan and M A Wong. A K</title>
          <p>Means Clustering Algorithm. Journal of
the Royal Statistical Society, 28(1):100{
108, 1979.
[KBMK10]</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Kei01]
          <string-name>
            <given-names>Milos</given-names>
            <surname>Krstajic</surname>
          </string-name>
          , Enrico Bertini, and Daniel A Keim.
          <article-title>Cloudlines: Compact display of event episodes in multiple time-series. Visualization and Computer Graphics</article-title>
          , IEEE Transactions on,
          <volume>17</volume>
          (
          <issue>12</issue>
          ):
          <volume>2432</volume>
          {
          <fpage>2439</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Milos</given-names>
            <surname>Krstajic</surname>
          </string-name>
          , Enrico Bertini, Florian Mansmann, and Daniel A Keim.
          <article-title>Visual analysis of news streams with article threads</article-title>
          .
          <source>In Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques</source>
          , pages
          <volume>39</volume>
          {
          <fpage>46</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Daniel A</given-names>
            <surname>Keim</surname>
          </string-name>
          .
          <article-title>Visual exploration of large data sets</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>44</volume>
          (
          <issue>8</issue>
          ):
          <volume>38</volume>
          {
          <fpage>44</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [KNAMK13]
          <string-name>
            <given-names>Milos</given-names>
            <surname>Krstajic</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mohammad</surname>
            <given-names>NajmAraghi</given-names>
          </string-name>
          , Florian Mansmann, and Daniel A Keim.
          <article-title>Story tracker: Incremental visual text analytics of news story development</article-title>
          .
          <source>Information Visualization</source>
          ,
          <volume>12</volume>
          (
          <issue>3-4</issue>
          ):
          <volume>308</volume>
          {
          <fpage>323</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [NGSM15]
          <article-title>[Rou87] Petra Kralj Novak</article-title>
          , Miha Grcar, Borut Sluban, and
          <string-name>
            <given-names>Igor</given-names>
            <surname>Mozetic</surname>
          </string-name>
          .
          <article-title>Analysis of nancial news with newsstream</article-title>
          ,
          <source>technical report IJS-DP-11965</source>
          . CoRR, abs/1508.00027,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Peter J. Rousseeuw</surname>
          </string-name>
          .
          <article-title>Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</article-title>
          .
          <source>Journal of Computational and Applied Mathematics</source>
          ,
          <volume>20</volume>
          :
          <fpage>53</fpage>
          {
          <fpage>65</fpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>