=Paper=
{{Paper
|id=Vol-1568/paper9
|storemode=property
|title=Exploring a Large News Collection Using Visualization Tools
|pdfUrl=https://ceur-ws.org/Vol-1568/paper9.pdf
|volume=Vol-1568
|authors=Tiago Devezas,José Devezas,Sérgio Nunes
|dblpUrl=https://dblp.org/rec/conf/ecir/DevezasDN16
}}
==Exploring a Large News Collection Using Visualization Tools==
Exploring a Large News Collection
Using Visualization Tools
Tiago Devezas1, 2 José Devezas2 Sérgio Nunes1, 2
tdevezas@fe.up.pt jld@fe.up.pt ssn@fe.up.pt
INESC TEC1 and DEI2 , FEUP, University of Porto
Rua Dr. Roberto Frias, s/n
4200-465 Porto, Portugal
three distinct visualization tools for exploring large
news collections, and apply them to the Signal Me-
Abstract dia One-Million News Articles Dataset1 , a collection
of one million news and blog articles.
The overwhelming amount of news content We show three use cases that highlight how these
published online every day has made it in- tools allow the investigation of distinct dimensions of
creasingly difficult to perform macro-level the data. The first case evaluates how the hierarchy of
analysis of the news landscape. Visual ex- importance given to a set of select global events, mani-
ploration tools harness both computing power fested through the amount of coverage, varies between
and human perception to assist in making news and blog sources. The second investigates the
sense of large data collections. In this paper, publication patterns of both source types during 24-
we employed three visualization tools to ex- hour and seven-day weekly cycles. The third use case
plore a dataset comprising one million articles studies the variation of topical diversity for news and
published by news organizations and blogs. blogs over time and employs a visualization tool devel-
The visual analysis of the dataset revealed oped specifically for this work. To develop this tool,
that 1) news and blog sources evaluate very an analysis was conducted to identify the topic vectors
differently the importance of similar events, representing the directions followed daily by the arti-
granting them distinct amounts of coverage, cles’ contents, compute a diversity score, and measure
2) there are both dissimilarities and overlaps the topic diversity over time for news and blogs.
in the publication patterns of the two source
types, and 3) the content’s direction and di-
2 Corpus Characterization
versity behave differently over time.
The Signal 1M Dataset is comprised of one million
1 Introduction articles published by 93,345 distinct media sources of
two types: news and blogs. An analysis of the articles’
Finding valuable information in large collections of media type reveals that 18,533 sources published ex-
data can resemble looking for a needle in a haystack. clusively news articles, 74,333 sources published only
An effective way to address this problem is the use blog stories, and 479 had documents of both types.
of data visualization tools to explore datasets [Kei01]. As for the article count by media type, nearly three-
The presentation of abstract data through interactive fourths were news (734,488 or 73.4%) and one-fourth
visual tools leverages human perceptual abilities and blog items (265,512 or 26.6%). Thus, despite its lower
enhances cognitive performance, thus promoting dis- amount, news sources were responsible for the publi-
covery and sensemaking. In this paper, we present cation of the majority of articles.
Copyright c 2016 for the individual papers by the paper’s au- Even though the publication period extends from
thors. Copying permitted for private and academic purposes. Jul 2nd 2015 to Sep 30th 2015, the majority of the arti-
This volume is published and copyrighted by its editors. cles were published between Sep 1st 2015 and Sep 30th
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- 2015 (987,248 or 98.7%). Of these, 734,488 (74.4%)
gartner, R. Campos and D. Albakour (eds.): Proceedings of the
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, 1 http://research.signalmedia.co/newsir16/
published at http://ceur-ws.org signal-dataset.html
were news articles and 265,512 (26.9%) blog articles. Tracker [KNAMK13]. The application represents the
The highest number of articles published by a single evolution of stories over time, and how they merge
source was 192,228 and the lowest amount, a single and split. Story clusters are displayed as rectangles
article. Regarding the overall distribution of articles, whose size corresponds to the number of articles and
the majority of the sources (91,693 or 98.2%) published include labels for the story title and the most impor-
100 articles or less, 1,565 sources (1.7%) published be- tant keywords. Related clusters have the same color,
tween 101 and 1000 articles, 85 (0.09%) between 1,001 are edge-connected, and can be zoomed to the level of
and 5,000, one (0.001%) between 5001 and 10000, and the individual articles that compose them.
one between 10,000 and 20,000 articles. The NewsStream service [NGSM15] provides sev-
The topic analysis conducted for each media type eral interactive tools to visually explore a continuously
stream (see Section 5.3.2) found that the top five n- updated collection of financial articles, published via
grams, based on the TF-IDF score of the topic vec- the RSS feeds of multiple news and blog sources. The
tors, were ‘south africa’, ‘pope francis’, ‘total volume system displays occurrences and co-occurrences of fi-
table’, ‘high school football’, and ‘college football’ for nancial and geographic entities in the news, the related
news articles, and ‘star wars’, ‘school district’, ‘syrian sentiment, a summary of the linked content through
refugees’, ‘executive director’, and ‘kansas city’ for the tag clouds, and temporal country co-occurrence net-
blog document set. works displayed on a world map.
3 Visualization of Large News 4 The MediaViz Platform
Archives The MediaViz platform [DNR15] aims to assist in gain-
The visualization and analysis of large volumes of news ing insight from a large archive of news through inter-
content is an emerging field of research [KBMK10]. active visualization tools. It comprises two compo-
The ThemeRiver application [HHN02] was one of the nents. The first is a back-end application that fetches
first efforts in this domain. It provides an interac- and stores articles published via the RSS feeds of mul-
tive visualization of thematic changes across a large tiple online news sources and provides access to the
set of news documents over time. It uses a metaphor data through an API. The second is a client applica-
of a river to assist in the recognition of relationships, tion which retrieves the data provided by the API and
trends and patterns in the data. Themes are displayed allows its exploration through interactive visualization
as colored streams whose width — the measure of its tools. Our approach is based on open technologies and
strength — varies as it flows across time from left to was built with extensibility in mind: the client appli-
right. A similar river-like visual metaphor is employed cation is decoupled from the back-end so it can be
by the NewsLab system [GLYR07], which allows ex- configured to work with different datasets with min-
ploratory analysis of the temporal variation of themes, imal effort. For this paper, we stored the Signal 1M
and their hierarchical structure, from a large collection Dataset in a relational database and built a simple
of news videos. API. No major modifications were required for the ex-
Krstajić et al. [KBK11] present CloudLines, a visu- isting visualization tools to work with the new API.
alization technique to display a compact view of mul- However, a new tool was developed to explore topic
tiple time series, each showing a sequence of related diversity over time for news and blog articles. A fully
events and event episodes (high density sequences of functional demo is available online2 .
events). The relative importance of events is conveyed
through variations in the clusters’ opacity and size. 5 MediaViz Visualization Tools
The system also permits fine-detailed analysis of indi-
vidual event data points. Rather than focusing on individual sources, we opted
The complexities of visualizing the dynamics of to explore the two types of media sources that com-
news data streams are addressed by Krstajić et prise the corpus — news and blogs —, as they allow a
al. [KBMK10]. The system displays the evolution of macro-level analysis and comparison of the dataset.
news in real-time by converting the stream into threads
comprised of similar articles. In addition to showing 5.1 Variations in Coverage
recent threads, the system computes the threads’ rel- The dynamics of the coverage that each source type
evance on the fly — based on the items’ age and their granted to different themes over time are displayed by
relationships — to determine which threads to keep on the Keywords tool. Users can insert multiple search
screen and which ones to remove. terms and see how many articles (in absolute terms or
The development of news stories and their re-
lationships through time is also explored by Story 2 http://irlab.fe.up.pt/p/mediaviz/newsir/
as a percentage of all articles published on the respec- 5.2 Publication Patterns
tive day) with those keywords were published daily
The Sources tool allows the comparison of publication
during the selected period. Additional context can be
patterns (count and percentage of articles) for multiple
obtained by clicking the data points, which displays
sources according to distinct temporal granularities:
a list of all related articles. Each list item includes
weekly, monthly and 24-hour cycles. To have compara-
the title, summary, publication date and the source’s
ble results, publication times are converted to the UTC
name, and can be clicked to display the full text.
time standard. The ability to compare several sources
in the same screen can thus provide meaningful per-
spectives regarding their production cycles. This can
be seen in Figure 2. News sources published a higher
percentage of articles than blogs during business days,
a behavior that is reversed during the weekend. While
this pattern might be expected, given the particulari-
ties of each media type, the Sources tool quantitatively
shows that such assertion is indeed true.
Figure 2: MediaViz Sources tool. Percentage of arti-
cles published by both source types for each day of the
Figure 1: MediaViz Keywords tool. Top: Daily per- week.
centage of articles published by all news sources con-
taining the given terms. Bottom: Daily percentage of When looking at a 24-hour cycle, news and blog
articles published by all blog sources containing the sources exhibit similar patterns. As Figure 3 displays,
same terms. publications follow a typical working schedule: the
most active publication period occurs between 08:00
and 16:00 UTC and then gradually decreases. One
Figure 1 displays the daily percentage of articles possible explanation for this overlap is the growing
published between Sep 1st 2015 and Sep 30th 2015 professionalization and influence of blogs, which often
by each source type with the terms ‘star wars’, ‘chile compete with traditional news sources for online eye-
earthquake’, ‘tsipras’, and ‘stampede saudi arabia’. balls. The most significant difference between the two
These particular terms were chosen because they are patterns, the news sources’ peak at 07:00, can be po-
related with some relevant global events — identified tentially explained by the publication of early morning
after consulting several online resources — that took news.
place on September 2015. The visualization’s peaks
highlight the selected events: the merchandise for the
latest Star Wars movie was released on Sep 4th; an
earthquake in Chile which led to the evacuation of mil-
lions of people took place on Sep 16th; on Sep 20th
Alexis Tsipras was reelected as Prime Minister of
Greece after resigning and calling for a snap election;
and, on Sep 24th, hundreds of people died after a stam-
pede during the annual pilgrimage to Mecca, in Saudi
Arabia. As shown in Figure 1, the attention given to
these events differed greatly between the two source Figure 3: MediaViz Sources tool. Percentage of arti-
types. News sources (top), gave similar attention to cles published by both source types during a 24-hour
each event, while in blogs (bottom), the primacy be- cycle.
longs to articles mentioning Star Wars.
5.3 Diversity Explorer rately for the news and blog corpora. Next, we cal-
culated the mean and standard deviation for the ob-
The Diversity Explorer tool was developed specifically
tained values, and combined the mean E[X] and stan-
for this work. Below we describe our strategy for de-
dard deviation σ(X) into a diversity score, as described
tecting topics and measuring topical diversity between
in Equation 1.
the news and blog streams.
5.3.1 Topic Detection score(X) = E[X] − 2 × (F (E[X]; 0.5, 1/50) × 0.5)
Our topic detection strategy was based on the clus- × (1 − E[X]) × (E[X] × σ(X)) (1)
tering of text documents using n-grams of size n = 2
(bigrams) and n = 3 (trigrams) as features. The base
strategy consisted of, for a given day, transforming 1
each document into a bag of n-grams and then run- F (x; µ, s) = x−µ (2)
1 + e− s
ning k-means [HW79] using the n-gram frequencies as
features. The value of k was selected based on the The idea was for the variance to affect the mean
Silhouette method [Rou87], by testing successive val- cosine distance in the following way: for a low mean,
ues of k ∈ [2, 15] for a random sample of 100 or less a low variance would result in a small increase, while
documents — in case less than 100 documents were a high variance would result in a large increase; for
available. Constraining the value of k, indirectly en- a high mean, a low variance would result in a small
forced the number of topics to range between 2 and 15. decrease, while a high variance would result in a large
The result of this process was a set of k topics, repre- decrease. For example, given a mean cosine distance of
sented by the centroid of each cluster and associated 0.9, with a 0.9 standard deviation, we know that there
with the documents for each day. are several values below the mean and that, since we
Prior to the clustering phase, and in order to ensure are using a normalized cosine distance, its maximum is
performance, we reduced the number of features by re- one. Thus, it makes sense that we would decrease (neg-
moving n-grams that were over 99.6% sparse, i.e., fea- ative sign) the diversity score with the intuition that
tures with more than 99.6% zeros, that were less useful a subset of documents would be less diverse among
in distinguishing documents, were simply discarded. themselves than average. On the other hand, for a
The sparsity threshold of 99.6% was determined em- mean cosine distance of 0.1, it would only make sense
pirically, by experimenting with the largest daily doc- to increase (positive sign) the value based on the stan-
ument set and ensuring that the number of features dard deviation. To determine sign, we took advan-
would not explode (99% decrease from 1,834,310 to tage of a logistic distribution (Equation 2), centered
350 features for the largest daily document set), but on µ = 0.5 and scaled to s = 1/50. We used this
also with smaller daily document sets to ensure that as a sign function by shifting the result by −0.5 and
the number of features would not be too small (nearly multiplying by 2, which gave us a value in the interval
0% decrease for daily document sets with less than 100 [−1, 1] with a sigmoidal behavior. We then combined
documents). After completing the feature reduction the mean and standard deviation to obtain the abso-
process, we repeated the previously described cluster- lute value of increase or decrease, and multiplied it by
ing process for the smaller matrix, obtaining k topic the sign function.
vectors that illustrated the different directions of fol- We repeated this process for news, blogs, and the
lowed contents in daily news. concatenated n-gram daily vectors of both corpora, for
an overall topic diversity measurement. This resulted
5.3.2 Measuring Topic Diversity in a diversity score between zero and one, where zero
In order to measure topic diversity within a corpus, meant that all the topics were exactly the same, while
we took the topic vectors for a given day and did one meant that all the topics were completely distinct.
an element-wise aggregation based on the maximum Based on our results, topics have, overall for the com-
weight of each n-gram. This resulted in a set of daily bined samples, a diversity score of 0.970, a value that
vectors, describing the overall topical direction of news is as high as 0.986 for blogs, and as low as 0.976 for
and blog articles per day. news. Topic diversity is similarly high in either case,
Our approach to measuring topic diversity was despite blogs having a slightly higher diversity score.
based on a combined distance metric between all n-
5.3.3 Exploring Diversity Over Time
gram daily vectors, for a given corpus — the more dis-
tant the topics are from every other topic, the higher We also measured topic diversity over time, for small
the diversity. We computed the normalized cosine dis- temporal windows, comparing news and blogs. Fig-
tances X for each pair of n-gram daily vectors, sepa- ure 4 shows the resulting diversity score for a sequence
of 5-day windows starting at the given date (x-axis), 6 Conclusion
from Sep 1st to Sep 30th 2015, with news in green
In this paper we presented the exploration of the Sig-
and blogs in red. As we can see, both corpora have a
nal 1M Dataset, which comprises a large collection
diversity behavior that is similar over time, with the
of news and blog articles, using distinct visualization
exception of the temporal windows from Sep 15th to
tools. The visual analysis of the corpus provided in-
Sep 19th 2015. Correlation between the two diver-
teresting perspectives that would be much more dif-
sity score distributions is 28.9% for the whole month
ficult to obtain without the assistance of such tools.
of September, but raises to 69.3% when ignoring the
The Keywords tool allowed us to see that news and
period of 15–19 Sep. We calculated the differences be-
blog sources granted different levels of importance to a
tween diversity scores over time and found that the
given set of keywords related with major global events
temporal window starting at Sep 19th 2015 repre-
that took place on September 2015. It was also evi-
sented the largest break in consistency between news
dent, using the Sources tool, that the temporal publi-
and blogs, with a difference in diversity of 0.205.
cation patterns of these two media behaved differently
We analyzed the n-grams of the topics, for each
— blogs published a higher percentage of content dur-
corpus, within this temporal window. For the news
ing the weekend than news sources —, but also in a
corpus, we found 111 unique n-grams out of 175 to-
similar fashion — both sources followed an identical
tal n-grams, meaning that 63.43% of the n-grams are
curve during a 24-hour cycle. Finally, through the
unique, which indicates a high diversity. On the other
Diversity Explorer tool, we were able to visualize vari-
hand, for the blog corpus, we found 64 unique n-grams
ations in the dynamics of topical diversity over time
out of 164 total n-grams, meaning that 39.02% of the
for each media type’s content stream.
n-grams are unique, which indicates a low diversity.
This is consistent with our diversity score. We also Acknowledgements
calculated the Jaccard index for the set of n-grams of
each corpora, for the Sep 19th 2015 temporal window, Project ‘NORTE-01-0145-FEDER-000020’ is financed
finding that 15.89% of the total number of unique n- by the North Portugal Regional Operational Pro-
grams appears in both news and blogs. gramme (NORTE 2020), under the PORTUGAL 2020
Partnership Agreement, and through the European
Regional Development Fund (ERDF).
References
[DNR15] Tiago Devezas, Sérgio Nunes, and
Marı́a Teresa Rodrı́guez. MediaViz:
An interactive visualization platform for
online media studies. In Proceedings
of the 2015 International Workshop on
Human-centric Independent Computing,
pages 7–11. ACM, 2015.
[GLYR07] Mohammad Ghoniem, Dongning Luo,
Jing Yang, and William Ribarsky.
Newslab: Exploratory broadcast news
video analysis. In Visual Analytics Sci-
ence and Technology, 2007. VAST 2007.
IEEE Symposium on, pages 123–130.
IEEE, 2007.
[HHN02] Susan Havre, Beth Hetzler, and Lucy
Nowell. Themerivertm: In search
of trends, patterns, and relationships.
IEEE Transactions on Visualization and
Computer Graphics, 8(1):9–20, 2002.
Figure 4: MediaViz diversity explorer. Top: diversity
over time for windows of 5 days, starting at the given [HW79] J A Hartigan and M A Wong. A K-
date. Bottom: number of documents for windows of 5 Means Clustering Algorithm. Journal of
days, starting at the given date. the Royal Statistical Society, 28(1):100–
108, 1979.
[KBK11] Miloš Krstajić, Enrico Bertini, and [KNAMK13] Miloš Krstajić, Mohammad Najm-
Daniel A Keim. Cloudlines: Com- Araghi, Florian Mansmann, and
pact display of event episodes in multi- Daniel A Keim. Story tracker: In-
ple time-series. Visualization and Com- cremental visual text analytics of
puter Graphics, IEEE Transactions on, news story development. Information
17(12):2432–2439, 2011. Visualization, 12(3-4):308–323, 2013.
[KBMK10] Miloš Krstajić, Enrico Bertini, Florian [NGSM15] Petra Kralj Novak, Miha Grcar, Borut
Mansmann, and Daniel A Keim. Vi- Sluban, and Igor Mozetic. Analy-
sual analysis of news streams with ar- sis of financial news with newsstream,
ticle threads. In Proceedings of the technical report IJS-DP-11965. CoRR,
First International Workshop on Novel abs/1508.00027, 2015.
Data Stream Pattern Mining Tech-
niques, pages 39–46. ACM, 2010. [Rou87] Peter J. Rousseeuw. Silhouettes: A
graphical aid to the interpretation and
[Kei01] Daniel A Keim. Visual exploration of validation of cluster analysis. Journal of
large data sets. Communications of the Computational and Applied Mathemat-
ACM, 44(8):38–44, 2001. ics, 20:53–65, 1987.