Exploring a Large News Collection Using Visualization Tools Tiago Devezas1, 2 José Devezas2 Sérgio Nunes1, 2 tdevezas@fe.up.pt jld@fe.up.pt ssn@fe.up.pt INESC TEC1 and DEI2 , FEUP, University of Porto Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal three distinct visualization tools for exploring large news collections, and apply them to the Signal Me- Abstract dia One-Million News Articles Dataset1 , a collection of one million news and blog articles. The overwhelming amount of news content We show three use cases that highlight how these published online every day has made it in- tools allow the investigation of distinct dimensions of creasingly difficult to perform macro-level the data. The first case evaluates how the hierarchy of analysis of the news landscape. Visual ex- importance given to a set of select global events, mani- ploration tools harness both computing power fested through the amount of coverage, varies between and human perception to assist in making news and blog sources. The second investigates the sense of large data collections. In this paper, publication patterns of both source types during 24- we employed three visualization tools to ex- hour and seven-day weekly cycles. The third use case plore a dataset comprising one million articles studies the variation of topical diversity for news and published by news organizations and blogs. blogs over time and employs a visualization tool devel- The visual analysis of the dataset revealed oped specifically for this work. To develop this tool, that 1) news and blog sources evaluate very an analysis was conducted to identify the topic vectors differently the importance of similar events, representing the directions followed daily by the arti- granting them distinct amounts of coverage, cles’ contents, compute a diversity score, and measure 2) there are both dissimilarities and overlaps the topic diversity over time for news and blogs. in the publication patterns of the two source types, and 3) the content’s direction and di- 2 Corpus Characterization versity behave differently over time. The Signal 1M Dataset is comprised of one million 1 Introduction articles published by 93,345 distinct media sources of two types: news and blogs. An analysis of the articles’ Finding valuable information in large collections of media type reveals that 18,533 sources published ex- data can resemble looking for a needle in a haystack. clusively news articles, 74,333 sources published only An effective way to address this problem is the use blog stories, and 479 had documents of both types. of data visualization tools to explore datasets [Kei01]. As for the article count by media type, nearly three- The presentation of abstract data through interactive fourths were news (734,488 or 73.4%) and one-fourth visual tools leverages human perceptual abilities and blog items (265,512 or 26.6%). Thus, despite its lower enhances cognitive performance, thus promoting dis- amount, news sources were responsible for the publi- covery and sensemaking. In this paper, we present cation of the majority of articles. Copyright c 2016 for the individual papers by the paper’s au- Even though the publication period extends from thors. Copying permitted for private and academic purposes. Jul 2nd 2015 to Sep 30th 2015, the majority of the arti- This volume is published and copyrighted by its editors. cles were published between Sep 1st 2015 and Sep 30th In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- 2015 (987,248 or 98.7%). Of these, 734,488 (74.4%) gartner, R. Campos and D. Albakour (eds.): Proceedings of the NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, 1 http://research.signalmedia.co/newsir16/ published at http://ceur-ws.org signal-dataset.html were news articles and 265,512 (26.9%) blog articles. Tracker [KNAMK13]. The application represents the The highest number of articles published by a single evolution of stories over time, and how they merge source was 192,228 and the lowest amount, a single and split. Story clusters are displayed as rectangles article. Regarding the overall distribution of articles, whose size corresponds to the number of articles and the majority of the sources (91,693 or 98.2%) published include labels for the story title and the most impor- 100 articles or less, 1,565 sources (1.7%) published be- tant keywords. Related clusters have the same color, tween 101 and 1000 articles, 85 (0.09%) between 1,001 are edge-connected, and can be zoomed to the level of and 5,000, one (0.001%) between 5001 and 10000, and the individual articles that compose them. one between 10,000 and 20,000 articles. The NewsStream service [NGSM15] provides sev- The topic analysis conducted for each media type eral interactive tools to visually explore a continuously stream (see Section 5.3.2) found that the top five n- updated collection of financial articles, published via grams, based on the TF-IDF score of the topic vec- the RSS feeds of multiple news and blog sources. The tors, were ‘south africa’, ‘pope francis’, ‘total volume system displays occurrences and co-occurrences of fi- table’, ‘high school football’, and ‘college football’ for nancial and geographic entities in the news, the related news articles, and ‘star wars’, ‘school district’, ‘syrian sentiment, a summary of the linked content through refugees’, ‘executive director’, and ‘kansas city’ for the tag clouds, and temporal country co-occurrence net- blog document set. works displayed on a world map. 3 Visualization of Large News 4 The MediaViz Platform Archives The MediaViz platform [DNR15] aims to assist in gain- The visualization and analysis of large volumes of news ing insight from a large archive of news through inter- content is an emerging field of research [KBMK10]. active visualization tools. It comprises two compo- The ThemeRiver application [HHN02] was one of the nents. The first is a back-end application that fetches first efforts in this domain. It provides an interac- and stores articles published via the RSS feeds of mul- tive visualization of thematic changes across a large tiple online news sources and provides access to the set of news documents over time. It uses a metaphor data through an API. The second is a client applica- of a river to assist in the recognition of relationships, tion which retrieves the data provided by the API and trends and patterns in the data. Themes are displayed allows its exploration through interactive visualization as colored streams whose width — the measure of its tools. Our approach is based on open technologies and strength — varies as it flows across time from left to was built with extensibility in mind: the client appli- right. A similar river-like visual metaphor is employed cation is decoupled from the back-end so it can be by the NewsLab system [GLYR07], which allows ex- configured to work with different datasets with min- ploratory analysis of the temporal variation of themes, imal effort. For this paper, we stored the Signal 1M and their hierarchical structure, from a large collection Dataset in a relational database and built a simple of news videos. API. No major modifications were required for the ex- Krstajić et al. [KBK11] present CloudLines, a visu- isting visualization tools to work with the new API. alization technique to display a compact view of mul- However, a new tool was developed to explore topic tiple time series, each showing a sequence of related diversity over time for news and blog articles. A fully events and event episodes (high density sequences of functional demo is available online2 . events). The relative importance of events is conveyed through variations in the clusters’ opacity and size. 5 MediaViz Visualization Tools The system also permits fine-detailed analysis of indi- vidual event data points. Rather than focusing on individual sources, we opted The complexities of visualizing the dynamics of to explore the two types of media sources that com- news data streams are addressed by Krstajić et prise the corpus — news and blogs —, as they allow a al. [KBMK10]. The system displays the evolution of macro-level analysis and comparison of the dataset. news in real-time by converting the stream into threads comprised of similar articles. In addition to showing 5.1 Variations in Coverage recent threads, the system computes the threads’ rel- The dynamics of the coverage that each source type evance on the fly — based on the items’ age and their granted to different themes over time are displayed by relationships — to determine which threads to keep on the Keywords tool. Users can insert multiple search screen and which ones to remove. terms and see how many articles (in absolute terms or The development of news stories and their re- lationships through time is also explored by Story 2 http://irlab.fe.up.pt/p/mediaviz/newsir/ as a percentage of all articles published on the respec- 5.2 Publication Patterns tive day) with those keywords were published daily The Sources tool allows the comparison of publication during the selected period. Additional context can be patterns (count and percentage of articles) for multiple obtained by clicking the data points, which displays sources according to distinct temporal granularities: a list of all related articles. Each list item includes weekly, monthly and 24-hour cycles. To have compara- the title, summary, publication date and the source’s ble results, publication times are converted to the UTC name, and can be clicked to display the full text. time standard. The ability to compare several sources in the same screen can thus provide meaningful per- spectives regarding their production cycles. This can be seen in Figure 2. News sources published a higher percentage of articles than blogs during business days, a behavior that is reversed during the weekend. While this pattern might be expected, given the particulari- ties of each media type, the Sources tool quantitatively shows that such assertion is indeed true. Figure 2: MediaViz Sources tool. Percentage of arti- cles published by both source types for each day of the Figure 1: MediaViz Keywords tool. Top: Daily per- week. centage of articles published by all news sources con- taining the given terms. Bottom: Daily percentage of When looking at a 24-hour cycle, news and blog articles published by all blog sources containing the sources exhibit similar patterns. As Figure 3 displays, same terms. publications follow a typical working schedule: the most active publication period occurs between 08:00 and 16:00 UTC and then gradually decreases. One Figure 1 displays the daily percentage of articles possible explanation for this overlap is the growing published between Sep 1st 2015 and Sep 30th 2015 professionalization and influence of blogs, which often by each source type with the terms ‘star wars’, ‘chile compete with traditional news sources for online eye- earthquake’, ‘tsipras’, and ‘stampede saudi arabia’. balls. The most significant difference between the two These particular terms were chosen because they are patterns, the news sources’ peak at 07:00, can be po- related with some relevant global events — identified tentially explained by the publication of early morning after consulting several online resources — that took news. place on September 2015. The visualization’s peaks highlight the selected events: the merchandise for the latest Star Wars movie was released on Sep 4th; an earthquake in Chile which led to the evacuation of mil- lions of people took place on Sep 16th; on Sep 20th Alexis Tsipras was reelected as Prime Minister of Greece after resigning and calling for a snap election; and, on Sep 24th, hundreds of people died after a stam- pede during the annual pilgrimage to Mecca, in Saudi Arabia. As shown in Figure 1, the attention given to these events differed greatly between the two source Figure 3: MediaViz Sources tool. Percentage of arti- types. News sources (top), gave similar attention to cles published by both source types during a 24-hour each event, while in blogs (bottom), the primacy be- cycle. longs to articles mentioning Star Wars. 5.3 Diversity Explorer rately for the news and blog corpora. Next, we cal- culated the mean and standard deviation for the ob- The Diversity Explorer tool was developed specifically tained values, and combined the mean E[X] and stan- for this work. Below we describe our strategy for de- dard deviation σ(X) into a diversity score, as described tecting topics and measuring topical diversity between in Equation 1. the news and blog streams. 5.3.1 Topic Detection score(X) = E[X] − 2 × (F (E[X]; 0.5, 1/50) × 0.5) Our topic detection strategy was based on the clus- × (1 − E[X]) × (E[X] × σ(X)) (1) tering of text documents using n-grams of size n = 2 (bigrams) and n = 3 (trigrams) as features. The base strategy consisted of, for a given day, transforming 1 each document into a bag of n-grams and then run- F (x; µ, s) = x−µ (2) 1 + e− s ning k-means [HW79] using the n-gram frequencies as features. The value of k was selected based on the The idea was for the variance to affect the mean Silhouette method [Rou87], by testing successive val- cosine distance in the following way: for a low mean, ues of k ∈ [2, 15] for a random sample of 100 or less a low variance would result in a small increase, while documents — in case less than 100 documents were a high variance would result in a large increase; for available. Constraining the value of k, indirectly en- a high mean, a low variance would result in a small forced the number of topics to range between 2 and 15. decrease, while a high variance would result in a large The result of this process was a set of k topics, repre- decrease. For example, given a mean cosine distance of sented by the centroid of each cluster and associated 0.9, with a 0.9 standard deviation, we know that there with the documents for each day. are several values below the mean and that, since we Prior to the clustering phase, and in order to ensure are using a normalized cosine distance, its maximum is performance, we reduced the number of features by re- one. Thus, it makes sense that we would decrease (neg- moving n-grams that were over 99.6% sparse, i.e., fea- ative sign) the diversity score with the intuition that tures with more than 99.6% zeros, that were less useful a subset of documents would be less diverse among in distinguishing documents, were simply discarded. themselves than average. On the other hand, for a The sparsity threshold of 99.6% was determined em- mean cosine distance of 0.1, it would only make sense pirically, by experimenting with the largest daily doc- to increase (positive sign) the value based on the stan- ument set and ensuring that the number of features dard deviation. To determine sign, we took advan- would not explode (99% decrease from 1,834,310 to tage of a logistic distribution (Equation 2), centered 350 features for the largest daily document set), but on µ = 0.5 and scaled to s = 1/50. We used this also with smaller daily document sets to ensure that as a sign function by shifting the result by −0.5 and the number of features would not be too small (nearly multiplying by 2, which gave us a value in the interval 0% decrease for daily document sets with less than 100 [−1, 1] with a sigmoidal behavior. We then combined documents). After completing the feature reduction the mean and standard deviation to obtain the abso- process, we repeated the previously described cluster- lute value of increase or decrease, and multiplied it by ing process for the smaller matrix, obtaining k topic the sign function. vectors that illustrated the different directions of fol- We repeated this process for news, blogs, and the lowed contents in daily news. concatenated n-gram daily vectors of both corpora, for an overall topic diversity measurement. This resulted 5.3.2 Measuring Topic Diversity in a diversity score between zero and one, where zero In order to measure topic diversity within a corpus, meant that all the topics were exactly the same, while we took the topic vectors for a given day and did one meant that all the topics were completely distinct. an element-wise aggregation based on the maximum Based on our results, topics have, overall for the com- weight of each n-gram. This resulted in a set of daily bined samples, a diversity score of 0.970, a value that vectors, describing the overall topical direction of news is as high as 0.986 for blogs, and as low as 0.976 for and blog articles per day. news. Topic diversity is similarly high in either case, Our approach to measuring topic diversity was despite blogs having a slightly higher diversity score. based on a combined distance metric between all n- 5.3.3 Exploring Diversity Over Time gram daily vectors, for a given corpus — the more dis- tant the topics are from every other topic, the higher We also measured topic diversity over time, for small the diversity. We computed the normalized cosine dis- temporal windows, comparing news and blogs. Fig- tances X for each pair of n-gram daily vectors, sepa- ure 4 shows the resulting diversity score for a sequence of 5-day windows starting at the given date (x-axis), 6 Conclusion from Sep 1st to Sep 30th 2015, with news in green In this paper we presented the exploration of the Sig- and blogs in red. As we can see, both corpora have a nal 1M Dataset, which comprises a large collection diversity behavior that is similar over time, with the of news and blog articles, using distinct visualization exception of the temporal windows from Sep 15th to tools. The visual analysis of the corpus provided in- Sep 19th 2015. Correlation between the two diver- teresting perspectives that would be much more dif- sity score distributions is 28.9% for the whole month ficult to obtain without the assistance of such tools. of September, but raises to 69.3% when ignoring the The Keywords tool allowed us to see that news and period of 15–19 Sep. We calculated the differences be- blog sources granted different levels of importance to a tween diversity scores over time and found that the given set of keywords related with major global events temporal window starting at Sep 19th 2015 repre- that took place on September 2015. It was also evi- sented the largest break in consistency between news dent, using the Sources tool, that the temporal publi- and blogs, with a difference in diversity of 0.205. cation patterns of these two media behaved differently We analyzed the n-grams of the topics, for each — blogs published a higher percentage of content dur- corpus, within this temporal window. For the news ing the weekend than news sources —, but also in a corpus, we found 111 unique n-grams out of 175 to- similar fashion — both sources followed an identical tal n-grams, meaning that 63.43% of the n-grams are curve during a 24-hour cycle. Finally, through the unique, which indicates a high diversity. On the other Diversity Explorer tool, we were able to visualize vari- hand, for the blog corpus, we found 64 unique n-grams ations in the dynamics of topical diversity over time out of 164 total n-grams, meaning that 39.02% of the for each media type’s content stream. n-grams are unique, which indicates a low diversity. This is consistent with our diversity score. We also Acknowledgements calculated the Jaccard index for the set of n-grams of each corpora, for the Sep 19th 2015 temporal window, Project ‘NORTE-01-0145-FEDER-000020’ is financed finding that 15.89% of the total number of unique n- by the North Portugal Regional Operational Pro- grams appears in both news and blogs. gramme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). References [DNR15] Tiago Devezas, Sérgio Nunes, and Marı́a Teresa Rodrı́guez. MediaViz: An interactive visualization platform for online media studies. In Proceedings of the 2015 International Workshop on Human-centric Independent Computing, pages 7–11. ACM, 2015. [GLYR07] Mohammad Ghoniem, Dongning Luo, Jing Yang, and William Ribarsky. Newslab: Exploratory broadcast news video analysis. In Visual Analytics Sci- ence and Technology, 2007. VAST 2007. IEEE Symposium on, pages 123–130. IEEE, 2007. [HHN02] Susan Havre, Beth Hetzler, and Lucy Nowell. Themerivertm: In search of trends, patterns, and relationships. IEEE Transactions on Visualization and Computer Graphics, 8(1):9–20, 2002. Figure 4: MediaViz diversity explorer. Top: diversity over time for windows of 5 days, starting at the given [HW79] J A Hartigan and M A Wong. A K- date. Bottom: number of documents for windows of 5 Means Clustering Algorithm. Journal of days, starting at the given date. the Royal Statistical Society, 28(1):100– 108, 1979. [KBK11] Miloš Krstajić, Enrico Bertini, and [KNAMK13] Miloš Krstajić, Mohammad Najm- Daniel A Keim. Cloudlines: Com- Araghi, Florian Mansmann, and pact display of event episodes in multi- Daniel A Keim. Story tracker: In- ple time-series. Visualization and Com- cremental visual text analytics of puter Graphics, IEEE Transactions on, news story development. Information 17(12):2432–2439, 2011. Visualization, 12(3-4):308–323, 2013. [KBMK10] Miloš Krstajić, Enrico Bertini, Florian [NGSM15] Petra Kralj Novak, Miha Grcar, Borut Mansmann, and Daniel A Keim. Vi- Sluban, and Igor Mozetic. Analy- sual analysis of news streams with ar- sis of financial news with newsstream, ticle threads. In Proceedings of the technical report IJS-DP-11965. CoRR, First International Workshop on Novel abs/1508.00027, 2015. Data Stream Pattern Mining Tech- niques, pages 39–46. ACM, 2010. [Rou87] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and [Kei01] Daniel A Keim. Visual exploration of validation of cluster analysis. Journal of large data sets. Communications of the Computational and Applied Mathemat- ACM, 44(8):38–44, 2001. ics, 20:53–65, 1987.