An Analysis of Novelty Dynamics in News Media Coverage Ronaldo Cristiano Prati Walter Teixeira Lima Júnior Universidade Federal do ABC Univerisdade Federal do Amapá Santo André, São Paulo, Brazil Macapá, Amapá, Brazil ronaldo.prati@ufabc.edu.br contato@walterlima.net massive amounts of data has transformed intensely fields such as biology and physics [7]. Abstract In Social Science, despite the difficulties to formalize Computer Science has affected almost all computationally many scientific subjects of the human fields of human knowledge, contributing to behavior, “a computational social science is emerg- scientific advances in many branches of Nat- ing that leverages the capacity to collect and ana- ural and Social Sciences. Journalism is one lyze data with an unprecedented breadth and depth of the fields that is benefiting of the advance and scale” [7]. Unfortunately, most of the advances of computer science. Among the journalis- in this area have been progressing at a much slower tic concepts that can be analyzed computa- pace. However, substantial barriers that might limit tionally is News Value. Novelty is one of the progress are being overcome in recent years. The emer- most important news value. A possible ap- gence of a powerful new field of data analysis of Social proach to get novelty elements in a story con- Science has also influenced the research on a branch of siders word frequency, through of the capac- it, Journalism. Journalism is an important social prac- ity to collect and analyze massive amounts of tice. Therefore, to find non-trivial information on con- data. In this paper, we use the News Cover- tent produced by journalism, it is necessary to count age Index dataset (NCI), maintained by the with the support of the current stage of technologies Pew Research Center, to analyze the novelty to advance in analytical techniques “Computation can dynamics of news coverage, using the novelty advance journalism by drawing on innovations in topic signatures proposed by [12]. As a definition detection, video analysis, personalization, aggregation, of novelty, we used the first appearance of a visualization, and sense making [10]. new lead newsmaker. Results show a good Among the journalistic concepts that can be ana- fit of the model to the dataset. Furthermore, lyzed computationally is News Value. News value as an analysis by media sector and broad topic a concept was thought by Johan Galtung and Mari shows interesting insights for the analysis of Holmboe Ruge’s seminal publication in the Journal of media coverage. Peace Research. In 1965, the paper suggested a range of attributes that establish news values in discursive el- 1 Introduction ements contained in newspapers and broadcast news. Galtung and Ruge established the news values ele- The Computational Science has affected almost all ments as Frequency; Threshold; Unambiguity; Mean- fields of human knowledge, contributing to scientific ingfulness; Consonance; Unexpectedness; Continuity; advances in many branches of Natural and Social Sci- Composition; Reference to Elite Nations; Reference to ences. For instance, the capacity to collect and analyze Elite People; Reference to Persons; and Reference to Copyright c 2016 for the individual papers by the paper’s au- Something Negative [3]. These factors have been the thors. Copying permitted for private and academic purposes. base to compose the structure of the theory of news- This volume is published and copyrighted by its editors. worthiness. The theory is based on the psychology of In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- individual perception and explain which factors influ- gartner, R. Campos and D. Albakour (eds.): Proceedings of the NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, ence newsworthiness of an event [6]. published at http://ceur-ws.org News values are studied considering a range of at- tributes contained in discursive elements. It is also 2 Novelty in dynamical processes possible to verify the news value through a range of “more specific cognitive constraints that define Tria et. al [12] have recently analyzed novelty as new news values (Novelty, Regency, Presupposition, Con- events occurring in a dynamical process evolving over sonance, Relevance, Deviance and Negativity, Prox- time. Given a sequence of events, a novelty occurs imity) [2]. The news value named Novelty can be an- whenever a new element first appears in a sequence. alyzed by words such as reveal or revelation. These They have analyzed four different data sets: books words announce semantically ‘unexpected aspects of from Gutenberg project Corpus, annotations in the so- an event News stories are frequently about happen- cial bookmarking platform Delicious, songs and singers ings that surprise us, that are unusual or rare’ [1]. at Last FM streaming portal and, entries appearance in English Wikipedia. The novelties in these data are, The novelty can be understood by concepts as out respectively, the occurrence of new words in books, the of the ordinary, least expected, or not predicted, news use of new annotation tags in the bookmarks, the in- values relating to the novelty, newness or unexpected- clusion of a new artist/song in a play list the user had ness of an event/happening [2]. The quality of being never listen to and the first edition of a page in the interesting enough to the public (newsworthiness) is collaborative encyclopedia. also based on if a journalistic fact is out of the ordi- They were able to model novelty as a simple math- nary, it will have a greater effect than something that ematical model based on random draws sampling with is an everyday occurrence (unexpectedness). The un- replacement of an Urn [4] that increases when a novel expectedness power of attraction is in the factor that item is observed. The model predicts statistical laws “there is new information that has been uncovered for the rate at which novelties happen (Heaps’ law [5]) and evaluations of importance can make the eliteness and for the probability distribution on the space ex- of a source explicit” [2]. This means that readers plored (Zipf’s law [14]), as well as signatures of the or viewers can know facts or different people or un- process by which one novelty sets the stage for another. usual to their quotidian, however, “this is the old man- bites-dog syndrome which needs little more explana- The first signature is based on quantifying the rate tion” [9, 2]. When a fact or term first come up, the at which novelties occur in a temporally ordered se- human attention is captured, but “the fact that the quence of elements of length N by analyzing the novelty of a story tends to fade with time and thus growth of the number D(N ) of distinct elements in the attention that people pay for it. This can be due this sequence. This relation would imply in a Heap’s to either habituation or competition from other new law, which states that the rate at which novelties oc- stories” [13]. cur decreases over time as tβ , where β is the coefficient of a power law distribution of D(N ) over N fitted over As previously observed, novelty is also elaborated the data. “mainly through using evaluative language, references The second signature is related to the frequency to surprise/expectations and comparisons” [1]. This of occurrence of different elements in the data. The way of perception of novelty on the construction of frequency-rank distribution would follow an approxi- journalistic contents is based on analyzes produced by mate Zipffian distribution (Zipf’s law). In this distri- reading the news. However, it is possible to get nov- bution, the frequency of any element is inversely pro- elty elements in the story considering word frequency, portional to its rank in the frequency table, i.e., the through the capacity to collect and analyze massive frequency F (R) of an element at rank R is propor- amounts of data. Over the years, there is a massive tional to R−α , where α is the coefficient of a power increase in the availability of journalistic data and cre- law distribution of F (R) over R fitted over the data. ation of new tools to extract the value from data that are helping to understand our lives, organizations, and It is well known that α and β are inversely cor- societies. related [8]. The larger the β coefficient, the higher the frequency of appearance of new elements in the In this paper, we used a recent model of novelty sequence, thus there is a high propensity for novelty. dynamics to analyze news coverage. The main idea On the other hand, the larger the α coefficient, the is to analyze whether different news sources present higher the occurrence of the most frequent elements in different novelty dynamics. This paper is organized the sequence. The key result reported in [12] is that as follows: Section 2 presents novelty signatures that in the four data sets analyzed, the model was able to emerge in some dynamical processes. Section 3 de- capture the novelty behavior in the data. An interest- scribes the data set used in our study. Section 4 ing research question is then whether News delivery presents the results of applying the novelty signatures also shows these novelty signatures. This paper is an to the NCI dataset, and Section 5 concludes the paper. initial attempt towards such analysis. 3 News Coverage Index dataset story topic, media sector (cable TV, network TV, newspaper, online and radio), and lead newsmaker. In our analysis, we used the data gathered by the Pew The number of outlets and individual programs vary Research Center1 . Every week, this institution pro- considerably within each media sector, as do the num- duced the News Coverage Index (NCI) by identifying ber of stories and size of the audience. and annotating the main subjects covered by the U.S. The index is a good source for analyzing, through mainstream media. The dataset used this research is time, how stories emerge and sink. Other possibili- the most updated dataset (2013), published by Pew ties include how the character or narrative focuses of Research Center. Until this moment, no other similar the story change and how much of the broad topic’s dataset that can be used to update the data or serve categories get more coverage, when compared to the to comparison. others. However, the index does not provide infor- The NCI captured and analyzed 52 news out- mation for additional possible questions, such as tone, lets in real time to determine what was be- sourcing or other matters. ing covered and what was not in the U.S. The key variable chosen in this study was “lead news media. The analysis was conducted newsmaker”, a variable that “determines the person weekly, Monday - Sunday. The key variables whose actions or statements constitute the main sub- included source, story date, big story, broad ject matter of the story”. In the NCI, the derivation story topic, placement, format, geographic of the “lead newsmaker’ variable used a methodology focus, story word count, duration of broad- that examined the outlets daily by the coding team. cast story and lead newsmaker. The outlets The researchers establish as a definition: variable lead studied came from print, network TV, cable, newsmaker determines the person whose actions or online, and radio. They included evening and statements constitute the main subject matter of the morning network news, several hours of day- story discussed with at least 50% of the story (in time time and prime time cable news each day, or space). newspapers from around the country, the top Therefore, in our analysis, a news story is flagged as online news sites, and radio, including head- a novelty whenever the first appearance of a new lead lines, the long form programs and talk [11]. newsmaker occurs, considering an ordered sequence of histories by date in the NCI. Obviously, this approach By focusing on the topic of the story, the index mea- does not completely capture all the aspect of novelty sures by what percentage of the analyzed news hole is in news coverage. It is perfectly possible (and indeed about that topic. Data were collected from January very common) that some new factor is being published 2007 to May 2012. Table 1 presents the number of by some lead newsmaker who appeared before. How- news stories collected per year. Note that the year ever, this approach does capture some aspect of nov- 2012 has a few stories because the collection period elty, in a sense that different subjects are being noticed ranges from January to May, rather than January to in the media. Furthermore, the approach sheds some December. interesting insights, as discussed next. Table 1: Number of news stories per year 4 Results and Discussion Year 2007 2008 2009 2010 2011 2012 CableTV 22823 21892 18856 17087 15324 6472 In this section we present the results of the novelty NetworkTV 21320 19796 19427 13016 11858 5186 signatures as proposed by [12] to the NCI dataset. As Newspaper 6559 7350 7370 5626 5190 1977 the main variable used in this study was lead news- Online 6520 6539 7830 7818 7744 3242 maker, we removed from the dataset all stories where Radio 13515 14365 15234 9067 8439 3570 the lead maker was not identified, resulting in a total All 70737 69942 68717 52614 48555 20447 of 135,205 entries in the dataset. Figure 1 shows the two novelty signatures for all sto- The codebook includes variable names, definitions, ries in the NCI dataset, for the Heaps’ law and Zipf’s applicable procedures and changes that were made to law, respectively. The graphs show a very good fit (the certain variables. For each story, it was annotated blue line in the graphs), indicating that the dynamic the date, source, broadcast start time (morning, noon, of novelties also follows the model proposed in [12] for afternoon, evening and night, or not broadcast), dura- the NCI dataset. tion in seconds, word counts, placement prominence, This is an interesting result per se, but we can move story format, big story, geographic focus (local, US na- beyond that by conditioning the analysis by some news tional, US international, non-US international), broad groups. Figure 2 does this, where we have split the 1 Formerly Project for Excellence in Journalism (PEJ). analysis by the media sector (newspaper, online, radio, All Stories β=0.81 Analysis by Media Sector 6000 104 103 4000 D(N) D(N) 102 2000 101 Media Sector Cable TV β=0.76 Network TV β=0.85 Radio β=0.80 Online β=0.83 0 Newspaper β=0.83 101 102 103 104 105 0 10000 20000 30000 40000 N N (a) Heap’s Law (a) Heap’s Law All Stories α=1.13 Analysis by Media Sector 104 Media Sector Cable TV α=1.07 4 Network TV α=0.98 10 Radio α=0.97 Online α=0.92 103 Newspaper α=0.89 103 102 f(R) f(R) 102 101 101 101 102 103 104 101 102 103 R R (b) Zipf’s Law (b) Zipf’s Law Figure 1: Novelty Signatures for lead newsmaker over Figure 2: Novelty Signatures for lead newsmaker in all stories in NCI dataset. The blue line is the best NCI dataset grouped by media sector data fit. the previous day on TV, radio and, Internet. Despite broadcast TV and cable TV). Figure 2(a) shows how being late in relation to events in one day (it gener- novel lead newsmakers appears in the news sequence, ally publishes stories from the eve), the newspaper still for each media sector collected by the NCI. The inter- continues to be a source for other rival media because pretation of these results is, the steeper the line, the it intends always having something new in their pages. more novelty the media sector has (according to the On the other hand, TVs have a rotating audience, and definition of novelty used in this paper). Surprisingly, focus on a narrow range of topics. Thus, the presented newspapers is the media histories focus in a few lead newsmakers. Radios an sector with the larger ratio of lead newsmakers per online media are somehow in between these two ex- story, followed by online portals, radio, network TV tremes. and cable TV. Figure 2(b) shows an orthogonal in- sight for this result, which shows the rank distribution To gain some insight in the online versus offline sce- of lead newsmakers for each sector. As Heaps’ law nario, we break down the analysis in online versus of- and Zipfs’ law are inverse correlated, the interpreta- fline media, as shown in Figure 3. The interpretation tion of these results are, the steeper the line, the more of the graphs is the same as of 2. Figure 3(a) shows a media sector concentrates the coverage in a few lead that online sector introduce more lead makers in their makers. Cable TV repeats lead newsmakers more of- stories, and Figure 3(b) indicate that the same lead ten than other sectors, and (proportionally) uses fewer maker appears less often in online media. As can be leads newsmaker than the other media sectors. News- seen from the graphs, online media have stronger nov- papers, on the other hand, proportionally use the top elty signatures. Therefore, online media have a bias ranked lead news makers less often, and have a larger towards introducing more different lead newsmakers, number of histories with different lead makers. and a lower tendency to echo the same leading maker We can speculate that the higher frequency of novel in future stories. lead newsmakers in newspaper media is due to that A possible reason for this is that online outlets have this media needs competitiveness in relation to other a high propensity to show new stories due to the differ- media (digital and electronic), which are characterized ence in media consumption from the target audience. by dissemination of news in real time. As the news- In general, the audience for online news sources is of paper is a diary media, it always needs to have some- younger people (as discussed in the previous section). thing different to present than what was published on These users have a less tendency to in-depth stories, fo- Analysis by Broad Story Topic Comparison of online versus non−online media 104 2000 103 1500 D(N) D(N) 102 1000 Media Sector 101 Non−online β=0.81 Broad Story Topic 500 Online β=0.84 Campaigns/Elections/Politics β=0.65 Government agencies/Legislaturesβ=0.81 Crime β=0.91 U.S. foreign affairs β=0.77 101 102 103 104 105 0 Economy/Economics β=0.75 N 0 10000 20000 N (a) Heap’s Law (a) Heap’s Law Analysis by Broad Story Topic Comparison of online versus nononline media Broad Story Topic Campaigns/Elections/Politics α=1.35 Government agencies/Legislaturesα=1.19 4 Media Sector α=1.02 10 Crime U.S. foreign affairs α=1.18 α=0.92 103 Online Economy/Economics α=1.08 103 Nononline α=1.12 102 102 f(R) f(R) 101 101 101 102 103 104 R 101 102 103 R (b) Zipf’s Law (b) Zipf’s Law Figure 3: Novelty Signatures for lead newsmaker in Figure 4: Novelty Signatures for lead newsmaker in NCI dataset grouped by online versus offline media NCI dataset grouped by media sector (top 5 sectors) cusing in the headlines. They also are more connected, the program, as shown in Figure 6. The interpreta- and access the news more often, thus the necessity of tion of the graphs is the same as of 2, except for the novelty in the news stories. fact that instead of media sectors, we have the pro- gram start time in these graphs. Figure 5(a) shows We did a similar analysis, but conditioning on the that, in general, morning programs introduce more of- five most frequent broad story topics. The broad story ten new lead makers, while night programs have few topic variable identifies which of the broad topic cat- novelty lead makers. On the other hand, Figure 5(b) egories is addressed by a story. NCI has 32 broad shows that night program cites more often the more story categories, but most of them have low frequen- noticed lead makers than morning programs. We be- cies. These low frequencies difficult an analysis, due lieve this also is related to the target audience, which to a lack of data. Figure 5 shows these results. The in the evening/night has a higher prevalence of elderly interpretation of the graphs is the same as of 2, except people, which is more interested in-depth coverage. for the fact that instead of media sectors, we have top- ics in these graphs. Figure 4(a) shows how novel lead newsmakers appear in the news sequence, for each of 5 Concluding Remarks the five most frequent topics collected by the NCI. In In this paper, we examine the dynamic of novelties in these figures, one traditional attribute of news value, the NCI dataset. We used the lead newsmaker as the Negativity (any reference that is negative), emerges as main variable to define the concept of novelty in our through Crime broad story topic. Crime is the topic framework. We verified a very good fit of these data with the largest rate of novel lead makers, followed to the two novelty signatures discussed in [12]. by economy/economics, US foreign affairs, government We obtained interesting and insightful insights agencies/legislatures and campaigns/elections/politics when conditioning the analysis do media sector and with the lowest rate. Figure 4(b) indicates that the broad story topic. Regarding media sector, we veri- most frequent lead makers appear proportionally less fied that newspapers is the sector with largest novelty, often in the news than the most frequent lead makers in terms of the introduction of new lead newsmak- in campaigns/elections/politics. Furthermore, crime is ers. Furthermore, online media have a largest novelty, the sector with the largest proportion of lead makers when compared to non-line media. In terms of story to appear in fewer histories. topic, crime is the sector with more novelty, also in A similar analysis was performed by start time of terms of lead newsmakers. Analysis by Broadcast Start Time [2] H. Caple and M. Bednarek. Delving into the dis- 6000 course: Approaches to news values in journalism studies and beyond. Technical report, Reuters In- stitute for the Study of Journalism, 2013. 4000 D(N) [3] J. Galtung and M. H. Ruge. The structure of 2000 Broadcast Start Time foreign news the presentation of the congo, cuba Not Broadcast β=0.84 Morning Program β=0.83 Noon Program β=0.65 and cyprus crises in four norwegian newspapers. Afternoon Programβ=0.79 Evening Program β=0.82 Night Program β=0.69 Journal of peace research, 2(1):64–90, 1965. 0 0 10000 20000 30000 40000 N [4] J. Haigh. Polya urn models. Journal of the Royal (a) Heap’s Law Statistical Society: Series A (Statistics in Soci- 104 Analysis by Broadcast Star t Time Broadcast Start Time Afternoon Programα=0.94 ety), 172(4):942–942, 2009. Evening Program α=1.06 Morning Program α=0.97 Night Program α=1.02 103 Noon Not Program Bradcast α=1.03 α=1.03 [5] H. Heaps. Information Retrieval: Computational 102 and Theoretical Aspects. Academic Press, New York, 1978. f(R) 101 [6] H. Kwak and J. An. Understanding news geogra- phy and major determinants of global news cover- age of disasters. In Computer+Journalism Sym- posium, New York, USA, 2014. 101 102 103 104 R (b) Zipf’s Law [7] D. Lazer, A. Pentland, A. Lada, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contrac- Figure 5: Novelty Signatures for lead newsmaker in tor, J. Fowler, M. Gutmann, T. Jebara, G. King, NCI dataset grouped by starting time M. Macy, D. Roy, and M. Van Alstyne. Life in the network: the coming age of computational social We believe these patterns somehow tend to follow science. Science, 323(5915):721–723, 2009. the interest of the public in order to get her attention. [8] L. Lü, Z.-K. Zhang, and T. Zhou. Zipf’s law leads Thus, there is the necessity to provide news on topics to heaps’ law: Analyzing their relation in finite- to better reach a target audience, tailoring the audi- size systems. PLoS ONE, 5(12):e14139, 12 2010. ence. An interesting future work is to analyze whether these patterns would be similar in the next years, be- [9] M. Masterton. Asian journalists seek values cause the online young audience became a generation worth preserving. Asia Pacific Media Educator, more mature. Would the behavior be the same, and 1(16):41–48, 2005. the sectors have to adapt to the news consumption patterns of this generation or they will change their [10] B. O’Connor, D. Bamman, and N. A. Smith. tastes, showing a similar behavior or their previous Computational text analysis for social science: generation as accessing the in-depth stories? Model assumptions and complexity. In Second This research has two obvious limitations. First, NIPS Workshop on Comptuational Social Science our adopted definition of novelty does not capture all and the Wisdom of Crowds, 2011. aspects of novelty, as new information can be pub- [11] Pew Research Center. News coverage index lished about lead Newsmakers which already appeared methodology, 2013. http://www.journalism. in the sequence. However, we believe this definition do org/news_index_methodology/99/. capture some aspects of novelty, and were able to pro- vide some interesting insights on the topic. Further- [12] F. Tria, V. Loreto, V. D. P. Servedio, and S. H. more, the data set has a bias towards the U.S.A. media Strogatz. The dynamics of correlated novelties. coverage. An interesting further research direction is Sci. Rep., 4, 2014. to broaden this research to different sources. [13] F. Wu and B. A. Huberman. Novelty and col- lective attention. Proceedings of the National References Academy of Sciences, 104(45):17599–17601, 2007. [1] M. Bednarek and H. Caple. ’value added’: Lan- [14] G. K. Zipf. The psycho-biology of language. Lan- guage, image and news values. Discourse, Context guage, 12(3):196–210, 1935. & Media, 1(2–3):103–113, 2012.