On the Decaying Utility of News Recommendation Models Benjamin Kille Sahin Albayrak Technische Universität Berlin Technische Universität Berlin Ernst-Reuter-Platz 7 Ernst-Reuter-Platz 7 10587 Berlin, Germany 10587 Berlin, Germany benjamin.kille@tu-berlin.de sahin.albayrak@dai-labor.de ABSTRACT Research on recommender systems has produced a myriad of For how long will a recommendation model provide adequate rec- methods. These methods take data related to users, item, or in- ommendations? The answer to this question depends on the kind teraction between them. Subsequently, they learn regularities and of model, its underlying data, and the domain among other factors. create a model capturing the essential information. The models in- We analyse four types of models in the news domain on how their clude global rankings, sets of rules, and latent factor representations predictive performances change. Our observations show that re- among others. placing or updating models is necessary to maintain high predictive Consequently, businesses continuously contemplate which model performance. The evaluation suggests that an exponential decay to use to generate recommendations. Ideally, they would choose model describes the changing predictive performance accurately. the model maximising users’ attention. Although, determining the utility of recommendation models has proven a difficult task. Shani CCS CONCEPTS and Gunawardana (2010) point to a variety of properties linked to the performance of recommender systems. These include accuracy, • Information systems → Data stream mining; Recommender novelty, and diversity. In other words, recommender systems ought systems; to provide relevant, new, and different items. Frequently, the interaction data is split into disjoint partitions. KEYWORDS One partition, the training set, is used to learn a model describ- news recommendation, cold-start, model update, time-awareness, ing the relation amid users and items. The remaining partitions decaying utility can be used to (a) optimise parameters, and (b) assess the utility. Cross-validation, a procedure wherein random partitions are per- ACM Reference format: Benjamin Kille and Sahin Albayrak. 2017. On the Decaying Utility of News mutatively used for training or testing, helps to limit the risk of Recommendation Models. In Proceedings of Workshop on Temporal Reasoning randomly selecting an unrepresentative sample. in Recommender Systems, Como, Italy, 31 August, 2017 (TempRec’17), 6 pages. Still, using the described methodology, we merely obtain infor- https://doi.org/ mation about what the best model would have been at some point in time. We frame the problem from a slightly different perspective. Suppose we have a set of recommendation models available. Sup- 1 INTRODUCTION pose further that we measure utility by models’ ability to predict Content providers compete to attract and retain information con- with which items users will interact in the future. We focus on how sumers in what can be described as “attention economy”. Therein, the utility of a set of recommendation models changes over time. consumers trade their attention in exchange for information and In particular, we posit the hypothesis that the utility change can entertainment. Brynjolfsson and Oh (2012) stress the difficulty quan- be modelled in form of an exponential decay function. We use part tifying the value of such exchanges. Their estimate puts the col- of the data set released for CLEF NewsREEL 2017 to conduct our lective annual value for such exchanges in the United States at evaluation (Lommatzsch et al. 2017). The data set comprises logs 100 billion dollars. Ciampaglia et al. (2015) emphasise the limited of various news publishers. News represent a particularly suited attention span for newly published contents. Publishers employ domain for our analysis. Publisher publish news articles at high recommender systems to provide consumers better information rates. Simultaneously, readers favour novel news. Consequently, access (Billsus and Pazzani 2007). Recommender systems reduce we expect models’ utility to change rapidly. vast collections of items to manageable subsets. In dynamic envi- This work entails two contributions. First, we formalise the con- ronments, they seek to maximise the number of interactions thus cept of decaying utility of recommender models in the news domain. connecting users and items. The rate at which interactions occur Second, we conduct experiments for four selected models. is directly linked to business success. The more users engage with The remainder of this paper commences with Section 2 intro- the collection of items the more advertisements they encounter. ducing the notion of decaying utility. Section 3 describes the ex- The more they enjoy the service, the less likely they are to quit perimental design used to analyse the changes in utility over time. using it. As a result, successful recommender systems represent a Section 4 presents our observations. Section 5 notes limitations competitive advantage. and discusses our findings. Section 6 relates our work to previously published results and ideas. Section 7 summarises our findings and points to directions for future work. Copyright © 2017 for the individual papers by the papers’ authors. TempRec’17, 31 August, 2017, Como, Italy TempRec’17, 31 August, 2017, Como, Italy Benjamin Kille and Sahin Albayrak 2 DECAYING UTILITY we can compare the differences in response rates for the same day Recommender systems provide lists of suggestions upon request. given different models. The selection follows a set of rules represented in form of a model. Models are derived from previously recorded data. We define the 4 EVALUATION utility of such a model as its ability to correctly predict future in- We consider the change in response rate as an appropriate proxy teractions amid users and items. Formally, let U = {um }m=1 M ,I = for the utility of a recommender model over time. Figure 1 shows N refer to the sets of users and items. The recommender {i n }n=1 the change in response rates over time for all combinations of pub- system monitors interactions amid users and items r = (um , i n ). lishers and models. The response rates are plotted on a logarithmic Thereby, the system collects a set of interactions Rτ = {r α }A α =1 , scale. For all models and publishers, we observe a decreasing trend where interactions occurred in a closed time interval τ = [t 0 ,T ], in response rates. The sequences model exhibits the highest response and interactions are chronologically ordered t α < t α +1 . A recom- rate for publishers A, B, and C. The popularity model exhibits the mendation model M Rτ is a function that takes an interaction r α and highest response rate for publisher D. The random model performs returns a list of suggestions (i k , i k+1 , . . . , i K ). Let t = [t 0 ,T ] with worst in the initial phase and mostly stagnates at this level. The t 0 > τ . The utility of M Rτ with respect to t refers to the number popularity model overtakes the sequences model over time. This of interactions r α ∈ R t where um previously has been suggested implies that businesses need to carefully monitor performances. reading i n . We normalise the utility by dividing through the num- Figure 2 shows the relation between response rates and coverage ber of requests. A request refers to each interaction occurring in t. for publisher A. We observe that as coverage decreases all mod- Thereby, we obtain a utility measure which we refer to as response els loose predictive accuracy. The effect is most apparent for the rate. In practise, the response rate can be monitored by keeping freshness model. track of which items have been recommended by the model. We We analyse how much we could gain by retraining the models hypothesise that the utility, or more concretely response rate, fol- on a daily schedule. We focus on the sequences model. Figures 3 lows an exponential decay. Similar to radioactive decay, readers contrasts the response rate to the number of requests and coverage. perceive an article as particularly interesting close to its publica- The top part of each subfigure shows the number of requests. At tion. As time progresses, the news has spread and the article attract times with fewer requests, response rates are based on a smaller fewer readers. Exponential decays is characterised by the function set of interactions. We observe this phenomenon particularly at f (t) = U · e V t , wherein U and V are the parameters. The function night time. The bottom part of each subfigure shows the cover- describes a decay if V < 0. Alternatively, the half-life t 1 /2 = ln −V 2 age. The retrained models are shown in varying colours. Similarly, describes the time it takes to arrive at half the initial quantity. the centre part shows the response rates in corresponding colour schemes. Initially, models have a relatively high predictive quality. 3 EXPERIMENT We conducted experiments to measure the change of utility in terms sequences popularity freshness random 1 of response rates for a selection of models. We consider the four RRA publishers whose characteristics are shown in Table 1. The data −2 10 correspond to one week of the NewsREEL 2017 data set. We notice −4 that sessions include few articles. Publisher B observers merely 3.3 10 articles per session on average. This impedes using models which 1 RRB rely heavily on sufficiently expressive user profiles such as collabo- −1 10 rative filtering. For each publisher we consider the time between −2 1–9 February, 2016. We learn four types of models each with the 10 data of 1 February, 2016. First, the random model takes all articles −3 10 and suggests a random subset. Second, the freshness model sug- 1 RRC gests the articles in chronologically reversed order of publication. −2 10 Third, the popularity model suggests articles proportional to how frequently they had been read. Fourth, the sequence model uses the 10 −4 frequency of reading sequences. In other words, given an article i n , the model suggests another item proportional to the frequency with 1 which it had been read after i n . We apply the model to all requests RRD −3 in the time 2–8 February, 2016. We determine whether readers sub- 10 sequently read any of the suggested articles. With this information, −6 we compute the average response rate for each hour. In addition, 10 we monitor newly added articles and derive the coverage of models. 0 50 100 150 [Time] h The coverage is defined as the proportion of known articles covered by any model. The coverage naturally shrinks as the publishers Figure 1: For each publisher, we consider the response rates release more and more articles unknown to the models. We repeat for four types of models. The response rate is plotted on a this procedure shifting the period by one day at a time. Thereby, logarithmic scale to prevent cluttering. On the Decaying Utility of News Recommendation Models TempRec’17, 31 August, 2017, Como, Italy Table 1: We consider four publishers each referred to by a character label. Content refers to the category of news the publisher offers. The data has been collected during 1–9 February 2016. Sessions refers to the number of unique session cookies observed. Articles refers to the number of unique articles, which users read at least once. Interactions refers to the total number of reads. For each publisher, we present the mean number as well as standard deviation of reads per session. Likewise, we include the mean number and standard deviation of new articles added per hour. Note that besides addition, publishers can change articles to include new information. Label Content Sessions Articles Interactions Interactions per Session Articles per Hour A general news 616 539 74 172 3 860 115 6.2±12.2 19.8±12.6 B information technology 24 643 2735 82 540 3.3±4.0 1.0±0.2 C general news 815 260 58 392 5 772 802 7.1±16.9 7.0±4.5 D sports 1 437 161 12 028 20 227 882 14.1±30.5 7.2±4.5 The predictive performance subsequently decreases and stabilises publisher B, we observe that the older model occasionally performs on a noticeably lower level compared to the initial performance. better than the new model. We observe a noticeable difference in predictive performance amid We have fitted an exponential function to our results using the the retrained models and their predecessors. This effect appears least squares method. Table 2 conveys the exponential fits to the re- closely linked to the coverage, which shows a similar trend. The sponse rates for combinations of publishers and models. We observe observations are consistent on all four publishers and affirm the that the initial response rates (U ) vary considerably. The random expectation of a exponential decay phenomenon. Publisher B at- model has particularly low initial response rates. Conversely, the se- tracts less visitors and exhibits higher variance compared to the quences model scores highest with respect to initial response rates. other publishers. Retraining models appears particularly beneficial All fits exhibit decay, V < 0, with the exception of the random to publisher D for which the decline in predictive performance model for publisher B. Recall that publisher B observed less inter- quickly renders models useless. actions than the other publishers. This could cause higher levels of Figure 4 illustrates the loss in predictive performance incurred variance. when using the initial model as opposed to learning a new model on the second day. We observe that the loss is highest on the first day 5 DISCUSSION AND LIMITATIONS for all publishers. The differences in utility level off over time. For The evaluation indicates that exponential decay models represent a suited first attempt to mathematically describe how the utility of recommendation models changes over time. The parameters vary among publishers and models. Still, Figure 3 shows similar trends popularity freshness random sequences 1 for the sequence models across all publishers and despite which day RR we picked. The coverage appears highly related to the decaying response rates. Figure 3 and Figure 2 illustrate this relation. As time −1 passes, publishers add new articles to their collections. Unless we 10 update the models used to provide recommendations, they cover a lesser proportion of articles. The distribution of requests over the course of the day affects the response rates. Figure 3 illustrates the −2 10 differences in requests for all four publishers. We observe a periodic pattern with more requests during the day and fewer requests at night. In addition, we observe that as the coverage arrives at 50 % 10 −3 the response rates level off for the sequence models and all four publishers. Figure 4 shows that switching to a retrained model is most beneficial on the first day. This suggests that publishers should −4 replace or update their models at least once a day. Additional ex- 10 perimentation is necessary to analyse how the choice of data used to create the model affects its utility. We have kept the training data set to the length of one day in our experiments. Using more −5 10 data and/or different types of models represents the direction to further explore. Our experiments used recorded data and inferred 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 the utility rather than observing actual interactions resulting from Coverage recommendations generated by our models. Joachims et al. (2017) discuss how counterfactual reasoning facilitates using logged in- Figure 2: We consider the relation between coverage and re- formation more effectively. Unfortunately, we lack the required sponse rate for publisher A. information on internal parameters of the recommender systems TempRec’17, 31 August, 2017, Como, Italy Benjamin Kille and Sahin Albayrak 60,000 1000 Req. 40,000 Req. 500 20,000 0 0 1.0 1.0 RRS RRS 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1.0 1.0 Cov. Cov. 0.5 0.5 0 0 0 50 100 150 [Time] h 0 50 100 150 [Time] h (a) Publisher A (b) Publisher B 60,000 5 Req. 2×10 40,000 Req. 5 1×10 20,000 0 0 1.0 1.0 RRS RRS 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1.0 1.0 Cov. Cov. 0.5 0.5 0 0 0 50 100 150 [Time] h 0 50 100 150 [Time] h (c) Publisher C (d) Publisher D Figure 3: Evaluation Results Overview. Each subfigure refers to a single publisher. Each subfigure contains three parts: at the top, the frequency of requests, at the centre the response rate (RR) referring to the sequence model, and at the bottom the coverage. For response rate and coverage, a colour scheme refers to the day at which the model has been created. Night times are highlighted in light blue. to apply their method. Our experiments are based on part of the conduct experiments with the feedback of actual readers. This will NewsREEL data set. In order to verify our findings, we have to On the Decaying Utility of News Recommendation Models TempRec’17, 31 August, 2017, Como, Italy Table 2: Exponential fit to the response rates observed for combinations of publishers and models. Publisher random freshness popularity sequences Publisher A 0.0000 · e −1.8058t 0.0069 · e −0.0712t 0.0292 · e −0.0376 0.4406 · e −0.0088t Publisher B 0.0018 · e 0.0028t 0.0778 · e −0.0206t 0.0655 · e −0.0106t 0.3498 · e −0.0039t Publisher C 0.0005 · e −0.0176t 0.0400 · e −0.0505t 0.0590 · e −0.0140t 0.4853 · e −0.0107t Publisher D 0.0027 · e −0.0531t 0.0877 · e −0.0875t 0.0646 · e −0.0057t 0.2887 · e −0.1079t confirm whether the selection of publishers or the time period may in which little information is available about user preferences. Bal- have biased the findings. trunas and Amatriain (2009) extended the time-aware collaborative filtering to implicit feedback. Implicit feedback can be derived from 6 RELATED WORK log files such as the ones used in our experiment. Still, they apply their method to movies, which again exhibit characteristics differ- The decreasing predictive performance of models has been dis- ent to news. Campos et al. (2014) discussed time-aware evaluation cussed by Jambor et al. (2012) for the domain of movies. They em- protocols. They introduce a scheme to categorise evaluation proto- ployed methods from Control Theory to devise an optimised up- cols focussing on rating prediction. Their scheme assigns our work dating strategy. Movies exhibit different characteristics than news. the time-dependent cross-validation category. Much of the work In particular, people tend to revisit movies much more frequently on time-aware evaluation of recommender systems has focused on than news thus impeding comparisons to our work. Koren (2009) movies and rating prediction. focused on collaborative filtering. He introduced a latent factor Das et al. (2007) present the news personalisation systems used model which captures the temporal development of preferences. for Google’s news aggregator. Their system employs covisitation Thereby, he could more accurately predict how users rate movies. counts similar to our sequence model. In addition, they use proba- Collaborative filtering requires expressive user profiles with suf- bilistic latent semantic indexing and MinHash clustering to improve ficiently clearly stated preferences. News consumption happens their response rates. The news aggregator has access to much more anonymously disallowing creating such profiles. As Table 1 illus- comprehensive user profiles for the subset of users reading news trates, publishers generally get to know readers’ preferences for few while logged in with their Google accounts. Li et al. (2010) represent articles. News recommender systems have to work in conditions news recommendation as contextual-bandit problem. Therein, the system has a set of choices modelled figuratively as arm of bandit found in casinos. The system learns how to choose depending on Loss of Predictive Performance ct-1 - ct the context. Garcin et al. (2013) introduce the notion of context 0.2 A 0 trees to news recommendation. Context trees capture particulari- ties of situation and use them to select a better set of article to be −0.2 recommended. −0.4 0.2 7 CONCLUSION AND FUTURE WORK B 0 We have introduced the notion of utility decay for news recom- −0.2 mender systems. The utility decay refers to a model’s decreas- −0.4 ing ability to correctly anticipate future interactions amid users 0.2 and items. Experiments with data from four publishers have con- C 0 firmed that exponential decay functions can be used to describe the changes of response rates over time. We observed a similar pattern −0.2 for the coverage, the proportion of articles a model can potentially −0.4 suggest. We conjecture that there is a strong relation between the 0.2 two quantities. The relation depends on factors including the pub- D 0 lisher and the type of model. Further evaluation is necessary to −0.2 improve the understanding of utility decay in news recommenda- −0.4 tion. First, we will consider varying the time span used to learn a model. This will show whether reducing or increasing the amount 40 60 80 100 120 140 160 180 of data describes the changes of response rates more accurately. [Time] h Second, we will consider additional types of models. With little information concerning users, we plan to evaluate an item-based Figure 4: Comparison of response rates for the sequence latent factor model. We intend to participate in the next edition of models learnt on 1 February (t − 1) and 2 February (t) in the NewsREEL to verify our findings with the feedback of actual news period 2–9 February, 2016. The highlighted areas show the readers. Finally, we will evaluate additional time periods to verify loss in predictive performance by using the older model. that the observed pattern is not due to choosing a particular time. TempRec’17, 31 August, 2017, Como, Italy Benjamin Kille and Sahin Albayrak REFERENCES Florent Garcin, Christos Dimitrakakis, and Boi Faltings. 2013. Personalized news Linas Baltrunas and Xavier Amatriain. 2009. Towards Time-dependant Recommenda- recommendation with context trees.. In RecSys. ACM Press, New York, New York, tion based on Implicit Feedback. Workshop on Context-aware Recommender Systems USA, 105–112. (2009). Tamas Jambor, Jun Wang, and Neal Lathia. 2012. Using Control Theory for Stable Daniel Billsus and Michael J Pazzani. 2007. Adaptive News Access. The Adaptive Web and Efficient Recommender Systems.. In WWW. ACM, New York, New York, USA, (2007), 550–570. 11–20. Erik Brynjolfsson and JooHee Oh. 2012. The Attention Economy - Measuring the Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning- Value of Free Digital Services on the Internet. ICIS (2012). to-Rank with Biased Feedback. In the Tenth ACM International Conference. ACM Pedro G Campos, Fernando Díez, and Iván Cantador. 2014. Time-aware Recommender Press, New York, New York, USA, 781–789. Systems: a Comprehensive Survey and Analysis of Existing Evaluation Protocols. Yehuda Koren. 2009. Collaborative Filtering with Temporal Dynamics. KDD (2009), User Modeling and User-Adapted Interaction 24, 1-2 (2014), 67–119. 447. Giovanni Luca Ciampaglia, Alessandro Flammini, and Filippo Menczer. 2015. The Lihong Li, Robert E Schapire, Wei Chu, John Langford, and John Langford. 2010. A production of information in the attention economy. Scientific reports 5, 1 (May contextual-bandit approach to personalized news article recommendation. In the 2015). 19th international conference. ACM Press, New York, New York, USA, 661–670. Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar Rajaram. 2007. Andreas Lommatzsch, Benjamin Kille, Frank Hopfgartner, Martha Larson, Torben Google news personalization - scalable online collaborative filtering.. In WWW. Brodt, Jonas Seiler, and Özlem Özgöbek. 2017. CLEF 2017 NewsREEL Overview: A ACM, New York, New York, USA, 271–280. Stream-based Evaluation Task for Evaluation and Education. Springer. Guy Shani and Asela Gunawardana. 2010. Evaluating Recommendation Systems.