On the Decaying Utility of News Recommendation Models
                                Benjamin Kille                                                     Sahin Albayrak
                       Technische Universität Berlin                                         Technische Universität Berlin
                           Ernst-Reuter-Platz 7                                                  Ernst-Reuter-Platz 7
                          10587 Berlin, Germany                                                 10587 Berlin, Germany
                        benjamin.kille@tu-berlin.de                                          sahin.albayrak@dai-labor.de

ABSTRACT                                                                         Research on recommender systems has produced a myriad of
For how long will a recommendation model provide adequate rec-                methods. These methods take data related to users, item, or in-
ommendations? The answer to this question depends on the kind                 teraction between them. Subsequently, they learn regularities and
of model, its underlying data, and the domain among other factors.            create a model capturing the essential information. The models in-
We analyse four types of models in the news domain on how their               clude global rankings, sets of rules, and latent factor representations
predictive performances change. Our observations show that re-                among others.
placing or updating models is necessary to maintain high predictive              Consequently, businesses continuously contemplate which model
performance. The evaluation suggests that an exponential decay                to use to generate recommendations. Ideally, they would choose
model describes the changing predictive performance accurately.               the model maximising users’ attention. Although, determining the
                                                                              utility of recommendation models has proven a difficult task. Shani
CCS CONCEPTS                                                                  and Gunawardana (2010) point to a variety of properties linked to
                                                                              the performance of recommender systems. These include accuracy,
• Information systems → Data stream mining; Recommender
                                                                              novelty, and diversity. In other words, recommender systems ought
systems;
                                                                              to provide relevant, new, and different items.
                                                                                 Frequently, the interaction data is split into disjoint partitions.
KEYWORDS                                                                      One partition, the training set, is used to learn a model describ-
news recommendation, cold-start, model update, time-awareness,                ing the relation amid users and items. The remaining partitions
decaying utility                                                              can be used to (a) optimise parameters, and (b) assess the utility.
                                                                              Cross-validation, a procedure wherein random partitions are per-
ACM Reference format:
Benjamin Kille and Sahin Albayrak. 2017. On the Decaying Utility of News      mutatively used for training or testing, helps to limit the risk of
Recommendation Models. In Proceedings of Workshop on Temporal Reasoning       randomly selecting an unrepresentative sample.
in Recommender Systems, Como, Italy, 31 August, 2017 (TempRec’17), 6 pages.      Still, using the described methodology, we merely obtain infor-
https://doi.org/                                                              mation about what the best model would have been at some point
                                                                              in time. We frame the problem from a slightly different perspective.
                                                                              Suppose we have a set of recommendation models available. Sup-
1    INTRODUCTION                                                             pose further that we measure utility by models’ ability to predict
Content providers compete to attract and retain information con-              with which items users will interact in the future. We focus on how
sumers in what can be described as “attention economy”. Therein,              the utility of a set of recommendation models changes over time.
consumers trade their attention in exchange for information and               In particular, we posit the hypothesis that the utility change can
entertainment. Brynjolfsson and Oh (2012) stress the difficulty quan-         be modelled in form of an exponential decay function. We use part
tifying the value of such exchanges. Their estimate puts the col-             of the data set released for CLEF NewsREEL 2017 to conduct our
lective annual value for such exchanges in the United States at               evaluation (Lommatzsch et al. 2017). The data set comprises logs
100 billion dollars. Ciampaglia et al. (2015) emphasise the limited           of various news publishers. News represent a particularly suited
attention span for newly published contents. Publishers employ                domain for our analysis. Publisher publish news articles at high
recommender systems to provide consumers better information                   rates. Simultaneously, readers favour novel news. Consequently,
access (Billsus and Pazzani 2007). Recommender systems reduce                 we expect models’ utility to change rapidly.
vast collections of items to manageable subsets. In dynamic envi-                This work entails two contributions. First, we formalise the con-
ronments, they seek to maximise the number of interactions thus               cept of decaying utility of recommender models in the news domain.
connecting users and items. The rate at which interactions occur              Second, we conduct experiments for four selected models.
is directly linked to business success. The more users engage with               The remainder of this paper commences with Section 2 intro-
the collection of items the more advertisements they encounter.               ducing the notion of decaying utility. Section 3 describes the ex-
The more they enjoy the service, the less likely they are to quit             perimental design used to analyse the changes in utility over time.
using it. As a result, successful recommender systems represent a             Section 4 presents our observations. Section 5 notes limitations
competitive advantage.                                                        and discusses our findings. Section 6 relates our work to previously
                                                                              published results and ideas. Section 7 summarises our findings and
                                                                              points to directions for future work.
Copyright © 2017 for the individual papers by the papers’ authors.
TempRec’17, 31 August, 2017, Como, Italy
TempRec’17, 31 August, 2017, Como, Italy                                                                                Benjamin Kille and Sahin Albayrak


2    DECAYING UTILITY                                                                  we can compare the differences in response rates for the same day
Recommender systems provide lists of suggestions upon request.                         given different models.
The selection follows a set of rules represented in form of a model.
Models are derived from previously recorded data. We define the                        4     EVALUATION
utility of such a model as its ability to correctly predict future in-                 We consider the change in response rate as an appropriate proxy
teractions amid users and items. Formally, let U = {um }m=1              M ,I =
                                                                                       for the utility of a recommender model over time. Figure 1 shows
      N refer to the sets of users and items. The recommender
{i n }n=1                                                                              the change in response rates over time for all combinations of pub-
system monitors interactions amid users and items r = (um , i n ).                     lishers and models. The response rates are plotted on a logarithmic
Thereby, the system collects a set of interactions Rτ = {r α }A               α =1 ,
                                                                                       scale. For all models and publishers, we observe a decreasing trend
where interactions occurred in a closed time interval τ = [t 0 ,T ],                   in response rates. The sequences model exhibits the highest response
and interactions are chronologically ordered t α < t α +1 . A recom-                   rate for publishers A, B, and C. The popularity model exhibits the
mendation model M Rτ is a function that takes an interaction r α and                   highest response rate for publisher D. The random model performs
returns a list of suggestions (i k , i k+1 , . . . , i K ). Let t = [t 0 ,T ] with     worst in the initial phase and mostly stagnates at this level. The
t 0 > τ . The utility of M Rτ with respect to t refers to the number                   popularity model overtakes the sequences model over time. This
of interactions r α ∈ R t where um previously has been suggested                       implies that businesses need to carefully monitor performances.
reading i n . We normalise the utility by dividing through the num-                       Figure 2 shows the relation between response rates and coverage
ber of requests. A request refers to each interaction occurring in t.                  for publisher A. We observe that as coverage decreases all mod-
Thereby, we obtain a utility measure which we refer to as response                     els loose predictive accuracy. The effect is most apparent for the
rate. In practise, the response rate can be monitored by keeping                       freshness model.
track of which items have been recommended by the model. We                               We analyse how much we could gain by retraining the models
hypothesise that the utility, or more concretely response rate, fol-                   on a daily schedule. We focus on the sequences model. Figures 3
lows an exponential decay. Similar to radioactive decay, readers                       contrasts the response rate to the number of requests and coverage.
perceive an article as particularly interesting close to its publica-                  The top part of each subfigure shows the number of requests. At
tion. As time progresses, the news has spread and the article attract                  times with fewer requests, response rates are based on a smaller
fewer readers. Exponential decays is characterised by the function                     set of interactions. We observe this phenomenon particularly at
f (t) = U · e V t , wherein U and V are the parameters. The function                   night time. The bottom part of each subfigure shows the cover-
describes a decay if V < 0. Alternatively, the half-life t 1 /2 = ln           −V
                                                                                 2     age. The retrained models are shown in varying colours. Similarly,
describes the time it takes to arrive at half the initial quantity.                    the centre part shows the response rates in corresponding colour
                                                                                       schemes. Initially, models have a relatively high predictive quality.

3    EXPERIMENT
We conducted experiments to measure the change of utility in terms                                   sequences    popularity         freshness         random
                                                                                         1
of response rates for a selection of models. We consider the four                      RRA
publishers whose characteristics are shown in Table 1. The data                             −2
                                                                                       10
correspond to one week of the NewsREEL 2017 data set. We notice                             −4
that sessions include few articles. Publisher B observers merely 3.3                   10
articles per session on average. This impedes using models which                           1
                                                                                       RRB
rely heavily on sufficiently expressive user profiles such as collabo-                    −1
                                                                                       10
rative filtering. For each publisher we consider the time between
                                                                                            −2
1–9 February, 2016. We learn four types of models each with the                        10
data of 1 February, 2016. First, the random model takes all articles                        −3
                                                                                       10
and suggests a random subset. Second, the freshness model sug-                           1
                                                                                       RRC
gests the articles in chronologically reversed order of publication.                        −2
                                                                                       10
Third, the popularity model suggests articles proportional to how
frequently they had been read. Fourth, the sequence model uses the                     10
                                                                                            −4

frequency of reading sequences. In other words, given an article i n ,
the model suggests another item proportional to the frequency with                       1
which it had been read after i n . We apply the model to all requests                  RRD
                                                                                            −3
in the time 2–8 February, 2016. We determine whether readers sub-                      10
sequently read any of the suggested articles. With this information,
                                                                                            −6
we compute the average response rate for each hour. In addition,                       10
we monitor newly added articles and derive the coverage of models.                               0           50                100               150    [Time] h
The coverage is defined as the proportion of known articles covered
by any model. The coverage naturally shrinks as the publishers                         Figure 1: For each publisher, we consider the response rates
release more and more articles unknown to the models. We repeat                        for four types of models. The response rate is plotted on a
this procedure shifting the period by one day at a time. Thereby,                      logarithmic scale to prevent cluttering.
On the Decaying Utility of News Recommendation Models                                                    TempRec’17, 31 August, 2017, Como, Italy

Table 1: We consider four publishers each referred to by a character label. Content refers to the category of news the publisher
offers. The data has been collected during 1–9 February 2016. Sessions refers to the number of unique session cookies observed.
Articles refers to the number of unique articles, which users read at least once. Interactions refers to the total number of reads.
For each publisher, we present the mean number as well as standard deviation of reads per session. Likewise, we include the
mean number and standard deviation of new articles added per hour. Note that besides addition, publishers can change articles
to include new information.

            Label     Content                    Sessions      Articles    Interactions    Interactions per Session      Articles per Hour
            A         general news                 616 539       74 172        3 860 115                     6.2±12.2               19.8±12.6
            B         information technology        24 643        2735            82 540                      3.3±4.0                 1.0±0.2
            C         general news                 815 260       58 392        5 772 802                     7.1±16.9                 7.0±4.5
            D         sports                     1 437 161       12 028       20 227 882                    14.1±30.5                 7.2±4.5


The predictive performance subsequently decreases and stabilises                publisher B, we observe that the older model occasionally performs
on a noticeably lower level compared to the initial performance.                better than the new model.
We observe a noticeable difference in predictive performance amid                  We have fitted an exponential function to our results using the
the retrained models and their predecessors. This effect appears                least squares method. Table 2 conveys the exponential fits to the re-
closely linked to the coverage, which shows a similar trend. The                sponse rates for combinations of publishers and models. We observe
observations are consistent on all four publishers and affirm the               that the initial response rates (U ) vary considerably. The random
expectation of a exponential decay phenomenon. Publisher B at-                  model has particularly low initial response rates. Conversely, the se-
tracts less visitors and exhibits higher variance compared to the               quences model scores highest with respect to initial response rates.
other publishers. Retraining models appears particularly beneficial             All fits exhibit decay, V < 0, with the exception of the random
to publisher D for which the decline in predictive performance                  model for publisher B. Recall that publisher B observed less inter-
quickly renders models useless.                                                 actions than the other publishers. This could cause higher levels of
   Figure 4 illustrates the loss in predictive performance incurred             variance.
when using the initial model as opposed to learning a new model on
the second day. We observe that the loss is highest on the first day            5    DISCUSSION AND LIMITATIONS
for all publishers. The differences in utility level off over time. For
                                                                                The evaluation indicates that exponential decay models represent
                                                                                a suited first attempt to mathematically describe how the utility of
                                                                                recommendation models changes over time. The parameters vary
                                                                                among publishers and models. Still, Figure 3 shows similar trends
                popularity      freshness    random          sequences
  1                                                                             for the sequence models across all publishers and despite which day
RR                                                                              we picked. The coverage appears highly related to the decaying
                                                                                response rates. Figure 3 and Figure 2 illustrate this relation. As time
     −1                                                                         passes, publishers add new articles to their collections. Unless we
10
                                                                                update the models used to provide recommendations, they cover a
                                                                                lesser proportion of articles. The distribution of requests over the
                                                                                course of the day affects the response rates. Figure 3 illustrates the
     −2
10                                                                              differences in requests for all four publishers. We observe a periodic
                                                                                pattern with more requests during the day and fewer requests at
                                                                                night. In addition, we observe that as the coverage arrives at 50 %
10
     −3                                                                         the response rates level off for the sequence models and all four
                                                                                publishers. Figure 4 shows that switching to a retrained model is
                                                                                most beneficial on the first day. This suggests that publishers should
     −4                                                                         replace or update their models at least once a day. Additional ex-
10
                                                                                perimentation is necessary to analyse how the choice of data used
                                                                                to create the model affects its utility. We have kept the training
                                                                                data set to the length of one day in our experiments. Using more
     −5
10                                                                              data and/or different types of models represents the direction to
                                                                                further explore. Our experiments used recorded data and inferred
          1.0   0.9     0.8     0.7    0.6     0.5    0.4     0.3    0.2
                                                                                the utility rather than observing actual interactions resulting from
                                                              Coverage
                                                                                recommendations generated by our models. Joachims et al. (2017)
                                                                                discuss how counterfactual reasoning facilitates using logged in-
Figure 2: We consider the relation between coverage and re-                     formation more effectively. Unfortunately, we lack the required
sponse rate for publisher A.                                                    information on internal parameters of the recommender systems
TempRec’17, 31 August, 2017, Como, Italy                                                         Benjamin Kille and Sahin Albayrak


 60,000                                                           1000
  Req.
 40,000                                                           Req.
                                                                   500
 20,000
      0                                                               0
    1.0                                                             1.0
  RRS                                                             RRS

     0.8                                                            0.8


     0.6                                                            0.6


     0.4                                                            0.4


     0.2                                                            0.2


       0                                                               0
    1.0                                                             1.0
  Cov.                                                            Cov.
     0.5                                                            0.5

      0                                                               0
           0          50             100         150   [Time] h            0          50             100          150    [Time] h

                           (a) Publisher A                                                 (b) Publisher B
 60,000                                                                5
  Req.                                                            2×10
 40,000                                                           Req.
                                                                       5
                                                                  1×10
 20,000
      0                                                               0
    1.0                                                             1.0
  RRS                                                             RRS

     0.8                                                            0.8


     0.6                                                            0.6


     0.4                                                            0.4


     0.2                                                            0.2


       0                                                               0
    1.0                                                             1.0
  Cov.                                                            Cov.
     0.5                                                            0.5

      0                                                               0
           0          50             100         150   [Time] h            0          50             100          150    [Time] h

                           (c) Publisher C                                                 (d) Publisher D

Figure 3: Evaluation Results Overview. Each subfigure refers to a single publisher. Each subfigure contains three parts: at the
top, the frequency of requests, at the centre the response rate (RR) referring to the sequence model, and at the bottom the
coverage. For response rate and coverage, a colour scheme refers to the day at which the model has been created. Night times
are highlighted in light blue.


to apply their method. Our experiments are based on part of the   conduct experiments with the feedback of actual readers. This will
NewsREEL data set. In order to verify our findings, we have to
On the Decaying Utility of News Recommendation Models                                                        TempRec’17, 31 August, 2017, Como, Italy

                Table 2: Exponential fit to the response rates observed for combinations of publishers and models.

                        Publisher            random                  freshness           popularity              sequences
                        Publisher A      0.0000 · e −1.8058t     0.0069 · e −0.0712t   0.0292 · e −0.0376     0.4406 · e −0.0088t
                        Publisher B      0.0018 · e 0.0028t      0.0778 · e −0.0206t   0.0655 · e −0.0106t    0.3498 · e −0.0039t
                        Publisher C      0.0005 · e −0.0176t     0.0400 · e −0.0505t   0.0590 · e −0.0140t    0.4853 · e −0.0107t
                        Publisher D      0.0027 · e −0.0531t     0.0877 · e −0.0875t   0.0646 · e −0.0057t    0.2887 · e −0.1079t


confirm whether the selection of publishers or the time period may               in which little information is available about user preferences. Bal-
have biased the findings.                                                        trunas and Amatriain (2009) extended the time-aware collaborative
                                                                                 filtering to implicit feedback. Implicit feedback can be derived from
6     RELATED WORK                                                               log files such as the ones used in our experiment. Still, they apply
                                                                                 their method to movies, which again exhibit characteristics differ-
The decreasing predictive performance of models has been dis-
                                                                                 ent to news. Campos et al. (2014) discussed time-aware evaluation
cussed by Jambor et al. (2012) for the domain of movies. They em-
                                                                                 protocols. They introduce a scheme to categorise evaluation proto-
ployed methods from Control Theory to devise an optimised up-
                                                                                 cols focussing on rating prediction. Their scheme assigns our work
dating strategy. Movies exhibit different characteristics than news.
                                                                                 the time-dependent cross-validation category. Much of the work
In particular, people tend to revisit movies much more frequently
                                                                                 on time-aware evaluation of recommender systems has focused on
than news thus impeding comparisons to our work. Koren (2009)
                                                                                 movies and rating prediction.
focused on collaborative filtering. He introduced a latent factor
                                                                                     Das et al. (2007) present the news personalisation systems used
model which captures the temporal development of preferences.
                                                                                 for Google’s news aggregator. Their system employs covisitation
Thereby, he could more accurately predict how users rate movies.
                                                                                 counts similar to our sequence model. In addition, they use proba-
Collaborative filtering requires expressive user profiles with suf-
                                                                                 bilistic latent semantic indexing and MinHash clustering to improve
ficiently clearly stated preferences. News consumption happens
                                                                                 their response rates. The news aggregator has access to much more
anonymously disallowing creating such profiles. As Table 1 illus-
                                                                                 comprehensive user profiles for the subset of users reading news
trates, publishers generally get to know readers’ preferences for few
                                                                                 while logged in with their Google accounts. Li et al. (2010) represent
articles. News recommender systems have to work in conditions
                                                                                 news recommendation as contextual-bandit problem. Therein, the
                                                                                 system has a set of choices modelled figuratively as arm of bandit
                                                                                 found in casinos. The system learns how to choose depending on
                 Loss of Predictive Performance ct-1 - ct                        the context. Garcin et al. (2013) introduce the notion of context
    0.2
A 0
                                                                                 trees to news recommendation. Context trees capture particulari-
                                                                                 ties of situation and use them to select a better set of article to be
−0.2
                                                                                 recommended.
−0.4
    0.2                                                                          7     CONCLUSION AND FUTURE WORK
B 0                                                                              We have introduced the notion of utility decay for news recom-
−0.2                                                                             mender systems. The utility decay refers to a model’s decreas-
−0.4                                                                             ing ability to correctly anticipate future interactions amid users
    0.2                                                                          and items. Experiments with data from four publishers have con-
C 0                                                                              firmed that exponential decay functions can be used to describe the
                                                                                 changes of response rates over time. We observed a similar pattern
−0.2
                                                                                 for the coverage, the proportion of articles a model can potentially
−0.4                                                                             suggest. We conjecture that there is a strong relation between the
    0.2                                                                          two quantities. The relation depends on factors including the pub-
D 0                                                                              lisher and the type of model. Further evaluation is necessary to
−0.2                                                                             improve the understanding of utility decay in news recommenda-
−0.4                                                                             tion. First, we will consider varying the time span used to learn a
                                                                                 model. This will show whether reducing or increasing the amount
           40     60     80    100     120    140     160        180             of data describes the changes of response rates more accurately.
                                                               [Time] h
                                                                                 Second, we will consider additional types of models. With little
                                                                                 information concerning users, we plan to evaluate an item-based
Figure 4: Comparison of response rates for the sequence                          latent factor model. We intend to participate in the next edition of
models learnt on 1 February (t − 1) and 2 February (t) in the                    NewsREEL to verify our findings with the feedback of actual news
period 2–9 February, 2016. The highlighted areas show the                        readers. Finally, we will evaluate additional time periods to verify
loss in predictive performance by using the older model.                         that the observed pattern is not due to choosing a particular time.
TempRec’17, 31 August, 2017, Como, Italy                                                                                    Benjamin Kille and Sahin Albayrak


REFERENCES                                                                            Florent Garcin, Christos Dimitrakakis, and Boi Faltings. 2013. Personalized news
Linas Baltrunas and Xavier Amatriain. 2009. Towards Time-dependant Recommenda-           recommendation with context trees.. In RecSys. ACM Press, New York, New York,
   tion based on Implicit Feedback. Workshop on Context-aware Recommender Systems        USA, 105–112.
   (2009).                                                                            Tamas Jambor, Jun Wang, and Neal Lathia. 2012. Using Control Theory for Stable
Daniel Billsus and Michael J Pazzani. 2007. Adaptive News Access. The Adaptive Web       and Efficient Recommender Systems.. In WWW. ACM, New York, New York, USA,
   (2007), 550–570.                                                                      11–20.
Erik Brynjolfsson and JooHee Oh. 2012. The Attention Economy - Measuring the          Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-
   Value of Free Digital Services on the Internet. ICIS (2012).                          to-Rank with Biased Feedback. In the Tenth ACM International Conference. ACM
Pedro G Campos, Fernando Díez, and Iván Cantador. 2014. Time-aware Recommender           Press, New York, New York, USA, 781–789.
   Systems: a Comprehensive Survey and Analysis of Existing Evaluation Protocols.     Yehuda Koren. 2009. Collaborative Filtering with Temporal Dynamics. KDD (2009),
   User Modeling and User-Adapted Interaction 24, 1-2 (2014), 67–119.                    447.
Giovanni Luca Ciampaglia, Alessandro Flammini, and Filippo Menczer. 2015. The         Lihong Li, Robert E Schapire, Wei Chu, John Langford, and John Langford. 2010. A
   production of information in the attention economy. Scientific reports 5, 1 (May      contextual-bandit approach to personalized news article recommendation. In the
   2015).                                                                                19th international conference. ACM Press, New York, New York, USA, 661–670.
Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar Rajaram. 2007.            Andreas Lommatzsch, Benjamin Kille, Frank Hopfgartner, Martha Larson, Torben
   Google news personalization - scalable online collaborative filtering.. In WWW.       Brodt, Jonas Seiler, and Özlem Özgöbek. 2017. CLEF 2017 NewsREEL Overview: A
   ACM, New York, New York, USA, 271–280.                                                Stream-based Evaluation Task for Evaluation and Education. Springer.
                                                                                      Guy Shani and Asela Gunawardana. 2010. Evaluating Recommendation Systems.