Time-dependent Evaluation of Recommender Systems
Teresa Scheidt 1 and Joeran Beel 2
1
    Lund University, Box 117, 221 00 Lund, Sweden
2
    Siegen University, Adolf-Reichwein-Straße 2, 57076 Siegen, Germany


                    Abstract
                    Evaluation of recommender systems is an actively discussed topic in the recommender system
                    community. However, some aspects of evaluation have received little to no attention, one of
                    them being whether evaluating recommender system algorithms with single-number metrics is
                    sufficient. When presenting results as a single number, the only possible assumption is a stable
                    performance over time regardless of changes in the datasets, while it intuitively seems more
                    likely that the performance changes over time. We suggest presenting results over time, making
                    it possible to identify trends and changes in performance as the dataset grows and changes. In
                    this paper, we conduct an analysis of 6 algorithms on 10 datasets over time to identify the need
                    for a time-dependent evaluation. To enable this evaluation over time, we split the datasets
                    based on the provided timesteps into smaller subsets. At every tested timepoint we use all
                    available data up to this timepoint, simulating a growing dataset as encountered in the real-
                    world. Our results show that for 90% of the datasets the performance changes over time and in
                    60% even the ranking of algorithms changes over time.
                    Keywords1
                    Recommender Systems, Evaluation, Time-dependent Evaluation

1. Introduction
    Recommender-system evaluation is an actively discussed topic in the community. Discussions
include advantages and disadvantages of evaluation methods such as online evaluations, offline
evaluations, and user studies [7, 11] or the ideal metrics to measure recommendation effectiveness [1].
    An issue that has received little attention is the question if presenting results as a single number (e.g.
recall@10 = 0.64) is sufficient.2 Typically, researchers present results as single-number metrics, e.g.
precision, normalized discounted cumulative gain (nDCG), recall, root mean square error (RMSE).
These metrics are based on the whole dataset and hold no information if the algorithm performs the
same over the whole time period or if it improved or worsened over time. Consequently, there is some
uncertainty how the algorithms will perform in the future when the dataset changes/grows.
    When presented with a single number, the only assumption is that the performance is the same over
the whole time period the dataset was collected and will stay the same in the future (Figure 1a). It could
however be different in reality, e.g. the performance of one algorithm could increase steadily while
another algorithms performance decreases (Figure 1b,c). This could lead to ‘crossing’ of the
performance lines, which changes the conclusion of which algorithm performs better. This is especially
problematic if the crossing-point is in the future (Figure 1c): the best performing algorithm at the time
of evaluation (Alg A) likely won’t be the best performing algorithm in the future. Another possibility
is that the performance fluctuates over time and two algorithms cross multiple times (Figure 1d). This
means the conclusion of which algorithm is better depends strongly on the time point the algorithms
are tested.


Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2021), September 25th, 2021, co-located with the 15th
ACM Conference on Recommender Systems, Amsterdam, The Netherlands
EMAIL: teresascheidt@gmail.com (A. 1); joeran.beel@uni-siegen.de (A. 2)
ORCID: 0000-0002-4537-5573 (A. 2)
                 ©️ 2021 Copyright for this paper by its authors.
                 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                 CEUR Workshop Proceedings (CEUR-WS.org)
2
    This issue was mentioned by Beel in a research proposal [4], which this work is based on.
Figure 1: When presented with a single number metric, it is not obvious how different algorithms
perform over time. There are several possibilities how the performance changes over time, that might
change the conclusion of which algorithm is better: (a) the performance is constant over time (Alg A
is better) (b) the performance changes linearly, the crossing point of two algorithms is in the past (Alg
A is better) (c) the performance changes linearly, the crossing point of two algorithms is in the future
(Alg B is better in the future) (d) the performance fluctuates, which algorithm is better changes often
over time.

   Sometimes, researchers do report metrics over time intervals. For instance, He et al. [13] report
changing performance of algorithms over a 24h interval and Feely et al. [10] evaluate predictions over
several weeks. Beel et al. [5] report Click-Through rate on a monthly basis over two years and try to
select the best algorithm, among others, based on time [3]. Lathia [14] demonstrates how the
performance of algorithms changes over time and shows that the performance can be improved when
using different algorithms depending on the timepoint. Barreau and Carlier [2] argued for their new
algorithm by showing that their proposed algorithm is more stable over time while others constantly
decrease in performance. However, the majority of researchers report single-number metrics. This is
evidenced by a small ad-hoc analysis that we conducted for our current paper. We analyzed all full and
short papers of the ACM RecSys 2020 conference (n=67). Of those 67 papers, 55 evaluated algorithms,
and of these 55, 89% presented single-number metrics, and only 11% presented metrics over time.
    While researchers sporadically present metrics over time, there is no comprehensive analysis of
how recommender-system evaluation metrics change over time. We found only two partly related
studies [14, 16] that, among other things, studied the evolution of performance over time on one or two
datasets respectively. They reported some interesting results, however, by studying just one or two
datasets, general conclusions on the necessity of evaluation over time can barely be drawn.
   We hypothesize that, instead of a single number, recommender system research would benefit from
presenting metrics over time, i.e. each metric should be calculated multiple times at different time
points, e.g.every week, month or year. This will allow to gain more information about an algorithm’s
effectiveness over time, identify trends and help choose the best algorithm. With this paper, we
systematically evaluate how performance changes over time over several datasets and examine to what
extent the community would benefit from evaluation over time. To the best of our knowledge, this is
the first study that explicitly focuses on metrics over time, and the first research paper to present such
results on a relatively large number of datasets.
2. Methodology
2.1. Algorithms, Metrics and Datasets
   Overall, we test six algorithms on four datasets in a total of ten variations, that we split based on the
timestamp. We observe the performance at every time-step and evaluate how the performance changes
over time.
   To identify the effects of time on the performance and differences of common recommendation
algorithms, we compare three model-based and three memory-based algorithms. The algorithms used
in this paper are from the Lenskit library [9], we chose funkSVD, biasedMF, Bias, UserKNN,
ItemKNN, and Most Popular. We evaluate the performance with nDCG, recall and RMSE.
   We chose four of the most common datasets [6, 17] in recommender system research, i.e. MovieLens
[15], Netflix [8], Amazon3 and Yelp4 in their different variations (MovieLens 100k, MovieLens 1M,
MovieLens 10M, Amazon books, Amazon Instant Video, Amazon Toys and Games, Amazon Music,
Amazon Electronics, Yelp, Netflix) totaling in 10 datasets (Table 1). The choice of datasets for our
research question is limited, as the datasets need to have timestamps included so that the data can be
spit and evaluated based on time. At every timestep we filter the dataset to only include users with more
than 2 ratings to make sure predictions are possible and meaningful.
   The code for our evaluation over time can be found here.

Table 1
Overview of used datasets.
Dataset                                          Number of Ratings   Timespan      Split (# of subsets)
MovieLens 100k                                   100 thousand        1995-1998     Monthly (8)
MovieLens 1M                                     1 million           2000-2003     Yearly (4)
MovieLens 10M                                    10 million          1996-2009     Yearly (14)
Netflix Prize1                                   51 million          1998-2005     Yearly (7)
Amazon Books                                     14.8 million        1997-2013     Yearly (18)
Amazon Instant Video                             135 thousand        2007-2014     Yearly (8)
Amazon Toys and Games                            878 thousand        2000-2014     Yearly (14)
Amazon Music                                     579 thousand        1998-2014     Yearly (17)
Amazon Electronics                               3.6 million         2000-2013     Yearly (14)
Yelp                                             192 thousand        2005-2013     Yearly (8)
1
    only comined_data_1 and combined_data_2 is used


2.2.        Evaluation over time
    To evaluate how the performance changes over time when more and more data becomes available,
the datasets have to be split based on the provided timestamp. We split the datasets on a monthly or
yearly basis (see Table 1), leading to 4-18 subsets per dataset. In general, there are two ways to define
the subsets, each subset could include only the data of that month or year, or each subset could contain
all data up to a certain timepoint. We define the subsets according to the second option, so with
advancing time the dataset grows, and the last subset consists of the whole dataset. This splitting is
closer to the ‘real world’, as practitioners probably would rather use all the available data, than only the
last month of the available data. The splitting process we implemented is visualized in Figure 2.


3
    Downloaded from Amazon review data (ucsd.edu) [12]
4
    Downloaded from Yelp Dataset
Figure 2: Time-based split of datasets. At every timepoint, all data up to that timepoint is included in
the set.

    At every timestep, we set aside a test-set consisting of the last 20% of the ratings from each user for
evaluation. Each algorithm is then optimized using grid-search with 5-fold cross-validation for 5
iterations5 on the remaining 80% of the subset. The algorithms are optimized two times, once w.r.t.
nDCG and once w.r.t. RMSE. The optimized algorithms are then applied to the subset and evaluated
on the corresponding test-set. We do not use any information from previous timesteps for the
optimization or model-fitting process and the models are optimized and retrained from scratch at every
timestep.
    We exclude subsets with less than 500 ratings, as too few ratings lead to very different (and worse)
performance or even lead to algorithms not working (e.g. UserKNN can’t find enough neighbors). If
the first subset has less than 500 ratings, we evaluate the next bigger subset as the first subset.


3. Results
    Our analysis shows that the performance of algorithms often changes over time (for 90% of the
datasets). For instance, on the MovieLens 10M dataset (Figure 3a), algorithms achieve an RMSE
between 0.81 (SVD) and 0.85 (Bias) in the first year. The performance then decreases steadily for five
years to RMSE between 0.85 and 0.91 before it increases again to an RMSE between 0.80 and 0.86 in
the latest year.
    We found no evidence for algorithm following different trends over time, i.e. some algorithms
improving while others decrease over time. In most cases (90%) the performance develops roughly the
same over time (Figure 4). For instance, nDCG on the Amazon-toys dataset worsens over time for all
algorithms (Figure 3e). While all algorithms reach an nDCG in the range of 0.02 to 0.09 in the first
year, the nDCG decreases to 0.001-0.01 in 2014. For the Netflix dataset (Figure 3c) the RMSE improves
over time for all algorithms from 1.09-1.24 in 1998 to 0.86-0.93 in 2005.
    Even though we did not observe algorithms following different trends over time, the ranking of
algorithms does change over time. Especially in the beginning of data collection, the ranking of
algorithms changes frequently (for 60% of the datasets). At later time-steps (and consequently more
data) the ranking however remains more stable. A change of rank can for instance be seen on the
Amazon-toys dataset (Figure 3e), where at the first time-step ‘Most Popular’ is the best algorithm,
measured by nDCG, followed by the ‘Bias’ algorithm while at the second time-step ‘ItemKNN’
performs best, and ‘Bias’ is the second worse performing algorithm. The ranking keeps changing until
the 10th year and then stays the same for the last 5 years. A similar behavior can be observed for the
Netflix dataset (Figure 3c), where the ranking of algorithm changes in the first 3 time-steps and
afterwards stays the same until the end. How often algorithms crossed lines, i.e. the ranking of
algorithms changed, can be seen in Table 2.


5 This relatively short grid-search does not guarantee to find the global optimum, for our purposes it is however

sufficient to find a rough optimum, as we just want to compare the evolution over time for different algorithms
rather than compare perfectly optimized algorithms.
Table 2
Overview of the evolution of the algorithms over time on the datasets MovieLens (ML), Netflix,
Amazon (A), Yelp, and their variations. Recall is omitted due to space restrictions, the results are very
similar to nDCG. If several trends were observed over time (mixed) the latest trend is named in
parentheses. The trend and range are given for the algorithm that performed best on the whole
dataset (i.e. the last subset).
    Dataset    Changed ranking of              Trend of best                  Range of best performing
               algorithms (changed         performing algorithm              algorithm (difference in %)
                  best algorithm)
                nDCG        RMSE         nDCG              RMSE                nDCG               RMSE
   ML 100k       1 (N)     2 (Y)          Stable       Mixed (Stable)     0.174-0.21 (17%)    0.975-1.09 (6%)
   ML 1M         0 (N)     0 (N)          Stable           Stable         0.138-0.143 (3%)    0.87-0.88 (0.5%)
   ML 10M        1 (N)     1 (N)          Stable      Mixed (Improving)   0.132-0.246 (46%)    0.80-0.85 (5%)
   Netflix       2 (N)     2 (N)     Mixed (Decrease)    Improving        0.138-0.239 (34%)   0.87-1.09 (20%)
   A - Books     5 (Y)     2 (N)          Stable         Improving        0.01-0.016 (37%)     0.98-1.07 (8%)
   A - Video     5 (N)     >5 (Y)       Decrease       Mixed (Stable)     0.102-0.284 (64%)   0.95-1.09 (13%)
   A - Toys      >5 (Y)    >5 (Y)       Decrease          Decrease        0.012-0.096 (87%)   0.71-1.05 (32%)
   A - Music     >5 (N)    >5 (Y)       Decrease      Mixed (Increase)    0.004-0.06 (93%)    0.85-1.26 (32%)
   A – Elec.     >5 (N)    >5 (Y)       Decrease          Decrease        0.012-0.09 (86%)     0.77-1.2 (35%)
   Yelp          3 (N)     4 (Y)        Decrease          Decrease        0.044-0.09 (53%)     0.95-1.2 (20%)


    It should be noted that the results vary based on the metric. While we observed similar evolutions
over time for recall und nDCG, the results for RMSE differed. This can be seen in Table 2, where the
observed trends differ in 40% depending on which metric is used. When looking at the MovieLens 10M
dataset (Figure 3a,d), for example, this becomes evident. While the performance reaches a stable state
after a few timesteps when looking at nDCG, the performance measured by RMSE first decreases until
2001 and then starts increasing. Additionally, the observed range of the metrics is different for
nDCG/recall compared to RMSE. For nDCG the results sometimes differ up 90% over time (e.g.
Amazon-music), while for RMSE the largest observed range is 35% (Amazon-electronics). For all
datasets, the range for nDCG is bigger than for RMSE (see Table 2).
    We observed distinct differences between datasets, especially between the Amazon and the
MovieLens datasets. The MovieLens datasets show a more stable behavior over time, with few changes
in ranking of algorithms and small ranges of nDCG and RMSE (less than 10% for all three variations).
The Amazon datasets on the other hand have many changes in rankings and a higher decrease over
time. A factor contributing to these differences might be the pruning of the datasets, the MovieLens
datasets include only users with 20 or more ratings while all other datasets include users with 2 or more
ratings. The bigger factor however appears to be the size of the datasets. Generally, the bigger datasets
seem to behave more stable over time, which can be seen for Netflix (Figure 3c) or Movie Lens-10M
(Figure 3d) for example, where the performance and the ranking of algorithms stays stable after the first
two timesteps. Similarly, Amazon books, the biggest set within the Amazon database, has the smallest
range of nDCG and RMSE values over time compared to the other Amazon datasets. The smaller
datasets have more variation in performance especially in the early subset but also exhibit more stable
behavior towards the end with more data (e.g. Amazon Toys and Games in Figure 3b,e), which might
also be explained by the dataset-size.
      (a) RMSE over time for ML-10M.   (b) RMSE over time for Amazon-toys.   (c) RMSE over time for Netflix.


      (d) nDCG over time for ML-10M.   (e) nDCG over time for Amazon-toys.   (f) nDCG over time for Netflix.

Figure 3: Examples of the development of the algorithms’ performance over time for three datasets:
MovieLens 10M (a,d), Amazon Toys and Games (b,e) and Neflix (c,f). Plots over time for all datasets
can be seen in the Appendix.


4. Discussion & Outlook
    We hope our work initiates a discussion if presenting results of recommender-systems evaluations
as single numbers should be changed. Given our analysis, we suggest presenting the performance of
algorithms over time. In many cases the performance as well as the ranking of algorithms changes over
time, making conclusions time dependent. Our results show that the performance of algorithms change
in 90% of the datasets and the ranking of algorithms changes in 60% over time. Especially in the
beginning of the data-collection phase the ranking of algorithm changes a lot, which should be
considered when evaluating algorithms. For larger datasets, the performance of algorithms still changes
over time sometimes, but the ranking is relatively stable. In those cases, single-number metrics might
be sufficient to present the results. Nonetheless, the evaluation over time reveals trends and holds more
information than a single-number metric. Consequently, for the development of new algorithms, it
could be useful to evaluate them over time, to see for example if they follow different trends or behave
more stable than the benchmark algorithms.
    In the future, it should be further investigated what factors influence the performance over time.
While we found that the development over time varies for different datasets and metrics, it remains
unclear from our analysis which factors have the biggest influence on the performance. Factors that
should be investigated include the data set size, number of users and items over time, data pruning, the
number of ratings per user and the impact of the ‘cold-start problem’. With deeper understanding what
influences the performance over time, new and better algorithms can be developed that consider these
changes over time and adapt to them, and more informed decisions can be made about which algorithms
to use.


5. References
[1]      Aggarwal, C.C. 2016. Evaluating Recommender Systems. Recommender Systems. Springer
         International Publishing. 225–254.
[2]      Barreau, B. and Carlier, L. 2020. History-Augmented Collaborative Filtering for Financial
       Recommendations. RecSys 2020 - 14th ACM Conference on Recommender Systems (2020),
       492–497.
[3]    Beel, J. et al. 2019. Darwin & goliath: A white-label recommender-system as-a-service with
       automated algorithm-selection. RecSys 2019 - 13th ACM Conference on Recommender Systems
       (Sep. 2019), 534–535.
[4]    Beel, J. 2017. It’s Time to Consider “Time” when Evaluating Recommender-System Algorithms
       [Proposal]. (Aug. 2017).
[5]    Beel, J. et al. 2019. Rard II: The 94 million related-article recommendation dataset. CEUR
       Workshop Proceedings (2019).
[6]    Beel, J. and Brunel, V. 2019. Data pruning in recommender systems research: Best-practice or
       malpractice? CEUR Workshop Proceedings (2019), 26–30.
[7]    Beel, J. and Langer, S. 2015. A comparison of offline evaluations, online evaluations, and user
       studies in the context of research-paper recommender systems. Lecture Notes in Computer
       Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
       Bioinformatics) (2015), 153–168.
[8]    Bennett, J. and Lanning, S. 2007. The Netflix Prize. KDD Cup and Workshop. (2007), 3–6.
[9]    Ekstrand, M.D. 2020. LensKit for Python: Next-Generation Software for Recommender
       Systems Experiments. International Conference on Information and Knowledge Management,
       Proceedings (Oct. 2020), 2999–3006.
[10]   Feely, C. et al. 2020. Providing Explainable Race-Time Predictions and Training Plan
       Recommendations to Marathon Runners. RecSys 2020 - 14th ACM Conference on
       Recommender Systems (2020), 539–544.
[11]   Gunawardana, A. and Shani, G. 2015. Evaluating recommender systems. Recommender Systems
       Handbook, Second Edition. Springer US. 265–308.
[12]   He, R. and Mcauley, J. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with
       One-Class Collaborative Filtering. DOI:https://doi.org/10.1145/2872427.2883037.
[13]   He, X. et al. 2020. Contextual User Browsing Bandits for Large-Scale Online Mobile
       Recommendation. RecSys 2020 - 14th ACM Conference on Recommender Systems (2020), 63–
       72.
[14]   Lathia, N.K. 2010. Evaluating collaborative filtering over time. Methodology. (2010), 1–140.
       DOI:https://doi.org/citeulike-article-id:7853161.
[15]   Maxwell, H. and A., K. 2015. The MovieLens Datasets. ACM Transactions on Interactive
       Intelligent Systems (TiiS). 5, 4 (Dec. 2015). DOI:https://doi.org/10.1145/2827872.
[16]   Soto, P.G.C. 2011. Temporal Models in Recommender Systems: An Exploratory Study on
       Different Evaluation Dimensions. Time. (2011).
[17]   Sun, Z. et al. 2020. Are We Evaluating Rigorously? Benchmarking Recommendation for
       Reproducible Evaluation and Fair Comparison. RecSys 2020 - 14th ACM Conference on
       Recommender Systems (2020), 23–32.
6. Appendix


Figure 4: Evolution of nDCG over time for all datasets. (Evolution of recall is very similar and due to
space restrictions omitted).
Figure 5: Evolution of RMSE over for all datasets.