<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Time-dependent Evaluation of Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Teresa Scheidt</string-name>
          <email>teresascheidt@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joeran Beel</string-name>
          <email>joeran.beel@uni-siegen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Recommender Systems, Evaluation, Time-dependent Evaluation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lund University</institution>
          ,
          <addr-line>Box 117, 221 00 Lund</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Siegen University</institution>
          ,
          <addr-line>Adolf-Reichwein-Straße 2, 57076 Siegen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>Evaluation of recommender systems is an actively discussed topic in the recommender system community. However, some aspects of evaluation have received little to no attention, one of them being whether evaluating recommender system algorithms with single-number metrics is sufficient. When presenting results as a single number, the only possible assumption is a stable performance over time regardless of changes in the datasets, while it intuitively seems more likely that the performance changes over time. We suggest presenting results over time, making it possible to identify trends and changes in performance as the dataset grows and changes. In this paper, we conduct an analysis of 6 algorithms on 10 datasets over time to identify the need for a time-dependent evaluation. To enable this evaluation over time, we split the datasets based on the provided timesteps into smaller subsets. At every tested timepoint we use all available data up to this timepoint, simulating a growing dataset as encountered in the realworld. Our results show that for 90% of the datasets the performance changes over time and in 60% even the ranking of algorithms changes over time.</p>
      </abstract>
      <kwd-group>
        <kwd>Keywords1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>2021 Copyright for this paper by its authors.</p>
      <p>Sometimes, researchers do report metrics over time intervals. For instance, He et al. [13] report
changing performance of algorithms over a 24h interval and Feely et al. [10] evaluate predictions over
several weeks. Beel et al. [5] report Click-Through rate on a monthly basis over two years and try to
select the best algorithm, among others, based on time [3]. Lathia [14] demonstrates how the
performance of algorithms changes over time and shows that the performance can be improved when
using different algorithms depending on the timepoint. Barreau and Carlier [2] argued for their new
algorithm by showing that their proposed algorithm is more stable over time while others constantly
decrease in performance. However, the majority of researchers report single-number metrics. This is
evidenced by a small ad-hoc analysis that we conducted for our current paper. We analyzed all full and
short papers of the ACM RecSys 2020 conference (n=67). Of those 67 papers, 55 evaluated algorithms,
and of these 55, 89% presented single-number metrics, and only 11% presented metrics over time.</p>
      <p>While researchers sporadically present metrics over time, there is no comprehensive analysis of
how recommender-system evaluation metrics change over time. We found only two partly related
studies [14, 16] that, among other things, studied the evolution of performance over time on one or two
datasets respectively. They reported some interesting results, however, by studying just one or two
datasets, general conclusions on the necessity of evaluation over time can barely be drawn.</p>
      <p>We hypothesize that, instead of a single number, recommender system research would benefit from
presenting metrics over time, i.e. each metric should be calculated multiple times at different time
points, e.g.every week, month or year. This will allow to gain more information about an algorithm’s
effectiveness over time, identify trends and help choose the best algorithm. With this paper, we
systematically evaluate how performance changes over time over several datasets and examine to what
extent the community would benefit from evaluation over time. To the best of our knowledge, this is
the first study that explicitly focuses on metrics over time, and the first research paper to present such
results on a relatively large number of datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Algorithms, Metrics and Datasets</title>
      <p>Overall, we test six algorithms on four datasets in a total of ten variations, that we split based on the
timestamp. We observe the performance at every time-step and evaluate how the performance changes
over time.</p>
      <p>To identify the effects of time on the performance and differences of common recommendation
algorithms, we compare three model-based and three memory-based algorithms. The algorithms used
in this paper are from the Lenskit library [9], we chose funkSVD, biasedMF, Bias, UserKNN,
ItemKNN, and Most Popular. We evaluate the performance with nDCG, recall and RMSE.</p>
      <p>We chose four of the most common datasets [6, 17] in recommender system research, i.e. MovieLens
[15], Netflix [8], Amazon3 and Yelp4 in their different variations (MovieLens 100k, MovieLens 1M,
MovieLens 10M, Amazon books, Amazon Instant Video, Amazon Toys and Games, Amazon Music,
Amazon Electronics, Yelp, Netflix) totaling in 10 datasets (Table 1). The choice of datasets for our
research question is limited, as the datasets need to have timestamps included so that the data can be
spit and evaluated based on time. At every timestep we filter the dataset to only include users with more
than 2 ratings to make sure predictions are possible and meaningful.</p>
      <p>The code for our evaluation over time can be found here.</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation over time</title>
      <p>To evaluate how the performance changes over time when more and more data becomes available,
the datasets have to be split based on the provided timestamp. We split the datasets on a monthly or
yearly basis (see Table 1), leading to 4-18 subsets per dataset. In general, there are two ways to define
the subsets, each subset could include only the data of that month or year, or each subset could contain
all data up to a certain timepoint. We define the subsets according to the second option, so with
advancing time the dataset grows, and the last subset consists of the whole dataset. This splitting is
closer to the ‘real world’, as practitioners probably would rather use all the available data, than only the
last month of the available data. The splitting process we implemented is visualized in Figure 2.
3 Downloaded from Amazon review data (ucsd.edu) [12]
4 Downloaded from Yelp Dataset</p>
      <p>At every timestep, we set aside a test-set consisting of the last 20% of the ratings from each user for
evaluation. Each algorithm is then optimized using grid-search with 5-fold cross-validation for 5
iterations5 on the remaining 80% of the subset. The algorithms are optimized two times, once w.r.t.
nDCG and once w.r.t. RMSE. The optimized algorithms are then applied to the subset and evaluated
on the corresponding test-set. We do not use any information from previous timesteps for the
optimization or model-fitting process and the models are optimized and retrained from scratch at every
timestep.</p>
      <p>We exclude subsets with less than 500 ratings, as too few ratings lead to very different (and worse)
performance or even lead to algorithms not working (e.g. UserKNN can’t find enough neighbors). If
the first subset has less than 500 ratings, we evaluate the next bigger subset as the first subset.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Results</title>
      <p>Our analysis shows that the performance of algorithms often changes over time (for 90% of the
datasets). For instance, on the MovieLens 10M dataset (Figure 3a), algorithms achieve an RMSE
between 0.81 (SVD) and 0.85 (Bias) in the first year. The performance then decreases steadily for five
years to RMSE between 0.85 and 0.91 before it increases again to an RMSE between 0.80 and 0.86 in
the latest year.</p>
      <p>We found no evidence for algorithm following different trends over time, i.e. some algorithms
improving while others decrease over time. In most cases (90%) the performance develops roughly the
same over time (Figure 4). For instance, nDCG on the Amazon-toys dataset worsens over time for all
algorithms (Figure 3e). While all algorithms reach an nDCG in the range of 0.02 to 0.09 in the first
year, the nDCG decreases to 0.001-0.01 in 2014. For the Netflix dataset (Figure 3c) the RMSE improves
over time for all algorithms from 1.09-1.24 in 1998 to 0.86-0.93 in 2005.</p>
      <p>Even though we did not observe algorithms following different trends over time, the ranking of
algorithms does change over time. Especially in the beginning of data collection, the ranking of
algorithms changes frequently (for 60% of the datasets). At later time-steps (and consequently more
data) the ranking however remains more stable. A change of rank can for instance be seen on the
Amazon-toys dataset (Figure 3e), where at the first time-step ‘Most Popular’ is the best algorithm,
measured by nDCG, followed by the ‘Bias’ algorithm while at the second time-step ‘ItemKNN’
performs best, and ‘Bias’ is the second worse performing algorithm. The ranking keeps changing until
the 10th year and then stays the same for the last 5 years. A similar behavior can be observed for the
Netflix dataset (Figure 3c), where the ranking of algorithm changes in the first 3 time-steps and
afterwards stays the same until the end. How often algorithms crossed lines, i.e. the ranking of
algorithms changed, can be seen in Table 2.
5 This relatively short grid-search does not guarantee to find the global optimum, for our purposes it is however
sufficient to find a rough optimum, as we just want to compare the evolution over time for different algorithms
rather than compare perfectly optimized algorithms.</p>
      <p>It should be noted that the results vary based on the metric. While we observed similar evolutions
over time for recall und nDCG, the results for RMSE differed. This can be seen in Table 2, where the
observed trends differ in 40% depending on which metric is used. When looking at the MovieLens 10M
dataset (Figure 3a,d), for example, this becomes evident. While the performance reaches a stable state
after a few timesteps when looking at nDCG, the performance measured by RMSE first decreases until
2001 and then starts increasing. Additionally, the observed range of the metrics is different for
nDCG/recall compared to RMSE. For nDCG the results sometimes differ up 90% over time (e.g.
Amazon-music), while for RMSE the largest observed range is 35% (Amazon-electronics). For all
datasets, the range for nDCG is bigger than for RMSE (see Table 2).</p>
      <p>We observed distinct differences between datasets, especially between the Amazon and the
MovieLens datasets. The MovieLens datasets show a more stable behavior over time, with few changes
in ranking of algorithms and small ranges of nDCG and RMSE (less than 10% for all three variations).
The Amazon datasets on the other hand have many changes in rankings and a higher decrease over
time. A factor contributing to these differences might be the pruning of the datasets, the MovieLens
datasets include only users with 20 or more ratings while all other datasets include users with 2 or more
ratings. The bigger factor however appears to be the size of the datasets. Generally, the bigger datasets
seem to behave more stable over time, which can be seen for Netflix (Figure 3c) or Movie Lens-10M
(Figure 3d) for example, where the performance and the ranking of algorithms stays stable after the first
two timesteps. Similarly, Amazon books, the biggest set within the Amazon database, has the smallest
range of nDCG and RMSE values over time compared to the other Amazon datasets. The smaller
datasets have more variation in performance especially in the early subset but also exhibit more stable
behavior towards the end with more data (e.g. Amazon Toys and Games in Figure 3b,e), which might
also be explained by the dataset-size.</p>
      <p>(a) RMSE over time for ML-10M.</p>
      <p>(b) RMSE over time for Amazon-toys.</p>
      <p>(c) RMSE over time for Netflix.
(d) nDCG over time for ML-10M.</p>
      <p>(e) nDCG over time for Amazon-toys.</p>
      <p>(f) nDCG over time for Netflix.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Discussion &amp; Outlook</title>
      <p>We hope our work initiates a discussion if presenting results of recommender-systems evaluations
as single numbers should be changed. Given our analysis, we suggest presenting the performance of
algorithms over time. In many cases the performance as well as the ranking of algorithms changes over
time, making conclusions time dependent. Our results show that the performance of algorithms change
in 90% of the datasets and the ranking of algorithms changes in 60% over time. Especially in the
beginning of the data-collection phase the ranking of algorithm changes a lot, which should be
considered when evaluating algorithms. For larger datasets, the performance of algorithms still changes
over time sometimes, but the ranking is relatively stable. In those cases, single-number metrics might
be sufficient to present the results. Nonetheless, the evaluation over time reveals trends and holds more
information than a single-number metric. Consequently, for the development of new algorithms, it
could be useful to evaluate them over time, to see for example if they follow different trends or behave
more stable than the benchmark algorithms.</p>
      <p>In the future, it should be further investigated what factors influence the performance over time.
While we found that the development over time varies for different datasets and metrics, it remains
unclear from our analysis which factors have the biggest influence on the performance. Factors that
should be investigated include the data set size, number of users and items over time, data pruning, the
number of ratings per user and the impact of the ‘cold-start problem’. With deeper understanding what
influences the performance over time, new and better algorithms can be developed that consider these
changes over time and adapt to them, and more informed decisions can be made about which algorithms
to use.</p>
    </sec>
    <sec id="sec-7">
      <title>5. References</title>
      <p>[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]</p>
      <p>Recommendations. RecSys 2020 - 14th ACM Conference on Recommender Systems (2020),
492–497.</p>
      <p>Beel, J. et al. 2019. Darwin &amp; goliath: A white-label recommender-system as-a-service with
automated algorithm-selection. RecSys 2019 - 13th ACM Conference on Recommender Systems
(Sep. 2019), 534–535.</p>
      <p>Beel, J. 2017. It’s Time to Consider “Time” when Evaluating Recommender-System Algorithms
[Proposal]. (Aug. 2017).</p>
      <p>Beel, J. et al. 2019. Rard II: The 94 million related-article recommendation dataset. CEUR
Workshop Proceedings (2019).</p>
      <p>Beel, J. and Brunel, V. 2019. Data pruning in recommender systems research: Best-practice or
malpractice? CEUR Workshop Proceedings (2019), 26–30.</p>
      <p>Beel, J. and Langer, S. 2015. A comparison of offline evaluations, online evaluations, and user
studies in the context of research-paper recommender systems. Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics) (2015), 153–168.</p>
      <p>Bennett, J. and Lanning, S. 2007. The Netflix Prize. KDD Cup and Workshop. (2007), 3–6.
Ekstrand, M.D. 2020. LensKit for Python: Next-Generation Software for Recommender
Systems Experiments. International Conference on Information and Knowledge Management,
Proceedings (Oct. 2020), 2999–3006.</p>
      <p>Feely, C. et al. 2020. Providing Explainable Race-Time Predictions and Training Plan
Recommendations to Marathon Runners. RecSys 2020 - 14th ACM Conference on
Recommender Systems (2020), 539–544.</p>
      <p>Gunawardana, A. and Shani, G. 2015. Evaluating recommender systems. Recommender Systems
Handbook, Second Edition. Springer US. 265–308.</p>
      <p>He, R. and Mcauley, J. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with
One-Class Collaborative Filtering. DOI:https://doi.org/10.1145/2872427.2883037.
He, X. et al. 2020. Contextual User Browsing Bandits for Large-Scale Online Mobile
Recommendation. RecSys 2020 - 14th ACM Conference on Recommender Systems (2020), 63–
72.</p>
      <p>Lathia, N.K. 2010. Evaluating collaborative filtering over time. Methodology. (2010), 1–140.
DOI:https://doi.org/citeulike-article-id:7853161.</p>
      <p>Maxwell, H. and A., K. 2015. The MovieLens Datasets. ACM Transactions on Interactive
Intelligent Systems (TiiS). 5, 4 (Dec. 2015). DOI:https://doi.org/10.1145/2827872.
Soto, P.G.C. 2011. Temporal Models in Recommender Systems: An Exploratory Study on
Different Evaluation Dimensions. Time. (2011).</p>
      <p>Sun, Z. et al. 2020. Are We Evaluating Rigorously? Benchmarking Recommendation for
Reproducible Evaluation and Fair Comparison. RecSys 2020 - 14th ACM Conference on
Recommender Systems (2020), 23–32.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Appendix</title>
      <p>Figure 5: Evolution of RMSE over for all datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>