<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM Conference on Recommender Systems, Seattle, WA, USA.
$ robin.verachtert@froomle.com (R. Verachtert); lien.michiels@froomle.com (L. Michiels);
bart.goethals@uantwerpen.be (B. Goethals)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Are We Forgetting Something? Correctly Evaluate a Recommender System With an Optimal Training Window</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robin Verachtert</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lien Michiels</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart Goethals</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Froomle N.V.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Belgium</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Monash University</institution>
          ,
          <addr-line>Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Antwerp</institution>
          ,
          <addr-line>Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Recommender systems are deployed in dynamic environments with constantly changing interests and availability of items, articles and products. The hyperparameter optimisation of such systems usually happens on a static dataset, extracted from a live system Although it is well known that the quality of a computed model highly depends on the quality of the data it was trained on, this is largely neglected in these optimisations. For example, when concept drift occurs in the data, the model is likely to learn patterns that are not aligned with the target prediction data. Interestingly, most scientific articles on recommender systems typically perform their evaluation on entire datasets, without considering their intrinsic quality or that of their parts. First, we show that using only the more recent parts of a dataset can drastically improve the performance of a recommendation system, and we pose that it should be a standard hyperparameter to be tuned prior to evaluation and deployment. Second, we find that comparing the performance of well-known baseline algorithms before and after optimising the training data window significantly changes the performance ranking.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Recommendation systems are widely used to help users find the most relevant products and
articles from the large catalogues available on most websites, like news websites and e-commerce
shops. The environments in which they are deployed generate large volume information streams
on which the models need to be trained. Barring online learning methods and incremental
models, the usual approach is to take a static slice of this data stream and train the model on
this slice. Determining the optimal width of this slice is a challenging engineering problem.
Using too little data can cause the model to starve, and not learn anything relevant. Using more
data often results in longer training times, and larger models that take longer to predict.</p>
      <p>
        In academic research, however, this is usually not considered to be such an issue. The typical
datasets used for experimental evaluation are static, and they are almost always used in their
entirety. Important steps have been taken to correctly evaluate recommendation techniques
through temporal or leave-last-one-out splits [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3, 4</xref>
        ]. By using all historic events to train
models, however, these evaluations place an implicit trust on the earliest interactions in the
dataset to add useful information. Challenging this trust, algorithms have been designed
to diminish the impact of older interactions [5]. In the evaluation of algorithms, we show
that disregarding earlier interactions entirely during training can significantly improve the
performance of a recommender system in multiple settings. Intuitively, this is true for a simple
popularity baseline: Items that were popular in the past week are more predictive of next
week than items that were popular in the past year [6]. But is this also true for more complex,
personalised recommendation algorithms?
      </p>
      <p>In this paper, we consider the maximum ‘age’ of an interaction, i.e. the time since it occurred,
used to build the model an additional hyperparameter during model training. In the remainder of
this paper, we will refer to the maximal age of an event used in training as the hyperparameter  .</p>
      <p>We investigate and answer the following three questions:
• RQ1: How does the optimisation of  impact the individual performance of an algorithm?
• RQ2: Does the optimisation of  change the relative performance between the algorithms?
• RQ3: How does the choice of  impact secondary metrics such as run time and coverage of
the item catalogue?</p>
      <p>Additionally, through our experiments we show that the optimal  has a significant impact
on model accuracy across algorithms and datasets. The biggest improvements are found for
algorithms that are agnostic of time, especially when deployed in highly dynamic environments
such as online news. Our findings strengthen our conviction that the hyperparameter  is an
important consideration when determining which model performs best, both in future academic
research and production settings. We leave a comprehensive benchmark of algorithms with
optimal values of hyperparameter  to future work.</p>
      <p>In Section 2 we highlight relevant related work. Section 3 describes how  should be considered
a hyperparameter, and how to set up an evaluation to mimic a real-world scenario. Also in
Section 3 we present the chosen algorithms, datasets and evaluation metrics. Finally, Section 4
discusses the experimental results with regards to the three research questions and presents the
results from two trials on news websites confirming our results. We also use our experiments
to give suggestions for the selection of values for  .</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Research in data science has recognised that data drift is an import factor in training high
quality models for several decades [7, 8, 9, 10] More specifically, Fan [11] raised awareness for
the issues associated with the blind usage of older data in the context of binary classification.
As they conclude: “[...] using old data unselectively is like gambling”. When a dataset contains
drift and an algorithm is not equipped to deal with this drift, using only more recent data, i.e.
explicitly defining  , is a straightforward way to avoid training poorly performing models [11].</p>
      <p>
        Recommender systems are used in highly dynamic environments and so naturally have to
deal with data drift. We can distinguish between two research directions related to handling data
drift, i.e. measuring accuracy under data drift and recommendation algorithms that perform
well under data drift. Regarding the former, improved data splitting techniques that better
reflect realistic recommendation scenarios have been proposed, e.g. timed splits [ 12, 13], a
sequential last item prediction split [14] and repeated time-aware splitting [
        <xref ref-type="bibr" rid="ref1">15, 1</xref>
        ]. In relation to
the latter, a large amount of time- and sequence-aware algorithms have been proposed over
the years. For a comprehensive overview we refer the interested reader to Campos et al. [12],
Ludewig and Jannach [16], Quadrana et al. [17] and Bogina et al. [5].
      </p>
      <p>Relevant to our work, Vinagre and Jorge [18] summarised two generic methods for dealing
with concept drift in a data-stream. The first is to utilise a predetermined  and use it as a sliding
window over the data. The second is to utilise fading factors such that older interactions have
less influence on the similarities. Ludmann [19] used a contextual popularity algorithm, with 
equal to five minutes, thirty minutes and one hour, to great success in the CLEF Initiative in
2017. Similarly Ji et al. [6] showed that computing popularity using a small  or using fading
factors provided a much stronger baseline. Jannach and Ludewig [20] and Jannach et al. [14] find
similar indications that the recency of training data is important in a retail context. Our work is
inspired by these earlier eforts and aims to further anchor and broaden their findings regarding
popularity and similarity-based algorithms to other types of recommendation algorithms,
such as time- and session-aware algorithms. Examples of such time-aware algorithms are
neighbourhood-based models that use fading factors [21, 22, 23, 24, 25, 26], similar to Vinagre
and Jorge [18]. More recently, we see sequence- and session-aware algorithms that learn
sequential models utilising the order in user histories. Examples of such methods are STAN [27],
Sequential Rules [20], VS-KNN [20], and GRU4Rec [28]. In the wake of GRU4Rec, more and
more deep learning approaches have been proposed that incorporate sequential and/or temporal
information. [e.g. 29, 30, 2].</p>
      <p>Recent reproducibility studies have challenged the performance of these complex deep
learning methods in a variety of domains. In two recent works, Dacrema et al. [31, 32] found
that “11 of the 12 reproducible neural approaches can be outperformed by conceptually simple
methods”, such as ItemKNN or UserKNN. Ludewig et al. [33] investigated the performance of
deep learning approaches compared to simpler baselines in a session context. They found that
“In the majority of the cases [...] it turned out that simple techniques outperform recent neural
approaches”. We follow their results, and focus on simpler baselines in our experiments.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Recommendation Scenario</title>
        <p>
          In many real-world applications, recommendation systems are used to generate
recommendations for users while they are looking at other articles or products. In these use cases, the
interest of the user is often captured mostly by their most recent interactions. A standard
evaluation protocol to model this situation is to perform either leave-last-one-out splits [
          <xref ref-type="bibr" rid="ref2">2, 3, 4</xref>
          ],
or iterative revealing [33].
        </p>
        <p>We modify the leave-last-one-out evaluation to best approximate a (repeated) train-and-serve
architecture typically used in production settings and avoid leaking future information into our
model training [34]. Only data before a timestamp  , the time at which the model is supposedly
retrained before serving, is used for training. Given the significant computational cost to
running our experiments, we use a single evaluation window and leave repeated evaluation as
suggested by Scheidt and Beel [15] for future work.</p>
        <p>Formally, given a set of users  and a set of items , let  = {(, , ) :  ∈ ,  ∈ ,  ∈ N}
be the dataset of interactions, where  is the timestamp on which user  last interacted with
item .</p>
        <p>To obtain a training dataset, we split the dataset on timestamp  ; data before  (&lt; ) is used
as training data for the algorithms. In addition to the other hyperparameters of each algorithm,
we introduced the hyperparameter  and so further limit our dataset used in training to data
after  −  , i.e. {(, , ) ∈ &lt; | −  &lt;  &lt;  }. Small values for  limit the training data to
only interactions that occurred close to the cut-of timestamp  . The larger  becomes, the
more data is used to train the models.</p>
        <p>To create the test dataset we extract only users that have at least 1 event after  and use all
but their last interaction (also those before  ) as history to predict their last interaction, as in a
classical leave-last-one-out scenario.</p>
        <p>To properly tune hyperparameters we introduce a second cut-of timestamp  &lt;  ,
such that our training dataset during hyperparameter optimisation are the events {(, , ) ∈
&lt; | −  &lt;  &lt;  }. To obtain the validation evaluation dataset, we extract users
that have an interaction between  &lt;  &lt;  and, analogous to the test dataset, use all prior
interactions to predict these users’ final interaction.</p>
        <p>When predicting items for a user, both during validation and testing, we remove their
previously visited items from the recommendations, as is frequently done in real world settings
as well. We consider evaluating re-consumption in this setting as future work.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Datasets</title>
        <p>For our experiments, we use five datasets, two from the news domain, and three from the
retail domain. We chose these two domains, because they are stereotypical real-world
recommendation use cases, and we expect the domains to exhibit diferent behavioural patterns.
News has a pronounced concept drift as articles become irrelevant quickly, while in retail,
product relevance is typically stable for a longer period. Intuitively we expect this to result
in retail datasets benefiting from larger  values as they experience weaker data drift, while
performance on news datasets sufers more drastically when  is too large. In our selection of
datasets, we required them to be of suficient size ( &gt; 1M interactions) and to contain timestamp
information for the item view events, which will be used to train models. For news, we use
the Adressa dataset [35] as well as a proprietary dataset, extracted from a live recommender
system, which we’ll call NEWS. Both of these datasets were collected over 7 days. In splitting
these datasets, we used the second to last day 12:00 to 23:59 as the source for the validation
target dataset, and the last day from 12:00 to 23:59 for the test target dataset. For retail, we
use the Yoochoose dataset from the Recsys Challenge in 2015 [36], the CosmeticsShop Kaggle
dataset [37] and a second proprietary dataset, extracted from a live recommender system, which
we’ll call RETAIL. All three of these datasets span a longer period than the two news datasets,
with CosmeticsShop collected over 152 days, Yoochoose over 182 days and RETAIL over 98
days. For the CosmeticsShop and Yoochoose datasets, we used validation and test sets of 14
days, for the slightly shorter but denser RETAIL dataset we used consecutive 7-day windows.
By using proprietary datasets as well as public datasets, we can link the ofline experimentation
results to our online trials.</p>
        <p>The properties of the datasets can be found in Table 1. We report the number of events (||),
number of users (| |), number of items (||), the period during which data was collected, and
the Gini coeficient comparing visits per item [ 38]. The Gini coeficient is a statistical measure
of dispersion, and a high Gini coeficient indicates that a few items have the most interactions,
and all the others are interacted with much less frequently. News datasets typically have a
higher Gini coeficient, because every day only a few articles are relevant for all users.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Algorithms</title>
        <p>We selected a combination of time-agnostic baseline algorithms, sequence-aware algorithms
and a time-aware algorithm to compare the impact resulting from optimising  for each of them.
Popularity The most visited items are recommended to each user. Recommendations are
only minimally personalised since items a user has interacted with before are removed from
the recommendations as per the scenario (See Section 3.1).</p>
        <p>Item-kNN One of the most well-known and frequently used baseline algorithms for
neighbourhoodbased collaborative filtering [ 39, 40]. The model consists of a single matrix multiplication with
an item-item matrix  ∈ R||×| |: () = . Where, , holds the similarity between
items  and . The similarity metric to use is considered a hyperparameter. In our work we use
cosine similarity and conditional probability as defined in Deshpande and Karypis [40]. Recent
work on neural news recommendation highlights the remarkable competitiveness of simple
neighbourhood-based methods compared to more complex alternatives [16, 41].
Item-kNN with fading factors We use two versions of ItemKNN methods with fading
factors. The first,proposed by Ding and Li [42], applies an exponential decay on user histories
when using them for prediction. The item-item similarity matrix is computed exactly as it
would be for the ItemKNN algorithm. The prediction function is changed to () = (),
where  applies an exponential decay on the interaction matrix . The decayed value for a
user item interaction is , = −  (0− ,), with 0 representing now, and , the last time
user  visited item  and  is a hyperparameter. Despite the fading factor applied during
prediction, we consider this algorithm time-agnostic in our discussion, because the trained
model is time-agnostic. We will call this method “IKNN Ding” in the remainder of the paper.</p>
        <p>The second method presented by Liu et al. [22] applies an exponential decay function on the
binary interaction matrix before computing the similarities . The similarities  are computed
as the cosine similarity between the columns of the decayed interaction matrix: (). The
decay function  is identical to the one used in IKNN, as is the prediction function . The
hyperparameters  fit and  predict of the decay function can be chosen independently for training
and prediction, allowing for more flexibility. We’ll reference this algorithm as “IKNN Liu”.
EASEr This model was proposed as an extension of the well-known SLIM method [43, 44].
In EASEr, the item-item matrix  is found through a least-squares optimisation problem that
allows for a closed-form solution. This makes the model much more eficient to compute than
iteratively optimised alternatives like Neural Networks, whilst yielding highly competitive
results. As the optimisation requires inverting the Gramian item-item matrix, EASEr becomes
more costly to compute as the size of the item catalogue grows.</p>
        <p>GRU4Rec The first deep learning model for recommendations to utilise a GRU component to
model sequential patterns in a session or user’s history [45]. The model was inspired by text
analysis methods and aims to capture relations between words that frequently appear together
in a particular order. In our experiments, we use the variant that with Bayesian Personalised
Ranking (BPR) loss to optimise the model, rather than using cross-entropy loss. BPR is more
suited for our scenario, because it solves a ranking problem and does not approach the problem
as a binary classification task. In addition this loss is also more eficient to compute, so training
times are lower.</p>
        <p>Sequential Rules (SR) Baseline algorithm using sequential association rules between items.
The model recommends items related to the last item a user has seen: () = . 
is the binary last visit matrix, (, ) = 1 only if  is the last item visited by user . The
asymmetric similarity between item i and j: , , is computed as ∑︀ ga1p((,,,,) . Where 1(, , )
∈
is an indicator function, that returns 1 only if user  has seen item  after item , and gap
returns the number of steps required to go from  to . A hyperparameter max_steps, specifies
how big this gap can maximally be before the co-occurrence is ignored. Ludewig et al. [33]
found that despite the simplicity of the algorithm, it performed competitively in sequential
recommendation tasks.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Evaluation metric(s)</title>
        <p>We consider the problem of optimal ranking of items, also known as the Top-K recommendation
problem. We use Normalised Discounted Cumulative Gain (NDCG) [46], Catalog Coverage
(Coverage) [47], Recall [46] and Mean Reciprocal Rank (MRR) [40] as metrics. The metrics
were evaluated on the top K recommendations, with  ∈ [10, 20, 50]. The goal we set for our
experiments is to generate an optimal ranking of items to be shown to the user as a list of
recommendations. Due to space considerations we report only NDCG@10 and Coverage@10
in this paper. Other results can be found in the public code repository1.</p>
        <p>Our primary metric is NDCG. We choose this metric because it rewards models that put the
correct items higher in the list. Besides this primary metric, we also report the coverage of the
algorithms because the amount of items recommended is often seen as a secondary goal for
recommendation [48].</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Parameter Optimisation</title>
        <p>We determine the optimal hyperparameters for each algorithm and dataset combination by
performing a search over the hyperparameter space and evaluating performance on the validation
dataset.</p>
        <p>Using a grid search, even one with coarse settings, would not be feasible given the large
amount of parameters for some of the algorithms, and the further addition of  to be inspected
over a large range of potential values.</p>
        <p>Rather than using a random search we utilised the Tree-structure Parzen Estimator [49] as
implemented in the Python hyperopt library2 [50]. While none of our hyperparameter spaces
contains dependent hyperparameters, the approach still manages to find optimal hyperparameter
combinations in fewer trials than a random search would.</p>
        <p>We don’t set a fixed amount of trials but give each algorithm-dataset pair a fixed amount
of time to run trials in order to find the best parameters. All algorithms were given six hours
to find the optimal hyperparameters, however, only GRU4Rec was unable to find convergence
within this timeframe. All other methods converged much sooner, often in less than two hours.
This way all experiments can be run in under a week without parallel computation on an 8-core
virtual machine with 52 GB of RAM, and a single NVIDIA Tesla T4 GPU. Due to insuficient
RAM, we could not train the EASE algorithm on Yoochoose and RETAIL datasets, and GRU4Rec
on the RETAIL dataset.</p>
        <p>In order to enable the exploration of more hyperparameters for GRU4Rec we did not train
it to full convergence during optimisation. This might lead to a loss of performance in the
optimisation results, however, the loss will be similar for every parameter combination so we
can find the optimal parameter combination while saving time on each trial. For the final results
on the test dataset, we train the GRU4Rec models for 20 epochs, resulting in convergence.</p>
        <sec id="sec-3-5-1">
          <title>1https://github.com/verachtertr/short-intent 2https://hyperopt.github.io/hyperopt/</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this Section we share the results of our experiments and answer the three research questions.
To enable reproduction and reuse of our experiments, we have made the code repository public3.</p>
      <sec id="sec-4-1">
        <title>4.1. RQ1: “How does the optimisation of  impact the individual performance</title>
        <p>of an algorithm?"
In Table 2 we present the optimal values for delta found during optimisation, and in Table 3 we
present the corresponding NDCG@10 values. We compute an NDCG value for both the model
trained on all training data ( = ∞) and on the optimised  ( = optim).</p>
        <p>The optimal choice of  depends on the combination of the dataset and the algorithm.</p>
        <p>A popularity algorithm works best with only the most recent data. Its optimal training
window is on most datasets smaller than a day, with only CosmeticsShop exhibiting stable
enough behaviour for 10 days to be optimal. On the news datasets we find the most drastic
improvements, up to 30 times on the Adressa dataset. The extraordinary performance of the
Popularity algorithm on news datasets and Adressa in particular is explained by the extreme
popularity bias present in these datasets. In Table 1 you find that for Adressa the Gini coeficient
of the items is 0.92, and on the test dataset, the Gini coeficient is even higher: 0.98. This
indicates that almost all events happen on a very small group of popular items.</p>
        <p>On the news datasets, the relevance of recent data is reflected in the optimal  values, the
time agnostic methods perform optimally using the last few hours to train. Only the time-aware
ItemKNN model (IKNN Liu) and GRU4Rec manage to use more than a day of data without
losing quality on the Adressa dataset. For both datasets, we see noticeable improvements in
performance for the time agnostic algorithms trained on only recent data. For the NEWS dataset,
with even more rapidly changing relevance, we see that all algorithms, even the time-aware
algorithm, perform optimally using only the last few hours of data.</p>
        <p>On the retail datasets, we see their stability reflected in the optimal  values. CosmeticsShop
is a very stable dataset, and most algorithms perform optimally using almost all of the data
(The maximal value for  is 124 * 24 = 2976 hours). For RETAIL, we note that the optimal  is
usually smaller than on CosmeticsShop, but the performance gains are minimal. This implies</p>
        <sec id="sec-4-1-1">
          <title>3https://github.com/verachtertr/short-intent</title>
          <p>that we can build a good model using less data, but adding the additional data does not hurt
performance as much as it did in the news use case. Yoochoose is the retail dataset where
optimisation of  has the largest impact. Most algorithms perform best using somewhere around
the last 10 days of data, only GRU4Rec requires a month of data to get the best model.</p>
          <p>The GRU4Rec algorithm shows the most inconsistent behaviour between validation and
testing data. The optimal values found during optimisation do not seem to translate to optimal
performance during testing. One possible reason for this, is that the model takes much longer
to train, and so far fewer parameter combinations could be checked.</p>
          <p>Choosing  right is important to get the optimal performance for an algorithm given a dataset.
In some cases, the dataset will be stable enough that using all data is optimal. In others, however,
it’s only the last few hours that hold relevant events to build a model for the imminent future.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. RQ2: "Does the optimisation of  change the relative performance</title>
        <p>between the algorithms?"
We compare how the rankings of algorithms sorted by NDCG change if we go from  = ∞ to
an optimised  . For this comparison we use Kendall’s Tau correlation between the two rankings
of the algorithms [51]. We report these correlations at the bottom of Table 3. On the two news
datasets, we note a strong dissonance between the rankings. Both have a correlation value
below zero, indicating that the rankings have drastically changed. When  = ∞, the time- and
sequence-aware approaches show superior performance, however, this is no longer the case
given an optimal  . The baseline methods surpass the deep learning methods and now perform
the best.</p>
        <p>For the retail datasets we don’t see this efect, either  = ∞ is optimal (CosmeticsShop
and RETAIL), or the time agnostic algorithms were already outperforming the deep learning
methods, and their improvement only further established their rank (Yoochoose). There is
however no guarantee that the rankings will always remain the same, we can imagine that
for some combinations of algorithms this ranking would change. Especially when comparing
time-aware models with time agnostic baselines. The time-aware models will have a higher
performance when using the whole datasets, and the baselines manage to close the gap when
their training window is optimised. We can see this happen on Yoochoose, IKNN Ding’s
performance almost matches that of IKNN Liu with an optimised  , when it was outperformed
on the  = ∞ setting.</p>
        <p>In most scientific articles, the results would be compared using a  = ∞ setting, and so
the time agnostic algorithms can be handily beaten by methods that do manage to take into
account the order and/or time of the interactions. However, simple baselines trained on the
more relevant - recent - part of the data, become much harder to improve on and even perform
best in some of our experiments. This highlights why it is so important to optimise  . If we do
not, we risk making the wrong conclusions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. RQ3: How does the choice of  impact secondary metrics such as run time</title>
        <p>and coverage?
In Table 4 we present the Coverage@10 results for the algorithm-dataset pairs. We see that in
general coverage is lower for the optimal  . This is to be expected, because one of the side efects
of using less data is that older articles have no events, and so will not get recommended. Only
for ItemKNN and IKNN Ding on Adressa do we see an inverse efect: Shrinking the training
window increased the number of items recommended. This behaviour occurs when the historic
data drowns more recent interactions, such that even given the recent history of the user, the
model still mostly recommends a select group of older items. Reducing  levels the playing
ifeld for the more recent items, and so more of them can get recommended depending on the
interests of the users.</p>
        <p>A third metric impacted by the selection of  is the run time of the algorithms. Training a
model on less data usually leads to lower training and prediction times. We compute run time
as the sum of training and prediction time, thus accounting for both slow training and slow
prediction. Both are impacted by the amount of data used and both contribute to problematic
situations in production settings. In Table 5, the run time for the optimisation trials with optimal
 and maximal  are reported. Using less data leads to lower run times. For production settings,
this is an important insight. For example, on the Yoochoose dataset using the SR algorithm,
there is a small increase in performance when changing to an optimal  but also a 10-time
reduction in run time. This means that models can be updated more frequently and with lower
computational cost.</p>
        <p>This highlights a final reason why using less data should be considered. When using as much
data as is available, we not only risk lower performance, we are also incurring higher
computational costs and creating larger delays when building models and generating recommendations.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Online Tests</title>
        <p>Complementing the ofline results, we also performed two online trials on diferent news
websites. The goal for these trials was to optimise recommendation boxes that serve a list
of popular items to the users. Before the use of an automated optimisation of  , the training
window was chosen manually by engineers with some input from editors. By performing the
optimisation of  as suggested in this work, we found that the original values were not optimal,
and could be improved by using smaller  values.</p>
        <p>In a first test, using the website from which the NEWS dataset was extracted, the box was
found on the homepage. The manual setting was to train every three hours. Thus  = 3ℎ was
used as our control treatment. During the ofline experiments, we found that  = 1ℎ performed
optimally, and so for the test group we used this as the training window. The results of the AB
test showed that the optimised  = 1ℎ training window resulted in an improvement in CTR (on
the box) of 7% during a period of three days. After which we concluded the test, and enabled
the new setting for all users. We could use a short testing window thanks to high trafic; Three
million recommendation lists were generated for both groups combined. The 7% improvement
we found online, is similar to the 10% improvement we found ofline.</p>
        <p>In a second test on a diferent news website, we found an optimal window of  = 2ℎ after
parameter tuning. In this more extensive test, we deployed a similar recommendation list in
multiple locations on the website to make sure the positive efects were consistent. Furthermore,
the test was run for two weeks to allow for variations between days. We used two control
groups, one with training window  = 6ℎ, and a second with  = 10ℎ. Depending on the
location of the box we found an improvement in CTR of 7% to 8% over both control groups,
which performed nearly identical.</p>
        <p>Even though these experiments were only done using a popularity-based algorithm, they
show the value in optimising the  parameter before deploying the algorithms in a production
setting. The improvements we find in our ofline experiments for this algorithm were reflected
in our online experiments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>“Are we forgetting something?” we wrote in the title, and our answer is clearly: yes! When
training and evaluating recommender systems, we typically forget to take the quality of the
data into account, or even consider the use of only (the most) recent parts of the given datasets.
As we have presented in this paper, the performance of state-of-the-art algorithms drastically
changes when training only on a recent part of the data. Moreover, the performance ranking
of state-of-the-art (both baseline and neural) algorithms changes significantly when using the
optimal training window size  . We believe that we have clearly shown that the choice of the
 matters, both to find the optimal performance of individual algorithms and to make a fair
comparison between algorithms. Optimising  for each algorithm should be standard practice
in the evaluation of recommender systems. Not optimising  will favour only the algorithms
that account for drift.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and Future Work</title>
      <p>In this work we focused on News and Retail datasets, as well as a selection of baseline algorithms.
In future work we want to extend the experiments, by including additional relevant domains,
such as entertainment, tourism and music, as well as using more recently presented
state-ofthe-art sequential recommendation methods. In doing so we want to provide a comprehensive
benchmark of the state of the art in sequential recommendation.</p>
      <p>Due to run time concerns we did not consider repeated evaluations. To solidify our findings,
and make sure that they hold on more than one split, we aim to report results over time in
future experiments.</p>
      <p>This work’s experiments focus on short term efects, extending these results to long term
efects such as user retention is an interesting avenue for future research.
[3] I. Bayer, X. He, B. Kanagal, S. Rendle, A generic coordinate descent framework for learning
from implicit feedback, in: Proc. of the 26th International Conference on World Wide Web,
2017, pp. 1341–1350.
[4] W.-C. Kang, J. McAuley, Self-attentive sequential recommendation, in: 2018 IEEE
International Conference on Data Mining (ICDM), IEEE, 2018, pp. 197–206.
[5] V. Bogina, T. Kuflik, D. Jannach, M. Bielikova, M. Kompan, C. Trattner, Considering
temporal aspects in recommender systems: a survey, User Modeling and User-Adapted
Interaction (2022) 1–39.
[6] Y. Ji, A. Sun, J. Zhang, C. Li, A re-visit of the popularity baseline in recommender
systems, in: Proceedings of the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, 2020, pp. 1749–1752.
[7] A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing, in:</p>
      <p>Proc. of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 443–448.
[8] R. Klinkenberg, I. Renz, Adaptive information filtering: Learning in the presence of concept
drifts, Learning for Text Categorization (1998) 33–40.
[9] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift
adaptation, ACM Computing Surveys 46 (2014).
[10] G. Widmer, M. Kubat, Learning in the presence of concept drift and hidden contexts,</p>
      <p>Machine Learning 23 (1996) 69–101.
[11] W. Fan, Systematic data selection to mine concept-drifting data streams, in: Proc. of the
10th ACM SIGKDD international conference on Knowledge discovery and data mining,
2004, pp. 128–137.
[12] P. G. Campos, F. Díez, I. Cantador, Time-aware recommender systems: a comprehensive
survey and analysis of existing evaluation protocols, User Modeling and User-Adapted
Interaction 24 (2014) 67–119.
[13] U. Panniello, M. Gorgoglione, C. Palmisano, Comparing pre-filtering and post-filtering
approach in a collaborative contextual recommender system: an application to e-commerce,
in: International Conference on Electronic Commerce and Web Technologies, Springer,
2009, pp. 348–359.
[14] D. Jannach, M. Ludewig, L. Lerche, Session-based item recommendation in e-commerce:
on short-term intents, reminders, trends and discounts, User Modeling and User-Adapted
Interaction 27 (2017) 351–392.
[15] T. Scheidt, J. Beel, Time-dependent evaluation of recommender systems., in: Perspectives@</p>
      <p>RecSys, 2021.
[16] M. Ludewig, D. Jannach, Evaluation of session-based recommendation algorithms, User</p>
      <p>Modeling and User-Adapted Interaction 28 (2018) 331–390.
[17] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-aware recommender systems, ACM</p>
      <p>Computing Surveys (2018).
[18] J. Vinagre, A. Jorge, Forgetting mechanisms for scalable collaborative filtering, Journal of
the Brazilian Computer Society 18 (2012) 271–282.
[19] C. A. Ludmann, Recommending news articles in the clef news recommendation evaluation
lab with the data stream management system odysseus., in: CLEF (Working Notes), 2017.
[20] D. Jannach, M. Ludewig, When recurrent neural networks meet the neighborhood for
session-based recommendation, in: Proc. of the 11th ACM Conference on Recommender
Systems, RecSys ’17, ACM, 2017, pp. 306–310.
[21] T. Q. Lee, Y. Park, Y. Park, A time-based approach to efective recommender systems using
implicit feedback, Expert systems with applications 34 (2008) 3055–3062.
[22] N. Liu, M. Zhao, E. Xiang, Q. Yang, Online evolutionary collaborative filtering, in: Proc. of
the 4th ACM Conference on Recommender Systems, RecSys ’10, ACM, 2010, pp. 95–102.
[23] C. Xia, X. Jiang, S. Liu, Z. Luo, Z. Yu, Dynamic item-based recommendation algorithm with
time decay, in: 2010 Sixth International Conference on Natural Computation, volume 1,
IEEE, 2010, pp. 242–247.
[24] Y. Liu, Z. Xu, B. Shi, B. Zhang, Time-based k-nearest neighbor collaborative filtering, in:
2012 IEEE 12th International Conference on Computer and Information Technology, IEEE,
2012, pp. 1061–1065.
[25] P. C. Vaz, R. Ribeiro, D. M. De Matos, Understanding the temporal dynamics of
recommendations across diferent rating scales., in: UMAP Workshops, 2013.
[26] V. W. Anelli, T. Di Noia, E. Di Sciascio, A. Ragone, J. Trotta, Local popularity and time in
top-n recommendation, in: European Conference on Information Retrieval, Springer, 2019,
pp. 861–868.
[27] D. Garg, P. Gupta, P. Malhotra, L. Vig, G. Shrof, Sequence and time aware neighborhood
for session-based recommendations: Stan, in: Proc. of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval, 2019, pp. 1069–1072.
[28] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, Session-based recommendations with
recurrent neural networks, arXiv preprint arXiv:1511.06939 (2015).
[29] J. Tang, K. Wang, Personalized top-n sequential recommendation via convolutional
sequence embedding, in: Proc. of the 11th ACM international conference on web search
and data mining, 2018, pp. 565–573.
[30] Q. Liu, Y. Zeng, R. Mokhosi, H. Zhang, Stamp: short-term attention/memory priority
model for session-based recommendation, in: Proceedings of the 24th ACM SIGKDD
international conference on knowledge discovery &amp; data mining, 2018, pp. 1831–1839.
[31] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? a
worrying analysis of recent neural recommendation approaches, in: Proc. of the 13th
ACM Conference on Recommender Systems, RecSys ’19, ACM, 2019, pp. 101–109.
[32] M. F. Dacrema, S. Boglio, P. Cremonesi, D. Jannach, A troubling analysis of reproducibility
and progress in recommender systems research, ACM Transactions on Information
Systems (TOIS) 39 (2021) 1–49.
[33] M. Ludewig, N. Mauro, S. Latifi, D. Jannach, Performance comparison of neural and
non-neural approaches to session-based recommendation, in: Proc. of the 13th ACM
conference on recommender systems, RecSys ’19, 2019, pp. 462–466.
[34] Y. Ji, A. Sun, J. Zhang, C. Li, A critical study on data leakage in recommender system
ofline evaluation, arXiv preprint arXiv:2010.11060 (2020).
[35] J. A. Gulla, L. Zhang, P. Liu, O. Özgöbek, X. Su, The adressa dataset for news
recommendation, in: Proc. of the International Conference on Web Intelligence, WI ’17, ACM, 2017,
p. 1042–1048.
[36] D. Ben-Shimon, A. Tsikinovsky, M. Friedmann, B. Shapira, L. Rokach, J. Hoerle, Recsys
challenge 2015 and the yoochoose dataset, in: Proc. of the 9th ACM Conference on
Recommender Systems, RecSys ’15, ACM, 2015, pp. 357–358.
[37] M. Kechinov, Cosmeticsshop e-commerce dataset, 2020. URL: https://www.kaggle.com/
mkechinov/ecommerce-events-history-in-cosmetics-shop, accessed: 2022-07-26.
[38] J. Y. Chin, Y. Chen, G. Cong, The datasets dilemma: How much do we really know about
recommendation datasets?, in: Proc. of the 15th ACM International Conference on Web
Search and Data Mining, 2022, pp. 141–149.
[39] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering
recommendation algorithms, in: Proc. of the 10th International Conference on World Wide Web,
WWW ’01, ACM, 2001, pp. 285–295.
[40] M. Deshpande, G. Karypis, Item-based top-n recommendation algorithms, ACM
Transactions on Information Systems (TOIS) 22 (2004) 143–177.
[41] G. D. S. P. Moreira, D. Jannach, A. M. da Cunha, Contextual hybrid session-based news
recommendation with recurrent neural networks, IEEE Access 7 (2019) 169185–169203.
[42] Y. Ding, X. Li, Time weight collaborative filtering, in: Proc. of the 14th ACM international
conference on Information and knowledge management, 2005, pp. 485–492.
[43] X. Ning, G. Karypis, Slim: Sparse linear methods for top-n recommender systems, in:
Proc. of the 2011 IEEE 11th International Conference on Data Mining, ICDM ’11, IEEE
Computer Society, 2011, pp. 497–506.
[44] H. Steck, Embarrassingly shallow autoencoders for sparse data, in: Proc. of the World</p>
      <p>Wide Web Conference, WWW ’19, ACM, 2019, p. 3251–3257.
[45] C. Wu, A. Ahmed, A. Beutel, A. J. Smola, H. Jing, Recurrent recommender networks, in:
Proc. of the 10th ACM international conference on web search and data mining, 2017, pp.
495–503.
[46] G. Shani, A. Gunawardana, Evaluating recommendation systems, in: Recommender
systems handbook, Springer, 2011, pp. 257–297.
[47] S. A. Puthiya P., N. Usunier, Y. Grandvalet, A coverage-based approach to
recommendation diversity on similarity graph, in: Proceedings of the 10th ACM Conference on
Recommender Systems, 2016, pp. 15–22.
[48] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: evaluating recommender
systems by coverage and serendipity, in: Proc. of the fourth ACM conference on Recommender
systems, RecSys ’10, 2010, pp. 257–260.
[49] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization,</p>
      <p>Advances in neural information processing systems 24 (2011).
[50] J. Bergstra, D. Yamins, D. Cox, Making a science of model search: Hyperparameter
optimization in hundreds of dimensions for vision architectures, in: International conference
on machine learning, volume 28, PMLR, 2013, pp. 115–123.
[51] M. G. Kendall, The treatment of ties in ranking problems, Biometrika 33 (1945) 239–251.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Jeunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verstrepen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          ,
          <article-title>Fair ofline evaluation methodologies for implicitfeedback recommender systems with MNAR data</article-title>
          ,
          <source>in: Proc. of the ACM RecSys Workshop on Ofline Evaluation for Recommender Systems, REVEAL '18</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lonjarret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auburtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Robardet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plantevit</surname>
          </string-name>
          ,
          <article-title>Sequential recommendation with metric models based on frequent sequences</article-title>
          ,
          <source>Data Mining and Knowledge Discovery</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>1087</fpage>
          -
          <lpage>1133</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>