Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context Milena Filipovic1,2 , Blagoj Mitrevski2,3 , Diego Antognini2 , Emma Lejal Glaude1 , Boi Faltings2 and Claudiu Musat1 1 Swisscom, Switzerland 2 Ecole Polytechnique Fédérale de Lausanne, Switzerland 3 Symphony, North Macedonia Abstract Recommender systems research tends to evaluate model performance offline and on randomly sampled targets, yet the same systems are later used to predict user behavior sequentially from a fixed point in time. Simulating online recommender system performance is notoriously difficult and the discrepancy between online and offline behaviors is typically not accounted for in offline evaluations. This disparity permits weaknesses to go unnoticed until the model is deployed in a production setting. In this paper, we first demonstrate how omitting temporal context when evaluating recommender system performance leads to false confidence. To overcome this, we postulate that offline evaluation protocols can only model real-life use-cases if they account for temporal context. Next, we propose a training procedure to further embed the temporal context in existing models. We use a multi-objective approach to introduce temporal context into traditionally time-unaware recommender systems and confirm its advantage via the proposed evaluation protocol. Finally, we validate that the Pareto Fronts obtained with the added objective dominate those produced by state-of-the-art models that are only optimized for accuracy on three real-world publicly available datasets. The results show that including our temporal objective can improve recall@20 by up to 20% Keywords evaluation, recommendation, offline and online evaluation, multi-objective optimization 1. Introduction In an increasingly digital world, recommender systems are a staple of our daily routines. They influence how we perceive our environment, from media content to human relationships. Traditional methods of evaluation that entail random sampling over a long period of time are perfect for a system that is designed to remain unchanged for an equally long and predefined period. However, if the system is to be used in a dynamic setting, e.g. recommending the next song to play in a playlist, the way it is evaluated must reflect that. Inadequate evaluation techniques can lead to false confidence, which is especially detrimental in commercial settings. Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2021), September 25th, 2021, co-located with the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands " milena.filipovic1@swisscom.com (M. Filipovic); blagoj.mitrevski@symphony.is (B. Mitrevski); diego.antognini@epfl.com (D. Antognini); emma.lejalglaude@swisscom.com (E. L. Glaude); boi.faltings@epfl.com (B. Faltings); claudiu.musat@swisscom.com (C. Musat) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Recommender system evaluation can be done online or offline. Online evaluation implies deployment of the recommender in the real world, often in a commercial setting. While this may be ideal for measuring the real-life impact of a system, it is also costly, both in terms of resources and time, and therefore rarely used in research and benchmarking. In offline evaluation, historical data is utilized. Some portion of the data is selected to train on, while another subset is used for performance testing. Since all data points are available beforehand, the evaluation costs and overall timeline are significantly reduced. Irrespective of whether they are evaluated online or offline, many existing recommenders ignore temporal information. Most recommender systems fall into one of two main categories: content-based or collaborative filtering [1, 2, 3, 4]. The former relies on recommended items having similar attributes to those that the user has previously interacted with, while the latter methods base their recommendation on items bought by similar users. However, many models completely ignore temporal information, with the notable exception of time-aware recommender systems[5]. These systems introduce additional context to interactions: their temporal dimension. In this work, we first focus on the importance of temporal dynamics in recommender system evaluation. We draw attention to the lack of standardization in the evaluations, and the differ- ences between research settings and the systems’ ultimate applications. Then, we highlight two temporal evaluation protocols and show how they attain a closer approximation of the real-life conditions in which recommender systems are deployed. Second, we present a multi-objective approach[6] of incorporating the temporal context to time-unaware recommender systems without any change in model architecture. We introduce a naive recency objective as a means to include temporal dynamics in typically time-independent recommender systems. We also provide a measure of recency in the form of a performance metric. Through experiments on three real-world publicly available datasets we show that the addition of the naive temporal objective yields improvements not only in recency but also in relevance. Finally, we demon- strate that the Pareto Fronts obtained with the added objective dominate those produced by state-of-the-art models. To the best of our knowledge, this is the first study quantifying the difference in recommender system performance when evaluated using methods that model real-world environments, as opposed to traditional techniques. We also show that a recommender system can be optimized for both relevance and recency objectives simultaneously. To summarize, the main contributions of this paper are as follows: • We demonstrate how commonly used evaluation protocols do not provide adequate modeling of real-world deployment settings. To combat this, we show two evaluation techniques to facilitate offline modeling of online production environments that inherently incorporate temporal dynamics; • We introduce a “naive” recency function that can be utilized to create a temporal objective. We show that optimizing for both temporal context and relevance [6] leads to solutions that dominate those optimized just for relevance in both dimensions. 2. Related Work 2.1. Evaluating Recommender Systems 2.1.1. Traditional Recommender Systems. Inputs and outputs share similarities with classification and regression modeling: a class variable is predicted from a set of given features. Given that recommendation tasks can be seen as a generalization of these, some evaluation techniques used for classification are transferable to recommender systems. In collaborative filtering research, recommenders are generally evaluated either through strong or weak generalization, as characterized by [7]. In both approaches, models are trained on observed interactions and validated or tested on those that are held-out. However, there exist some key differences. Weak generalization is introduced in [8], where the held-out set is created through random sampling of the available interactions of all users. Strong generalization differs by taking disjoint sets of users for the training, validation, and testing sets. Following this, some interactions are held-out from the validation and test sets and then approximated using the recommender. Methods that encode user representation cannot apply strong general- ization, as they cannot generate outputs for previously unseen users. An example of the strong generalization approach can be seen in [3], whereas [9, 10, 11] all use weak generalization. Several of these works emphasize that the application of their recommender system would be in predicting future user actions, yet all validation and testing is done with randomly selected interactions. This can break the time linearity as the knowledge of future interactions during training can help predict a randomly sampled past interaction. While much effort is directed towards establishing the importance of proper evaluation design, it is generally focused on implementing relevant metrics to avoid under- or over-estimating real-world performance [12], and not on the evaluation procedures themselves. 2.1.2. Temporal Recommender Systems. They denote time-aware RS (TARS), and incorporate time explicitly or implicitly[5]. Temporal recommender systems can be taken to include sequence-aware recommender systems (SARS), as a special form of time adaptive recommenders that focus on ordering rather than specific time instances[5]. It is however important to note that while they can be evaluated using similar techniques, SARS approach temporal dynamics from a different perspective, therefore the resulting models can differ greatly from typical TARS[13]. [5] provide an extensive overview of possible evaluation techniques, which served as an inspiration and point of reference for this work. While traditional evaluation protocols may be used on temporal recommenders, it is more representative to preserve the temporal ordering between interactions since this is something that the recommender aims to learn. By extension, train, validation, and test splits should also be ordered. [13] state that they were unable to find a consensus among evaluation protocols used in recent sequence-aware recommender work, which is mirrored in our findings. Yet we did determine that most recent SARS focus only on next item prediction, meaning they output one recommendation. They also typically employ certain target item conditions to decrease computational cost [5]. The target item conditions determine the (sub)set of items for which a recommender should produce predictions and are specific for top-N recommendation evaluation. The reduction of the computational costs is generally done through conditions that rank one ground truth item against a set of other items false items. Examples can be found in [14, 15, 16]. We return to the problem of subsampling in Section 3. 2.2. Temporal Context in Recommender Systems In this paper, we introduce the concept of recency. An important note is that there are multiple definitions of recency in recommender systems literature. In fact, this lack of consensus has persisted for years. [17, 18] treat the recency of an item as an attribute that is user-dependent. The value is determined by the last time the user interacted with a given item. [19, 20] also claim to incorporate recency into their research: when recommending news articles, they measure recency as the age of the item on the platform. Our analysis will follow the latter definition. This is in line with our desire to explore the effects of a light-weight temporal addition on the performance of traditional RS. Further work to determine the "ideal" definition of recency, while undoubtedly invaluable, is outside the scope of this work. 3. Evaluation Protocols We propose that the temporal dimension should be considered when evaluating the performance of any recommender. While random sampling may be an appropriate target selection technique for some classification or regression tasks, we argue that this is not the case when it comes to predicting a user’s subsequent move. Unlike the vast majority of evaluation methods applied to traditional recommenders, temporal recommender systems literature does model the passage of time. However, as stated above, the performance is often computed over a subset of the itemset and the user’s true chosen item. The argument is that subsampling is necessary due to the complexity of the ranking task. While this has some validity, itemsets of around 10,000 datapoints can be ranked highly efficiently, especially when taking into consideration recent advancements in machine learning libraries and GPU programming. Therefore, we do not utilize subsampling in our work. The adoption of a recommender system in real scenarios has two major phases. The first, called the development phase, is purely offline and theoretical. In this part, three separate sets of data must be created: a training set that the model will use to learn item and user representations, a validation set for hyperparameter tuning, and a test set to evaluate how well the model performs. The second, called the deployment-ready phase, include interactions with end-users. The maximum amount of data is leveraged to train a model with as much information as possible, evaluate its performance, and then deploy it into production. In this case, only two sets are needed: training and validation sets. One downside of collaborative filtering methods is that most models are incapable of incorpo- rating new items without retraining. While ways to alleviate this problem have been explored [21], the issue remains widespread and worthy of more study, but lies outside the scope of this paper. Therefore, we assume an industry-like environment: the recommender system will be retrained regularly and will be exposed to clients for a relatively short period, ranging from a Users Users Time Time (a) Proportional Temporal Selection. (b) Strict Temporal Cutoff. Figure 1: Two methods for temporal validation set target selection. couple of days to a few months. We postulate that the performance of the recommender on the last portion of historically available data is most indicative of how it will behave when deployed. Our protocols focus on set creation. When selecting the target values in a validation set, we take two possible approaches. The first, proportional selection, depicted in Figure 1a, selects the final 𝑋% of each user’s interactions and uses these to create target items. Here we preserve the time ordering of the input and target interactions, maintaining similarity with the real-life use-case. The second approach, shown in Figure 1b, is based on a strict time cutoff to select the target items of the validation set. This method is even closer to the real-world use case. However, it does suffer from certain drawbacks as user interactions are not necessarily evenly distributed through time, leading to some users being more represented than others in the target set. While these are similar to the suggestions developed in [5], we underline that these approaches should not be limited to evaluating TARS. It is crucial to approximate with maximum precision the performance of a model when developing a novel system, before it is released into production. The second approach directly models the real-world context and contains user-item interaction sequences created after a specific strict time cutoff. 4. Recency to Improve Recommendation The main task of a recommender system is to anticipate users’ future desires and suggest relevant content. The relevance objective is the one that is most commonly found in recom- mender systems literature and accounts for the accuracy or correctness. It actively focuses the recommender on selecting the item(s) with which the users will most likely interact. However, just recommending the most relevant items does not always satisfy all the concerns of those building the system and it is not the only objective used in practice. We distinguish two types of objectives: correlated and uncorrelated to relevance. The former ones correspond to those whose optimization is linked to the relevance objective. Examples are novelty [22], serendipity [23], and utility-based objectives, such as revenue. For the latter, not correlated to relevance, examples can be found in diversity and fairness. In this work, we introduce a simple utility-based objective used to inject temporal information alongside the relevance objective. While the exploration of uncorrelated objectives is essential for the future of recommender systems, we leave it for future work. 4.1. Adding Temporal Context Based on our experience with real-life use-cases, we observed that users seem to gravitate towards content that had more recently been added to a given platform. While we cannot disclose internal facts and figures, the temporal objective described below was motivated by behaviour exhibited across many months of user interactions observed internally. Building on these findings, and works such as [19] and [20], we chose to explore the effects of incorporating recency as an objective during the learning phase. Given an item 𝑥 with a timestamp 𝑡𝑥 , we further define the recency function 𝑓 (·) as: 𝑡𝑥 −𝑡𝑚𝑖𝑛 {︃ 1 𝑡𝑚𝑎𝑥 −𝑡𝑚𝑖𝑛 ≥ 0.8 𝑓 (𝑥) = (0.8− 𝑡 𝑡𝑥 −𝑡𝑚𝑖𝑛 10 )× 3 (1) 0.3 𝑚𝑎𝑥 −𝑡𝑚𝑖𝑛 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where 𝑡𝑚𝑎𝑥 and 𝑡𝑚𝑖𝑛 are the maximum (most recent) and minimum (oldest) timestamps over the itemset. In 𝑓 (·), we first scale all timestamps to [0, 1] using the min-max scaler, and then apply a transformation inspired by [24]. We underline that this is a naive function which we found to best approximate user in- teractions observed in-house. Works such as [25] rely on a power law distribution to model temporal effects, highlighting that further exploration into use-case specific temporal context approximations may yield exciting results. The recency objective is formulated as a loss that stimulates the recommendation of recent items. Each item in the itemset is assigned a recency weight, The vector is then used to weigh item importance when calculating the loss. Adding weights into a traditional loss does not affect the differentiability of the function. To illustrate how our temporal objective can be easily integrated into a traditionally time- unaware RS, we take as a use-case the state-of-the-art variational autoencoder Mult-VAEPR of [3]. For the sake of brevity, we refer the reader to [3] for more details about the model. We thus propose an extension of Mult-VAEPR , where the loss function for user 𝑢 is modified to: ℒ𝛽 (𝑥𝑢 ; 𝜃, 𝜑) = E𝑞𝜑 (𝑧𝑢 |𝑥𝑢 ) [log 𝑝𝜃 (𝑓 (𝑥𝑢 ) * 𝑥𝑢 |𝑧𝑢 )] − 𝛽 · KL(𝑞𝜑 (𝑧𝑢 |𝑥𝑢 )||𝑝(𝑧𝑢 )) (2) where the expected negative log-likelihood is modified to include the element-wise multiplica- tion of input vector 𝑥𝑢 by 𝑓 (𝑥𝑢 ), which corresponds to the recency scores of the given items in 𝑥𝑢 . 𝛽 controls how much importance is given to the KL term, 𝑧𝑢 is a variational parameter of the variational distribution 𝜃 and 𝜑 are model parameters. 4.2. Multi-Objective Optimization Optimizing a recommender on multiple objectives is non-trivial. Thanks to the recent work of [6], we employ the proposed multi-gradient descent algorithm for multiple objectives to train our recommenders. After a standard forward pass, the loss and gradient are computed for each objective and weights of the gradients are computing as a Quadratic Constrained Optimization Problem [26]. This can be solved analytically for two objectives, or solved as a constrained optimization problem as proposed in [27] for more than two objectives. Solving it allows us to obtain the common descent vector and update the parameters. This training procedure enables us to incorporate both our temporal context and the relevance objectives to retrieve time-aware recommendations. The algorithm adapts the weight repartition between the two objectives in an advanced manner to optimize both during training. 5. Experiments 5.1. Datasets We study the performance of various models on three real-world publicly available datasets. MovieLens-20M. Contains about 20 million ratings1 , with values between 1 and 5. We binarize the user-item interaction matrix, keeping ratings of 4 and above as positive feedback to transform it into implicit feedback. We filter out all users with less than five ratings, and all movies rated by less than five users. We focus on the last ten years of available data (2005-2015). The preprocessed dataset contains 46,295 users, 9479 items, and 3.76M interactions with a density of 0.86%. Steam. Has review information from the gaming platform Steam2 . We converted user-item interactions into a positive feedback signals. The dataset contains reviews from 2010 to 2018; however, the platform only sees an uptick in review activity after 2014, therefore we use 2014- 2018 for our analysis. After preprocessing the dataset contains 471,457 users, 13,018 items, and 3.14M interactions with a density of 0.09%. Netflix. The well-known Netflix Prize Competition dataset3 . It consists of over 100 million ratings. We filter these ratings in the same way as the MovieLens ratings, and take the last two years of activity (2003-2005). Because of low performance on certain baselines, we denote two variants for the implicit feedback: one with threshold of 4 and above (Netflix≥4), the other one with a threshold of 5 (Netflix≥5). After preprocessing the dataset contains 257,775 users, 13,995 items, and 38.87M interactions with a density of 0.59%. 5.2. Recommendation Techniques We conduct experiment with the following well-known recommendation systems: Mult-VAEPR . The MAMO framework4 and the setup from the original paper [3] are utilized. SVD. The PyTorch implementation5 of the Singular Value Decomposition [28] is used, taking the top 100 dimensions. 1 https://grouplens.org/datasets/movielens/20m/. 2 https://cseweb.ucsd.edu/~jmcauley/datasets.html. 3 https://www.kaggle.com/netflix-inc/netflix-prize-data. 4 https://github.com/swisscom/ai-research-mamo-framework. 5 https://pytorch.org/docs/stable/generated/torch.svd.html. NCF. The Neural Collaborative Filtering [2], we take the implementation from 6 , sample 4 negative instances for every existing user-item interaction, set the predictive factor of 64, and the number of hidden layers for the multilayer perceptron (MLP) to three. We do not present results obtained using pre-trained NeuMF, as they exhibited the same patterns as generalized matrix factorizaion (GMF) and MLP, but did not give a significant improvement. To resolve difficulties in obtaining good results with Netflix≥4 for GMF and MLP models, we used instead the Netflix≥5 dataset. BERT4Rec. We implement this sequence-aware recommender system from [14] in PyTorch and integrate it with the MAMO framework. We take this model to show how directly encoding temporal information in the model impacts performance. In this case the ordering represents the temporal information. Hyperparameters were mostly taken from the original paper, otherwise selected based on a simple grid search. The number of transformer layers is set to 2, the head number is 4, head dimensionality is 64, and the dropout is 0.1. We use a sequence length of 100, while the proportion of masked inputs is 0.2. The model is trained using the Adam optimizer with a learning rate of 1e-4. All models were trained with the Adam optimizer, with a learning rate of 0.001. 5.3. Experimental Setup We explore whether validation set formation in the deployment-ready phase may lead to false confidence in the performance of the evaluated model. In the deployment-ready phase, what we call the validation set is not necessarily used for hyperparameter tuning, but to assess the performance of the model before it is deployed. There are minor differences in the datasets used for models with and without user representation. Models without user representation require some input interactions to be able to predict targets, while those with simply need to be passed a user identifier. We divide our experiments into three sets, corresponding to the type of evaluation. 5.3.1. Traditional Evaluation. Similarly to [3], we divide the users 80:10:10 to form a train, validation, and test set. The target interactions are selected by randomly sampling 20% of the user-item interactions in the validation and test sets. We show that if a model is evaluated on and then applied to a task that entails predicting randomly held-out interactions, the performance achieved on both validation and test sets is comparable. This traditional approach is typically used to report model performance. We then contrast performance on randomly held-out interactions in the validation set against temporally held-out interactions in the test set. We take 5% of the users from the train set to create the validation set and randomly hold-out 20% of their interactions. The train and validation sets contain user-item interactions up to a specific point in time. The test set contains the interactions and users from the train and validation sets as inputs, and the temporally held-out interactions are targets, to simulate deployment in a commercial setting. 6 https://github.com/guoyang9/NCF. 5.3.2. Temporal Evaluation. We show that when evaluated with either a proportional or hard temporal cutoff, the model’s performance is closer to what would be observed in a real-life setting. However, it is important to note the ideal evaluation technique is heavily domain dependent. We divide the train, validation, and test sets as follows. 5% of the users from the train form the validation set. In the first approach, we hold out the last 20% of user-item interactions from each user in the validation set. While in the other, we hold out the last couple of months of activity and evaluate the model’s ability to predict these interactions. The test set contains the interactions and users from the two other sets as inputs, and the temporally held-out interactions are targets. 5.3.3. Temporal Evaluation with Added Temporal Context. We introduce temporal context into the traditionally time-independent Mult-VAEPR by using the work from [6] to optimize the model for accuracy and recency. To calculate the recency score we take the timestamp of the moment that the item first became available, or the first recorded instance of any user interacting with the given item. This timestamp is 𝑡𝑥 in the recency function 1. The strict temporal cutoff validation set is utilized, as well as the temporal test set described previously. 5.4. Evaluation Metrics We evaluate models using three ranking metrics, as RS can often only show a predefined number of recommendations. We ensured that the items that the user had previously interacted with were removed from the output before the top-k results were selected for metric calculation. • Precision@K: calculates how many of the recommended items are relevant to the user; • Recall@K: quantifies the proportion of relevant items in the top-k recommended items by calculating how many of the desirable items are are suggested to the end-user. We take our definition from [3]; • Recency@K: assigns a recency score to each item, calculating the rating of the top-k recommended and relevant items. For user 𝑢 with relevant items 𝐼𝑢 we define 𝜔(𝑘) as the item at rank 𝑘, where I is the indicator function: 𝐾 ∑︁ 𝑅𝑒𝑐𝑒𝑛𝑐𝑦@𝐾(𝑢, 𝜔, 𝑓 ) = I[𝜔(𝑘) ∈ 𝐼𝑢 ] × 𝑓 (𝜔(𝑘)) (3) 𝑘=1 6. Results 6.1. Traditional Evaluation. This experiment aims to show that the traditional way of evaluation recommender systems, shown in Table 1, is not a faithful representation of the environments in which they are actually Table 1 Results of initial Mult-VAEPR experiments, evaluated on a traditional evaluation protocol. We report Recall / Precision at 𝑘 = 20. Dataset Valtrad Testtrad ML-20M 0.31 / 0.17 0.31 / 0.17 Steam 0.20 / 0.02 0.20 / 0.02 Netflix≥4 0.35 / 0.19 0.35 / 0.19 Table 2 Results of the Mult-VAEPR , SVD, GMF, and MLP evaluated on a traditional, proportionally selected tem- poral, and strict cutoff validation set, as well as on a temporally held out test set. Results of BERT4Rec evaluated on a strict cutoff validation set and a time delayed test set. We report Recall / Precision at 𝑘 = 20. Dataset Model Valtrad Valprop Valcutoff Testtemp Mult-VAEPR 0.32 / 0.18 0.26 / 0.13 0.11 / 0.06 0.11 / 0.07 SVD 0.25 / 0.22 0.14 / 0.11 0.07 / 0.03 0.11 / 0.07 ML-20M GMF 0.25 / 0.22 0.11/ 0.10 0.08 / 0.03 0.10 / 0.07 MLP 0.25 / 0.23 0.12 / 0.10 0.07 / 0.03 0.11 / 0.07 BERT4Rec - - 0.20 / 0.09 0.15 / 0.08 Mult-VAEPR 0.20 / 0.02 0.14 / 0.02 0.11 / 0.01 0.13 / 0.01 Steam SVD 0.10 / 0.02 0.10 / 0.02 0.09 / 0.01 0.08 / 0.01 BERT4Rec - - 0.21 / 0.02 0.17 / 0.02 Mult-VAEPR 0.35 / 0.18 0.22 / 0.10 0.12 / 0.05 0.10 / 0.05 Netflix≥4 SVD 0.23 / 0.16 0.23 / 0.16 0.09 / 0.05 0.07 / 0.04 BERT4Rec - - 0.24 / 0.13 0.20 / 0.05 SVD 0.23 / 0.10 0.23 / 0.11 0.12 / 0.05 0.09 / 0.03 Netflix≥5 GMF 0.31 / 0.14 0.30 / 0.14 0.14 / 0.05 0.12 / 0.04 MLP 0.31 / 0.14 0.30 / 0.14 0.14 / 0.05 0.12 / 0.04 deployed. The good performance achieved by evaluating in this way can provide a false sense of security. Our claim is supported by the values highlighted by Table 2. Even though the validation sets are not identical to the ones before, the performance observed is very similar. However, it degrades on the time delayed test set, or to be more precise, when we simulate what would happen in a production setting. Drops in performance of -65.63%, -35.00%, and -71.43% can be observed, on the Recall@20 values. We postulate that this discrepancy leads to significant dissonance between the results of certain recommenders as reported in literature, and those observed in their real-life application. 6.2. Temporal Evaluation. The results shown in Table 2 depict what happens when using traditional validation as opposed to our proposed evaluation sets. The table illustrates how the strict cutoff validation set approximates the deployment behavior. For all datasets, this approach seems to be a closer estimation of the "real-life" performance. For example, the drop in performance is reduced from -71.43% to -16.67% on the Netflix≥4 dataset for the Mult-VAEPR model. The proportionally selected validation sets seems to work well for the Steam dataset, and we know from industry experience that it can be good on others. However, this seems to be highly dataset specific. Table 2 also shows that this phenomenon is not isolated to the Mult-VAEPR , but can be repeated with the SVD, GMF, and MLP models. As mentioned before, we were unable to conduct experiments on Netflix≥4 with the GMF and MLP models; therefore we report their results on Netflix≥5. It is important to note that simpler methods, especially those based on matrix factorization, do not deal well with the Steam dataset. This is the sparsest dataset that we work with which seems to make it difficult to learn anything meaningful. Based on this, we exclude the Steam dataset for GMF and MLP. However, we keep the results for SVD. We strongly recommend that these evaluation methods be taken into account when presenting novel achievements in the field. When feasible, we recommend to apply both protocols. 6.3. Temporal Evaluation and Temporal Models. The results presented so far were achieved using traditional recommender architectures, with no way of learning temporal dynamics. To show that it is possible to achieve better results on the given datasets, we incorporate the temporal dynamics into the training process, by utilizing BERT4Rec. The results are shown in 2, and dominate all traditional solutions. This confirms our hypothesis that temporal dynamics should be accounted for in both evaluation design and model architecture in order to attain the best possible recommenders. With BERT4Rec the interaction ordering is encoded in the model. The authors acknowledge that the naive recency objective does not reproduce this effect when added to traditional RS. However, the goal of the subsequent subsection is to illustrate that the simple addition of a cheap time-dependant weight affects performance in a meaningful way. 6.4. Temporal Evaluation with Added Temporal Context. To integrate the temporal context into the traditional models, our following contribution has the recency included as an objective influencing the optimization. We refer to the multi-objective Mult-VAEPR as the Multi-Objective Recency Enriched mult-VAEPR (MOREVAE). We present both the Pareto Fronts obtained during training and the results of the best models on the test sets. These results were obtained through more intense training than those shown in the previous sections in an attempt to extract the best possible performance from the Mult- VAEPR . The Pareto Fronts were generated by evaluating on the strict cutoff validation sets during training, and the best models were chosen by selecting those with the highest Recall@20 and applying them to the time delayed test sets. Figure 2 shows that the multi-objective approach not only dominates the single objective one in terms of recency, but that optimizing for recency also increases the relevance of the recommendations, validating our initial intuition. The results of the best models over the test sets are shown in Table 3. The improvements obtained are 18.18%, 0.00%, and 20% for Recall@20; 14.29%, 0.00%, and 25.00% for Precision@20. The improvements seen in Recency@20 are 104.35%, 20.00%, and 94.12%. Table 3 Comparison of Mult-VAEPR and MOREVAE results obtained on temporally held out test sets. We report Recall, Precision, and Recency at 𝑘 = 20. Dataset Model R P Re PR Mult-VAE 0.11 0.07 0.23 ML-20M MOREVAE 0.13 0.08 0.47 Mult-VAEPR 0.13 0.01 0.15 Steam MOREVAE 0.13 0.01 0.18 Mult-VAEPR 0.10 0.04 0.34 Netflix≥4 MOREVAE 0.12 0.05 0.66 0.15 0.13 0.15 Single-Objective Single-Objective Single-Objective Multi-Objective Multi-Objective 0.14 Multi-Objective 0.14 Recall@20 Recall@20 0.12 Recall@20 0.13 0.13 0.12 0.11 0.12 0.11 0.3 0.4 0.5 0.6 0.7 0.10 0.14 0.110.4 0.5 0.6 0.7 0.8 0.9 0.15 0.16 0.17 Recency@20 Recency@20 Recency@20 (a) ML20m dataset. (b) Steam dataset. (c) Netflix≥4 dataset. Figure 2: Pareto Fronts obtained through optimizing on one objective (accuracy), and two objectives (accuracy and recency). 7. Conclusion Following standard offline recommendation evaluations during development may lead to false confidence when deploying models in real-life scenarios. Utilizing random sampling to hold out data is not an adequate approximation of many real-life use-cases. Previous research generally focuses on developing better metrics to reflect real-world performance, but still omits temporal context. We highlight this lack of standardization and propose two temporal evaluation protocols that empirically better approximate real-life conditions. Our second contribution is to propose leveraging a multi-objective approach and train models on relevance and recency simultaneously. We show that a naive recency objective can be used to integrate temporal information in existing time-unaware recommenders. Experiments on three real-world publicly available datasets demonstrate that our method produced solutions that strictly dominate those obtained with a model trained on a single-objective optimization. We explored datasets that are frequently used in recommender systems research, all related to digital media content. Digital media content is consumed frequently and generally without much repetition. The importance of recency and capturing transient behavioral trends may not be equivalent in other recommender systems applications, such as grocery or clothes shopping. The influence of temporal dynamics on these sectors is an exciting topic, and we leave it to future academic and commercial research. References [1] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. Wang, J.-R. Wen, S3- rec: Self-supervised learning for sequential recommendation with mutual information maximization, Proceedings of the 29th ACM International Conference on Information and Knowledge Management (2020). URL: http://dx.doi.org/10.1145/3340531.3411954. doi:10. 1145/3340531.3411954. [2] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, Neural collaborative filtering, in: Proceedings of the 26th international conference on world wide web, 2017, pp. 173–182. [3] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara, Variational autoencoders for col- laborative filtering, in: Proceedings of the 2018 world wide web conference, 2018, pp. 689–698. [4] D. Antognini, C. Musat, B. Faltings, Interacting with explanations through critiquing, in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, (IJCAI 2021), 2021. [5] P. G. Campos, F. Díez, I. Cantador, Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols, User Modeling and User-Adapted Interaction 24 (2014) 67–119. [6] N. Milojkovic, D. Antognini, G. Bergamin, B. Faltings, C. Musat, Multi-gradient descent for multi-objective recommender systems, Proceedings of the AAAI (2020) - Workshop on Interactive and Conversational Recommendation Systems (WICRS) (2020). [7] B. Marlin, Collaborative filtering: A machine learning perspective, University of Toronto Toronto, 2004. [8] J. S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998, p. 43–52. [9] X. Ning, G. Karypis, Slim: Sparse linear methods for top-n recommender systems, in: 2011 IEEE 11th International Conference on Data Mining, IEEE, 2011, pp. 497–506. [10] Y. Wu, C. DuBois, A. X. Zheng, M. Ester, Collaborative denoising auto-encoders for top-n recommender systems, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 2016, pp. 153–162. [11] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, Bpr: Bayesian personalized ranking from implicit feedback, arXiv preprint arXiv:1205.2618 (2012). [12] C. C. Aggarwal, et al., Recommender systems, volume 1, Springer, 2016. [13] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-aware recommender systems, ACM Computing Surveys (CSUR) 51 (2018) 1–36. [14] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1441–1450. [15] W.-C. Kang, J. McAuley, Self-attentive sequential recommendation, in: 2018 IEEE Interna- tional Conference on Data Mining (ICDM), IEEE, 2018, pp. 197–206. [16] B. Hidasi, A. Karatzoglou, Recurrent neural networks with top-k gains for session-based recommendations, in: Proceedings of the 27th ACM International Conference on Informa- tion and Knowledge Management, 2018, pp. 843–852. [17] Y. Ding, X. Li, M. E. Orlowska, Recency-based collaborative filtering, in: Proceedings of the 17th Australasian Database Conference-Volume 49, 2006, pp. 99–107. [18] J. Vinagre, A. M. Jorge, J. Gama, Collaborative filtering with recency-based negative feedback, in: Proceedings of the 30th Annual ACM Symposium on Applied Computing, 2015, pp. 963–965. [19] A. Chakraborty, S. Ghosh, N. Ganguly, K. P. Gummadi, Optimizing the recency-relevancy trade-off in online news recommendations, in: Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 837–846. [20] P. M. Gabriel De Souza, D. Jannach, A. M. Da Cunha, Contextual hybrid session-based news recommendation with recurrent neural networks, IEEE Access 7 (2019) 169185–169203. [21] X. Luo, Y. Xia, Q. Zhu, Incremental collaborative filtering recommender based on regular- ized matrix factorization, Knowledge-Based Systems 27 (2012) 271–280. [22] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender systems, in: Proceedings of the fifth ACM conference on Recommender systems, 2011, pp. 109–116. [23] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: evaluating recommender systems by coverage and serendipity, in: Proceedings of the fourth ACM conference on Recommender systems, 2010, pp. 257–260. [24] C.-L. Huang, M.-C. Chen, W.-C. Huang, S.-H. Huang, Incorporating frequency, recency and profit in sequential pattern based recommender systems, Intelligent Data Analysis 17 (2013) 899–916. [25] D. Kowald, S. C. Pujari, E. Lex, Temporal effects on hashtag reuse in twitter: A cognitive- inspired hashtag recommendation approach, in: Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1401–1410. [26] J.-A. Désidéri, Multiple-gradient descent algorithm (mgda) for multiobjective optimization, Comptes Rendus Mathematique 350 (2012) 313–318. [27] O. Sener, V. Koltun, Multi-task learning as multi-objective optimization, in: Advances in Neural Information Processing Systems, 2018, pp. 527–538. [28] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Application of dimensionality reduction in recommender system-a case study, Technical Report, Minnesota Univ Minneapolis Dept of Computer Science, 2000.