Sequence or Pseudo-Sequence? An Analysis of Sequential Recommendation Datasets Daniel Woolridge1 , Sean Wilner1 and Madeleine Glick1 1 Vody LLC. Los Angeles, CA, USA Abstract Sequential recommendation aims to model a user’s preferences by looking at the order of interactions in a user’s history. The evaluation of such algorithms requires robust datasets with genuine sequential information. In this work we analyze the timestamp information of several commonly used datasets and show that reported timestamps are not indicative of meaningful sequential order. In the datasets explored, significant numbers of users have interactions occurring at identical timestamps. The actual order of these interactions is therefore unknowable; the interaction history is pseudo-sequential. We find that randomly shuffling the order of interactions has minimal impact on the performance of a leading sequential recommender. Particular attention is paid to MovieLens because of its frequency of use in the field of sequential recommendation. Our findings motivate the necessity for new datasets with more meaningful ordering for the evaluation of sequential recommenders. Keywords Datasets, Recommendation, Sequential Recommendation, MovieLens, CEUR-WS 1. Introduction The ubiquity of and necessity for recommendation systems has given rise to an explosion of research in the field. Recommendation algorithms are studied and implemented for use in domains spanning media, e-commerce, social networks and more. The landscape of recom- mendation research consists of a plethora of techniques, most of which attempt to model the interactions between users and items in order to predict the items with which a user is most likely to interact. Sequential recommendation, an increasingly popular trend in the field, works by taking into account not only the users’ interaction history but the order of those interactions as well. The goal of a sequential recommender is to utilize and exploit sequential patterns in historical user behavior [1, 2, 3, 4]. As with any burgeoning field, the ability to accurately benchmark and compare results across various datasets, models, and metrics is essential. Benchmarking the relative performance of various recommendation algorithms on publicly available datasets is a core part of the research and development of such systems. It is therefore of utmost importance to ensure the validity of various benchmark datasets for accurately conducting research. Throughout the machine Perspectives on the Evaluation of Recommender Systems Workshop (PERSPECTIVES 2021), September 25th, 2021, co-located with the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands " dan@vody.com (D. Woolridge); sean@vody.com (S. Wilner); madeliene@vody.com (M. Glick) ~ https://www.vody.com (D. Woolridge)  0000-0003-2415-5528 (D. Woolridge); 0000-0002-8382-3488 (S. Wilner); 0000-0003-3042-2039 (M. Glick) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 learning community emphasis is being increasingly placed on verifying all aspects of the datasets used to benchmark models. Recent work on biases in image datasets [5, 6], language datasets [7, 8], and the models trained on them focus primarily on the ethical and societal impacts of these biases and how their presence and lack of mitigation can affect the wider world. Others have looked at the frequency of label errors in various commonly used datasets spanning multiple domains [9]. Misunderstanding and misuse of input data can lead to erroneous conclusions that become tainted seeds for subsequent works. One such example, brought to attention by Li et al. [10], shows how experimental procedure can imbue datasets with non-signal correlations that models will pick up on and utilize. In an analogous manner, our work analyzes time-sequence data in various recommendation datasets and calls into question their validity for assessing the performance of sequential recommendation algorithms. There are two main approaches to assessing the performance of a recommendation system, online and offline tests. In the case of online testing, researchers are able to assess the perfor- mance of two or more systems by using those systems to provide recommendation for different user segments [11]. Often, researchers do not have access to online tests for recommendation and thus have to rely on offline tests [12]. A core assumption for offline tests is that the dataset on which they are conducted accurately reflects a real-world recommendation scenario, and if not, the differences are established and understood. In this work, we begin by exploring the timestamp information in several popular datasets used for evaluating sequential recommendation algorithms. We find that for some of these datasets the mere presence of timestamp information is not indicative of task relevant sequential information. We discuss the construction of the datasets and explore issues therein. Additionally, we show the impact of these issues by conducting experiments with a popular sequential recommendation algorithm. Our findings call into question the validity of using these datasets for the evaluation of sequential recommenders. 2. Related Work 2.1. Sequential Recommendation The task of sequential recommendation is to provide relevant items to users by using the users’ historical interaction data and exploiting patterns between subsequent items. There has been much work done in this area which we briefly summarize below1 . 2.1.1. Baselines and Earlier Models Simple baselines are often used to compare against more sophisticated methods. One frequently used baseline is to rank items by their popularity and provide these as recommendations for each user. Another simple yet performant baseline is to use ItemKNN, a K-Nearest Neighbors approach on the items [15]. One particularly successful class of approaches has been to explore historical sequences using K-th order Markov models to model stochastic transitions between items by utilizing sequential patterns [16]. Rendle et al. [3] have combined Markov Chains (MC) with Matrix Factorization 1 We refer to Quadrana et al. [13] and Campos et al. [14] for deeper studies on the subject. 2 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 in their work Factorized Personalized Markov Chains which, although promising, struggles to deal with sparsity issues. As an attempt to address this problems He and McAuley [4] introduce Fossil, a model that fuses item similarity models with Markov Chain models. 2.1.2. Recurrent Models Another class of approaches use Recurrent Neural Networks (RNNs) and their extensions to tackle the problem of sequential recommendation. Hidasi et al. [17] use an RNN architecture for session recommendation. Quadrana et al. [18]. extend this work by incorporating user interests via Gated Recurrent Unit (GRU) layer across user sessions. Zhu et al. [19] use a time interval aware Long Short-Term Memory (LSTM) model in an attempt to better capture both long and short term interactions in a users history. A key drawback of these models is that they require large amounts of dense data to perform well. 2.1.3. Attentional Models With the success of attentional methods and transformers in multiple domains with sequential information, it is only natural that they would be applied to the problem of recommendation. Kang and McAuley [1] present SASRec, a self-attention model that is able to capture both long term semantics and provide recommendations based on a few salient actions. The success of this model has prompted many children and extensions. One such extension, BERT4Rec, has been developed by Sun et al. [2]. Inspired by the popular language model BERT, BERT4Rec uses bi-directional self attention to better model users’ behavior sequences and deal with potentially noisy input sequences [20]. Ying et al. [21] introduce a 2-level hierarchical attention network in an attempt to better capture long and short term interests of users. 2.2. Evaluation and Benchmarking The lack of effective benchmarks for evaluation of recommendation algorithms is a critical issue facing the community. Although there have been recent dedicated works dissecting vision, language and audio datasets [9] and the effect of errors therein on benchmark validity, the field of recommendation still lags behind in this area. Conference tracks are being dedicated to datasets and benchmarks23 , and efforts are being made to properly review and badge artifacts [22]. Various attempts to standardize dataset versioning have been proposed [23, 24, 25] but have yet to be widely adopted. Evaluating recommendation algorithms is difficult [26], and despite many attempts to stan- dardize frameworks [27, 28, 29, 30, 31] the field as a whole still lacks the consistency desired. Said and Bellogín [32] explore four aspects of recommendation contributions pertaining to their reproducibility, the dataset, the evaluation framework, data details, and algorithmic details. Within this structure, our work focuses primarily on the dataset and data details aspects. A recent and vitally important publication by Rendle et al. [26] shows how several baselines can be tuned to outperform reported results, thereby calling into question many results from the 2 https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks 3 https://recsys.acm.org/recsys21/perspectives/ 3 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 previous years. Sun et al. [29] provides an extensive look at many prominent recommendation contributions. As shown in their Figure 1B, MovieLens-1M is the most frequently used dataset for evaluating recommendation algorithms, appearing in just over 30% of the 85 papers studied. The other MovieLens datasets explored (100K, 10M, and 20M) appear less frequently but all are in the top 15 datasets by popularity. In Figure 5A of [29], results are shown on baseline models using the MovieLens dataset split randomly or split by timestamp. The authors claim that the time-aware split better simulates the real recommendation scenario, which may be true if the timestamps represent a realistic interaction sequence. Gruson et al. [12] discuss the challenges of offline recommendation evaluation and specifically point out that some datasets include biases introduced in their construction, whether through the user interface, internal recommendation algorithm or otherwise. Ji et al. [33] raise concerns with offline evaluations on datasets which ignore the global timeline of interaction sequences in the data. When datasets are collected over many years some items may not be available for the entire duration of the data collection, thereby introducing biases that should be accounted for in evaluation. In a similar vein, our work takes a deeper look at the local timestamp information present in some leading recommendation datasets. 3. Datasets Offline evaluation of recommendation systems requires robust datasets that are applicable to the task under consideration. For sequential recommendation this means that datasets should contain genuine order information that serves as a close proxy to the real world scenario being simulated. Our primary focus in this work is on the timestamp information that most recommendation datasets include and is often used to infer the order of interactions. The timestamps for the datasets we explore are provided either as a date format or in Unix time. When presented Unix time, meaningful relations are obscured as the scale between timestamps is less apparent to a visual inspection. To understand the nature of timestamps in the field we explore six established recommendation datasets: • MovieLens 1M and 25M : Our primary focus and one of the most widely used datasets for benchmarking recommendation performance. We explore both the ML-1M and ML- 25M versions [34]. In these datasets the interactions represent a user rating a movie and the timestamps indicate when the rating was submitted to the nearest second. • Amazon Beauty: A dataset with reviews of beauty products from Amazon.com intro- duced in [35]. The interactions are reviews of the products and the timestamps represent the date of review. • Amazon Video Games 2014 and 2018 : Datasets of video game products reviews. As both were available we look at the 2014 version introduced in McAuley et al. [35] and 2018 version introduced in Ni et al. [36]. As with Amazon Beauty, the interactions in this dataset represent a user reviewing an item and the timestamp is the date of the review. • Steam: Introduced in Kang and McAuley [1], this dataset captures interactions between users and video games on Steam. The interactions are reviews of the video games, and the timestamps are the date of the review. 4 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 Table 1 Dataset statistics after pre-processing steps. Avg. Interactions Avg. Interactions Dataset Users Items Per User Per Item Interactions Amazon Beauty 2014 22,363 12,101 8.9 16.4 0.2M Amazon Video Games 2014 24,303 10,672 9.5 21.7 0.23M Amazon Video Games 2018 50,688 16,898 9.0 26.9 0.45M Steam 281,455 11,961 12.6 297 3.5M MovieLens-1M 6,040 3,416 165.5 292.6 1.0M MovieLens-25M 162,541 32,720 153.5 762.4 24.9M For all datasets we apply the same pre-processing steps as detailed in Kang and McAuley [1] and Sun et al. [2] wherein we remove duplicate interactions and keep only users and items with at least 5 interactions. All datasets include timestamp data, and all datasets have a maximum resolution of days except for the two MovieLens datasets whose maximum resolution is at the level of seconds. As seen in Table 1, ML-1M is the most dense of the datasets explored with the three Amazon datasets being the most sparse. ML-1M also has much longer interaction histories on average. This makes it of particular interest to researchers looking to explore longer range dynamics in user sequences for recommendation. However, as we show in the following sections, the use of MovieLens is especially problematic for evaluating sequentially driven methods. 3.1. Timestamps (a) ML-1M (b) ML-25M (c) Steam 60 20 40 40 15 10 20 Percentage Of Users 20 5 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 (d) Video 2014 (e) Video 2018 (f) Beauty 30 15 15 20 10 10 5 10 5 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number Of Unique Days With Interactions Figure 1: The percentage of users in each dataset whose interactions happened on 𝑥 unique days. In all datasets, each user history exists as a timestamp-ordered sequence of interactions with items. In this and the following sections we show that although the datasets include timestamps, 5 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 this does necessarily mean that these timestamps convey interaction order. One way to clearly see the lack of true ordering in the ML-1M dataset is to look at the number of interactions for each user that happen at unique timestamps. Figure 1 shows the difference between the MovieLens datasets and others when viewed via the lens of interactions on unique days. For both MovieLens datasets the vast majority of users (59% and 56.4% for ML-1M and 25M respectively) have all their interactions occuring on a single date. When taken in conjunction with an average sequence length of over 150, these facts make it clear that the MovieLens datasets are not representative of a realistic sequence of movie watching. Steam is the only dataset with no users whose interactions are all on one day, and although there are single-day interaction sequences in the other datatsets, these users represent a much smaller percentage of the total users than they do in the MovieLens datasets. The timestamp information in the MovieLens datasets is at the level of seconds, not days. Naturally then, it may be the case that although all interactions for a user occur on one or two days their ordering still provides some estimate of a genuine interaction history. Ideally, all users would have fully distinct interaction histories and therefore discernible order would exist between the interactions. However, this is not the case in MovieLens datasets with 0 second intervals accounting for 53.2% of timestamp intervals in ML-1M (17.4% in ML-25M). Further details are provided in Appendices A.1 & A.2. This further highlights the importance of dataset inspection and special care with regards to the application of timestamps for ordering in recommendation. 3.2. Intervals In order to better understand the sequences in these datasets we look at each user’s sequence of 𝑛 interactions as a list of 𝑛 − 1 consecutive intervals. For example if a user interacted with item A on 2018-01-02 and item B on 2018-01-03 the interval would be one day. Some key statistics of this interval information are presented in Table 2. The mean-mode interval is the dataset-mean user-mode interval in days or seconds. The mode-mode interval statistic is the dataset-mode user-mode interval in days or seconds. We define the unique interaction ratio for a dataset as the number of interactions at distinct timestamps divided by the total number of interactions for each user averaged over the dataset. Of key importance is that if two interactions have an interval of zero, then the order in which those interactions occurred is unknowable. Table 2 Interval statistics Dataset Mean-Mode Mode-Mode Unique Interaction Interval (Days/Seconds) Interval (Days/Seconds) Ratio (Days/Seconds) Beauty 2014 4 0 0.61 Video 2014 9 0 0.69 Video 2018 7.8 0 0.6 Steam 14.7 0 0.86 ML-1M ~0 / 0.05 0/0 0.03 / 0.48 ML-25M ~0 / 3.24 0/0 0.04 / 0.8 6 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 As can be seen in Table 2, the mean-mode day-interval for ML-1M is nearly 0 over both days and seconds. This means that on average, users’ interaction sequences contain more interactions happening at the same second as another interaction in the same user’s sequence than not. Therefore the order of items in such a sequence is ill-posed and, given the time-scale under consideration, divorced from a realistic viewing pattern. Also of note is that for all datasets the mode-mode interval is 0 days or seconds. The significant amount of indistinct regions of interactions implied by this value means that interaction orders derived from all these datasets are contaminated with some amount of noise. Figure 2 visualizes the percentage of intervals for each dataset that are of zero days in length. While all datasets have large amounts of zero-day intervals, ranging from 19.4% for Steam to 98.5% for ML-1M, the MovieLens datasets stand out as severely divergent from a real-world scenario. These distributions of timestamps do not reflect a natural interaction behavior where, for example, a person is unlikely to watch more than two or three movies in a single day. In fact the 59% of users in the ML-1M dataset whose entire interaction sequence is on a single day have a median sequence length of 62 interactions. A large percentage of interactions in MovieLens share timestamps with ‘adjacent’ interactions. This behavior is problematic when using the ordering of these interactions as input information for modelling. Given these overlaps, the ordering only partly exists, and in the case of ML-1M mostly does not. See Appendix A.1 for further analysis. (d) Video (e) Video (a) ML-1M (b) ML-25M (c) Steam 2014 2018 (f) Beauty 1.47% 19.4% 34.6% 45.8% 43.9% 54.2% 56.1% 65.4% 98.5% 96% 80.6% All Other Intervals 0 Days Figure 2: Percentage of intervals between consecutive interactions of length 0 days. The data presented in Table 2 and Figure 2 raise interesting questions for the validity of these datasets as offline proxies for genuine sequential interaction data. By presenting Figures 1 and 2 as well as Table 2 we hope to convey that although the timestamp information in the MovieLens datasets is particularly problematic, the other datasets all suffer from the same affliction to lesser and varying degrees. We propose that the above analysis or something analogous to it becomes standard procedure for datasets when they are being used for evaluation sequential recommenders. 7 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 3.3. MovieLens for Sequential Recommendation The MovieLens datasets are well established for evaluating recommendation engines [37, 38] and continue to be used by many of the leading models as one of the benchmarks for sequential recommendation [39, 1, 2, 40, 41, 42]. While the MovieLens datasets are incredibly valuable for assessing general recommendation algorithms, we find that they are not good datasets for assessing the performance of sequential recommendation. In fact, the originators of the datasets raise evidentiary concerns regarding the value of timestamps in their 2015 paper [34]. An argument could be made that although the ordered interaction data present in the dataset does not closely represent a real-world scenario, the sequences nonetheless do represent some estimate of order of interest. This is, however, in conflict with our findings that over 50% of interactions in user sequences from ML-1M have an equal timestamp as another interaction in the same user sequence. For ML-25M this percentage is 17.4%, showing that despite improvements to MovieLens, regions of ambiguous order still exist in a meaningful portion of the dataset. It may be true that some user interactions that share timestamps were added to the dataset in such a way that the order of interaction was preserved. Following this, it may seem reasonable to infer that they present a valid source of order. However, if we sort the datasets by timestamp, as is often done during pre-processing, we are left to the whims of the sorting algorithm and how it decides to order the equal valued entries. This allows for the situation of two researchers using the same dataset to have different interaction orders for the same users. This problem could be remedied by an unambiguous ’interaction order’ field included along side the timestamps at construction. An additional factor mentioned by Harper and Konstan [34] is that the movies rated by users in the MovieLens datasets are often prompted by an internal recommendation engine. This means that the items users are served to rate depend on the items the user has previously rated. For the larger MovieLens datasets (10M, 20M and 25M) the interface and underlying recommendation engine have changed over the course of the dataset collection [34]. The presence of an internal recommendation engine introduces an implicit signal into a user’s interaction sequence since it controls the scope of which items a user is served for rating at any given position in the sequence. The examination of the datasets throughout this section points towards several issues with their applicability as evaluation benchmarks for sequential recommendation. In the following section we perform a set of experiments aimed to explore the breadth of impact of these pseudo-sequences on a sequential recommender model. 4. Experiments We aim to explore the impact that the pervasive ordering pathologies of several common rec- ommendation datasets have, if any, on sequential recommendation performance. Specifically, the goals of our experiments are to answer the following research questions: Given the ques- tionable timestamp informed sequences in the datasets, how much impact does shuffling this information have on performance? Does the explicit construction by internal recommendation for MovieLens introduce a signal that sequential recommenders inadvertently exploit? If we adapt the experimental design to control for the conflated signal in MovieLens, what impact 8 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 does shuffling the data have on performance? 4.1. Implementation Details We choose to use SASRec as the basis for our experiments because it is an established model with a left to right architecture whose training paradigm predicts the next item for each position in the sequence. We re-implement SASRec from the ground up in Tensorflow 2. For all datasets used we pull fresh copies from the sources4 . We omit ML-25M from our experiments due to computational and time constraints, and leave this for future work. We adopt the same hyperparameters for each dataset as in Kang and McAuley [1]. Following the lead of Kang and McAuley [1], the metrics evaluated are Hit Rate (HR) and Normalized Discounted Cumulative Gain (NDCG) [43] at 10. Evaluation is done by taking 100 items from the dataset that the user did not interact with and calculating relevance scores for these items. The relevance score for the held out item, either belonging to the validation or test set, is then compared to and ranked with the negative items. These metrics assume that the user has equal opportunity to pick any item in the dataset to interact with next. The MovieLens rating interface and internal recommendation engine explicitly invalidates this assumption. 4.2. Shuffled vs. Unshuffled Our first experiment aims to determine the impact of explicitly randomizing the order of the training items for each user. We follow the training paradigm in Kang and McAuley [1]. The input for each user is the last 𝑁 items in their history, where the second to last item is held out for validation and the last item is held out for test evaluation. To avoid biases from random seed selection, we perform twenty total runs for each dataset, ten shuffled and ten unshuffled with both sets sharing the same ten random seeds. In the shuffled cases we randomly re-order the users interaction history but keep the last two items untouched (for validation and test). In Table 3 we report the results of our experiments as well as the reported results from Kang and McAuley [1]. Although all differences between shuffled and unshuffled results are statistically significant (except in the case of Steam where shuffling had no effect on performance) the differences are not qualitatively substantial outside of ML-1M where the ranges of shuffled and unshuffled are non-overlapping (and Beauty, though there the difference favors shuffled and will be explored briefly later). From a replication stand point our unshuffled results align well with the SASRec reported scores5 . The focus of our subsequent analysis is on the difference in performance between the shuffled and unshuffled cases. The largest such difference on both HR@10 and NDCG@10 occurs for ML-1M. We propose that this difference is caused by the shuffling process destroying most of the detectable signal provided by the internal recommendation engine introduced by the MovieLens dataset construction. Notably, shuffling has little effect on the performance of the other datasets. One factor that may explain how robust SASRec is to shuffling of the input sequences is that recommendations are made by looking at few items in the history due to the attentional mechanism [1]. Whether this robustness to input-sequence shuffling is a general property of sequential recommenders or 4 We will provide all model and data cleaning code upon publication. 5 There is a noticeable variation on the Steam dataset. We suspect this is due to differences in pre-processing. 9 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 Table 3 Mean scores over ten random initializations. All runs use the same parameters as in [1] and were ran for 200 epochs. Dataset Metric SASRec Unshuffled Shuffled Delta Percentage ML-1M HR@10 0.825 0.814 0.681 -0.133 -16.3% NDCG@10 0.591 0.583 0.408 -0.175 -30.0% Video Games 2014 HR@10 0.741 0.729 0.715 -0.014 -2.0% NDCG@10 0.536 0.506 0.486 -0.020 -4.0% Video Games 2018 HR@10 - 0.710 0.705 -0.005 -0.7% NDCG@10 - 0.494 0.482 -0.012 -2.5% Steam HR@10 0.873 0.836 0.839 0.003 0.4% NDCG@10 0.631 0.587 0.586 -0.001 -0.2% Beauty HR@10 0.485 0.484 0.508 0.024 5.0% NDCG@10 0.322 0.325 0.332 0.007 2.0% is specific to attention based sequential recommenders exclusively is an interesting question and one we leave for future work. Another factor may be in the dataset construction itself, given that the other datasets come from processes that better proxy a realistic interaction scenario than MovieLens. Although small, the difference in NDCG@10 for Video 2014, Video 2018 and Beauty are all statistically significant. While shuffling negatively impacts the performance for the two Video datasets, interestingly, shuffling seems to improve the performance on Beauty. Statistically, Beauty is not too dissimilar to Video 2018 as shown in Section 3. We propose that the difference in response to shuffling may be due to domain specific differences in how users interact with items. Analysis of the loss characteristics for these runs suggest that shuffling the Beauty dataset improves generalization of the model. Learning is slower but eventually crosses the plateau reached by the model on the unshuffled data. 4.3. Rating Prediction Experiments For this experiment we modify SASRec to predict the ratings of the held-out items rather than predicting the items themselves. Our goal is to remove the impact of the internal recommenda- tion engine in ML-1M by limiting the scope of the model to only items that users interacted with. This separation of ratings from item identification allows for a more fair comparison between the shuffled and unshuffed cases. That is, by asking the model to rate items rather than suggest items, we remove the benefit of being able to predict which items the internal recommendation engine would have suggested.6 We modify the loss function of SASRec to have two parts, one mean squared error component that penalizes distance from the true rating for predictions and one cross-entropy component that penalizes wrong labels. Table 4 show the results of shuffled and unshuffled runs on ML-1M for the rating prediction 6 It has been noted that rating prediction is a poor measure for recommendation algorithm evaluation [44]. We nonetheless find it useful for teasing apart our confounded data-sequence. 10 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 Table 4 Results of the rating prediction experiment. Delta is this difference between shuffled and unshuffled. The relative performance drop reframes delta as a percentage of the unshuffled score for a given metric. Relative Holdout N Metric Unshuffled Shuffled Delta Performance Drop 3 Accuracy 0.416 0.411 0.005 -1.2% RMSE 1.156 1.168 -0.012 -1% 1 Accuracy 0.421 0.412 0.009 -2.1% RMSE 1.141 1.167 -0.026 -2.3% version of SASRec. We present accuracy and root mean squared error (RMSE) for two values of holdout N, this value represents the number of items held out for test and validation. By changing the evaluation method to one that does not depend on scores for items with which the user did not interact we are able to close the performance gap considerably, from tens of percent to single digits. These results provide further evidence that the difference in performance on the ML-1M dataset between shuffled and unshuffled shown in Section 4.2 is due to a sequential bias that constrains the possibility space of items that can appear later in a user’s sequence. That being the case, this suggests that a model’s ability to utilize the order information in ML-1M does not translate beyond the dataset and thus ML-1M is an especially poor benchmark for sequential recommenders in general. 5. Discussion We have shown that the MovieLens datasets, while important contributions and useful bench- marks for recommendation in general, are inappropriate for use in the subfield of sequential recommendation. Of note is that, while the presence of a sequential signal to the data implies that MovieLens is an effective benchmark for sequential pattern recognition, the detachment of that signal from real-world watch habits makes it unsuitable as a benchmark for sequential recommendation specifically. That is, any pattern embedded into a sequence of data would make a valid test case for sequential pattern recognition, but recommendation, while related, has different constraints and more specific goals. Facts pertaining to these issues were acknowl- edged in the original release of the datasets, but seem to be often overlooked and/or forgotten in the field. We performed a rigorous analysis on two of the MovieLens datasets and compared them to other benchmark datasets to further elucidate the inherent issues therein. Though the MovieLens datasets clearly stood-out, the other datasets all contain significant amounts of ratings with indistinct timestamps for users as well, suggesting that ordering issues may be generally pervasive across benchmarks in the field. To directly examine the impact of ordering information in these datasets for sequential recommendation we conducted the following two experiments. The first explicitly destroyed any sequential information in the datasets by randomly shuffling the training sequence. The difference between shuffled and unshuffled was found to be small in most cases with the notable exception being ML-1M. Shuffling caused a large drop in performance for ML-1M, which we 11 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 propose is a symptom of the dataset’s construction, namely the contribution to a sequential signal introduced by the internal recommendation engine. Our second experiment aimed to remove the impact from this internal recommendation system by attempting to predict the rating given by users for items they interacted with. We showed that doing this greatly reduced the performance gap between shuffled and unshuffled, providing further evidence that a large chunk of the performance of SASRec on ML-1M comes from modelling the internal recommendation engine of the MovieLens system and not genuine sequential information. Assessing the value of offline testing requires a precise understanding of the differences between benchmarks and the real-world scenarios we aim to emulate. What constitutes a true sequence looks different depending on the domain. For example, you may purchase multiple beauty items at the same time but you are unlikely to watch more than one movie at once. Furthermore, when the timestamp and ordering information is distanced from a natural interaction sequence, the ability of sequential recommendation models to generalize is called into question. Our evaluation does not speak to the performance of such algorithms, rather that the true performance may be obscured by a lack of appropriate datasets. 6. Conclusion Our work highlights the importance of further scrutiny for dataset creation methodologies when using those datasets as benchmarks for tasks beyond their initial scope. Specifically, we find there is a necessity for further exploration and creation of new datasets for evaluating sequential recommendation, with special attention paid to temporal tagging and order information. Acknowledgments References [1] W. Kang, J. McAuley, Self-attentive sequential recommendation, in: 2018 IEEE International Conference on Data Mining (ICDM), IEEE Computer Society, Los Alamitos, CA, USA, 2018, pp. 197–206. URL: https://doi.ieeecomputersociety.org/10.1109/ICDM.2018.00035. doi:10.1109/ICDM.2018.00035. [2] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 1441–1450. URL: https://doi.org/10.1145/3357384.3357895. doi:10.1145/3357384.3357895. [3] S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Factorizing personalized markov chains for next-basket recommendation, in: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, Association for Computing Machinery, New York, NY, USA, 2010, p. 811–820. URL: https://doi.org/10.1145/1772690.1772773. doi:10.1145/1772690. 1772773. [4] R. He, J. McAuley, Fusing similarity models with markov chains for sparse sequential 12 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 recommendation, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, IEEE, Barcelona, Spain, 2016, pp. 191–200. [5] A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: CVPR 2011, IEEE, Colorado Springs, CO, USA, 2011, pp. 1521–1528. doi:10.1109/CVPR.2011.5995347. [6] T. Tommasi, N. Patricia, B. Caputo, T. Tuytelaars, A Deeper Look at Dataset Bias, Springer International Publishing, Cham, 2017, pp. 37–55. URL: https://doi.org/10.1007/ 978-3-319-58347-1_2. doi:10.1007/978-3-319-58347-1_2. [7] H. He, S. Zha, H. Wang, Unlearn dataset bias in natural language inference by fitting the residual, in: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 132–142. URL: https://www.aclweb.org/anthology/D19-6115. doi:10. 18653/v1/D19-6115. [8] T. Sun, A. Gaut, S. Tang, Y. Huang, M. ElSherief, J. Zhao, D. Mirza, E. Belding, K.-W. Chang, W. Y. Wang, Mitigating gender bias in natural language processing: Literature review, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 1630–1640. URL: https://www.aclweb.org/anthology/P19-1159. doi:10.18653/v1/P19-1159. [9] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, 2021. arXiv:2103.14749. [10] R. Li, J. S. Johansen, H. Ahmed, T. V. Ilyevsky, R. B. Wilbur, H. M. Bharadwaj, J. M. Siskind, The perils and pitfalls of block design for eeg classification experiments, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021) 316–333. doi:10.1109/TPAMI. 2020.2973153. [11] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Liberty, K. Al Jadda, From the lab to production: A case study of session-based recommendations in the home-improvement domain, in: Fourteenth ACM Conference on Recommender Systems, RecSys ’20, As- sociation for Computing Machinery, New York, NY, USA, 2020, p. 140–149. URL: https: //doi.org/10.1145/3383313.3412235. doi:10.1145/3383313.3412235. [12] A. Gruson, P. Chandar, C. Charbuillet, J. McInerney, S. Hansen, D. Tardieu, B. Carterette, Offline evaluation to make decisions about playlistrecommendation algorithms, in: Pro- ceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 420–428. URL: https://doi.org/10.1145/3289600.3291027. doi:10.1145/3289600.3291027. [13] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-aware recommender systems, ACM Computing Surveys (CSUR) 51 (2018) 1–36. [14] P. G. Campos, F. Díez, I. Cantador, Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols, User Modeling and User-Adapted Interaction 24 (2014) 67–119. [15] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering recommenda- tion algorithms, in: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, Association for Computing Machinery, New York, NY, USA, 2001, p. 285–295. URL: https://doi.org/10.1145/371920.372071. doi:10.1145/371920.372071. [16] B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Using sequential and non-sequential patterns in predictive web usage mining tasks, in: 2002 IEEE International Conference on Data 13 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 Mining, 2002. Proceedings., IEEE, IEEE, Maebashi City, Japan, 2002, pp. 669–672. [17] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, Session-based recommendations with recurrent neural networks, 2015. [18] M. Quadrana, A. Karatzoglou, B. Hidasi, P. Cremonesi, Personalizing session-based recom- mendations with hierarchical recurrent neural networks, in: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys ’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 130–137. URL: https://doi.org/10.1145/3109859. 3109896. doi:10.1145/3109859.3109896. [19] Y. Zhu, H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, D. Cai, What to do next: Modeling user behaviors by time-lstm, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, AAAI Press, Palo Alto, California, 2017, p. 3602–3608. [20] P. Covington, J. Adams, E. Sargin, Deep neural networks for youtube recommendations, in: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 191–198. URL: https: //doi.org/10.1145/2959100.2959190. doi:10.1145/2959100.2959190. [21] H. Ying, F. Zhuang, F. Zhang, Y. Liu, G. Xu, X. Xie, H. Xiong, J. Wu, Sequential recommender system based on hierarchical attention network, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, AAAI Press, Palo Alto, California, 2018, p. 3926–3932. [22] N. Ferro, D. Kelly, Sigir initiative to implement acm artifact review and badging, SI- GIR Forum 52 (2018) 4–10. URL: https://doi.org/10.1145/3274784.3274786. doi:10.1145/ 3274784.3274786. [23] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, K. Craw- ford, Datasheets for datasets, 2018. [24] S. Holland, A. Hosny, S. Newman, J. Joseph, K. Chmielinski, The dataset nutrition label: A framework to drive higher data quality standards, 2018. [25] E. M. Bender, B. Friedman, Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science, Transactions of the Association for Computational Linguistics 6 (2018) 587–604. URL: https://doi.org/10.1162/tacl_a_00041. doi:10.1162/tacl_a_00041. [26] S. Rendle, L. Zhang, Y. Koren, On the difficulty of evaluating baselines: A study on recommender systems, 2019. [27] M. D. Ekstrand, Lenskit for python: Next-generation software for recommender systems experiments, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 2999–3006. URL: https://doi.org/10.1145/3340531.3412778. doi:10.1145/ 3340531.3412778. [28] Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Mymedialite: A free rec- ommender system library, in: Proceedings of the Fifth ACM Conference on Recom- mender Systems, RecSys ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 305–308. URL: https://doi.org/10.1145/2043932.2043989. doi:10.1145/2043932. 2043989. [29] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, C. Geng, Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison, in: 14 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 Fourteenth ACM Conference on Recommender Systems, RecSys ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 23–32. URL: https://doi.org/10.1145/ 3383313.3412489. doi:10.1145/3383313.3412489. [30] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, K. Li, Y. Chen, Y. Lu, H. Wang, C. Tian, X. Pan, Y. Min, Z. Feng, X. Fan, X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, J.-R. Wen, Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms, 2020. [31] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, F. A. Merra, C. Pomo, F. M. Donini, T. Di Noia, Elliot: a comprehensive and rigorous framework for reproducible recommender systems evaluation, 2021. arXiv:2103.02590. [32] A. Said, A. Bellogín, Comparative recommender system evaluation: Benchmarking recommendation frameworks, in: Proceedings of the 8th ACM Conference on Recom- mender Systems, RecSys ’14, Association for Computing Machinery, New York, NY, USA, 2014, p. 129–136. URL: https://doi.org/10.1145/2645710.2645746. doi:10.1145/2645710. 2645746. [33] Y. Ji, A. Sun, J. Zhang, C. Li, A critical study on data leakage in recommender system offline evaluation, 2021. arXiv:2010.11060. [34] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, Acm transactions on interactive intelligent systems (tiis) 5 (2015) 1–19. [35] J. McAuley, C. Targett, Q. Shi, A. van den Hengel, Image-based recommendations on styles and substitutes, in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 43–52. URL: https://doi.org/10.1145/2766462. 2767755. doi:10.1145/2766462.2767755. [36] J. Ni, J. Li, J. McAuley, Justifying recommendations using distantly-labeled reviews and fine-grained aspects, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 188–197. URL: https://www.aclweb.org/anthology/D19-1018. doi:10. 18653/v1/D19-1018. [37] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara, Variational autoencoders for col- laborative filtering, in: Proceedings of the 2018 World Wide Web Conference, WWW ’18, International World Wide Web Conferences Steering Committee, Republic and Can- ton of Geneva, CHE, 2018, p. 689–698. URL: https://doi.org/10.1145/3178876.3186150. doi:10.1145/3178876.3186150. [38] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, Neural collaborative filtering, in: Proceedings of the 26th International Conference on World Wide Web, WWW ’17, International World Wide Web Conferences Steering Committee, Republic and Can- ton of Geneva, CHE, 2017, p. 173–182. URL: https://doi.org/10.1145/3038912.3052569. doi:10.1145/3038912.3052569. [39] Q. Liu, S. Wu, D. Wang, Z. Li, L. Wang, Context-aware sequential recommendation, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, Barcelona, Spain, 2016, pp. 1053–1058. doi:10.1109/ICDM.2016.0135. [40] A. Yan, S. Cheng, W.-C. Kang, M. Wan, J. McAuley, Cosrec: 2d convolutional neural networks for sequential recommendation, in: Proceedings of the 28th ACM International 15 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 Conference on Information and Knowledge Management, CIKM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 2173–2176. URL: https://doi.org/10. 1145/3357384.3358113. doi:10.1145/3357384.3358113. [41] P. Zhao, T. Shui, Y. Zhang, K. Xiao, K. Bian, Adversarial oracular seq2seq learning for sequential recommendation, in: C. Bessiere (Ed.), Proceedings of the Twenty-Ninth Interna- tional Joint Conference on Artificial Intelligence, IJCAI-20, International Joint Conferences on Artificial Intelligence Organization, San Francisco, CA, USA, 2020, pp. 1905–1911. URL: https://doi.org/10.24963/ijcai.2020/264. doi:10.24963/ijcai.2020/264, main track. [42] M. Ji, W. Joo, K. Song, Y.-Y. Kim, I.-C. Moon, Sequential recommendation with relation- aware kernelized self-attention, Proceedings of the AAAI Conference on Artificial Intelli- gence 34 (2020) 4304–4311. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5854. doi:10.1609/aaai.v34i04.5854. [43] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of ir techniques, ACM Transactions on Information Systems (TOIS) 20 (2002) 422–446. [44] S. M. McNee, J. Riedl, J. A. Konstan, Being accurate is not enough: How accuracy metrics have hurt recommender systems, in: CHI ’06 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’06, Association for Computing Machinery, New York, NY, USA, 2006, p. 1097–1101. URL: https://doi.org/10.1145/1125451.1125659. doi:10.1145/ 1125451.1125659. A. Appendices - Additional Figures A.1. Indistinct Sequences Figure A.1 (a) shows that the user sequences in ML-1M are particularly contaminated with large regions of indistinct interactions. These regions represent areas where the actual interaction order is unknowable, and thus unusable by sequential recommenders. While less pronounced, this issue persists for ML-25M. A.2. Per Second Data The timestamps associated with interactions in the MovieLens datasets are distinct to the level of seconds. Figure A.2 shows both the percentages of zero second intervals in the ML-1M (a) and ML-25M (b) datasets as well as the cumulative percentage of users plotted against the number of interactions with unique timestamps. Figure A.2 (c) shows that over two-thirds of the users in both datatsets have interaction histories containing less than 100 unique timestamps. Only 2 of the 6040 users in ML-1M have all their interactions at distinct timestamps. The average interaction history contains 165.5 and 153.5 interactions for ML-1M and ML-25M respectively. A.3. Shuffled Experiment Visualization Figure A.3 shows the performance difference between shuffled and unshuffled runs. Figure A.3 (a) shows a significantly larger difference than the others, illustrating presence of a strong sequential signal in ML-1M. 16 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 (a) ML-1M (b) ML-25M (c) Steam 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 10 Random Users 2 2 2 1 1 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 (d) Video 2014 (e) Video 2018 (f) Beauty 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Last 20 Interactions Distinct Indistinct Figure A.1: The last 20 interactions for 10 randomly selected users for each of the six datasets. Each row represents a single user’s last 20 interactions and each box represents a single interaction. All interactions with a unique timestamp are presented in blue. Interactions that share a timestamp with a neighboring interaction are colored one of two shades of grey. We use the two shades to separate neighboring chunks of same-timestamp interactions. Uncolored sections represent users with fewer than 20 total interactions. (a) ML-1M (b) ML-25M (c) Cumulative Percentage of Users 100 Percentage Of Users 80 17.4% 60 46.8% 40 53.2% 20 82.6% 0 0 50 100 150 200 Number Of Interactions On Unique Seconds All Other Intervals 0 Seconds ML-1M ML-25M Figure A.2: a, b: Percentage of intervals between consecutive interactions of length 0 seconds for ML- 1M and ML-25M respectively. c: Cumulative percentage of users in each dataset whose interactions happened on X unique timestamps. 17 Daniel Woolridge et al. CEUR Workshop Proceedings 1–16 (a) ML-1M (b) Video 2014 (c) Video 2018 (d) Steam (e) Beauty 0.6 0.6 0.6 0.65 0.55 0.55 0.55 0.55 0.6 0.5 NDCG@10 0.5 0.5 0.5 0.55 0.45 0.45 0.45 0.45 0.5 0.4 0.4 0.4 0.4 0.45 0.35 0.35 0.35 0.35 0.4 0.3 Shuffled Unshuffled Shuffled Unshuffled Shuffled Unshuffled Shuffled Unshuffled Shuffled Unshuffled Sequence Order Figure A.3: NDCG@10 results of ten separate runs on each of the datasets. Points show the distri- bution of scores, with box-plots to show the inter-quartile range and whiskers spanning the minimum and maximum values. Red points and box-plots are shuffled results and blue represent unshuffled. Though overlapping, only the difference between shuffled and unshuffled for Steam is not statistically significant. 18