<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM Conference on Recommender Systems, Amsterdam, The Netherlands
" dan@vody.com (D. Woolridge); sean@vody.com (S. Wilner); madeliene@vody.com (M. Glick)
~ https://www.vody.com (D. Woolridge)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sequence or Pseudo-Sequence?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>An Analysis of Sequential Recommendation Datasets</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Woolridge</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sean Wilner</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Madeleine Glick</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vody LLC. Los Angeles</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Sequential recommendation aims to model a user's preferences by looking at the order of interactions in a user's history. The evaluation of such algorithms requires robust datasets with genuine sequential information. In this work we analyze the timestamp information of several commonly used datasets and show that reported timestamps are not indicative of meaningful sequential order. In the datasets explored, significant numbers of users have interactions occurring at identical timestamps. The actual order of these interactions is therefore unknowable; the interaction history is pseudo-sequential. We ifnd that randomly shufling the order of interactions has minimal impact on the performance of a leading sequential recommender. Particular attention is paid to MovieLens because of its frequency of use in the field of sequential recommendation. Our findings motivate the necessity for new datasets with more meaningful ordering for the evaluation of sequential recommenders.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Datasets</kwd>
        <kwd>Recommendation</kwd>
        <kwd>Sequential Recommendation</kwd>
        <kwd>MovieLens</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The ubiquity of and necessity for recommendation systems has given rise to an explosion of
research in the field. Recommendation algorithms are studied and implemented for use in
domains spanning media, e-commerce, social networks and more. The landscape of
recommendation research consists of a plethora of techniques, most of which attempt to model the
interactions between users and items in order to predict the items with which a user is most
likely to interact. Sequential recommendation, an increasingly popular trend in the field, works
by taking into account not only the users’ interaction history but the order of those interactions
as well. The goal of a sequential recommender is to utilize and exploit sequential patterns in
historical user behavior [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ].
      </p>
      <p>
        As with any burgeoning field, the ability to accurately benchmark and compare results across
various datasets, models, and metrics is essential. Benchmarking the relative performance of
various recommendation algorithms on publicly available datasets is a core part of the research
and development of such systems. It is therefore of utmost importance to ensure the validity
of various benchmark datasets for accurately conducting research. Throughout the machine
learning community emphasis is being increasingly placed on verifying all aspects of the datasets
used to benchmark models. Recent work on biases in image datasets [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], language datasets
[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], and the models trained on them focus primarily on the ethical and societal impacts of these
biases and how their presence and lack of mitigation can afect the wider world. Others have
looked at the frequency of label errors in various commonly used datasets spanning multiple
domains [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Misunderstanding and misuse of input data can lead to erroneous conclusions that
become tainted seeds for subsequent works. One such example, brought to attention by Li et al.
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], shows how experimental procedure can imbue datasets with non-signal correlations that
models will pick up on and utilize. In an analogous manner, our work analyzes time-sequence
data in various recommendation datasets and calls into question their validity for assessing the
performance of sequential recommendation algorithms.
      </p>
      <p>
        There are two main approaches to assessing the performance of a recommendation system,
online and ofline tests. In the case of online testing, researchers are able to assess the
performance of two or more systems by using those systems to provide recommendation for diferent
user segments [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Often, researchers do not have access to online tests for recommendation
and thus have to rely on ofline tests [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A core assumption for ofline tests is that the dataset
on which they are conducted accurately reflects a real-world recommendation scenario, and if
not, the diferences are established and understood.
      </p>
      <p>In this work, we begin by exploring the timestamp information in several popular datasets
used for evaluating sequential recommendation algorithms. We find that for some of these
datasets the mere presence of timestamp information is not indicative of task relevant sequential
information. We discuss the construction of the datasets and explore issues therein. Additionally,
we show the impact of these issues by conducting experiments with a popular sequential
recommendation algorithm. Our findings call into question the validity of using these datasets
for the evaluation of sequential recommenders.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Sequential Recommendation</title>
        <p>The task of sequential recommendation is to provide relevant items to users by using the users’
historical interaction data and exploiting patterns between subsequent items. There has been
much work done in this area which we briefly summarize below 1.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Baselines and Earlier Models</title>
          <p>
            Simple baselines are often used to compare against more sophisticated methods. One frequently
used baseline is to rank items by their popularity and provide these as recommendations for
each user. Another simple yet performant baseline is to use ItemKNN, a K-Nearest Neighbors
approach on the items [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
          <p>
            One particularly successful class of approaches has been to explore historical sequences using
K-th order Markov models to model stochastic transitions between items by utilizing sequential
patterns [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. Rendle et al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] have combined Markov Chains (MC) with Matrix Factorization
1We refer to Quadrana et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] and Campos et al. [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] for deeper studies on the subject.
in their work Factorized Personalized Markov Chains which, although promising, struggles to
deal with sparsity issues. As an attempt to address this problems He and McAuley [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] introduce
Fossil, a model that fuses item similarity models with Markov Chain models.
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Recurrent Models</title>
          <p>
            Another class of approaches use Recurrent Neural Networks (RNNs) and their extensions to
tackle the problem of sequential recommendation. Hidasi et al. [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] use an RNN architecture for
session recommendation. Quadrana et al. [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. extend this work by incorporating user interests
via Gated Recurrent Unit (GRU) layer across user sessions. Zhu et al. [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] use a time interval
aware Long Short-Term Memory (LSTM) model in an attempt to better capture both long and
short term interactions in a users history. A key drawback of these models is that they require
large amounts of dense data to perform well.
          </p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.1.3. Attentional Models</title>
          <p>
            With the success of attentional methods and transformers in multiple domains with sequential
information, it is only natural that they would be applied to the problem of recommendation.
Kang and McAuley [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] present SASRec, a self-attention model that is able to capture both long
term semantics and provide recommendations based on a few salient actions. The success of
this model has prompted many children and extensions. One such extension, BERT4Rec, has
been developed by Sun et al. [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. Inspired by the popular language model BERT, BERT4Rec uses
bi-directional self attention to better model users’ behavior sequences and deal with potentially
noisy input sequences [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. Ying et al. [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] introduce a 2-level hierarchical attention network in
an attempt to better capture long and short term interests of users.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluation and Benchmarking</title>
        <p>
          The lack of efective benchmarks for evaluation of recommendation algorithms is a critical issue
facing the community. Although there have been recent dedicated works dissecting vision,
language and audio datasets [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and the efect of errors therein on benchmark validity, the
ifeld of recommendation still lags behind in this area. Conference tracks are being dedicated to
datasets and benchmarks23, and eforts are being made to properly review and badge artifacts
[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Various attempts to standardize dataset versioning have been proposed [
          <xref ref-type="bibr" rid="ref23 ref24 ref25">23, 24, 25</xref>
          ] but
have yet to be widely adopted.
        </p>
        <p>
          Evaluating recommendation algorithms is dificult [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], and despite many attempts to
standardize frameworks [
          <xref ref-type="bibr" rid="ref27 ref28 ref29 ref30 ref31">27, 28, 29, 30, 31</xref>
          ] the field as a whole still lacks the consistency desired.
Said and Bellogín [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] explore four aspects of recommendation contributions pertaining to their
reproducibility, the dataset, the evaluation framework, data details, and algorithmic details.
Within this structure, our work focuses primarily on the dataset and data details aspects.
        </p>
        <p>
          A recent and vitally important publication by Rendle et al. [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] shows how several baselines
can be tuned to outperform reported results, thereby calling into question many results from the
2https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks
3https://recsys.acm.org/recsys21/perspectives/
previous years. Sun et al. [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] provides an extensive look at many prominent recommendation
contributions. As shown in their Figure 1B, MovieLens-1M is the most frequently used dataset
for evaluating recommendation algorithms, appearing in just over 30% of the 85 papers studied.
The other MovieLens datasets explored (100K, 10M, and 20M) appear less frequently but all are
in the top 15 datasets by popularity. In Figure 5A of [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], results are shown on baseline models
using the MovieLens dataset split randomly or split by timestamp. The authors claim that the
time-aware split better simulates the real recommendation scenario, which may be true if the
timestamps represent a realistic interaction sequence.
        </p>
        <p>
          Gruson et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] discuss the challenges of ofline recommendation evaluation and specifically
point out that some datasets include biases introduced in their construction, whether through
the user interface, internal recommendation algorithm or otherwise. Ji et al. [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] raise concerns
with ofline evaluations on datasets which ignore the global timeline of interaction sequences in
the data. When datasets are collected over many years some items may not be available for the
entire duration of the data collection, thereby introducing biases that should be accounted for
in evaluation. In a similar vein, our work takes a deeper look at the local timestamp information
present in some leading recommendation datasets.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>Ofline evaluation of recommendation systems requires robust datasets that are applicable to
the task under consideration. For sequential recommendation this means that datasets should
contain genuine order information that serves as a close proxy to the real world scenario
being simulated. Our primary focus in this work is on the timestamp information that most
recommendation datasets include and is often used to infer the order of interactions.</p>
      <p>
        The timestamps for the datasets we explore are provided either as a date format or in Unix
time. When presented Unix time, meaningful relations are obscured as the scale between
timestamps is less apparent to a visual inspection. To understand the nature of timestamps in
the field we explore six established recommendation datasets:
• MovieLens 1M and 25M : Our primary focus and one of the most widely used datasets
for benchmarking recommendation performance. We explore both the ML-1M and
ML25M versions [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. In these datasets the interactions represent a user rating a movie and
the timestamps indicate when the rating was submitted to the nearest second.
• Amazon Beauty: A dataset with reviews of beauty products from Amazon.com
introduced in [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. The interactions are reviews of the products and the timestamps represent
the date of review.
• Amazon Video Games 2014 and 2018 : Datasets of video game products reviews. As
both were available we look at the 2014 version introduced in McAuley et al. [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] and
2018 version introduced in Ni et al. [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. As with Amazon Beauty, the interactions in this
dataset represent a user reviewing an item and the timestamp is the date of the review.
• Steam: Introduced in Kang and McAuley [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], this dataset captures interactions between
users and video games on Steam. The interactions are reviews of the video games, and
the timestamps are the date of the review.
Avg. Interactions
      </p>
      <p>Per User</p>
      <p>Avg. Interactions</p>
      <p>Per Item</p>
      <p>Interactions</p>
      <p>
        For all datasets we apply the same pre-processing steps as detailed in Kang and McAuley [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and Sun et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] wherein we remove duplicate interactions and keep only users and items with
at least 5 interactions. All datasets include timestamp data, and all datasets have a maximum
resolution of days except for the two MovieLens datasets whose maximum resolution is at the
level of seconds.
      </p>
      <p>As seen in Table 1, ML-1M is the most dense of the datasets explored with the three Amazon
datasets being the most sparse. ML-1M also has much longer interaction histories on average.
This makes it of particular interest to researchers looking to explore longer range dynamics in
user sequences for recommendation. However, as we show in the following sections, the use of
MovieLens is especially problematic for evaluating sequentially driven methods.
3.1. Timestamps</p>
      <p>In all datasets, each user history exists as a timestamp-ordered sequence of interactions with
items. In this and the following sections we show that although the datasets include timestamps,
this does necessarily mean that these timestamps convey interaction order.</p>
      <p>One way to clearly see the lack of true ordering in the ML-1M dataset is to look at the number
of interactions for each user that happen at unique timestamps. Figure 1 shows the diference
between the MovieLens datasets and others when viewed via the lens of interactions on unique
days. For both MovieLens datasets the vast majority of users (59% and 56.4% for ML-1M
and 25M respectively) have all their interactions occuring on a single date. When taken in
conjunction with an average sequence length of over 150, these facts make it clear that the
MovieLens datasets are not representative of a realistic sequence of movie watching. Steam is
the only dataset with no users whose interactions are all on one day, and although there are
single-day interaction sequences in the other datatsets, these users represent a much smaller
percentage of the total users than they do in the MovieLens datasets.</p>
      <p>The timestamp information in the MovieLens datasets is at the level of seconds, not days.
Naturally then, it may be the case that although all interactions for a user occur on one or
two days their ordering still provides some estimate of a genuine interaction history. Ideally,
all users would have fully distinct interaction histories and therefore discernible order would
exist between the interactions. However, this is not the case in MovieLens datasets with 0
second intervals accounting for 53.2% of timestamp intervals in ML-1M (17.4% in ML-25M).
Further details are provided in Appendices A.1 &amp; A.2. This further highlights the importance of
dataset inspection and special care with regards to the application of timestamps for ordering
in recommendation.</p>
      <sec id="sec-3-1">
        <title>3.2. Intervals</title>
        <p>In order to better understand the sequences in these datasets we look at each user’s sequence of
 interactions as a list of  − 1 consecutive intervals. For example if a user interacted with item
A on 2018-01-02 and item B on 2018-01-03 the interval would be one day. Some key statistics of
this interval information are presented in Table 2. The mean-mode interval is the dataset-mean
user-mode interval in days or seconds. The mode-mode interval statistic is the dataset-mode
user-mode interval in days or seconds. We define the unique interaction ratio for a dataset as
the number of interactions at distinct timestamps divided by the total number of interactions
for each user averaged over the dataset. Of key importance is that if two interactions have an
interval of zero, then the order in which those interactions occurred is unknowable.</p>
        <p>Mean-Mode Mode-Mode Unique Interaction</p>
        <p>Interval (Days/Seconds) Interval (Days/Seconds) Ratio (Days/Seconds)</p>
        <p>As can be seen in Table 2, the mean-mode day-interval for ML-1M is nearly 0 over both days
and seconds. This means that on average, users’ interaction sequences contain more interactions
happening at the same second as another interaction in the same user’s sequence than not.
Therefore the order of items in such a sequence is ill-posed and, given the time-scale under
consideration, divorced from a realistic viewing pattern. Also of note is that for all datasets
the mode-mode interval is 0 days or seconds. The significant amount of indistinct regions of
interactions implied by this value means that interaction orders derived from all these datasets
are contaminated with some amount of noise.</p>
        <p>Figure 2 visualizes the percentage of intervals for each dataset that are of zero days in length.
While all datasets have large amounts of zero-day intervals, ranging from 19.4% for Steam to
98.5% for ML-1M, the MovieLens datasets stand out as severely divergent from a real-world
scenario. These distributions of timestamps do not reflect a natural interaction behavior where,
for example, a person is unlikely to watch more than two or three movies in a single day. In
fact the 59% of users in the ML-1M dataset whose entire interaction sequence is on a single
day have a median sequence length of 62 interactions.</p>
        <p>A large percentage of interactions in MovieLens share timestamps with ‘adjacent’ interactions.
This behavior is problematic when using the ordering of these interactions as input information
for modelling. Given these overlaps, the ordering only partly exists, and in the case of ML-1M
mostly does not. See Appendix A.1 for further analysis.</p>
        <p>The data presented in Table 2 and Figure 2 raise interesting questions for the validity of these
datasets as ofline proxies for genuine sequential interaction data. By presenting Figures 1 and 2
as well as Table 2 we hope to convey that although the timestamp information in the MovieLens
datasets is particularly problematic, the other datasets all sufer from the same afliction to
lesser and varying degrees. We propose that the above analysis or something analogous to it
becomes standard procedure for datasets when they are being used for evaluation sequential
recommenders.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. MovieLens for Sequential Recommendation</title>
        <p>
          The MovieLens datasets are well established for evaluating recommendation engines [
          <xref ref-type="bibr" rid="ref37 ref38">37, 38</xref>
          ]
and continue to be used by many of the leading models as one of the benchmarks for sequential
recommendation [
          <xref ref-type="bibr" rid="ref1 ref2 ref39 ref40 ref41 ref42">39, 1, 2, 40, 41, 42</xref>
          ]. While the MovieLens datasets are incredibly valuable
for assessing general recommendation algorithms, we find that they are not good datasets for
assessing the performance of sequential recommendation. In fact, the originators of the datasets
raise evidentiary concerns regarding the value of timestamps in their 2015 paper [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ].
        </p>
        <p>An argument could be made that although the ordered interaction data present in the dataset
does not closely represent a real-world scenario, the sequences nonetheless do represent some
estimate of order of interest. This is, however, in conflict with our findings that over 50% of
interactions in user sequences from ML-1M have an equal timestamp as another interaction in the
same user sequence. For ML-25M this percentage is 17.4%, showing that despite improvements
to MovieLens, regions of ambiguous order still exist in a meaningful portion of the dataset.</p>
        <p>It may be true that some user interactions that share timestamps were added to the dataset in
such a way that the order of interaction was preserved. Following this, it may seem reasonable
to infer that they present a valid source of order. However, if we sort the datasets by timestamp,
as is often done during pre-processing, we are left to the whims of the sorting algorithm and how
it decides to order the equal valued entries. This allows for the situation of two researchers using
the same dataset to have diferent interaction orders for the same users. This problem could be
remedied by an unambiguous ’interaction order’ field included along side the timestamps at
construction.</p>
        <p>
          An additional factor mentioned by Harper and Konstan [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] is that the movies rated by users
in the MovieLens datasets are often prompted by an internal recommendation engine. This
means that the items users are served to rate depend on the items the user has previously
rated. For the larger MovieLens datasets (10M, 20M and 25M) the interface and underlying
recommendation engine have changed over the course of the dataset collection [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. The
presence of an internal recommendation engine introduces an implicit signal into a user’s
interaction sequence since it controls the scope of which items a user is served for rating at any
given position in the sequence.
        </p>
        <p>The examination of the datasets throughout this section points towards several issues with
their applicability as evaluation benchmarks for sequential recommendation. In the following
section we perform a set of experiments aimed to explore the breadth of impact of these
pseudo-sequences on a sequential recommender model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We aim to explore the impact that the pervasive ordering pathologies of several common
recommendation datasets have, if any, on sequential recommendation performance. Specifically,
the goals of our experiments are to answer the following research questions: Given the
questionable timestamp informed sequences in the datasets, how much impact does shufling this
information have on performance? Does the explicit construction by internal recommendation
for MovieLens introduce a signal that sequential recommenders inadvertently exploit? If we
adapt the experimental design to control for the conflated signal in MovieLens, what impact
does shufling the data have on performance?</p>
      <sec id="sec-4-1">
        <title>4.1. Implementation Details</title>
        <p>
          We choose to use SASRec as the basis for our experiments because it is an established model
with a left to right architecture whose training paradigm predicts the next item for each position
in the sequence. We re-implement SASRec from the ground up in Tensorflow 2. For all datasets
used we pull fresh copies from the sources4. We omit ML-25M from our experiments due
to computational and time constraints, and leave this for future work. We adopt the same
hyperparameters for each dataset as in Kang and McAuley [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          Following the lead of Kang and McAuley [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], the metrics evaluated are Hit Rate (HR) and
Normalized Discounted Cumulative Gain (NDCG) [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] at 10. Evaluation is done by taking 100
items from the dataset that the user did not interact with and calculating relevance scores for
these items. The relevance score for the held out item, either belonging to the validation or test
set, is then compared to and ranked with the negative items. These metrics assume that the
user has equal opportunity to pick any item in the dataset to interact with next. The MovieLens
rating interface and internal recommendation engine explicitly invalidates this assumption.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Shufled vs. Unshufled</title>
        <p>
          Our first experiment aims to determine the impact of explicitly randomizing the order of the
training items for each user. We follow the training paradigm in Kang and McAuley [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The
input for each user is the last  items in their history, where the second to last item is held
out for validation and the last item is held out for test evaluation. To avoid biases from random
seed selection, we perform twenty total runs for each dataset, ten shufled and ten unshufled
with both sets sharing the same ten random seeds. In the shufled cases we randomly re-order
the users interaction history but keep the last two items untouched (for validation and test).
        </p>
        <p>
          In Table 3 we report the results of our experiments as well as the reported results from
Kang and McAuley [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Although all diferences between shufled and unshufled results are
statistically significant (except in the case of Steam where shufling had no efect on performance)
the diferences are not qualitatively substantial outside of ML-1M where the ranges of shufled
and unshufled are non-overlapping (and Beauty, though there the diference favors shufled
and will be explored briefly later). From a replication stand point our unshufled results align
well with the SASRec reported scores5. The focus of our subsequent analysis is on the diference
in performance between the shufled and unshufled cases. The largest such diference on
both HR@10 and NDCG@10 occurs for ML-1M. We propose that this diference is caused
by the shufling process destroying most of the detectable signal provided by the internal
recommendation engine introduced by the MovieLens dataset construction.
        </p>
        <p>
          Notably, shufling has little efect on the performance of the other datasets. One factor that
may explain how robust SASRec is to shufling of the input sequences is that recommendations
are made by looking at few items in the history due to the attentional mechanism [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Whether
this robustness to input-sequence shufling is a general property of sequential recommenders or
4We will provide all model and data cleaning code upon publication.
5There is a noticeable variation on the Steam dataset. We suspect this is due to diferences in pre-processing.
0.825
0.591
is specific to attention based sequential recommenders exclusively is an interesting question and
one we leave for future work. Another factor may be in the dataset construction itself, given
that the other datasets come from processes that better proxy a realistic interaction scenario
than MovieLens.
        </p>
        <p>Although small, the diference in NDCG@10 for Video 2014, Video 2018 and Beauty are all
statistically significant. While shufling negatively impacts the performance for the two Video
datasets, interestingly, shufling seems to improve the performance on Beauty. Statistically,
Beauty is not too dissimilar to Video 2018 as shown in Section 3. We propose that the diference
in response to shufling may be due to domain specific diferences in how users interact with
items. Analysis of the loss characteristics for these runs suggest that shufling the Beauty dataset
improves generalization of the model. Learning is slower but eventually crosses the plateau
reached by the model on the unshufled data.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Rating Prediction Experiments</title>
        <p>For this experiment we modify SASRec to predict the ratings of the held-out items rather than
predicting the items themselves. Our goal is to remove the impact of the internal
recommendation engine in ML-1M by limiting the scope of the model to only items that users interacted
with. This separation of ratings from item identification allows for a more fair comparison
between the shufled and unshufed cases. That is, by asking the model to rate items rather
than suggest items, we remove the benefit of being able to predict which items the internal
recommendation engine would have suggested.6 We modify the loss function of SASRec to
have two parts, one mean squared error component that penalizes distance from the true rating
for predictions and one cross-entropy component that penalizes wrong labels.</p>
        <p>
          Table 4 show the results of shufled and unshufled runs on ML-1M for the rating prediction
6It has been noted that rating prediction is a poor measure for recommendation algorithm evaluation [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]. We
nonetheless find it useful for teasing apart our confounded data-sequence.
version of SASRec. We present accuracy and root mean squared error (RMSE) for two values
of holdout N, this value represents the number of items held out for test and validation. By
changing the evaluation method to one that does not depend on scores for items with which the
user did not interact we are able to close the performance gap considerably, from tens of percent
to single digits. These results provide further evidence that the diference in performance on the
ML-1M dataset between shufled and unshufled shown in Section 4.2 is due to a sequential bias
that constrains the possibility space of items that can appear later in a user’s sequence. That
being the case, this suggests that a model’s ability to utilize the order information in ML-1M
does not translate beyond the dataset and thus ML-1M is an especially poor benchmark for
sequential recommenders in general.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>We have shown that the MovieLens datasets, while important contributions and useful
benchmarks for recommendation in general, are inappropriate for use in the subfield of sequential
recommendation. Of note is that, while the presence of a sequential signal to the data implies
that MovieLens is an efective benchmark for sequential pattern recognition, the detachment
of that signal from real-world watch habits makes it unsuitable as a benchmark for sequential
recommendation specifically. That is, any pattern embedded into a sequence of data would
make a valid test case for sequential pattern recognition, but recommendation, while related,
has diferent constraints and more specific goals. Facts pertaining to these issues were
acknowledged in the original release of the datasets, but seem to be often overlooked and/or forgotten
in the field. We performed a rigorous analysis on two of the MovieLens datasets and compared
them to other benchmark datasets to further elucidate the inherent issues therein. Though
the MovieLens datasets clearly stood-out, the other datasets all contain significant amounts of
ratings with indistinct timestamps for users as well, suggesting that ordering issues may be
generally pervasive across benchmarks in the field.</p>
      <p>To directly examine the impact of ordering information in these datasets for sequential
recommendation we conducted the following two experiments. The first explicitly destroyed
any sequential information in the datasets by randomly shufling the training sequence. The
diference between shufled and unshufled was found to be small in most cases with the notable
exception being ML-1M. Shufling caused a large drop in performance for ML-1M, which we
propose is a symptom of the dataset’s construction, namely the contribution to a sequential
signal introduced by the internal recommendation engine. Our second experiment aimed to
remove the impact from this internal recommendation system by attempting to predict the rating
given by users for items they interacted with. We showed that doing this greatly reduced the
performance gap between shufled and unshufled, providing further evidence that a large chunk
of the performance of SASRec on ML-1M comes from modelling the internal recommendation
engine of the MovieLens system and not genuine sequential information.</p>
      <p>Assessing the value of ofline testing requires a precise understanding of the diferences
between benchmarks and the real-world scenarios we aim to emulate. What constitutes a
true sequence looks diferent depending on the domain. For example, you may purchase
multiple beauty items at the same time but you are unlikely to watch more than one movie at
once. Furthermore, when the timestamp and ordering information is distanced from a natural
interaction sequence, the ability of sequential recommendation models to generalize is called
into question. Our evaluation does not speak to the performance of such algorithms, rather
that the true performance may be obscured by a lack of appropriate datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Our work highlights the importance of further scrutiny for dataset creation methodologies when
using those datasets as benchmarks for tasks beyond their initial scope. Specifically, we find
there is a necessity for further exploration and creation of new datasets for evaluating sequential
recommendation, with special attention paid to temporal tagging and order information.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments References</title>
    </sec>
    <sec id="sec-8">
      <title>A. Appendices - Additional Figures</title>
      <sec id="sec-8-1">
        <title>A.1. Indistinct Sequences</title>
        <p>Figure A.1 (a) shows that the user sequences in ML-1M are particularly contaminated with large
regions of indistinct interactions. These regions represent areas where the actual interaction
order is unknowable, and thus unusable by sequential recommenders. While less pronounced,
this issue persists for ML-25M.</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Per Second Data</title>
        <p>The timestamps associated with interactions in the MovieLens datasets are distinct to the level
of seconds. Figure A.2 shows both the percentages of zero second intervals in the ML-1M (a) and
ML-25M (b) datasets as well as the cumulative percentage of users plotted against the number
of interactions with unique timestamps. Figure A.2 (c) shows that over two-thirds of the users
in both datatsets have interaction histories containing less than 100 unique timestamps. Only
2 of the 6040 users in ML-1M have all their interactions at distinct timestamps. The average
interaction history contains 165.5 and 153.5 interactions for ML-1M and ML-25M respectively.</p>
      </sec>
      <sec id="sec-8-3">
        <title>A.3. Shufled Experiment Visualization</title>
        <p>10
9
8
7
6
5
4
3
2
1
10
9
8
7
6
5
4
3
2
1
10
9
8
7
6
5
4
3
2
1
10
9
8
7
6
5
4
3
2
1
rses 231
U
m
o
d
Ran10 1089
10
9
8
7
6
5
4
7
6
5
4
3
2
1
Figure A.1: The last 20 interactions for 10 randomly selected users for each of the six datasets. Each
row represents a single user’s last 20 interactions and each box represents a single interaction. All
interactions with a unique timestamp are presented in blue. Interactions that share a timestamp with
a neighboring interaction are colored one of two shades of grey. We use the two shades to separate
neighboring chunks of same-timestamp interactions. Uncolored sections represent users with fewer
than 20 total interactions.
(a) ML-1M
(d) Steam</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Self-attentive sequential recommendation</article-title>
          ,
          <source>in: 2018 IEEE International Conference on Data Mining (ICDM)</source>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2018</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>206</lpage>
          . URL: https://doi.ieeecomputersociety.
          <source>org/10</source>
          .1109/ICDM.
          <year>2018</year>
          .
          <volume>00035</volume>
          . doi:
          <volume>10</volume>
          .1109/ICDM.
          <year>2018</year>
          .
          <volume>00035</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ou</surname>
          </string-name>
          , P. Jiang,
          <article-title>Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>1441</fpage>
          -
          <lpage>1450</lpage>
          . URL: https://doi.org/10.1145/3357384.3357895. doi:
          <volume>10</volume>
          .1145/3357384.3357895.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          ,
          <article-title>Factorizing personalized markov chains for next-basket recommendation</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on World Wide Web, WWW '10</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2010</year>
          , p.
          <fpage>811</fpage>
          -
          <lpage>820</lpage>
          . URL: https://doi.org/10.1145/1772690.1772773. doi:
          <volume>10</volume>
          .1145/1772690. 1772773.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Fusing similarity models with markov chains for sparse sequential recommendation</article-title>
          ,
          <source>in: 2016 IEEE 16th International Conference on Data Mining (ICDM)</source>
          , IEEE, IEEE, Barcelona, Spain,
          <year>2016</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Efros</surname>
          </string-name>
          ,
          <article-title>Unbiased look at dataset bias</article-title>
          ,
          <source>in: CVPR</source>
          <year>2011</year>
          , IEEE,
          <string-name>
            <surname>Colorado</surname>
            <given-names>Springs</given-names>
          </string-name>
          , CO, USA,
          <year>2011</year>
          , pp.
          <fpage>1521</fpage>
          -
          <lpage>1528</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2011</year>
          .
          <volume>5995347</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tommasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Patricia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          , A Deeper Look at Dataset Bias, Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>55</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>319</fpage>
          -58347-
          <issue>1</issue>
          _2. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -58347-
          <issue>1</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Unlearn dataset bias in natural language inference by fitting the residual</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo</source>
          <year>2019</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>132</fpage>
          -
          <lpage>142</lpage>
          . URL: https://www.aclweb.org/anthology/D19-6115. doi:
          <volume>10</volume>
          . 18653/v1/
          <fpage>D19</fpage>
          -6115.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gaut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , M. ElSherief, J.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Belding</surname>
            ,
            <given-names>K.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>W. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Mitigating gender bias in natural language processing: Literature review, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>1630</fpage>
          -
          <lpage>1640</lpage>
          . URL: https://www.aclweb.org/anthology/P19-1159. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1159.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Northcutt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Athalye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <source>Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks</source>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .
          <fpage>14749</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Johansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. V.</given-names>
            <surname>Ilyevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Bharadwaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Siskind</surname>
          </string-name>
          ,
          <article-title>The perils and pitfalls of block design for eeg classification experiments</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>43</volume>
          (
          <year>2021</year>
          )
          <fpage>316</fpage>
          -
          <lpage>333</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2020</year>
          .
          <volume>2973153</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kouki</surname>
          </string-name>
          , I. Fountalis,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasiloglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Liberty</surname>
          </string-name>
          , K. Al Jadda,
          <article-title>From the lab to production: A case study of session-based recommendations in the home-improvement domain</article-title>
          ,
          <source>in: Fourteenth ACM Conference on Recommender Systems</source>
          , RecSys '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>140</fpage>
          -
          <lpage>149</lpage>
          . URL: https: //doi.org/10.1145/3383313.3412235. doi:
          <volume>10</volume>
          .1145/3383313.3412235.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gruson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Charbuillet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McInerney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tardieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <article-title>Ofline evaluation to make decisions about playlistrecommendation algorithms</article-title>
          ,
          <source>in: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining</source>
          , WSDM '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>420</fpage>
          -
          <lpage>428</lpage>
          . URL: https://doi.org/10.1145/3289600.3291027. doi:
          <volume>10</volume>
          .1145/3289600.3291027.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Quadrana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Sequence-aware recommender systems</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 51</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Díez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Cantador</surname>
          </string-name>
          ,
          <article-title>Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols, User Modeling and User-Adapted Interaction 24 (</article-title>
          <year>2014</year>
          )
          <fpage>67</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sarwar</surname>
          </string-name>
          , G. Karypis,
          <string-name>
            <given-names>J.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>Item-based collaborative filtering recommendation algorithms</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on World Wide Web, WWW '01</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2001</year>
          , p.
          <fpage>285</fpage>
          -
          <lpage>295</lpage>
          . URL: https://doi.org/10.1145/371920.372071. doi:
          <volume>10</volume>
          .1145/371920.372071.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mobasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          ,
          <article-title>Using sequential and non-sequential patterns in predictive web usage mining tasks</article-title>
          ,
          <source>in: 2002 IEEE International Conference on Data Mining</source>
          ,
          <year>2002</year>
          . Proceedings., IEEE, IEEE,
          <string-name>
            <surname>Maebashi</surname>
            <given-names>City</given-names>
          </string-name>
          , Japan,
          <year>2002</year>
          , pp.
          <fpage>669</fpage>
          -
          <lpage>672</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hidasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <article-title>Session-based recommendations with recurrent neural networks</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Quadrana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hidasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <article-title>Personalizing session-based recommendations with hierarchical recurrent neural networks</article-title>
          ,
          <source>in: Proceedings of the Eleventh ACM Conference on Recommender Systems</source>
          , RecSys '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          . URL: https://doi.org/10.1145/3109859. 3109896. doi:
          <volume>10</volume>
          .1145/3109859.3109896.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>What to do next: Modeling user behaviors by time-lstm</article-title>
          ,
          <source>in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI'17</source>
          , AAAI Press, Palo Alto, California,
          <year>2017</year>
          , p.
          <fpage>3602</fpage>
          -
          <lpage>3608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Covington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adams</surname>
          </string-name>
          , E. Sargin,
          <article-title>Deep neural networks for youtube recommendations</article-title>
          ,
          <source>in: Proceedings of the 10th ACM Conference on Recommender Systems</source>
          , RecSys '16,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2016</year>
          , p.
          <fpage>191</fpage>
          -
          <lpage>198</lpage>
          . URL: https: //doi.org/10.1145/2959100.2959190. doi:
          <volume>10</volume>
          .1145/2959100.2959190.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , J. Wu,
          <article-title>Sequential recommender system based on hierarchical attention network</article-title>
          ,
          <source>in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI'18</source>
          , AAAI Press, Palo Alto, California,
          <year>2018</year>
          , p.
          <fpage>3926</fpage>
          -
          <lpage>3932</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <article-title>Sigir initiative to implement acm artifact review and badging</article-title>
          ,
          <source>SIGIR Forum 52</source>
          (
          <year>2018</year>
          )
          <fpage>4</fpage>
          -
          <lpage>10</lpage>
          . URL: https://doi.org/10.1145/3274784.3274786. doi:
          <volume>10</volume>
          .1145/ 3274784.3274786.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vecchione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          , K. Crawford, Datasheets for datasets,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Holland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hosny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chmielinski</surname>
          </string-name>
          ,
          <article-title>The dataset nutrition label: A framework to drive higher data quality standards</article-title>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <source>Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science, Transactions of the Association for Computational Linguistics</source>
          <volume>6</volume>
          (
          <year>2018</year>
          )
          <fpage>587</fpage>
          -
          <lpage>604</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00041. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Koren,
          <article-title>On the dificulty of evaluating baselines: A study on recommender systems</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>M. D. Ekstrand</surname>
          </string-name>
          ,
          <article-title>Lenskit for python: Next-generation software for recommender systems experiments</article-title>
          ,
          <source>in: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management, CIKM '20</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>2999</fpage>
          -
          <lpage>3006</lpage>
          . URL: https://doi.org/10.1145/3340531.3412778. doi:
          <volume>10</volume>
          .1145/ 3340531.3412778.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          ,
          <article-title>Mymedialite: A free recommender system library</article-title>
          ,
          <source>in: Proceedings of the Fifth ACM Conference on Recommender Systems</source>
          , RecSys '11,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , p.
          <fpage>305</fpage>
          -
          <lpage>308</lpage>
          . URL: https://doi.org/10.1145/2043932.2043989. doi:
          <volume>10</volume>
          .1145/2043932. 2043989.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Geng,
          <article-title>Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison</article-title>
          ,
          <source>in: Fourteenth ACM Conference on Recommender Systems</source>
          , RecSys '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>23</fpage>
          -
          <lpage>32</lpage>
          . URL: https://doi.org/10.1145/ 3383313.3412489. doi:
          <volume>10</volume>
          .1145/3383313.3412489.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , Recbole:
          <article-title>Towards a unified, comprehensive and eficient framework for recommendation algorithms</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Anelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellogín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Malitesta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Merra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Donini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Noia</surname>
          </string-name>
          ,
          <article-title>Elliot: a comprehensive and rigorous framework for reproducible recommender systems evaluation</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .
          <fpage>02590</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellogín</surname>
          </string-name>
          ,
          <article-title>Comparative recommender system evaluation: Benchmarking recommendation frameworks</article-title>
          ,
          <source>in: Proceedings of the 8th ACM Conference on Recommender Systems</source>
          , RecSys '14,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2014</year>
          , p.
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          . URL: https://doi.org/10.1145/2645710.2645746. doi:
          <volume>10</volume>
          .1145/2645710. 2645746.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A critical study on data leakage in recommender system ofline evaluation</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <year>2010</year>
          .11060.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Harper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <article-title>The movielens datasets: History and context, Acm transactions on interactive intelligent systems (tiis) 5 (</article-title>
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>J. McAuley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Targett</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>A. van den Hengel</given-names>
          </string-name>
          ,
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          ,
          <source>in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '15,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2015</year>
          , p.
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          . URL: https://doi.org/10.1145/2766462. 2767755. doi:
          <volume>10</volume>
          .1145/2766462.2767755.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Justifying recommendations using distantly-labeled reviews and ifne-grained aspects</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>197</lpage>
          . URL: https://www.aclweb.org/anthology/D19-1018. doi:
          <volume>10</volume>
          . 18653/v1/
          <fpage>D19</fpage>
          -1018.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Hofman</surname>
          </string-name>
          , T. Jebara,
          <article-title>Variational autoencoders for collaborative filtering</article-title>
          ,
          <source>in: Proceedings of the 2018 World Wide Web Conference</source>
          , WWW '18,
          <string-name>
            <given-names>International</given-names>
            <surname>World Wide Web Conferences Steering Committee</surname>
          </string-name>
          , Republic and Canton of Geneva, CHE,
          <year>2018</year>
          , p.
          <fpage>689</fpage>
          -
          <lpage>698</lpage>
          . URL: https://doi.org/10.1145/3178876.3186150. doi:
          <volume>10</volume>
          .1145/3178876.3186150.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Neural collaborative filtering</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on World Wide Web, WWW '17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE</source>
          ,
          <year>2017</year>
          , p.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          . URL: https://doi.org/10.1145/3038912.3052569. doi:
          <volume>10</volume>
          .1145/3038912.3052569.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Context-aware sequential recommendation</article-title>
          ,
          <source>in: 2016 IEEE 16th International Conference on Data Mining (ICDM)</source>
          , IEEE, Barcelona, Spain,
          <year>2016</year>
          , pp.
          <fpage>1053</fpage>
          -
          <lpage>1058</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDM.
          <year>2016</year>
          .
          <volume>0135</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yan</surname>
          </string-name>
          , S. Cheng, W.-C. Kang,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Cosrec: 2d convolutional neural networks for sequential recommendation</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>2173</fpage>
          -
          <lpage>2176</lpage>
          . URL: https://doi.org/10. 1145/3357384.3358113. doi:
          <volume>10</volume>
          .1145/3357384.3358113.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <article-title>Adversarial oracular seq2seq learning for sequential recommendation</article-title>
          , in: C.
          <string-name>
            <surname>Bessiere</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint Conferences on Artificial Intelligence Organization</source>
          , San Francisco, CA, USA,
          <year>2020</year>
          , pp.
          <fpage>1905</fpage>
          -
          <lpage>1911</lpage>
          . URL: https://doi.org/10.24963/ijcai.
          <year>2020</year>
          /264. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2020</year>
          /264, main track.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Joo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          , I.-C. Moon,
          <article-title>Sequential recommendation with relationaware kernelized self-attention</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>4304</fpage>
          -
          <lpage>4311</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/5854. doi:
          <volume>10</volume>
          .1609/aaai.v34i04.
          <fpage>5854</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kekäläinen</surname>
          </string-name>
          ,
          <article-title>Cumulated gain-based evaluation of ir techniques</article-title>
          ,
          <source>ACM Transactions on Information Systems (TOIS) 20</source>
          (
          <year>2002</year>
          )
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>McNee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <article-title>Being accurate is not enough: How accuracy metrics have hurt recommender systems</article-title>
          ,
          <source>in: CHI '06 Extended Abstracts on Human Factors in Computing Systems, CHI EA '06</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2006</year>
          , p.
          <fpage>1097</fpage>
          -
          <lpage>1101</lpage>
          . URL: https://doi.org/10.1145/1125451.1125659. doi:
          <volume>10</volume>
          .1145/ 1125451.1125659.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          0.6 (
          <issue>b</issue>
          )
          <article-title>Video 2014 0.6 (c</article-title>
          )
          <source>Video 2018 0.65 0.55 0.5 0.45 0.4 0.35 0.35 Shuffled Unshuffled 0.35 Shuffled Unshuffled 0.35 Shuffled Unshuffled 0.4 Shuffled Unshuffled 0</source>
          .3 Shuffled Unshuffled Sequence Order
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>