=Paper= {{Paper |id=Vol-1922/paper10 |storemode=property |title=A Comparison of Frequent Pattern Techniques and a Deep Learning Method for Session-Based Recommendation |pdfUrl=https://ceur-ws.org/Vol-1922/paper10.pdf |volume=Vol-1922 |authors=Iman Kamehkhosh,Dietmar Jannach,Malte Ludewig |dblpUrl=https://dblp.org/rec/conf/recsys/KamehkhoshJL17 }} ==A Comparison of Frequent Pattern Techniques and a Deep Learning Method for Session-Based Recommendation == https://ceur-ws.org/Vol-1922/paper10.pdf
         A Comparison of Frequent Pattern Techniques and a
      Deep Learning Method for Session-Based Recommendation
             Iman Kamehkhosh                                         Dietmar Jannach                                Malte Ludewig
             TU Dortmund                                              TU Dortmund                                    TU Dortmund
    iman.kamehkhosh@tu-dortmund.de                          dietmar.jannach@tu-dortmund.de                  malte.ludewig@tu-dortmund.de

ABSTRACT                                                                          common on e-commerce sites, e.g., when returning users do not log
Making session-based recommendations, i.e., recommending items                    in every time they use the site. The same challenges can, however,
solely based on the users’ last interactions without having access                be observed also for other application domains, in particular for
to their long-term preference profiles, is a challenging problem                  news and media (music and video) recommendation [21, 33].
in various application fields of recommender systems. Using a                        The problem of predicting the next actions of users based solely
coarse classification scheme, the proposed algorithmic approaches                 on their sequence of actions in the current session is referred to
to this problem in the research literature can be categorized into                in the literature as session-based recommendation. A number of
frequent pattern mining algorithms and approaches that are based                  algorithmic approaches have been proposed over the years to deal
on sequence modeling. In the context of methods of the latter class,              with the problem. Early academic approaches, for example, rely
recent works suggest the application of recurrent neural networks                 on the detection of sequential patterns in the session data of a
(RNN) for the problem. However, the lack of established algorithmic               larger user community. In principle, even simpler methods can be
baselines for session-based recommendation problems makes the                     applied. Amazon’s “Customers who bought . . . also bought” feature
assessment of such novel approaches difficult.                                    represents an example that relies on simple co-occurrence patterns
   In this work, we therefore compare a state-of-the-art RNN-based                to generate recommendations, in that case in the context of the
approach with a number of (heuristics-based) frequent pattern                     very last user interaction (an item view event). A number of later
mining methods both with respect to the accuracy of their recom-                  works then explored the use of Markov models [30, 35, 39], and
mendations and with respect to their computational complexity.                    most recently, researchers explored the use of recurrent neural
The results obtained for a variety of different datasets show that in             networks (RNN) for the session-based next-item recommendation
every single case a comparably simple frequent pattern method can                 problem [16, 17, 38, 42].
be found that outperforms the recent RNN-based method. At the                        Today, RNNs can be considered one of the state-of-the-art meth-
same time, the proposed much more simple methods are also com-                    ods for sequence learning tasks. They have been successfully ex-
putationally less expensive and can be applied within the narrow                  plored for various sequence-based prediction problems in the past
time constraints of online recommendation.                                        [5, 9, 11, 18] and in a recent work, Hidasi et al. [16] investigated an
                                                                                  RNN variant based on gated recurrent units (GRU) for the session-
CCS CONCEPTS                                                                      based recommendations problem. In their work, they benchmarked
                                                                                  their RNN-based method gru4rec with different baseline methods
•General and reference →Evaluation; •Information systems
                                                                                  on two datasets. Their results showed that gru4rec is able to out-
→Recommender systems; •Computing methodologies →Ne-
                                                                                  perform the baseline approaches in terms of accuracy for top-20
ural networks; Rule learning;
                                                                                  recommendation lists.
KEYWORDS                                                                             While these results indicate that RNNs can be successfully ap-
                                                                                  plied for the given recommendation task, we argue that the experi-
Session-Based Recommendations; Deep Learning; Frequent Pattern                    mental evaluation in [16] does not fully inform us about different
Mining; Benchmarking                                                              aspects of the effectiveness and the practicability of the proposed
                                                                                  method. First, regarding the effectiveness, it is unclear if the meth-
1    INTRODUCTION                                                                 ods to which gru4rec was compared are competitive. Second, as
Making recommendations solely based on a user’s current session                   the evaluation was based on one single training-test split and only
and most recent interactions is a nontrivial problem for recom-                   using accuracy measures, further investigations are necessary to
mender systems. On an e-commerce website, for instance, when                      assess, for example, if some algorithms exhibit certain biases, e.g., to
a visitor is new (or not logged in), there are no long-term user                  recommend mostly popular items. Third, even if the RNN method is
models that can be applied to determine suitable recommendations                  effective, questions regarding the scalability of the method should
for this user. Furthermore, recent work shows that considering the                be discussed, in particular as hyper-parameter optimization for the
user’s short-term intent has often more effect on the accuracy of the             complex networks can become very challenging in practice.
recommendations than the choice of the method used to build the                      The goal of this work is to shed light on these questions and in
long-term user profiles [20]. In general, such types of problems are              the remainder of this paper we will report the detailed results of
                                                                                  comparing a state-of-the-art RNN-based method with a number
Workshop on Temporal Reasoning in Recommender Systems, collocated with ACM Rec-
Sys’17, Como, Italy.
                                                                                  of computationally more efficient pattern mining approaches in
Copyright©2017 for this paper by its authors. Copying permitted for private and   different dimensions.
academic purposes.
.
2   PREVIOUS WORKS                                                       which leverage additional item features to achieve higher accu-
In session-based recommendation problems, we are given a se-             racy. For the problem of news recommendation, Song et al. [36]
quence of the most recent actions of a user and the goal is to find      proposed a temporal deep semantic structured model for the combi-
items that are relevant in the context of the user’s specific short-     nation of long-term static and short-term temporal user preferences.
term intent. One traditional way to determine recommendations            They considered different levels of granularity in their model to
given a set of recent items of interest is to apply frequent pat-        process both fast and slow temporal changes in the users’ prefer-
tern mining techniques, e.g., based on association rules (AR) [1].       ences. In general, neural networks have been used for a number
AR are often applied for market basket analysis with the goal to         of recommendation-related tasks in recent years. Often, such net-
find sets of items that are bought together with some probability        works are used to learn embeddings of content features in compact
[14]. The order of the items or actions in a session is irrelevant       fixed-size latent vectors, e.g., for music, for images, for video data,
for AR-based approaches. Sequential patterns mining (SP) [2] tech-       for documents, or to represent the user [3, 6–8, 13, 25, 29, 46].
niques, in contrast, consider the order of the elements in sessions      These representations are then integrated, e.g., in content-based
when identifying frequent patterns. In one of the earlier works,         approaches, in variations of latent factor models, or are part of new
Mobasher et al. [32] used frequent pattern mining methods to pre-        methods for computing recommendations [7, 8, 13, 27, 37, 43, 45].
dict a user’s next navigation action. In another work, Yap et al. [47]      In the work presented in this paper, we will compare different
propose a sequential pattern-mining-based next-item recommen-            existing and novel pattern-mining-based approaches with a state-
dation framework, which weights the patterns according to their          of-the-art RNN-based algorithm.
estimated relevance for the individual user. In the domain of music
recommendation, Hariri et al. more recently [15] propose to mine         3 EXPERIMENT CONFIGURATIONS
sequential patterns of latent topics based on the tags attached to       3.1 Algorithms
the tracks to predict the context of the next song.                         3.1.1 RNN Baseline. gru4rec is an RNN-based algorithm that
   A different way of finding item-to-item correlations is to look       uses Gated Recurrent Units to deal with the vanishing or exploding
for sessions that are similar to the current one (neighbors), and to     gradient problem proposed in [16]. In our experiments, we used
determine frequent item co-occurrence patterns that can be used in       the Python implementation that is shared by the authors online.1
the prediction phase. Such neighborhood-based approaches were
for example applied in the domains of e-commerce and music in               3.1.2 Session-based kNN – knn. The knn method searches the k
[4] or [26]. In some cases and application domains, simple co-           most similar past sessions (“neighbors”) in the training data based
occurrence patterns are despite their simplicity quite effective, see,   on the set of items in the current session. Since the process of
e.g., [20, 40] or [44].                                                  determining the neighbor sessions becomes very time-consuming
   Differently from such pattern- and co-occurrence-based tech-          as the number of sessions increases, we use an special in-memory
niques, a number of recent approaches are based on sequence mod-         index data structure (cache) in our implementation. Technically, in
eling using, e.g., Markov models. The main assumption of Markov-         the training phase, we create a data structure that maps the training
model-based approaches in the context of session-based recom-            sessions to their set of items and one structure that maps the items
mendation is that the selection of the next item in a session is de-     to the sessions in which they appear. To make recommendations for
pendent on a limited number of previous actions. Shani et al. [35]       the current session s, we first create a union of the sessions in which
were among the first who applied first-order Markov chains (MC)          the items of s appear. This union will be the set of possible neighbors
for session-based recommendation and showed the superiority of           of the current session. This is a fast operation as it only involves a
sequential models over non-sequential ones. In the music domain,         cache lookup and set operations. To further reduce the computa-
McFee and Lanckriet [30] proposed a music playlist generation            tional complexity of the prediction process, we select a subsample
algorithm based on MCs that – given a seed song – selects the            of these possible neighbors using a heuristic. In this work, we took
next track from uniform and weighted distributions as well as from       the m most recent sessions as focusing on recent trends has shown
k-nearest neighbor graphs. Generally, a main issue when applying         to be effective for recommendations in e-commerce [23]. We then
Markov chains in session-based recommendation is that the state          compute the similarity of these m most recent possible neighbors
space quickly becomes unmanageable when all possible sequences           and the current session and select the k most similar sessions as
of user selections should be considered [12, 16].                        the neighbor sessions of the current session. Again through lookup
   More recent approaches to sequence modeling for session-based         and set union operations, we create the set of recommendable items
recommendation utilize recurrent neural networks (RNN). RNNs             R that contains items that appear in one of the k sessions. For each
process sequential data one element at a time and are able to selec-     recommendable item i in R, we then compute the knn score as the
tively pass information across sequence steps [28]. Zhang et al. [49],   sum of the similarity values of s and its neighbor sessions n ∈ Ns
for example, successfully applied RNNs to predict advertisement          which contains i (Equation 1). The indicator function 1n (i) returns
clicks based on the users’ browsing behavior in a sponsored search       1 if n contains i and 0 otherwise, see also [4].
scenario. For session-based recommendations, Hidasi et al. [16]                         score knn (i, s) = Σn ∈Ns sim(s, n) × 1n (i)          (1)
investigated a customized RNN variant based on gated recurrent
                                                                            In our experiments, we tested different distance measures to
units (GRU) [5] to model the users’ transactions within sessions.
                                                                         determine the similarity of sessions. The best results were achieved
They also tested several ranking loss functions in their solutions.
                                                                         when the sessions were encoded as binary vectors of the item space
Later on, in [17] and [42] RNN-based approaches were proposed
                                                                         1 https://github.com/hidasib/GRU4Rec
and when using cosine similarity. In our implementation, the set                             Table 1: Dataset characteristics.
operations, similarity computations, and the final predictions can
be done very efficiently as will be discussed later in Section 4.2.2.                 RSC     TMall    #nowplaying      30Music      AotM    8tracks
Our algorithm has only two parameters, the number of neighbors
                                                                          Sessions    8M      4.6M          95K          170K         83K     500K
k and the number of sampled sessions m. For the large e-commerce          Events      32M     46M           1M           2.9M        1.2M     5.8M
dataset used in [16], the best parameters were, for example, achieved     Items       38K     620K         115K          450K        140K     600K
with k = 500 and m = 1000. Note that the kNN method used in               Avg. E/S    3.97    9.77         10.37         17.03       14.12    11.40
[16] is based on item-to-item similarities while our kNN method           Avg. I/S    3.17    6.92          9.62         14.20       14.11    11.38
aims to identify similar sessions.

   3.1.3 kNN Temporal Extension – tknn. The knn method, when
                                                                          where j is the last item of session s and SR is the set of sequential
using cosine similarity as a distance measure, does not consider the
                                                                          rules. The indicator function 1S R (r j,i ) = 1 when SR contains r j,i
temporal sequence of the events in a session. To be able to leverage
                                                                          and 0 otherwise.
the temporal information within the knn technique, we designed
an additional temporal-filtering heuristic for it. The proposed tknn         3.1.6 Hybrid Approaches. We made additional experiments with
method uses the same scoring scheme as the knn method (Equa-              several hybrids that combine different algorithms. At the end, a
tion 1). The only difference is that, given the current session s, we     weighted combination of the two normalized prediction scores of
consider item i as being recommendable only if it appears in the          the algorithms led to the best results in our experiments.
neighbor session n directly after a certain item. In our implemen-
tation, that certain item is the last item of the current session s.      3.2    Datasets and Evaluation Protocol
Technically, we therefore use a slightly different implementation of
                                                                          We performed experiments using datasets from two different do-
the indicator function of Equation 1: 1n (i) = 1 if neighbor session
                                                                          mains in which session-based recommendation is relevant, namely
n contains i and (j, i) is a subsequence of n, where j is the last item
                                                                          e-commerce and next-track music recommendation. The source
of the current session and thus the basis to predict the next item.
                                                                          code and the public datasets can be found online.2
   3.1.4 Simple Association Rules – ar. To assess the strength of
                                                                             3.2.1 E-commerce Datasets. For the e-commerce domain, we
simple two-element co-occurrence patterns, we included a method
                                                                          chose the ACM RecSys 2015 Challenge dataset (RSC) as used in
named ar which can be considered as an association rule technique
                                                                          [16]. The RSC dataset is a collection of sequences of click events
with a maximum rule size of two. Technically, we create a rule rp,q
                                                                          in shopping sessions. The second e-commerce dataset is a public
for every two items p and q that appear together in the training
                                                                          dataset published for the TMall competition. This dataset contains
sessions. We determine the weight, wp,q , of each rule simply as the
                                                                          shopping logs of the users on the Tmall.com website.
number of times p and q appear together in past sessions. Given
the current session s, the ar score of a target item i will be then           3.2.2 Music Datasets. We used (a) two datasets that contain
computed as                                                               listening logs of several thousand users and (b) two datasets that
                                                                          comprise thousands of manually created playlists.
                   score ar (i, s) = w i, j × 1AR (r i, j )        (2)        Listening logs: These used datasets are (almost one-year-long)
where j is the last item of the current session s for which we want       sub-samples of two public datasets. First, we created a subset of
to predict the successor and AR is the set of rules and their weights     the #nowplaying dataset [48], which contains music-related tweets
as determined based on the training data. The indicator function          on Twitter. Second, we used the recent 30Music dataset [41], which
1AR (r i, j ) = 1 when AR contains r i, j and 0 otherwise.                contains listening sessions retrieved from Internet radio stations
                                                                          through the Last.fm API.
   3.1.5 Simple Sequential Rules – sr. The sr method is a variant of          Playlists: Generally, music playlists are different in nature from
ar, which aims to take the order of the events into account. Similar      listening logs and e-commerce user logs in various ways. Nonethe-
to the ar method, we create a sequential rule for the co-occurrence       less, they are designed to be consumed in a listening session and the
of every two items p and q as rp,q in the training data. This time,       tracks are often arranged in a specific sequence. The used playlist
however, we consider the distance between p and q in the session          datasets come from two different music platforms. The Art-of-the-
when computing the weight of the rules. In our implementation,            Mix dataset (AotM) was published by [31] and contains playlists
we use the multiplicative inverse as a weight function and set            by music enthusiasts. The 8tracks dataset was shared with us by
wp,q = 1/x, where x is the number of items that appear between p          the 8tracks platform. A particularity of the 8tracks dataset is that
and q in a session. Other heuristics such as a linear or a logarithmic    each public playlist can only contain two tracks per artist.
function can also be used. In case that those two items appear                The dataset statistics are shown in Table 1. The total number
together in another session in the training data, the weight of the       of sessions is larger for the e-commerce datasets. However, the
rule in that session will be added to the current weight. We finally      number of unique items in the music datasets, which corresponds
normalize the weight and divide it by the total number of sessions        to the number of tracks included in the playlists or the number of
that contributed to the weight. Given the current session s, the sr       played tracks in the listening sessions, is higher than the number
score of a target item i is then computed as                              of items in e-commerce datasets.

                   score sr (i, s) = w j,i × 1S R (r j,i )         (3)    2 http://ls13-www.cs.tu-dortmund.de/homepage/rectemp2017
   The music sessions are on average longer than the e-commerce                            Table 2: Results when using the evaluation scheme of [16].
sessions.3 The last row of Table 1 shows the average number of
unique items in each session (“Avg. I/S”). Comparing this value                            Method                        HR@10         MRR@10          HR@20        MRR@20
with the average session length (“Avg. E/S”) indicates what we call
                                                                                           sr                             0.568             0.290       0.672          0.297
the item repetition rate in each dataset. Including the same track
                                                                                           tknn                           0.545             0.251       0.670          0.260
more than once in a playlist is comparably uncommon. Listening
                                                                                           ar                             0.543             0.273       0.655          0.280
to a track more than once during a listening session is, however,
                                                                                           knn                            0.521             0.242       0.641          0.250
common. The difference between the average session length and
                                                                                           gru4rec(1000,bpr)              0.517             0.235       0.636          0.243
the average number of items in each session for the e-commerce
                                                                                           gru4rec(1000,top1)             0.517             0.261       0.623          0.268
dataset indicates that re-occurring of the same item in a session is
common in the e-commerce domain.
                                                                                           0.32
                                                                                                                                   RSC                                   *
    3.2.3 Evaluation Protocol. The general task of the session-based                        0.3
recommendation techniques in our experiment is to predict the next-                        0.28
                                                                                           0.26
item view event in a shopping session or to predict the next track                         0.24
that is played in a listening session or is included in a playlist. To                     0.22
evaluate the session-based algorithms, we use the same evaluation                           0.2
                                                                                           0.18
scheme as in [16]. We incrementally add events to the sessions in
                                                                                           0.16
the test set and report the average hit rate (HR), which corresponds                       0.14
to recall in this evaluation setting, and the mean reciprocal rank                         0.12
                                                                                                   MRR@1     MRR@2    MRR@3     MRR@5       MRR@7   MRR@10 MRR@15 MRR@20
(MRR), which takes the position of the hit into account. We tested
                                                                                            0.2    SR   AR     KNN    TKNN      GRU4REC(1000,TOP1)      GRU4REC(1000,BPR)
list lengths of 1, 2, 3, 5, 7, 10, 15, and 20. While the experiments                                                                TMall
in [16] are done without cross-validation, we additionally apply a                         0.18                                                                           *
                                                                                           0.16
fivefold sliding-window validation protocol as in [24] to minimize
                                                                                           0.14
the risk that the obtained results are specific to the single train-test
                                                                                           0.12
split. We, therefore, created five train-test splits for each dataset.
                                                                                            0.1
For the listening logs, we used 3 months of training data and the
                                                                                           0.08
next 5 days as the test data and randomized splits for the playlists
                                                                                           0.06
as they have no timestamps assigned.
                                                                                           0.04
                                                                                                   MRR@1     MRR@2    MRR@3     MRR@5       MRR@7   MRR@10    MRR@15   MRR@20
4 RESULTS                                                                                         SR    AR     KNN      TKNN      GRU4REC(1000,TOP1)         GRU4REC(1000,BPR)
                                                                                           0.12
4.1 Accuracy Results
                                                                                           Figure 1: MRR results for the e-commerce datasets (* indi-
Our first experiment used the exact same setup as described in [16],
                                                                                           cates statistical significance).
i.e., we use only one training-test split when comparing gru4rec
with our methods. As done in [16], we trained the algorithms using                            The best accuracy results were achieved by the sr method both
6 months of data containing 7,966,257 sessions of 31,637,239 clicks                        for the hit rate and MRR and for both list lengths. In terms of the hit
on 37,483 items and tested them on the sessions of the next day.                           rate, every single frequent pattern method used in the experiment
    In the subsequent sections, we then report the results of our                          was better than the gru4rec methods. A similar observation can be
comparison using the sliding-window validation scheme described                            made also for the MRR, with the exception that the knn-based meth-
above with recommendation list lengths varying from 1 to 20. In all                        ods consistently performed worse than the gru4rec(1000,top1)
experiments, we tuned the parameters for the different algorithms                          method on this measure.
using grid search and optimized for HR@20 on validation sets                                  4.1.2 E-commerce Datasets. Figure 1 shows the MRR results for
(subsets of the training sets). gru4rec was only optimized with                            the algorithms on the two e-commerce datasets, RSC and TMall.
100 layers as done in [16] due to the computational complexity of                          For both datasets, we can observe that most of the frequent pattern
the method. To test for statistical significance, we use Wilcoxon                          methods lead to higher or at least similar MRR values as gru4rec.
signed-rank test with α = 0.05.                                                            There is, however, no clear “winner” across both datasets. The sr
   4.1.1 Results Using the Original Evaluation Setup. Table 2 shows                        method works best for the RSC dataset. On the TMALL dataset,
the results ordered by the hit rate (HR@20) when using the origi-                          the knn method outperforms the others, an effect which might be
nal setup. We could reproduce the hit rate and MRR results from                            caused by the longer list session lengths for this dataset.4 In both
[16] (using their optimal parameters) for gru4rec(1000,bpr) and                            cases, however, the difference between the winning method and the
gru4rec(1000,top1), which use 1000 hidden units and the TOP1 and                           best-performing gru4rec configuration is statistically significant.
BPR’s pairwise ranking loss functions, respectively. In Table 2, we                        This is indicated by a star symbol in Figure 1.
additionally report the results for recommendation list length ten,                           4.1.3 Listening Logs Datasets. Figure 2 shows the accuracy per-
which might be more important for different application domains.                           formance of the algorithms on two selected listening logs datasets.
3 Note that each session in the TMall dataset is defined as the sequence of actions of a   4 Remember that the sessions of the TMALL dataset cover the events of one day, as the
user during one day which results in relatively larger average session length.             time stamps in this dataset are given only in the granularity of days.
       SR   AR      KNN      TKNN      GRU4REC(1000,TOP1)        GRU4REC(1000,BPR)
0.12                                                                                      0.014
                                     # nowplaying                                                                                    AotM
                                                                                *         0.012                                                                          *
 0.1
                                                                                           0.01
0.08                                                                                      0.008

0.06                                                                                      0.006

                                                                                          0.004
0.04
                                                                                          0.002
0.02                                                                                         0
        MRR@1     MRR@2    MRR@3     MRR@5     MRR@7     MRR@10 MRR@15 MRR@20                          MRR@1       MRR@2   MRR@3   MRR@5    MRR@7   MRR@10 MRR@15 MRR@20

 0.3 SR      AR      KNN      TKNN      GRU4REC(100,TOP1)        GRU4REC(100,BPR)         0.008 SR        AR        KNN     TKNN    GRU4REC(100,TOP1)       GRU4REC(100,BPR)
                                        30Music                                                                                     8tracks                              *
                                                                                          0.007
0.25                                                                             *
                                                                                          0.006
 0.2
                                                                                          0.005
0.15                                                                                      0.004
                                                                                          0.003
 0.1
                                                                                          0.002
0.05                                                                                      0.001
   0                                                                                         0
        MRR@1     MRR@2    MRR@3     MRR@5     MRR@7 MRR@10 MRR@15 MRR@20                              MRR@1       MRR@2   MRR@3   MRR@5    MRR@7   MRR@10 MRR@15 MRR@20

       SR    AR     KNN       TKNN      GRU4REC(100,TOP1)        GRU4REC(100,BPR)                 SR      AR        KNN     TKNN    GRU4REC(100,TOP1)       GRU4REC(100,BPR)


       Figure 2: MRR results for the listening log datasets.                                           Figure 3: MRR results for the playlist datasets.

Similar to the e-commerce datasets, in all measurements, a frequent                        0.3
                                                                                                                                   #nowplaying
pattern approach, namely the sr method, outperforms gru4rec.                              0.25                                                                           *
Here again, for MRR@20, the recommendations of sr are sig-                                 0.2
nificantly more accurate than the recommendations of gru4rec.
                                                                                          0.15
Note that on the music datasets, we apply gru4rec(100,top1) and
                                                                                           0.1
gru4rec(100,bpr), which use 100 hidden units and the TOP1 and
BPR’s pairwise ranking loss function, respectively.5                                      0.05

   The tknn method – the time-aware extension of knn– works                                 0
always significantly better than the knn method on the listening                                       HR@1        HR@2    HR@3    HR@5     HR@7    HR@10     HR@15    HR@20

logs datasets. tknn also outperforms both gru4rec configurations                                  SR          AR     KNN    TKNN    GRU4REC(100,TOP1)       GRU4REC(100,BPR)

on the #nowplaying dataset for list lengths larger than 1.
   Another observation on the listening logs datasets is that the                                  Figure 4: HR results for the #nowplaying dataset.
sequence-based approaches (sr, tknn and gru4rec) work signif-
                                                                                          to 24% for list length 20. At the same time, the hit rate of some of
icantly better than methods that do not consider the temporal
                                                                                          the other methods only slightly increases, e.g., from 6% to 15%. As a
information in data (knn and ar).
                                                                                          result, across all four investigated music datasets, knn outperforms
    4.1.4 Playlists Datasets. Figure 3 shows the MRR results of the                       all other algorithms in terms of HR@20. A similar trend can also
algorithms on the playlists datasets. On both datasets, the temporal                      be seen for ar, the other non-sequential approach.
extension of kNN, tknn, leads to the best results across all recom-
                                                                                             4.1.5 Aggregated Ranking of Algorithms. To determine the rank-
mendation list sizes and significantly outperforms both variants of
                                                                                          ing of different algorithms based on their accuracy results (MRR@20)
gru4rec. The performance of all other methods, however, seems
                                                                                          across all six datasets, we applied the Borda Count (BC) rank aggre-
to depend on the specifics of the datset. The sr method works
                                                                                          gation strategy [10]. The results show that sr and tknn are both
well on both datasets. The relative performance of the ar method,
                                                                                          ranked first (30 points), followed by ar as the second best algorithm
however, depends on the dataset and the list length at which the
                                                                                          (20 points). The gru4rec method with TOP1 ranking loss is ranked
measurement is made.
                                                                                          third (18 points). Finally, knn and gru4rec with BPR ranking loss
    One interesting observation that we made for the music datasets
                                                                                          are ranked fourth (15 points) and fifth (13 points), respectively.
is that the relative performance of knn strongly improves in terms
of the hit rate6 when the recommendation list length is increased.                           4.1.6 Hybrid Approaches. We conducted a variety of additional
This can, for example, be seen in Figure 4, which shows the hit rate                      experiments with different hybridization methods as described in
results for the #nowplaying dataset. The hit rate of knn on the                           Section 3.1.6 to analyze the effect of combining the algorithms. In
#nowplaying dataset that is about 3% for list length one increases                        general, a weighted combination of the two normalized prediction
                                                                                          scores of a neighborhood-based and a sequence-based method led
5 Repeating the experiments with 1000 hidden layers for the gru4rec methods did not
                                                                                          to the best results in our experiments. For instance, the combination
lead to any better results on the music datasets.
6 Generally, the hit rate results for the experiments, which we do not include here for   of knn and sr with a weight ratio of 3 to 7, wh(knn,sr:0.3,0.7), out-
space reasons, are similar to the MRR results.                                            performed all other individual algorithms on the 30Music dataset.
      Table 3: Results of the hybrid methods for 30Music.                      0.16
                                                                                                                   Popularity@20
                                                                               0.14
                                                                               0.12
Method                       HR@5        MRR@5         HR@20        MRR@20      0.1
                                                                               0.08
sr                            0.285        0.233        0.332          0.238   0.06
knn                           0.142        0.069        0.342          0.089   0.04
gru                           0.275        0.222        0.315          0.226   0.02
wh(knn,sr:0.3,0.7)            0.298        0.243        0.386          0.252           0
wh(knn,gru:0.6,0.4)           0.261        0.144        0.396          0.159                     8tracks             #nowplaying             RSC
                                                                               0.7          SR   AR        KNN   TKNN     GRU4REC(TOP1)    GRU4REC(BPR)
                                                                                                                   Coverage@20
                                                                               0.6
Another example is combining the normalized score of knn and                   0.5
gru4rec(100,top1), which can outperform other algorithms in                    0.4
terms of HR@20. The differences between the winning hybrid                     0.3
approaches (printed in bold face in Table 3) and the best perform-             0.2
ing individual methods in each measurement were statistically                  0.1

significant. Similar results were also achieved for the other datasets,            0
                                                                                                 AotM               #nowplaying             TMall
which we do not include here for space reasons.
                                                                                            SR    AR       KNN   TKNN      GRU4REC(TOP1)   GRU4REC(BPR)

4.2     Additional Analyses
                                                                               Figure 5: Popularity biases and catalog coverages of the al-
Since prediction accuracy might not be the only possible relevant              gorithms on three sample datasets.
quality criterion in a domain [19], we made a number of additional
analyses as shown in Figure 5.
                                                                                  The raw data used for training the algorithms in this specific
   4.2.1 Popularity Bias and Catalog Coverage. As in [22], we first            experiment (one split of the RSC dataset) occupies about 540 MB
measured the average popularity of the top-20 recommendations                  of main memory. The data structures used for training sr and knn
of the algorithms to assess possible recommendation biases. The                occupy about 50 MB and 3.2 GB, respectively. The model created
popularity of an item is computed based on its number of occur-                by gru4rec needs about 510 MB. Note that memory demand of
rences in the training dataset. Overall, the recommendations of                gru4rec depends on the algorithm parameters and significantly in-
non-sequential approaches (knn and ar) shows the highest bias                  creases with the number of items. For the music and Tmall datasets,
towards popular items. The sequence-based approaches (sr and                   the memory demand of gru4rec exceeded the capacity of our
gru4rec), in contrast, recommend comparably less popular items.                graphics card. Running gru4rec using the CPU is multiple times
   Additionally, we analyzed the catalog coverage of each algorithm            slower than when a graphics card is used.
by counting the number of different items that appear in the top-20
recommendation lists of all sessions in the test set. Overall, the rec-        5           CONCLUSION AND FUTURE WORKS
ommendation lists of gru4rec and sr include more different items               Our work indicates that comparably simple frequent-pattern-based
than the other algorithms. The recommendations of neighborhood                 approaches can represent a comparably strong baseline when eval-
methods, knn and tknn, on the other hand, focus on smaller sets of             uating session-based recommendation problems. At the end, we
items and show a higher concentration bias. This can be explained              could find at least one pattern-based approach that was significantly
by considering the sampling strategy of knn which focuses on a                 better than a recent RNN-based method. In particular the sr method
smaller subset of the sessions, e.g., those of the last few days.              was surprisingly effective, despite the fact that both learning and
    4.2.2 Computational Complexity and Memory Usage. We mea-                   applying the rules is very fast.
sured the training time as well as the needed memory and time                     Our results also indicates that the “winning” strategy seems to
to generate recommendations for each algorithm. On a desktop                   strongly depend on the characteristics of the data sets like average
computer with an Intel i7-4790k processor, training gru4rec on                 session lengths or repetition rates. Further research is still required
one split of the RSC dataset with almost 4 million sessions and in             to understand this relationship. In our future work, we will investi-
its best configuration takes more than 12 hours. This time can be              gate the performance of additional session-based algorithms. These
reduced to 4 hours when calculations are performed by the GPU                  algorithms include both ones that are based on Markov models, e.g.,
(Nvidia GeForce GTX 960).7 The knn method needs about 27 sec-                  Rendle et al.’s factorized Markov chains [34], as well as recently
onds to build the needed in-memory maps, see Section 3.1.2. The                proposed improvements to gru4rec, e.g., by Tan et al. [38]. We
well-performing sr method needs about 48 seconds to determine                  expect that continuously improved RNN-based methods will be
the rule weights. A specific advantage of the latter two methods               able to outperform the frequent pattern based baselines used in the
is that they support incremental updates, i.e., new events can be              evaluation reported in this paper. These methods can, however, be
immediately incorporated into the algorithms. Creating one rec-                computationally quite expensive. From a practical perspective, one
ommendation list with gru4rec needed, on average, about 12 ms.                 has therefore to assess depending on the application domain if the
knn needs about 26 ms for this task and sr only 3 ms.                          obtained gains in accuracy justify the usage of these complex mod-
                                                                               els, which cannot be easily updated online and whose predictions
7 Training the model for 6 month of data using the GPU lasts about 8 hours.    can be difficult to explain.
REFERENCES                                                                                [26] Lukas Lerche, Dietmar Jannach, and Malte Ludewig. 2016. On the Value of
 [1] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining Association              Reminders within E-Commerce Recommendations. In UMAP ’16. 27–25.
     Rules between Sets of Items in Large Databases. In SIGMOD ’93. 207–216.              [27] Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep Collaborative Filtering via
 [2] Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining Sequential Patterns.                Marginalized Denoising Auto-encoder. In CIKM ’15. 811–820.
     In ICDE ’95. 3–14.                                                                   [28] Zachary Chase Lipton. 2015. A Critical Review of Recurrent Neural Networks
 [3] Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the GRU:                    for Sequence Learning. CoRR abs/1506.00019 (2015).
     Multi-task Learning for Deep Text Recommendations. In RecSys ’16. 107–114.           [29] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel.
 [4] Geoffray Bonnin and Dietmar Jannach. 2014. Automated Generation of Music                  2015. Image-Based Recommendations on Styles and Substitutes. In SIGIR ’15.
     Playlists: Survey and Experiments. ACM Computing Surveys 47, 2 (2014), 26:1–              43–52.
     26:35.                                                                               [30] Brian McFee and Gert R. G. Lanckriet. 2011. The Natural Language of Playlists.
 [5] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014.               In ISMIR ’11. 537–542.
     Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.        [31] Brian McFee and Gert R. G. Lanckriet. 2012. Hypergraph Models of Playlist
     CoRR abs/1412.3555 (2014).                                                                Dialects. In ISMIR ’12. 343–348.
                                                                                          [32] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. 2002. Using
 [6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for
                                                                                               Sequential and Non-Sequential Patterns in Predictive Web Usage Mining Tasks.
     YouTube Recommendations. In RecSys ’16. 191–198.
                                                                                               In ICDM ’02. 669–672.
 [7] Sander Dieleman. 2016. Deep Learning for Audio-Based Music Recommendation.
                                                                                          [33] Ozlem Ozgobek, Jon A. Gulla, and Riza C. Erdur. 2014. A Survey on Challenges
     In DLRS ’16 Workshop. 1–1.
                                                                                               and Methods in News Recommendation. In WEBIST ’14. 278–285.
 [8] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A Multi-View
                                                                                          [34] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-
     Deep Learning Approach for Cross Domain User Modeling in Recommendation
                                                                                               izing Personalized Markov Chains for Next-basket Recommendation. In WWW
     Systems. In WWW ’15. 278–288.
                                                                                               ’10. 811–820.
 [9] Jeffrey L. Elman. 1990. Finding Structure in Time. Cognitive Science 14, 2 (1990),
                                                                                          [35] Guy Shani, David Heckerman, and Ronen I. Brafman. 2005. An MDP-Based
     179 – 211.
                                                                                               Recommender System. The Journal of Machine Learning Research 6 (Dec. 2005),
[10] Peter Emerson. 2013. The Original Borda Count and Partial Voting. Social Choice
                                                                                               1265–1295.
     and Welfare 40, 2 (2013), 353–358.
                                                                                          [36] Yang Song, Ali Mamdouh Elkahky, and Xiaodong He. 2016. Multi-Rate Deep
[11] Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks.
                                                                                               Learning for Temporal Recommendation. In SIGIR ’16. 909–912.
     CoRR abs/1308.0850 (2013). http://arxiv.org/abs/1308.0850
                                                                                          [37] Florian Strub, Romaric Gaudel, and Jérémie Mary. 2016. Hybrid Recommender
[12] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines.
                                                                                               System based on Autoencoders. In DLRS ’16 Workshop. 11–16.
     CoRR abs/1410.5401 (2014).
                                                                                          [38] Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved Recurrent Neural
[13] Yupeng Gu, Bo Zhao, David Hardtke, and Yizhou Sun. 2016. Learning Global
                                                                                               Networks for Session-based Recommendations. In Proceedings of the 1st Workshop
     Term Weights for Content-based Recommender Systems. In WWW ’16. 391–400.
                                                                                               on Deep Learning for Recommender Systems (DLRS ’16). ACM, 17–22.
[14] Jiawei Han and Micheline Kamber. 2006. Data Mining: Concepts and Techniques
                                                                                          [39] Maryam Tavakol and Ulf Brefeld. 2014. Factored MDPs for Detecting Topics of
     (Second Edition). Morgan Kaufmann.
                                                                                               User Sessions. In RecSys ’14. 33–40.
[15] Negar Hariri, Bamshad Mobasher, and Robin Burke. 2012. Context-Aware Music
                                                                                          [40] Roberto Turrin, Andrea Condorelli, Paolo Cremonesi, Roberto Pagano, and
     Recommendation Based on Latent Topic Sequential Patterns. In RecSys ’12. 131–
                                                                                               Massimo Quadrana. 2015. Large Scale Music Recommendation. In LSRS 2015
     138.
                                                                                               Workshop at ACM RecSys.
[16] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
                                                                                          [41] Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, and
     2015. Session-based Recommendations with Recurrent Neural Networks. CoRR
                                                                                               Paolo Cremonesi. 2015. 30Music Listening and Playlists Dataset. In Poster Pro-
     abs/1511.06939 (2015).
                                                                                               ceedings RecSys ’15.
[17] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos
                                                                                          [42] Bart lomiej Twardowski. 2016. Modelling Contextual Information in Session-
     Tikk. 2016. Parallel Recurrent Neural Network Architectures for Feature-rich
                                                                                               Aware Recommender Systems with Neural Networks. In RecSys ’16. 273–276.
     Session-based Recommendations. In RecSys ’16. 241–248.
                                                                                          [43] Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec:
[18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
                                                                                               Product Embeddings Using Side-Information for Recommendation. In RecSys ’16.
     Neural Computation 9, 8 (Nov. 1997), 1735–1780.
                                                                                               225–232.
[19] Dietmar Jannach and Gedas Adomavicius. 2016. Recommendations with a
                                                                                          [44] Koen Verstrepen and Bart Goethals. 2014. Unifying Nearest Neighbors Collabo-
     Purpose. In RecSys ’16. 7–10.
                                                                                               rative Filtering. In RecSys ’14. 177–184.
[20] Dietmar Jannach, Lukas Lerche, and Michael Jugovac. 2015. Adaptation and
                                                                                          [45] Jeroen B. P. Vuurens, Martha Larson, and Arjen P. de Vries. 2016. Exploring
     Evaluation of Recommendations for Short-term Shopping Goals. In RecSys ’15.
                                                                                               Deep Space: Learning Personalized Ranking in a Semantic Space. In DLRS ’16
     211–218.
                                                                                               Workshop. 23–28.
[21] Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. 2015. Beyond “Hitting
                                                                                          [46] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learning
     the Hits”: Generating Coherent Music Playlist Continuations with the Right
                                                                                               for Recommender Systems. In KDD ’15. 1235–1244.
     Tracks. In RecSys ’15. 187–194.
                                                                                          [47] Ghim-Eng Yap, Xiao-Li Li, and Philip S. Yu. 2012. Effective Next-items Recom-
[22] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015.
                                                                                               mendation via Personalized Sequential Pattern Mining. In DASFAA’12. Berlin,
     What recommenders recommend: an analysis of recommendation biases and
                                                                                               Heidelberg, 48–64.
     possible countermeasures. User Modeling and User-Adapted Interaction (2015),
                                                                                          [48] Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #Now-
     1–65.
                                                                                               playing Music Dataset: Extracting Listening Behavior from Twitter. In WISMM
[23] Dietmar Jannach and Malte Ludewig. 2017. Determining Characteristics of
                                                                                               ’14 Workshop at MM ’14. 21–26.
     Successful Recommendations from Log Data – A Case Study. In SAC ’17.
                                                                                          [49] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin
[24] Dietmar Jannach and Malte Ludewig. 2017. When Recurrent Neural Networks
                                                                                               Wang, and Tie-Yan Liu. 2014. Sequential Click Prediction for Sponsored Search
     meet the Neighborhood for Session-Based Recommendation. In RecSys 2017.
                                                                                               with Recurrent Neural Networks. In AAAI ’14. 1369–1375.
     (forthcoming).
[25] Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu.
     2016. Convolutional Matrix Factorization for Document Context-Aware Recom-
     mendation. In RecSys ’16. 233–240.