=Paper=
{{Paper
|id=Vol-1922/paper10
|storemode=property
|title=A Comparison of Frequent Pattern Techniques and a Deep Learning Method for Session-Based Recommendation
|pdfUrl=https://ceur-ws.org/Vol-1922/paper10.pdf
|volume=Vol-1922
|authors=Iman Kamehkhosh,Dietmar Jannach,Malte Ludewig
|dblpUrl=https://dblp.org/rec/conf/recsys/KamehkhoshJL17
}}
==A Comparison of Frequent Pattern Techniques and a Deep Learning Method for Session-Based Recommendation
==
A Comparison of Frequent Pattern Techniques and a Deep Learning Method for Session-Based Recommendation Iman Kamehkhosh Dietmar Jannach Malte Ludewig TU Dortmund TU Dortmund TU Dortmund iman.kamehkhosh@tu-dortmund.de dietmar.jannach@tu-dortmund.de malte.ludewig@tu-dortmund.de ABSTRACT common on e-commerce sites, e.g., when returning users do not log Making session-based recommendations, i.e., recommending items in every time they use the site. The same challenges can, however, solely based on the users’ last interactions without having access be observed also for other application domains, in particular for to their long-term preference profiles, is a challenging problem news and media (music and video) recommendation [21, 33]. in various application fields of recommender systems. Using a The problem of predicting the next actions of users based solely coarse classification scheme, the proposed algorithmic approaches on their sequence of actions in the current session is referred to to this problem in the research literature can be categorized into in the literature as session-based recommendation. A number of frequent pattern mining algorithms and approaches that are based algorithmic approaches have been proposed over the years to deal on sequence modeling. In the context of methods of the latter class, with the problem. Early academic approaches, for example, rely recent works suggest the application of recurrent neural networks on the detection of sequential patterns in the session data of a (RNN) for the problem. However, the lack of established algorithmic larger user community. In principle, even simpler methods can be baselines for session-based recommendation problems makes the applied. Amazon’s “Customers who bought . . . also bought” feature assessment of such novel approaches difficult. represents an example that relies on simple co-occurrence patterns In this work, we therefore compare a state-of-the-art RNN-based to generate recommendations, in that case in the context of the approach with a number of (heuristics-based) frequent pattern very last user interaction (an item view event). A number of later mining methods both with respect to the accuracy of their recom- works then explored the use of Markov models [30, 35, 39], and mendations and with respect to their computational complexity. most recently, researchers explored the use of recurrent neural The results obtained for a variety of different datasets show that in networks (RNN) for the session-based next-item recommendation every single case a comparably simple frequent pattern method can problem [16, 17, 38, 42]. be found that outperforms the recent RNN-based method. At the Today, RNNs can be considered one of the state-of-the-art meth- same time, the proposed much more simple methods are also com- ods for sequence learning tasks. They have been successfully ex- putationally less expensive and can be applied within the narrow plored for various sequence-based prediction problems in the past time constraints of online recommendation. [5, 9, 11, 18] and in a recent work, Hidasi et al. [16] investigated an RNN variant based on gated recurrent units (GRU) for the session- CCS CONCEPTS based recommendations problem. In their work, they benchmarked their RNN-based method gru4rec with different baseline methods •General and reference →Evaluation; •Information systems on two datasets. Their results showed that gru4rec is able to out- →Recommender systems; •Computing methodologies →Ne- perform the baseline approaches in terms of accuracy for top-20 ural networks; Rule learning; recommendation lists. KEYWORDS While these results indicate that RNNs can be successfully ap- plied for the given recommendation task, we argue that the experi- Session-Based Recommendations; Deep Learning; Frequent Pattern mental evaluation in [16] does not fully inform us about different Mining; Benchmarking aspects of the effectiveness and the practicability of the proposed method. First, regarding the effectiveness, it is unclear if the meth- 1 INTRODUCTION ods to which gru4rec was compared are competitive. Second, as Making recommendations solely based on a user’s current session the evaluation was based on one single training-test split and only and most recent interactions is a nontrivial problem for recom- using accuracy measures, further investigations are necessary to mender systems. On an e-commerce website, for instance, when assess, for example, if some algorithms exhibit certain biases, e.g., to a visitor is new (or not logged in), there are no long-term user recommend mostly popular items. Third, even if the RNN method is models that can be applied to determine suitable recommendations effective, questions regarding the scalability of the method should for this user. Furthermore, recent work shows that considering the be discussed, in particular as hyper-parameter optimization for the user’s short-term intent has often more effect on the accuracy of the complex networks can become very challenging in practice. recommendations than the choice of the method used to build the The goal of this work is to shed light on these questions and in long-term user profiles [20]. In general, such types of problems are the remainder of this paper we will report the detailed results of comparing a state-of-the-art RNN-based method with a number Workshop on Temporal Reasoning in Recommender Systems, collocated with ACM Rec- Sys’17, Como, Italy. of computationally more efficient pattern mining approaches in Copyright©2017 for this paper by its authors. Copying permitted for private and different dimensions. academic purposes. . 2 PREVIOUS WORKS which leverage additional item features to achieve higher accu- In session-based recommendation problems, we are given a se- racy. For the problem of news recommendation, Song et al. [36] quence of the most recent actions of a user and the goal is to find proposed a temporal deep semantic structured model for the combi- items that are relevant in the context of the user’s specific short- nation of long-term static and short-term temporal user preferences. term intent. One traditional way to determine recommendations They considered different levels of granularity in their model to given a set of recent items of interest is to apply frequent pat- process both fast and slow temporal changes in the users’ prefer- tern mining techniques, e.g., based on association rules (AR) [1]. ences. In general, neural networks have been used for a number AR are often applied for market basket analysis with the goal to of recommendation-related tasks in recent years. Often, such net- find sets of items that are bought together with some probability works are used to learn embeddings of content features in compact [14]. The order of the items or actions in a session is irrelevant fixed-size latent vectors, e.g., for music, for images, for video data, for AR-based approaches. Sequential patterns mining (SP) [2] tech- for documents, or to represent the user [3, 6–8, 13, 25, 29, 46]. niques, in contrast, consider the order of the elements in sessions These representations are then integrated, e.g., in content-based when identifying frequent patterns. In one of the earlier works, approaches, in variations of latent factor models, or are part of new Mobasher et al. [32] used frequent pattern mining methods to pre- methods for computing recommendations [7, 8, 13, 27, 37, 43, 45]. dict a user’s next navigation action. In another work, Yap et al. [47] In the work presented in this paper, we will compare different propose a sequential pattern-mining-based next-item recommen- existing and novel pattern-mining-based approaches with a state- dation framework, which weights the patterns according to their of-the-art RNN-based algorithm. estimated relevance for the individual user. In the domain of music recommendation, Hariri et al. more recently [15] propose to mine 3 EXPERIMENT CONFIGURATIONS sequential patterns of latent topics based on the tags attached to 3.1 Algorithms the tracks to predict the context of the next song. 3.1.1 RNN Baseline. gru4rec is an RNN-based algorithm that A different way of finding item-to-item correlations is to look uses Gated Recurrent Units to deal with the vanishing or exploding for sessions that are similar to the current one (neighbors), and to gradient problem proposed in [16]. In our experiments, we used determine frequent item co-occurrence patterns that can be used in the Python implementation that is shared by the authors online.1 the prediction phase. Such neighborhood-based approaches were for example applied in the domains of e-commerce and music in 3.1.2 Session-based kNN – knn. The knn method searches the k [4] or [26]. In some cases and application domains, simple co- most similar past sessions (“neighbors”) in the training data based occurrence patterns are despite their simplicity quite effective, see, on the set of items in the current session. Since the process of e.g., [20, 40] or [44]. determining the neighbor sessions becomes very time-consuming Differently from such pattern- and co-occurrence-based tech- as the number of sessions increases, we use an special in-memory niques, a number of recent approaches are based on sequence mod- index data structure (cache) in our implementation. Technically, in eling using, e.g., Markov models. The main assumption of Markov- the training phase, we create a data structure that maps the training model-based approaches in the context of session-based recom- sessions to their set of items and one structure that maps the items mendation is that the selection of the next item in a session is de- to the sessions in which they appear. To make recommendations for pendent on a limited number of previous actions. Shani et al. [35] the current session s, we first create a union of the sessions in which were among the first who applied first-order Markov chains (MC) the items of s appear. This union will be the set of possible neighbors for session-based recommendation and showed the superiority of of the current session. This is a fast operation as it only involves a sequential models over non-sequential ones. In the music domain, cache lookup and set operations. To further reduce the computa- McFee and Lanckriet [30] proposed a music playlist generation tional complexity of the prediction process, we select a subsample algorithm based on MCs that – given a seed song – selects the of these possible neighbors using a heuristic. In this work, we took next track from uniform and weighted distributions as well as from the m most recent sessions as focusing on recent trends has shown k-nearest neighbor graphs. Generally, a main issue when applying to be effective for recommendations in e-commerce [23]. We then Markov chains in session-based recommendation is that the state compute the similarity of these m most recent possible neighbors space quickly becomes unmanageable when all possible sequences and the current session and select the k most similar sessions as of user selections should be considered [12, 16]. the neighbor sessions of the current session. Again through lookup More recent approaches to sequence modeling for session-based and set union operations, we create the set of recommendable items recommendation utilize recurrent neural networks (RNN). RNNs R that contains items that appear in one of the k sessions. For each process sequential data one element at a time and are able to selec- recommendable item i in R, we then compute the knn score as the tively pass information across sequence steps [28]. Zhang et al. [49], sum of the similarity values of s and its neighbor sessions n ∈ Ns for example, successfully applied RNNs to predict advertisement which contains i (Equation 1). The indicator function 1n (i) returns clicks based on the users’ browsing behavior in a sponsored search 1 if n contains i and 0 otherwise, see also [4]. scenario. For session-based recommendations, Hidasi et al. [16] score knn (i, s) = Σn ∈Ns sim(s, n) × 1n (i) (1) investigated a customized RNN variant based on gated recurrent In our experiments, we tested different distance measures to units (GRU) [5] to model the users’ transactions within sessions. determine the similarity of sessions. The best results were achieved They also tested several ranking loss functions in their solutions. when the sessions were encoded as binary vectors of the item space Later on, in [17] and [42] RNN-based approaches were proposed 1 https://github.com/hidasib/GRU4Rec and when using cosine similarity. In our implementation, the set Table 1: Dataset characteristics. operations, similarity computations, and the final predictions can be done very efficiently as will be discussed later in Section 4.2.2. RSC TMall #nowplaying 30Music AotM 8tracks Our algorithm has only two parameters, the number of neighbors Sessions 8M 4.6M 95K 170K 83K 500K k and the number of sampled sessions m. For the large e-commerce Events 32M 46M 1M 2.9M 1.2M 5.8M dataset used in [16], the best parameters were, for example, achieved Items 38K 620K 115K 450K 140K 600K with k = 500 and m = 1000. Note that the kNN method used in Avg. E/S 3.97 9.77 10.37 17.03 14.12 11.40 [16] is based on item-to-item similarities while our kNN method Avg. I/S 3.17 6.92 9.62 14.20 14.11 11.38 aims to identify similar sessions. 3.1.3 kNN Temporal Extension – tknn. The knn method, when where j is the last item of session s and SR is the set of sequential using cosine similarity as a distance measure, does not consider the rules. The indicator function 1S R (r j,i ) = 1 when SR contains r j,i temporal sequence of the events in a session. To be able to leverage and 0 otherwise. the temporal information within the knn technique, we designed an additional temporal-filtering heuristic for it. The proposed tknn 3.1.6 Hybrid Approaches. We made additional experiments with method uses the same scoring scheme as the knn method (Equa- several hybrids that combine different algorithms. At the end, a tion 1). The only difference is that, given the current session s, we weighted combination of the two normalized prediction scores of consider item i as being recommendable only if it appears in the the algorithms led to the best results in our experiments. neighbor session n directly after a certain item. In our implemen- tation, that certain item is the last item of the current session s. 3.2 Datasets and Evaluation Protocol Technically, we therefore use a slightly different implementation of We performed experiments using datasets from two different do- the indicator function of Equation 1: 1n (i) = 1 if neighbor session mains in which session-based recommendation is relevant, namely n contains i and (j, i) is a subsequence of n, where j is the last item e-commerce and next-track music recommendation. The source of the current session and thus the basis to predict the next item. code and the public datasets can be found online.2 3.1.4 Simple Association Rules – ar. To assess the strength of 3.2.1 E-commerce Datasets. For the e-commerce domain, we simple two-element co-occurrence patterns, we included a method chose the ACM RecSys 2015 Challenge dataset (RSC) as used in named ar which can be considered as an association rule technique [16]. The RSC dataset is a collection of sequences of click events with a maximum rule size of two. Technically, we create a rule rp,q in shopping sessions. The second e-commerce dataset is a public for every two items p and q that appear together in the training dataset published for the TMall competition. This dataset contains sessions. We determine the weight, wp,q , of each rule simply as the shopping logs of the users on the Tmall.com website. number of times p and q appear together in past sessions. Given the current session s, the ar score of a target item i will be then 3.2.2 Music Datasets. We used (a) two datasets that contain computed as listening logs of several thousand users and (b) two datasets that comprise thousands of manually created playlists. score ar (i, s) = w i, j × 1AR (r i, j ) (2) Listening logs: These used datasets are (almost one-year-long) where j is the last item of the current session s for which we want sub-samples of two public datasets. First, we created a subset of to predict the successor and AR is the set of rules and their weights the #nowplaying dataset [48], which contains music-related tweets as determined based on the training data. The indicator function on Twitter. Second, we used the recent 30Music dataset [41], which 1AR (r i, j ) = 1 when AR contains r i, j and 0 otherwise. contains listening sessions retrieved from Internet radio stations through the Last.fm API. 3.1.5 Simple Sequential Rules – sr. The sr method is a variant of Playlists: Generally, music playlists are different in nature from ar, which aims to take the order of the events into account. Similar listening logs and e-commerce user logs in various ways. Nonethe- to the ar method, we create a sequential rule for the co-occurrence less, they are designed to be consumed in a listening session and the of every two items p and q as rp,q in the training data. This time, tracks are often arranged in a specific sequence. The used playlist however, we consider the distance between p and q in the session datasets come from two different music platforms. The Art-of-the- when computing the weight of the rules. In our implementation, Mix dataset (AotM) was published by [31] and contains playlists we use the multiplicative inverse as a weight function and set by music enthusiasts. The 8tracks dataset was shared with us by wp,q = 1/x, where x is the number of items that appear between p the 8tracks platform. A particularity of the 8tracks dataset is that and q in a session. Other heuristics such as a linear or a logarithmic each public playlist can only contain two tracks per artist. function can also be used. In case that those two items appear The dataset statistics are shown in Table 1. The total number together in another session in the training data, the weight of the of sessions is larger for the e-commerce datasets. However, the rule in that session will be added to the current weight. We finally number of unique items in the music datasets, which corresponds normalize the weight and divide it by the total number of sessions to the number of tracks included in the playlists or the number of that contributed to the weight. Given the current session s, the sr played tracks in the listening sessions, is higher than the number score of a target item i is then computed as of items in e-commerce datasets. score sr (i, s) = w j,i × 1S R (r j,i ) (3) 2 http://ls13-www.cs.tu-dortmund.de/homepage/rectemp2017 The music sessions are on average longer than the e-commerce Table 2: Results when using the evaluation scheme of [16]. sessions.3 The last row of Table 1 shows the average number of unique items in each session (“Avg. I/S”). Comparing this value Method HR@10 MRR@10 HR@20 MRR@20 with the average session length (“Avg. E/S”) indicates what we call sr 0.568 0.290 0.672 0.297 the item repetition rate in each dataset. Including the same track tknn 0.545 0.251 0.670 0.260 more than once in a playlist is comparably uncommon. Listening ar 0.543 0.273 0.655 0.280 to a track more than once during a listening session is, however, knn 0.521 0.242 0.641 0.250 common. The difference between the average session length and gru4rec(1000,bpr) 0.517 0.235 0.636 0.243 the average number of items in each session for the e-commerce gru4rec(1000,top1) 0.517 0.261 0.623 0.268 dataset indicates that re-occurring of the same item in a session is common in the e-commerce domain. 0.32 RSC * 3.2.3 Evaluation Protocol. The general task of the session-based 0.3 recommendation techniques in our experiment is to predict the next- 0.28 0.26 item view event in a shopping session or to predict the next track 0.24 that is played in a listening session or is included in a playlist. To 0.22 evaluate the session-based algorithms, we use the same evaluation 0.2 0.18 scheme as in [16]. We incrementally add events to the sessions in 0.16 the test set and report the average hit rate (HR), which corresponds 0.14 to recall in this evaluation setting, and the mean reciprocal rank 0.12 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 (MRR), which takes the position of the hit into account. We tested 0.2 SR AR KNN TKNN GRU4REC(1000,TOP1) GRU4REC(1000,BPR) list lengths of 1, 2, 3, 5, 7, 10, 15, and 20. While the experiments TMall in [16] are done without cross-validation, we additionally apply a 0.18 * 0.16 fivefold sliding-window validation protocol as in [24] to minimize 0.14 the risk that the obtained results are specific to the single train-test 0.12 split. We, therefore, created five train-test splits for each dataset. 0.1 For the listening logs, we used 3 months of training data and the 0.08 next 5 days as the test data and randomized splits for the playlists 0.06 as they have no timestamps assigned. 0.04 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 4 RESULTS SR AR KNN TKNN GRU4REC(1000,TOP1) GRU4REC(1000,BPR) 0.12 4.1 Accuracy Results Figure 1: MRR results for the e-commerce datasets (* indi- Our first experiment used the exact same setup as described in [16], cates statistical significance). i.e., we use only one training-test split when comparing gru4rec with our methods. As done in [16], we trained the algorithms using The best accuracy results were achieved by the sr method both 6 months of data containing 7,966,257 sessions of 31,637,239 clicks for the hit rate and MRR and for both list lengths. In terms of the hit on 37,483 items and tested them on the sessions of the next day. rate, every single frequent pattern method used in the experiment In the subsequent sections, we then report the results of our was better than the gru4rec methods. A similar observation can be comparison using the sliding-window validation scheme described made also for the MRR, with the exception that the knn-based meth- above with recommendation list lengths varying from 1 to 20. In all ods consistently performed worse than the gru4rec(1000,top1) experiments, we tuned the parameters for the different algorithms method on this measure. using grid search and optimized for HR@20 on validation sets 4.1.2 E-commerce Datasets. Figure 1 shows the MRR results for (subsets of the training sets). gru4rec was only optimized with the algorithms on the two e-commerce datasets, RSC and TMall. 100 layers as done in [16] due to the computational complexity of For both datasets, we can observe that most of the frequent pattern the method. To test for statistical significance, we use Wilcoxon methods lead to higher or at least similar MRR values as gru4rec. signed-rank test with α = 0.05. There is, however, no clear “winner” across both datasets. The sr 4.1.1 Results Using the Original Evaluation Setup. Table 2 shows method works best for the RSC dataset. On the TMALL dataset, the results ordered by the hit rate (HR@20) when using the origi- the knn method outperforms the others, an effect which might be nal setup. We could reproduce the hit rate and MRR results from caused by the longer list session lengths for this dataset.4 In both [16] (using their optimal parameters) for gru4rec(1000,bpr) and cases, however, the difference between the winning method and the gru4rec(1000,top1), which use 1000 hidden units and the TOP1 and best-performing gru4rec configuration is statistically significant. BPR’s pairwise ranking loss functions, respectively. In Table 2, we This is indicated by a star symbol in Figure 1. additionally report the results for recommendation list length ten, 4.1.3 Listening Logs Datasets. Figure 2 shows the accuracy per- which might be more important for different application domains. formance of the algorithms on two selected listening logs datasets. 3 Note that each session in the TMall dataset is defined as the sequence of actions of a 4 Remember that the sessions of the TMALL dataset cover the events of one day, as the user during one day which results in relatively larger average session length. time stamps in this dataset are given only in the granularity of days. SR AR KNN TKNN GRU4REC(1000,TOP1) GRU4REC(1000,BPR) 0.12 0.014 # nowplaying AotM * 0.012 * 0.1 0.01 0.08 0.008 0.06 0.006 0.004 0.04 0.002 0.02 0 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 0.3 SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR) 0.008 SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR) 30Music 8tracks * 0.007 0.25 * 0.006 0.2 0.005 0.15 0.004 0.003 0.1 0.002 0.05 0.001 0 0 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR) SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR) Figure 2: MRR results for the listening log datasets. Figure 3: MRR results for the playlist datasets. Similar to the e-commerce datasets, in all measurements, a frequent 0.3 #nowplaying pattern approach, namely the sr method, outperforms gru4rec. 0.25 * Here again, for MRR@20, the recommendations of sr are sig- 0.2 nificantly more accurate than the recommendations of gru4rec. 0.15 Note that on the music datasets, we apply gru4rec(100,top1) and 0.1 gru4rec(100,bpr), which use 100 hidden units and the TOP1 and BPR’s pairwise ranking loss function, respectively.5 0.05 The tknn method – the time-aware extension of knn– works 0 always significantly better than the knn method on the listening HR@1 HR@2 HR@3 HR@5 HR@7 HR@10 HR@15 HR@20 logs datasets. tknn also outperforms both gru4rec configurations SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR) on the #nowplaying dataset for list lengths larger than 1. Another observation on the listening logs datasets is that the Figure 4: HR results for the #nowplaying dataset. sequence-based approaches (sr, tknn and gru4rec) work signif- to 24% for list length 20. At the same time, the hit rate of some of icantly better than methods that do not consider the temporal the other methods only slightly increases, e.g., from 6% to 15%. As a information in data (knn and ar). result, across all four investigated music datasets, knn outperforms 4.1.4 Playlists Datasets. Figure 3 shows the MRR results of the all other algorithms in terms of HR@20. A similar trend can also algorithms on the playlists datasets. On both datasets, the temporal be seen for ar, the other non-sequential approach. extension of kNN, tknn, leads to the best results across all recom- 4.1.5 Aggregated Ranking of Algorithms. To determine the rank- mendation list sizes and significantly outperforms both variants of ing of different algorithms based on their accuracy results (MRR@20) gru4rec. The performance of all other methods, however, seems across all six datasets, we applied the Borda Count (BC) rank aggre- to depend on the specifics of the datset. The sr method works gation strategy [10]. The results show that sr and tknn are both well on both datasets. The relative performance of the ar method, ranked first (30 points), followed by ar as the second best algorithm however, depends on the dataset and the list length at which the (20 points). The gru4rec method with TOP1 ranking loss is ranked measurement is made. third (18 points). Finally, knn and gru4rec with BPR ranking loss One interesting observation that we made for the music datasets are ranked fourth (15 points) and fifth (13 points), respectively. is that the relative performance of knn strongly improves in terms of the hit rate6 when the recommendation list length is increased. 4.1.6 Hybrid Approaches. We conducted a variety of additional This can, for example, be seen in Figure 4, which shows the hit rate experiments with different hybridization methods as described in results for the #nowplaying dataset. The hit rate of knn on the Section 3.1.6 to analyze the effect of combining the algorithms. In #nowplaying dataset that is about 3% for list length one increases general, a weighted combination of the two normalized prediction scores of a neighborhood-based and a sequence-based method led 5 Repeating the experiments with 1000 hidden layers for the gru4rec methods did not to the best results in our experiments. For instance, the combination lead to any better results on the music datasets. 6 Generally, the hit rate results for the experiments, which we do not include here for of knn and sr with a weight ratio of 3 to 7, wh(knn,sr:0.3,0.7), out- space reasons, are similar to the MRR results. performed all other individual algorithms on the 30Music dataset. Table 3: Results of the hybrid methods for 30Music. 0.16 Popularity@20 0.14 0.12 Method HR@5 MRR@5 HR@20 MRR@20 0.1 0.08 sr 0.285 0.233 0.332 0.238 0.06 knn 0.142 0.069 0.342 0.089 0.04 gru 0.275 0.222 0.315 0.226 0.02 wh(knn,sr:0.3,0.7) 0.298 0.243 0.386 0.252 0 wh(knn,gru:0.6,0.4) 0.261 0.144 0.396 0.159 8tracks #nowplaying RSC 0.7 SR AR KNN TKNN GRU4REC(TOP1) GRU4REC(BPR) Coverage@20 0.6 Another example is combining the normalized score of knn and 0.5 gru4rec(100,top1), which can outperform other algorithms in 0.4 terms of HR@20. The differences between the winning hybrid 0.3 approaches (printed in bold face in Table 3) and the best perform- 0.2 ing individual methods in each measurement were statistically 0.1 significant. Similar results were also achieved for the other datasets, 0 AotM #nowplaying TMall which we do not include here for space reasons. SR AR KNN TKNN GRU4REC(TOP1) GRU4REC(BPR) 4.2 Additional Analyses Figure 5: Popularity biases and catalog coverages of the al- Since prediction accuracy might not be the only possible relevant gorithms on three sample datasets. quality criterion in a domain [19], we made a number of additional analyses as shown in Figure 5. The raw data used for training the algorithms in this specific 4.2.1 Popularity Bias and Catalog Coverage. As in [22], we first experiment (one split of the RSC dataset) occupies about 540 MB measured the average popularity of the top-20 recommendations of main memory. The data structures used for training sr and knn of the algorithms to assess possible recommendation biases. The occupy about 50 MB and 3.2 GB, respectively. The model created popularity of an item is computed based on its number of occur- by gru4rec needs about 510 MB. Note that memory demand of rences in the training dataset. Overall, the recommendations of gru4rec depends on the algorithm parameters and significantly in- non-sequential approaches (knn and ar) shows the highest bias creases with the number of items. For the music and Tmall datasets, towards popular items. The sequence-based approaches (sr and the memory demand of gru4rec exceeded the capacity of our gru4rec), in contrast, recommend comparably less popular items. graphics card. Running gru4rec using the CPU is multiple times Additionally, we analyzed the catalog coverage of each algorithm slower than when a graphics card is used. by counting the number of different items that appear in the top-20 recommendation lists of all sessions in the test set. Overall, the rec- 5 CONCLUSION AND FUTURE WORKS ommendation lists of gru4rec and sr include more different items Our work indicates that comparably simple frequent-pattern-based than the other algorithms. The recommendations of neighborhood approaches can represent a comparably strong baseline when eval- methods, knn and tknn, on the other hand, focus on smaller sets of uating session-based recommendation problems. At the end, we items and show a higher concentration bias. This can be explained could find at least one pattern-based approach that was significantly by considering the sampling strategy of knn which focuses on a better than a recent RNN-based method. In particular the sr method smaller subset of the sessions, e.g., those of the last few days. was surprisingly effective, despite the fact that both learning and 4.2.2 Computational Complexity and Memory Usage. We mea- applying the rules is very fast. sured the training time as well as the needed memory and time Our results also indicates that the “winning” strategy seems to to generate recommendations for each algorithm. On a desktop strongly depend on the characteristics of the data sets like average computer with an Intel i7-4790k processor, training gru4rec on session lengths or repetition rates. Further research is still required one split of the RSC dataset with almost 4 million sessions and in to understand this relationship. In our future work, we will investi- its best configuration takes more than 12 hours. This time can be gate the performance of additional session-based algorithms. These reduced to 4 hours when calculations are performed by the GPU algorithms include both ones that are based on Markov models, e.g., (Nvidia GeForce GTX 960).7 The knn method needs about 27 sec- Rendle et al.’s factorized Markov chains [34], as well as recently onds to build the needed in-memory maps, see Section 3.1.2. The proposed improvements to gru4rec, e.g., by Tan et al. [38]. We well-performing sr method needs about 48 seconds to determine expect that continuously improved RNN-based methods will be the rule weights. A specific advantage of the latter two methods able to outperform the frequent pattern based baselines used in the is that they support incremental updates, i.e., new events can be evaluation reported in this paper. These methods can, however, be immediately incorporated into the algorithms. Creating one rec- computationally quite expensive. From a practical perspective, one ommendation list with gru4rec needed, on average, about 12 ms. has therefore to assess depending on the application domain if the knn needs about 26 ms for this task and sr only 3 ms. obtained gains in accuracy justify the usage of these complex mod- els, which cannot be easily updated online and whose predictions 7 Training the model for 6 month of data using the GPU lasts about 8 hours. can be difficult to explain. REFERENCES [26] Lukas Lerche, Dietmar Jannach, and Malte Ludewig. 2016. On the Value of [1] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining Association Reminders within E-Commerce Recommendations. In UMAP ’16. 27–25. Rules between Sets of Items in Large Databases. In SIGMOD ’93. 207–216. [27] Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep Collaborative Filtering via [2] Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining Sequential Patterns. Marginalized Denoising Auto-encoder. In CIKM ’15. 811–820. In ICDE ’95. 3–14. [28] Zachary Chase Lipton. 2015. A Critical Review of Recurrent Neural Networks [3] Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the GRU: for Sequence Learning. CoRR abs/1506.00019 (2015). Multi-task Learning for Deep Text Recommendations. In RecSys ’16. 107–114. [29] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. [4] Geoffray Bonnin and Dietmar Jannach. 2014. Automated Generation of Music 2015. Image-Based Recommendations on Styles and Substitutes. In SIGIR ’15. Playlists: Survey and Experiments. ACM Computing Surveys 47, 2 (2014), 26:1– 43–52. 26:35. [30] Brian McFee and Gert R. G. Lanckriet. 2011. The Natural Language of Playlists. [5] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. In ISMIR ’11. 537–542. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. [31] Brian McFee and Gert R. G. Lanckriet. 2012. Hypergraph Models of Playlist CoRR abs/1412.3555 (2014). Dialects. In ISMIR ’12. 343–348. [32] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. 2002. Using [6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Sequential and Non-Sequential Patterns in Predictive Web Usage Mining Tasks. YouTube Recommendations. In RecSys ’16. 191–198. In ICDM ’02. 669–672. [7] Sander Dieleman. 2016. Deep Learning for Audio-Based Music Recommendation. [33] Ozlem Ozgobek, Jon A. Gulla, and Riza C. Erdur. 2014. A Survey on Challenges In DLRS ’16 Workshop. 1–1. and Methods in News Recommendation. In WEBIST ’14. 278–285. [8] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A Multi-View [34] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor- Deep Learning Approach for Cross Domain User Modeling in Recommendation izing Personalized Markov Chains for Next-basket Recommendation. In WWW Systems. In WWW ’15. 278–288. ’10. 811–820. [9] Jeffrey L. Elman. 1990. Finding Structure in Time. Cognitive Science 14, 2 (1990), [35] Guy Shani, David Heckerman, and Ronen I. Brafman. 2005. An MDP-Based 179 – 211. Recommender System. The Journal of Machine Learning Research 6 (Dec. 2005), [10] Peter Emerson. 2013. The Original Borda Count and Partial Voting. Social Choice 1265–1295. and Welfare 40, 2 (2013), 353–358. [36] Yang Song, Ali Mamdouh Elkahky, and Xiaodong He. 2016. Multi-Rate Deep [11] Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks. Learning for Temporal Recommendation. In SIGIR ’16. 909–912. CoRR abs/1308.0850 (2013). http://arxiv.org/abs/1308.0850 [37] Florian Strub, Romaric Gaudel, and Jérémie Mary. 2016. Hybrid Recommender [12] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. System based on Autoencoders. In DLRS ’16 Workshop. 11–16. CoRR abs/1410.5401 (2014). [38] Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved Recurrent Neural [13] Yupeng Gu, Bo Zhao, David Hardtke, and Yizhou Sun. 2016. Learning Global Networks for Session-based Recommendations. In Proceedings of the 1st Workshop Term Weights for Content-based Recommender Systems. In WWW ’16. 391–400. on Deep Learning for Recommender Systems (DLRS ’16). ACM, 17–22. [14] Jiawei Han and Micheline Kamber. 2006. Data Mining: Concepts and Techniques [39] Maryam Tavakol and Ulf Brefeld. 2014. Factored MDPs for Detecting Topics of (Second Edition). Morgan Kaufmann. User Sessions. In RecSys ’14. 33–40. [15] Negar Hariri, Bamshad Mobasher, and Robin Burke. 2012. Context-Aware Music [40] Roberto Turrin, Andrea Condorelli, Paolo Cremonesi, Roberto Pagano, and Recommendation Based on Latent Topic Sequential Patterns. In RecSys ’12. 131– Massimo Quadrana. 2015. Large Scale Music Recommendation. In LSRS 2015 138. Workshop at ACM RecSys. [16] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. [41] Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, and 2015. Session-based Recommendations with Recurrent Neural Networks. CoRR Paolo Cremonesi. 2015. 30Music Listening and Playlists Dataset. In Poster Pro- abs/1511.06939 (2015). ceedings RecSys ’15. [17] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos [42] Bart lomiej Twardowski. 2016. Modelling Contextual Information in Session- Tikk. 2016. Parallel Recurrent Neural Network Architectures for Feature-rich Aware Recommender Systems with Neural Networks. In RecSys ’16. 273–276. Session-based Recommendations. In RecSys ’16. 241–248. [43] Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec: [18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Product Embeddings Using Side-Information for Recommendation. In RecSys ’16. Neural Computation 9, 8 (Nov. 1997), 1735–1780. 225–232. [19] Dietmar Jannach and Gedas Adomavicius. 2016. Recommendations with a [44] Koen Verstrepen and Bart Goethals. 2014. Unifying Nearest Neighbors Collabo- Purpose. In RecSys ’16. 7–10. rative Filtering. In RecSys ’14. 177–184. [20] Dietmar Jannach, Lukas Lerche, and Michael Jugovac. 2015. Adaptation and [45] Jeroen B. P. Vuurens, Martha Larson, and Arjen P. de Vries. 2016. Exploring Evaluation of Recommendations for Short-term Shopping Goals. In RecSys ’15. Deep Space: Learning Personalized Ranking in a Semantic Space. In DLRS ’16 211–218. Workshop. 23–28. [21] Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. 2015. Beyond “Hitting [46] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learning the Hits”: Generating Coherent Music Playlist Continuations with the Right for Recommender Systems. In KDD ’15. 1235–1244. Tracks. In RecSys ’15. 187–194. [47] Ghim-Eng Yap, Xiao-Li Li, and Philip S. Yu. 2012. Effective Next-items Recom- [22] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015. mendation via Personalized Sequential Pattern Mining. In DASFAA’12. Berlin, What recommenders recommend: an analysis of recommendation biases and Heidelberg, 48–64. possible countermeasures. User Modeling and User-Adapted Interaction (2015), [48] Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #Now- 1–65. playing Music Dataset: Extracting Listening Behavior from Twitter. In WISMM [23] Dietmar Jannach and Malte Ludewig. 2017. Determining Characteristics of ’14 Workshop at MM ’14. 21–26. Successful Recommendations from Log Data – A Case Study. In SAC ’17. [49] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin [24] Dietmar Jannach and Malte Ludewig. 2017. When Recurrent Neural Networks Wang, and Tie-Yan Liu. 2014. Sequential Click Prediction for Sponsored Search meet the Neighborhood for Session-Based Recommendation. In RecSys 2017. with Recurrent Neural Networks. In AAAI ’14. 1369–1375. (forthcoming). [25] Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. 2016. Convolutional Matrix Factorization for Document Context-Aware Recom- mendation. In RecSys ’16. 233–240.