=Paper=
{{Paper
|id=Vol-1922/paper6
|storemode=property
|title=Online Ranking Prediction in Non-stationary Environments
|pdfUrl=https://ceur-ws.org/Vol-1922/paper6.pdf
|volume=Vol-1922
|authors=Erzsébet Frigó,Róbert Pálovics,Domokos Kelen,Levente Kocsis,András A. Benczúr
|dblpUrl=https://dblp.org/rec/conf/recsys/FrigoPKKB17a
}}
==Online Ranking Prediction in Non-stationary Environments==
Online ranking prediction in non-stationary environments Erzsébet Frigó Róbert Pálovics Domokos Kelen Levente Kocsis András A. Benczúr Institute for Computer Science and Control Hungarian Academy of Sciences (MTA SZTAKI) {rpalovics, kelen.domokos.miklos, kocsis, benczur}@sztaki.hu ABSTRACT incremental stochastic gradient descent (SGD) based matrix factor- Recommender systems have to serve in online environments which ization. We show that the online version strongly outperforms the can be highly non-stationary.1 . Traditional recommender algo- batch version on non-stationary data sets. rithms may periodically rebuild their models, but they cannot adjust In [23], the online DCG measure is defined as a side result to to quick changes in trends caused by timely information. In our track the performance of recommender algorithms over time. In experiments, we observe that even a simple, but online trained our present experiments, we heavily rely on online DCG, introduce recommender model can perform significantly better than its batch another measure, the online MRR, and show their importance for version. We investigate online learning based recommender algo- evaluating streaming recommenders. Summarized, our main results rithms that can efficiently handle non-stationary data sets. We are the following: evaluate our models over seven publicly available data sets. Our • We provide some measures suited to non-stationary environ- experiments are available as an open source project2 . ments that are predictive of which algorithm would work well on a particular data set. • We show that simpler algorithms that can be updated online— 1 INTRODUCTION and therefore use the most recent data as well—often perform better than more complex algorithms that can only be up- The research of recommender systems became popular since the dated periodically. This is in contrast to the case when the Netflix prize [4]. As an effect of the competition, batch rating pre- testing is batch or the data is stationary. diction is considered the standard recommender evaluation task, • We also show that even though initialization by large batch with one part of the data used for model training, and the other models may be required for optimal performance, online ma- for evaluation. In contrast to the Netflix Prize task, recommender trix factorization combined with a simple sampling strategy systems are typically not required to predict ratings, rather they can perform comparably to more computationally intensive are required to present a ranked top list of relevant items for the models. user. Also, most users give no explicit ratings and we have to infer their preferences from implicit feedback [17]. Most importantly, users request one or a few items at a time, and may get exposed 1.1 Related Results to new information that can change their needs and taste before The difficulty of evaluating streaming recommenders was first they return to the service the next time. In a real application, top mentioned in [19], although they evaluated models by offline train- item recommendation by online learning is therefore more relevant ing and testing split. Ideas for online evaluation metrics appeared than batch rating prediction usually. There is vast literature on the first in [23, 24, 33]. In [33], incremental algorithms are evaluated batch performance of collaborative filtering algorithms, while the using recall, which is a batch quality measure. In the other two more realistic sequential or online evaluation received much less results, online evaluation appears as a side result of investigating attention. Information on recommendation performance is scarce social influence. As the starting point of our results, the last paper when collaborative filtering is evaluated online, sequentially. notices that some online methods perform better than their batch In this work we intend to consider top recommendation in highly counterpart. non-stationary environments similarly to the conceptually sim- Since our goal is to recommend different items at different times, pler classification task [22]. Our goal is to promptly update rec- our evaluation must be based on the quality of the top list produced ommender models after user interactions using online learning by the recommender. This so-called top-k recommender task is methods. known to be hard [7]. A recent result on evaluating top-k recom- There are some indications [19] that the performance gains in menders is found in [6]. batch experiments do not necessary carry over to online learn- In our experiments, we apply widely used methods of matrix ing environments. One result of the Netflix prize is that matrix factorization. The highly successful gradient descent matrix fac- factorization [18] techniques provide strong baseline results on torization recommender is described among others by [11, 30]. non-temporal data sets. In our work we compare the online and Note that even though typically performing marginally better than batch versions of the same matrix factorization algorithm. Due gradient descent, we do not investigate alternating least squares to the highly non-stationary properties of the data set, we apply [18, 27], since it is an inherently batch algorithm that cannot be easily adapted to the online setting. 1 Copyright©2017 for this paper by its authors. Copying permitted for private and We also evaluate basic item-to-item recommenders. Recent academic purposes. works [26] show that most industry recommendation tasks are 2 https://github.com/rpalovics/Alpenglow item-to-item, since the only information available is the present time the interaction. If a user u views item i at time t, our models predict i1 a score rˆ (u, i ′, t ) for each item i ′ that appears in the data so far, and i2 recommend the k items with the largest values from those that u has not seen before. u i One possible measure for the quality of a recommended top list i4 could be precision and recall [35, 36]. Note that we evaluate against a single item. Both the number of relevant (1) and the number of Figure 1: Temporal evaluation of the online ranking predic- retrieved (K) items are fixed. Precision is 1/K if we retrieve the tion problem. single item viewed and 0 otherwise. Recall is 0 if we do not retrieve the single relevant item and 1 otherwise. Hence up to a constant user session. The first item-to-item recommender methods [20, 29] factor K, precision and recall are identical and are binary indicators were using similarity information to directly find nearest neighbor of whether the item viewed is on the recommended list. [8] transactions. Nearest neighbor methods were criticized for two Recently, measures other than precision and recall became pre- reasons. First, similarity metrics typically have no mathematical ferred for measuring the quality of top-k recommendation [2]. The justification. Second, the confidence of the similarity values is of- most common measure is NDCG, the normalized version of the ten not involved when finding the nearest neighbor, which leads discounted cumulative gain (DCG). Since the decrease of DCG as to overfitting in sparse cases. A new method to give session rec- the function of the rank is smoother than the decrease of precision ommendations was described in [28] by Rendle et al. that models and recall, it is more advantageous, since we have a large number of the users by factorizing personal Markov chains. Koenigstein and items of potential interest to each user. Our choice is in accordance Koren [17] improve the basic transition model by computing la- with the observations in [2] as well. tent item factors to represent all items in a low dimensional space. DCG computed individually for each event and averaged in time While certainly giving performance gains in batch testing, these is an appropriate measure for real-time recommender evaluation. algorithms are overly costly to be updated online. If i is the next consumed item by the user, the online DCG@K is Finally we mention that item-to-item recommendation was also defined as the following function of the rank of i returned by the considered as a special context aware recommendation problem. In recommender system: [14] sequentiality as context is handled using pairwise associations as features in an alternating least squares model by Hidasi et al. 0 if rank(i) > K; DCG@K(i) = 1 (1) They mention that they face the sparsity problem in setting mini- log (rank(i) + 1) otherwise. mum support, confidence and lift of the associations and they use 2 the category of last purchased item as fallback. In a follow-up result The overall evaluation of a model is the average of the DCG@K [15], they use the same context-aware ALS algorithm, however they values over all events of the evaluation period. Note that in our only consider seasonality as context in that paper. Our result can unusual temporal setting of DCG evaluation, there is a single rele- be used independently of the ALS based methods and can easily be vant item and hence no normalization is needed as opposed to the combined with user personalization. Most recently, context infor- original NDCG measure. mation learning was also performed by recurrent neural networks Another possible ranking evaluation measure with a natural [13]. online extension is Mean Reciprocal Rank (MRR), the average of 1/rank of the first relevant item [34]. In online evaluation, there 2 TEMPORAL EVALUATION is a single relevant item i, therefore online MRR@K is equal to 1/rank(i) if rank(i) ≤ K and is 0 otherwise. Hence the difference In the implicit top-k recommendation task [6], the goal is not to between online DCG and MRR the rank discount function, which rate individual items, but to recommend the best candidates. In a is reciprocal for MRR and inverse logarithmic for DCG. time sensitive or online recommender that potentially re-trains the model after each new item, we have to generate a new top list for 3 ALGORITHMS every recommendation request. The online top-k task is therefore different from the standard recommender evaluation settings, since In this section, we describe the algorithms used in our experiments, there is always only a single relevant item. In an online setting, as both in batch and online setting. The simplest algorithms, for ex- seen in Figure 1, we ample the count based ones, are online by nature. Most of the algorithms discussed have more or less natural online counterparts. (1) request a top-k recommendation for the active user, Our goal is to include a variety of count, nearest neighbor, session (2) evaluate against the single relevant item, and matrix factorization based methods to see how these algorithms (3) train the model on the revealed user-item pair. can adapt to non-stationary environments. Many of the algorithms Next we introduce natural evaluation techniques for the online will have parameters to handle decay in time, i.e. forget older events ranking prediction problem, extending the methods of [23]. In our and emphasize new ones. setting, model training and evaluation happen simultaneously, iter- Online recommenders seem more restricted, since they may ating over the data set only once, in chronological order. Whenever not iterate over the data several times, hence we would expect we see a new user-item pair, we assume that the user becomes inferior quality. Online methods however have the advantage of active and requests a recommendation. The recommendation is putting much more emphasis on recent events. In an online setting online, hence it may depend on all events before the exact time of [1], the model needs to be retrained after each new event and 2 time (u,i) hence reiterations over the earlier parts of the data is ruled out. We may implement an online recommender algorithm by allowing a sampling set S instances for learning single iteration over the training data only, with this single iteration (u,i) (u,i) (u,i) (u,i) processing the events in the order of time. positive samples negative samples 3.1 Popularity (POP) Figure 2: Sampling from the past. In this non-personalized method, we recommend the most popular items. The method may be both batch and online depending on the 3.4 Batch and online matrix factorization (MF) granularity of the item frequency measurement update. For each One of our methods is the popular gradient descent matrix factoriza- data set, we selected the best time frame, which had a wide range tion recommender described among others by [11, 30]. The original of variety between 10 minutes and a year. algorithm builds a batch model by iterating over a fixed training data set a certain I number of times in random order, perform- 3.2 Time sensitive nearest neighbor (t-NN) ing stochastic gradient descent, until convergence. The algorithm One of the earliest and most popular collaborative filtering algo- builds F latent factors, Puf for user u and Q if for item i, to predict rithms in practice is the item-based nearest neighbor [29]. For the rating by these algorithms similarity scores are computed between item pairs F X based on the co-occurrence of the pairs in the preference of users. rˆ (u, i) = Puf Q if . (4) Non-stationarity of the data can be accounted for e.g. with the f =1 introduction of a time-decay [9]. Given an actual rating r (u, i), the factors Puf are updated by regu- Describing the algorithm more formally, let us denote by Ui the larized gradient descent as set of users that visited item i, by Iu the set of items visited by user (t +1) (t ) (t ) u, and by sui the index of item i in the sequence of interactions of Puf = Puf + η · (r (u, i) − rˆ (u, i)) · Q if − ηλPuf , (5) user u. The frequency based time-weighted similarity function is defined by where η is the learning rate and λ is the regularization coefficient. Item factors Q if are updated similarly. u ∈U j ∩Ui f (sui − su j ) P sim(j, i) = , (2) In our tasks, all of the ratings are implicit. Whenever we observe Uj a user-item interaction (u, i), we may only infer positive feedback r (u, i) = 1. We follow the approach of [27] by treating all unknown where f (τ ) = γ τ is the time decaying function. For non-stationary elements (u ′, i ′ ) of the matrix as negative feedback, i.e we assume data we sum only over users that visit item j before item i, setting r (u ′, i ′ ) = 0. In order to avoid the model fitting to r ≡ 1, we f (τ ) = 0 if τ < 0. For stationary data the absolute value of τ is used. generate negative samples for each item interaction. In other words, The score assigned to item i for user u is after training on (u, i), we train according to equation (5) for a X fixed negative rate nRate number of (u, i ′ ) pairs. Item i ′ is drawn score (u, i) = f |Iu | − su j sim(j, i). (3) uniformly nRate times for user u from the set of items that the u j ∈Iu has not interacted with. The model is represented by the similarity scores. Since computing Online matrix factorization takes the same steps, including the model is time consuming, it is done periodically. Moreover, only negative sample generation, but strictly in the order of the events. the most similar items are stored for each item. When the prediction In the update equation (5), we may consider the superscript t as scores are computed for a particular user, all items visited by the time, and process interactions (u, i, t ) in the order of t. For negative user can be considered, including the most recent ones. Hence, the samples, we generate items i ′ such that no (u, i ′, t ′ ) interaction algorithm can be considered semi-online in that it uses the most exists with t ′ ≤ t. recent interactions of the current user, but not of the other users. A natural option to combine the advantages of the batch and the We note that the time decay function is used here to quantify the online model is to periodically (e.g. weekly) build batch factors P strength of connection between pairs of items depending on how and Q and continue from the same P and Q with the online gradient closely are located in the sequence of a user, and not as a way to descent method. We call this method batch & online. forget old data as in [9]. 3.5 Sampling from the past for matrix 3.3 Item-to-item transition model (TRANS) factorization A simple algorithm that focuses on the sequence of items a user While batch & online matrix factorization naturally combines the has visited is one that records how often users visited item i after advantages of the two methods, the periodic re-training of the visiting another item j. This can be viewed as particular form of batch model may be computationally inefficient. We propose an the item-to-item nearest neighbor with a time decay function that online, efficient technique that approximates the batch and online is non-zero only for the immediately preceding item. While the combination. Similar to the online MF model, we only allow a algorithm is more simplistic, it is fast to update the transition fre- single iteration for the model in a temporal order. However, for each quencies after each interaction, thus all recent information is taken interaction, we generate not only negative but also positive samples. into account. The positive samples are selected randomly from past interactions, 3 i.e. we allow the model to re-learn the past. We generate pRate positive samples along with nRate negative samples, hence for t 10 -1 events, we only take (1 + nRate + pRate) · t gradient steps. 10 -2 The samples are not drawn uniformly from the past, but selected 10 -3 fraction of intervals with time t randomly from pool S with maximum size s. This avoids oversam- 10 -4 pling interactions that happened at the beginning of the data set. 10 -5 More specifically, as seen in Fig. 2, for each observed new training 10 -6 instance, we 10 -7 • update the model by pRate random samples from S, 10 -8 amazon Books • delete the selected samples from S if |S | > s, 10 -9 amazon Electronics • and add the new instance pRate times to S. 10 -10 amazon Movies 10 -11 lastfm30m artist lastfm30m track 3.6 Asymmetric Matrix Factorization (AMF) 10 -12 movielens10m 10 -13 Asymmetric Matrix Factorization [25] computes item-to-item twitter similarity with the help of latent vectors for items. In contrast to 10 -14 -5 10 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 Section 3.4, both latent matrices P and Q correspond to items. Using time t (days) the notations and time decaying function from Section 3.2, the score assigned to item i for user u is: 1.0 X score (u, i) = f |Iu | − su j P j Q i . (6) fraction of top different following items needed to cover 90% of item views j ∈Iu 0.8 Sampling negative instances and updating the latent vectors online using stochastic gradient descent can be done in a similar way to 0.6 the one described in Section 3.4. amazon Books 4 DATA 0.4 amazon Electronics We compare the performance of the batch and the online algorithms amazon Movies lastfm30m artist of Section 3 over seven data sets described next. For each data set, 0.2 lastfm30m track we discard items that appear less than ten times in interactions. We movielens10m apply no filtering for the users. twitter Twitter. We use 400M tweets over four months between Feb- 0.0 0 10 10 1 10 2 10 3 10 4 10 5 ruary 1 and May 30, 2012 from a crawl introduced in [10]. We number of item views excluding the last user items recommend new hashtags that the user has not used before. Last.fm We recommend artists and tracks in the 30Music data set [31]. New tracks appear more frequently than new artists. We Figure 3: Statistics over the 7 datasets. Top: Inter-event dis- expect that user-track events are highly non-stationary, more than tribution. Bottom: Transition matrix statistics. user-artists events. Item-to-item transitions. Users of online music services very MovieLens data sets [12], first released in 1998, contain times- often follow playlists, which result in artificially very strong per- tamped explicit movie ratings. We use the ML10M data set, consist- formance of the item-to-item methods. When listening to playlists, ing of 10,000,064 ratings of 69,878 users for 10,681 movies. In our users do not actually request recommendations and hence eval- experiments, we consider these records as implicit interactions. uating user sequences for recommendation quality is somewhat Amazon data sets [21] are browsing and co-purchasing data artificial. Next we examine if the data sets contain trivial item-to- crawled from Amazon. We use four categories, Books, Electronics item transitions by computing the item-to-item transition matrix, as well as Movies and TV . i.e. the number of times j followed i in the music listening session of the same user. For each item i we count the least number of unique 4.1 Statistics i → j item transitions needed to cover 90% of the item’s transitions. Before turning to explaining the findings, we show two statistics In Figure 3 we plot the fraction of these against the total number of that highlight the non-stationary properties of the data sets. unique i → j transitions. We observe that for the Last.fm track data, Burstiness. The inter-event time (Fig. 3) is the time elapsed the MovieLens data set, and for the Twitter data lower fractions between consecutive events of the same user. In accordance with are possible, hence these data sets may involve trivial transitions. several results [3, 16, 32], the distribution is highly skewed for Shuffling the temporal order. In order to assess the impor- all data, which show their burstiness in the approximate order tance of the time dimension in the streaming recommendation task, from strongest to weakest as Twitter, Amazon, Last.fm and finally we compare our results over the same data sets but with randomly MovieLens (Fig. 3, right to left). shuffled order of time. In other words, we evaluate over the shuffled 4 stationary variant of each data, where we expect that the temporal 45000 effects disappear. 40000 weekly events 35000 5 RESULTS 30000 count In this section, we describe our results, including the performance 25000 of the algorithms described in Section 3 and our experiments with 20000 matrix factorization training strategies. In Table 1, we summarize 15000 the performance of our models by online DCG. Online MRR behaves 10000 almost identical and is not shown for space limitations. 5000 Twitter data is very sparse and bursty with the highest fraction 0 10 20 30 40 50 of small inter-events (Fig. 3). As a result, factorization methods per- time (weeks) form worse and popularity models perform well. The Last.fm track 6000 data set is similar, but it incorporates playlist and album listening weekly new users 5000 weekly new items data, hence transition and similarity models perform better than factorization and popularity. The number of items in the Last.fm 4000 count artist data are significantly lower than for the track data: as a result 3000 factorization gets close to transition and similarity models on this data set . In Fig. 3 we can observe that the Last.fm track data, the 2000 MovieLens data set, and the Twitter data contain several predictable 1000 item-to-item transitions. This observation is consistent with per- 0 formance of the transition model, as it provides the strongest DCG 0 10 20 30 40 50 values in case of these three data sets. time (weeks) For batch rating prediction, MF is known in the literature to per- Figure 4: Statistics over the 30M data set. Top: number of form very strong. For ranking prediction, however, our results show records per week. Bottom: number of weekly new users and the superiority of item-to-item nearest neighbor methods. Further- items. more, performance of batch MF models drop considerably when the data is non-stationary. For example, for MovieLens, MF is the best performing algorithm after shuffling, but the worst for the original 0.05 temporal order. Popularity, the simplest algorithm, performs well average weekly DCG 0.04 for many data sets in non-stationary setting, although the strong performance is mainly due to users with few interactions. AMF 0.03 performs between MF and NN, slightly better for non-stationary data. The transition model is constantly inferior to NN, but despite 0.02 batch its simplicity, it is competitive with more complex factorization batch & online 0.01 online models for non-stationary data. sampling 0.00 0 10 20 30 40 50 5.1 Online training strategies for MF time (weeks) The high performance of transition models indicates the existence 0.05 of trivial item-to-item transitions. In music data, such as the Last.fm average weekly DCG data set, the listening of predefined playlists or albums results 0.04 in predictable item-to-item transitions. These are easy to predict, however they are not valuable for the actual user, therefore they bias 0.03 recommender evaluation. We filtered the Last.fm artist data before 0.02 batch comparing different factor model training strategies to reduce the batch & online above effects. We only considered records after at least 30 minutes 0.01 online started from batch of user inactivity, i.e. the start of each user session. Statistics of the sampling started from batch filtered 30M data are shown in Fig. 4. 0.00 0 10 20 30 40 50 We compared multiple different learning strategies of MF mod- time (weeks) els: batch, online, batch & online, and sampling methods. In the experiments we use the following best parameters. We train batch Figure 5: Performance of the sampling method compared MF weekly with η = 0.05, nRate = 69 and 9 iterations in random to three MF variants: batch, online and batch & online. Top: order. For online MF we set η = 0.2, nRate = 99. The parameters are each model trained individually from the beginning of time. the same for batch & online. For the sampling model, we use lower Bottom: the sampling and online models are started from a learning rate η = 0.05 and nRate = 39. Best results are achieved batch model at week 18. by generating pRate = 3 positive samples from the past for each record with pool size 3,000. 5 original POP TRANS t-NN batch MF online MF, F = batch AMF SVD full sample 10 100 1,000 & online ++ Twitter all 0.350 0.149 0.027 0.010 0.006 0.335 0.337 0.340 0.269 0.332 0.348 Twitter from 10th 0.416 0.267 0.036 0.008 0.005 0.523 0.527 0.532 0.457 0.516 0.516 Last.fm track 0.006 0.276 0.298 0.007 0.009 0.022 0.066 0.010 0.021 0.030 0.028 Last.fm artist 0.024 0.051 0.064 0.032 0.031 0.048 0.042 0.004 0.049 0.049 0.051 MovieLens10M 0.086 0.148 0.148 0.012 0.012 0.140 0.142 0.038 0.133 0.167 0.172 Books all 0.026 0.033 0.042 0.008 0.007 0.025 0.032 0.009 0.016 0.025 0.026 Books from 10th 0.017 0.046 0.074 0.010 0.012 0.042 0.053 0.018 0.021 0.040 0.042 Amazon Electronics 0.035 0.015 0.014 0.006 0.006 0.016 0.018 0.019 0.011 0.018 0.019 Electronics from 10th 0.023 0.024 0.035 0.017 0.018 0.046 0.049 0.050 0.023 0.048 0.050 Movies 0.066 0.029 0.034 0.023 0.007 0.035 0.039 0.014 0.024 0.035 0.037 Movies from 10th 0.048 0.037 0.062 0.040 0.016 0.064 0.068 0.037 0.038 0.060 0.064 shuffled NN Twitter all 0.020 0.041 0.055 0.050 0.045 0.034 0.039 0.045 0.050 0.044 0.044 Twitter from 10th 0.017 0.047 0.091 0.073 0.082 0.060 0.068 0.074 0.073 0.066 0.067 Last.fm track 0.004 0.002 0.043 0.010 0.021 0.008 0.009 0.002 0.010 0.006 0.008 Last.fm artist 0.023 0.014 0.036 0.043 0.041 0.035 0.025 0.005 0.043 0.033 0.037 MovieLens10M 0.059 0.058 0.065 0.088 0.085 0.080 0.052 0.004 0.088 0.072 0.080 Books all 0.009 0.016 0.027 0.009 0.010 0.007 0.009 0.008 0.011 0.010 0.010 Books from 10th 0.004 0.010 0.042 0.013 0.016 0.012 0.015 0.013 0.014 0.013 0.013 Amazon Electronics 0.020 0.010 0.011 0.009 0.007 0.007 0.008 0.009 0.009 0.011 0.011 Electronics from 10th 0.013 0.008 0.025 0.020 0.012 0.019 0.020 0.020 0.020 0.021 0.021 Movies 0.023 0.018 0.023 0.016 0.014 0.011 0.012 0.013 0.016 0.015 0.015 Movies from 10th 0.007 0.010 0.034 0.024 0.024 0.019 0.021 0.021 0.024 0.019 0.020 Table 1: Online DCG@100 of different algorithms. Duplicate data marked “from 10th” are re-evaluation over items past the 10th interaction only, for each user. Top: original data. Bottom: shuffled data. In Fig. 5 we plot the average weekly DCG over a one year period. While sampling appears to keep up with batch & online in the 0.040 average DCG between weeks 25-45 beginning, its performance drops in the second part. If we start each factor model from the batch model of week 18, sampling 0.038 produces similar results to batch & online. In Fig. 6, as a function of k, we start with the batch model of week k. Then, we only use 0.036 online updates with or without further sampling. It can be seen that sampling performs roughly the same as batch & online for k > 10. Note that the number of weekly incoming new users and 0.034 items drops after the first 10 weeks as seen in Fig. 4. The single batch iteration online model produces a comparable, but slightly worse 0.032 batch & online results than the sampling version. sampling started from batch In summary, in a non-stationary environment, multiple passes online started from batch 0.030 (i.e. batch models) are required over the data to fully incorporate 5 10 15 20 many new users and items into the system. However, if a large starting week k component of the user-item matrix is previously learned by a robust batch model, a simple online sampling algorithm is sufficient to keep Figure 6: Performance of the online and sampling methods up in performance with periodic batch re-training, while requiring started from a batch model at week k. Averages are taken only 3 additional positive samples from the past per iteration and between week 25 and 45. thus being much more efficient. learning methods in highly non-stationary environments. We pro- posed and evaluated a variety of algorithms as well as measures that 6 CONCLUSIONS are predictive of which algorithm would work well on a particular data. We showed that simpler algorithms that can use most recent Despite the fact that a practical recommender system processes data by updating their models online perform better than more events and generates top lists in an online sequence, the literature complex algorithms that can be updated only periodically. We also payed relative little attention to designing and evaluating online showed that sampling from past events may completely replace 6 batch modeling needs in a real time recommender system, thus reducing latency. We released our code as an open source project3 . 3 https://github.com/rpalovics/Alpenglow 7 REFERENCES [33] J. Vinagre, A. M. Jorge, and J. Gama. Evaluation of recommender systems in [1] J. Abernethy, K. Canini, J. Langford, and A. Simma. Online collaborative filtering. streaming environments. In Workshop on Recommender Systems Evaluation: University of California at Berkeley, Tech. Rep, 2007. Dimensions and Design (REDD), in conjunction with RecSys, 2014. [2] A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between ir ef- [34] E. M. Voorhees et al. The TREC-8 question answering track report. In TREC, fectiveness measures and user satisfaction. In Proc. SIGIR, pp. 773–774. ACM, volume 99, pages 77–82, 1999. 2007. [35] X. Yang, H. Steck, Y. Guo, and Y. Liu. On top-k recommendation using social [3] Albert-Laszlo Barabasi. The origin of bursts and heavy tails in human dynamics. networks. In Proc. RecSys, pages 67–74. ACM, 2012. Nature, 435(7039):207–211, 2005. [36] Q. Yuan, L. Chen, and S. Zhao. Factorization vs. regularization: fusing hetero- [4] J. Bennett and S. Lanning. The netflix prize. In KDD Cup and Workshop in geneous social relationships in top-n recommendation. In Proc. RecSys, pages conjunction with KDD, 2007. 245–252. ACM, 2011. [5] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ Pr, 2006. [6] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proc. RecSys, pages 39–46. ACM, 2010. [7] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM TOIS, 22(1):143–177, 2004. [8] C. Desrosiers and G. Karypis. A comprehensive survey of neighborhood-based recommendation methods. In Recommender systems handbook, pages 107–144. Springer, 2011. [9] Y. Ding and X. Li. Time weight collaborative filtering. In Proc. CIKM, pages 485–492. ACM, 2005. [10] L. Dobos, J. Szule, T. Bodnar, T. Hanyecz, T. Sebok, D. Kondor, Z. Kallus, J. Stéger, I. Csabai, and G. Vattay. A multi-terabyte relational database for geo-tagged social network data. In Proc. CogInfoCom, pages 289–294. IEEE, 2013. [11] S. Funk. Netflix update: Try this at home. http://sifter.org/˜simon/journal/20061211.html, 2006. [12] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM TiiS, 5(4):19, 2016. [13] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommenda- tions with recurrent neural networks. In ICLR, 2016. [14] B. Hidasi and D. Tikk. Fast als-based tensor factorization for context-aware recommendation from implicit feedback. In Machine Learning and Knowledge Discovery in Databases, pages 67–82. Springer, 2012. [15] B. Hidasi and D. Tikk. Context-aware item-to-item recommendation within the factorization framework. In Proc. 3rd Workshop on Context-awareness in Retrieval and Recommendation, pp. 19–25. 2013. [16] Mikko Kivelä and Mason A Porter. Estimating inter-event time distributions from finite observation periods in communication networks. arXiv preprint arXiv:1412.8388, 2014. [17] N. Koenigstein and Y. Koren. Towards scalable and accurate item-oriented recommendations. In Proc RecSys, pages 419–422. ACM, 2013. [18] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom- mender systems. Computer, 42(8):30–37, 2009. [19] N. Lathia, S. Hailes, and L. Capra. Temporal collaborative filtering with adaptive neighbourhoods. In Proc SIGIR, pages 796–797. ACM, 2009. [20] G. Linden, B. Smith, and J. York. Amazon.com recommendations: item-to-item collaborative filtering. Internet Computing, IEEE, 7(1):76–80, 2003. [21] J. McAuley, R. Pandey, and J. Leskovec. Inferring networks of substitutable and complementary products. In Proc SIGKDD, pages 785–794. ACM, 2015. [22] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In Proc SIGKDD, pages 1222–1230. ACM, 2013. [23] R. Pálovics and A. A. Benczúr. Temporal influence over the last.fm social network. In Proc. ASONAM, pages 486–493. ACM, 2013. [24] R. Pálovics, A. A. Benczúr, L. Kocsis, T. Kiss, and E. Frigó. Exploiting temporal influence in online recommendation. In Proc. RecSys, pages 273–280. ACM, 2014. [25] A. Paterek Improving regularized singular value decomposition for collaborative filtering In Proc. SIGKDD, pages 39–42. ACM, 2007. [26] I. Pilászy, A. Serény, G. Dózsa, B. Hidasi, A. Sári, and J. Gub. Neighbor methods vs matrix factorization - case studies of real-life recommendations. In LSRS2015 at RECSYS, 2015. [27] I. Pilászy, D. Zibriczky, and D. Tikk. Fast als-based matrix factorization for explicit and implicit feedback datasets. In Proc. RecSys, pages 71–78. ACM, 2010. [28] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. In Proc. WWW, pages 811–820. ACM, 2010. [29] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proc. WWW, pages 285–295, 2001. [30] G. Takács, I. Pilászy, B. Németh, and D. Tikk. Investigation of various matrix factorization methods for large recommender systems. In Proc. 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition, pages 1–8. ACM, 2008. [31] R. Turrin, M. Quadrana, A. Condorelli, R. Pagano, and P. Cremonesi. 30music listening and playlists dataset. In RecSys Posters, 2015. [32] Alexei Vazquez, Balazs Racz, Andras Lukacs, and Albert-Laszlo Barabasi. Impact of non-poissonian activity patterns on spreading processes. Physical review letters, 98(15):158702, 2007. 8