=Paper= {{Paper |id=Vol-1922/paper6 |storemode=property |title=Online Ranking Prediction in Non-stationary Environments |pdfUrl=https://ceur-ws.org/Vol-1922/paper6.pdf |volume=Vol-1922 |authors=Erzsébet Frigó,Róbert Pálovics,Domokos Kelen,Levente Kocsis,András A. Benczúr |dblpUrl=https://dblp.org/rec/conf/recsys/FrigoPKKB17a }} ==Online Ranking Prediction in Non-stationary Environments== https://ceur-ws.org/Vol-1922/paper6.pdf
       Online ranking prediction in non-stationary environments
         Erzsébet Frigó         Róbert Pálovics Domokos Kelen Levente Kocsis András A. Benczúr
                                      Institute for Computer Science and Control
                                    Hungarian Academy of Sciences (MTA SZTAKI)
                          {rpalovics, kelen.domokos.miklos, kocsis, benczur}@sztaki.hu
ABSTRACT                                                                            incremental stochastic gradient descent (SGD) based matrix factor-
Recommender systems have to serve in online environments which                      ization. We show that the online version strongly outperforms the
can be highly non-stationary.1 . Traditional recommender algo-                      batch version on non-stationary data sets.
rithms may periodically rebuild their models, but they cannot adjust                   In [23], the online DCG measure is defined as a side result to
to quick changes in trends caused by timely information. In our                     track the performance of recommender algorithms over time. In
experiments, we observe that even a simple, but online trained                      our present experiments, we heavily rely on online DCG, introduce
recommender model can perform significantly better than its batch                   another measure, the online MRR, and show their importance for
version. We investigate online learning based recommender algo-                     evaluating streaming recommenders. Summarized, our main results
rithms that can efficiently handle non-stationary data sets. We                     are the following:
evaluate our models over seven publicly available data sets. Our                          • We provide some measures suited to non-stationary environ-
experiments are available as an open source project2 .                                      ments that are predictive of which algorithm would work
                                                                                            well on a particular data set.
                                                                                          • We show that simpler algorithms that can be updated online—
1    INTRODUCTION                                                                           and therefore use the most recent data as well—often perform
                                                                                            better than more complex algorithms that can only be up-
The research of recommender systems became popular since the
                                                                                            dated periodically. This is in contrast to the case when the
Netflix prize [4]. As an effect of the competition, batch rating pre-
                                                                                            testing is batch or the data is stationary.
diction is considered the standard recommender evaluation task,
                                                                                          • We also show that even though initialization by large batch
with one part of the data used for model training, and the other
                                                                                            models may be required for optimal performance, online ma-
for evaluation. In contrast to the Netflix Prize task, recommender
                                                                                            trix factorization combined with a simple sampling strategy
systems are typically not required to predict ratings, rather they
                                                                                            can perform comparably to more computationally intensive
are required to present a ranked top list of relevant items for the
                                                                                            models.
user. Also, most users give no explicit ratings and we have to infer
their preferences from implicit feedback [17]. Most importantly,
users request one or a few items at a time, and may get exposed                     1.1     Related Results
to new information that can change their needs and taste before                     The difficulty of evaluating streaming recommenders was first
they return to the service the next time. In a real application, top                mentioned in [19], although they evaluated models by offline train-
item recommendation by online learning is therefore more relevant                   ing and testing split. Ideas for online evaluation metrics appeared
than batch rating prediction usually. There is vast literature on the               first in [23, 24, 33]. In [33], incremental algorithms are evaluated
batch performance of collaborative filtering algorithms, while the                  using recall, which is a batch quality measure. In the other two
more realistic sequential or online evaluation received much less                   results, online evaluation appears as a side result of investigating
attention. Information on recommendation performance is scarce                      social influence. As the starting point of our results, the last paper
when collaborative filtering is evaluated online, sequentially.                     notices that some online methods perform better than their batch
   In this work we intend to consider top recommendation in highly                  counterpart.
non-stationary environments similarly to the conceptually sim-                         Since our goal is to recommend different items at different times,
pler classification task [22]. Our goal is to promptly update rec-                  our evaluation must be based on the quality of the top list produced
ommender models after user interactions using online learning                       by the recommender. This so-called top-k recommender task is
methods.                                                                            known to be hard [7]. A recent result on evaluating top-k recom-
   There are some indications [19] that the performance gains in                    menders is found in [6].
batch experiments do not necessary carry over to online learn-                         In our experiments, we apply widely used methods of matrix
ing environments. One result of the Netflix prize is that matrix                    factorization. The highly successful gradient descent matrix fac-
factorization [18] techniques provide strong baseline results on                    torization recommender is described among others by [11, 30].
non-temporal data sets. In our work we compare the online and                       Note that even though typically performing marginally better than
batch versions of the same matrix factorization algorithm. Due                      gradient descent, we do not investigate alternating least squares
to the highly non-stationary properties of the data set, we apply                   [18, 27], since it is an inherently batch algorithm that cannot be
                                                                                    easily adapted to the online setting.
1 Copyright©2017 for this paper by its authors. Copying permitted for private and      We also evaluate basic item-to-item recommenders. Recent
academic purposes.                                                                  works [26] show that most industry recommendation tasks are
2 https://github.com/rpalovics/Alpenglow                                            item-to-item, since the only information available is the present
                                                            time
                                                                             the interaction. If a user u views item i at time t, our models predict
                                              i1                             a score rˆ (u, i ′, t ) for each item i ′ that appears in the data so far, and
                                              i2                             recommend the k items with the largest values from those that u
                                                                             has not seen before.
                                      u       i
                                                                                 One possible measure for the quality of a recommended top list
                                              i4                             could be precision and recall [35, 36]. Note that we evaluate against
                                                                             a single item. Both the number of relevant (1) and the number of
Figure 1: Temporal evaluation of the online ranking predic-                  retrieved (K) items are fixed. Precision is 1/K if we retrieve the
tion problem.                                                                single item viewed and 0 otherwise. Recall is 0 if we do not retrieve
                                                                             the single relevant item and 1 otherwise. Hence up to a constant
user session. The first item-to-item recommender methods [20, 29]            factor K, precision and recall are identical and are binary indicators
were using similarity information to directly find nearest neighbor          of whether the item viewed is on the recommended list.
[8] transactions. Nearest neighbor methods were criticized for two               Recently, measures other than precision and recall became pre-
reasons. First, similarity metrics typically have no mathematical            ferred for measuring the quality of top-k recommendation [2]. The
justification. Second, the confidence of the similarity values is of-        most common measure is NDCG, the normalized version of the
ten not involved when finding the nearest neighbor, which leads              discounted cumulative gain (DCG). Since the decrease of DCG as
to overfitting in sparse cases. A new method to give session rec-            the function of the rank is smoother than the decrease of precision
ommendations was described in [28] by Rendle et al. that models              and recall, it is more advantageous, since we have a large number of
the users by factorizing personal Markov chains. Koenigstein and             items of potential interest to each user. Our choice is in accordance
Koren [17] improve the basic transition model by computing la-               with the observations in [2] as well.
tent item factors to represent all items in a low dimensional space.             DCG computed individually for each event and averaged in time
While certainly giving performance gains in batch testing, these             is an appropriate measure for real-time recommender evaluation.
algorithms are overly costly to be updated online.                           If i is the next consumed item by the user, the online DCG@K is
   Finally we mention that item-to-item recommendation was also              defined as the following function of the rank of i returned by the
considered as a special context aware recommendation problem. In             recommender system:
[14] sequentiality as context is handled using pairwise associations
as features in an alternating least squares model by Hidasi et al.                                  0
                                                                                                    
                                                                                                                         if rank(i) > K;
                                                                                     DCG@K(i) =              1                              (1)
They mention that they face the sparsity problem in setting mini-                                    log (rank(i) + 1) otherwise.
                                                                                                    
mum support, confidence and lift of the associations and they use                                    2
the category of last purchased item as fallback. In a follow-up result       The overall evaluation of a model is the average of the DCG@K
[15], they use the same context-aware ALS algorithm, however they            values over all events of the evaluation period. Note that in our
only consider seasonality as context in that paper. Our result can           unusual temporal setting of DCG evaluation, there is a single rele-
be used independently of the ALS based methods and can easily be             vant item and hence no normalization is needed as opposed to the
combined with user personalization. Most recently, context infor-            original NDCG measure.
mation learning was also performed by recurrent neural networks                  Another possible ranking evaluation measure with a natural
[13].                                                                        online extension is Mean Reciprocal Rank (MRR), the average of
                                                                             1/rank of the first relevant item [34]. In online evaluation, there
2    TEMPORAL EVALUATION                                                     is a single relevant item i, therefore online MRR@K is equal to
                                                                             1/rank(i) if rank(i) ≤ K and is 0 otherwise. Hence the difference
In the implicit top-k recommendation task [6], the goal is not to
                                                                             between online DCG and MRR the rank discount function, which
rate individual items, but to recommend the best candidates. In a
                                                                             is reciprocal for MRR and inverse logarithmic for DCG.
time sensitive or online recommender that potentially re-trains the
model after each new item, we have to generate a new top list for
                                                                             3    ALGORITHMS
every recommendation request. The online top-k task is therefore
different from the standard recommender evaluation settings, since           In this section, we describe the algorithms used in our experiments,
there is always only a single relevant item. In an online setting, as        both in batch and online setting. The simplest algorithms, for ex-
seen in Figure 1, we                                                         ample the count based ones, are online by nature. Most of the
                                                                             algorithms discussed have more or less natural online counterparts.
    (1) request a top-k recommendation for the active user,                  Our goal is to include a variety of count, nearest neighbor, session
    (2) evaluate against the single relevant item,                           and matrix factorization based methods to see how these algorithms
    (3) train the model on the revealed user-item pair.                      can adapt to non-stationary environments. Many of the algorithms
Next we introduce natural evaluation techniques for the online               will have parameters to handle decay in time, i.e. forget older events
ranking prediction problem, extending the methods of [23]. In our            and emphasize new ones.
setting, model training and evaluation happen simultaneously, iter-             Online recommenders seem more restricted, since they may
ating over the data set only once, in chronological order. Whenever          not iterate over the data several times, hence we would expect
we see a new user-item pair, we assume that the user becomes                 inferior quality. Online methods however have the advantage of
active and requests a recommendation. The recommendation is                  putting much more emphasis on recent events. In an online setting
online, hence it may depend on all events before the exact time of           [1], the model needs to be retrained after each new event and
                                                                         2
                                                                                                                                                             time




                                                                                                                             (u,i)
hence reiterations over the earlier parts of the data is ruled out. We
may implement an online recommender algorithm by allowing a                           sampling set S                                 instances for learning
single iteration over the training data only, with this single iteration




                                                                                                                  (u,i)
                                                                                                                  (u,i)
                                                                                                                  (u,i)




                                                                                                                                     (u,i)
processing the events in the order of time.
                                                                                                      positive samples                       negative samples
3.1    Popularity (POP)
                                                                                                   Figure 2: Sampling from the past.
In this non-personalized method, we recommend the most popular
items. The method may be both batch and online depending on the                 3.4     Batch and online matrix factorization (MF)
granularity of the item frequency measurement update. For each
                                                                                One of our methods is the popular gradient descent matrix factoriza-
data set, we selected the best time frame, which had a wide range
                                                                                tion recommender described among others by [11, 30]. The original
of variety between 10 minutes and a year.
                                                                                algorithm builds a batch model by iterating over a fixed training
                                                                                data set a certain I number of times in random order, perform-
3.2    Time sensitive nearest neighbor (t-NN)                                   ing stochastic gradient descent, until convergence. The algorithm
One of the earliest and most popular collaborative filtering algo-              builds F latent factors, Puf for user u and Q if for item i, to predict
rithms in practice is the item-based nearest neighbor [29]. For                 the rating by
these algorithms similarity scores are computed between item pairs                                                   F
                                                                                                                     X
based on the co-occurrence of the pairs in the preference of users.                                      rˆ (u, i) =   Puf Q if .                   (4)
Non-stationarity of the data can be accounted for e.g. with the                                                       f =1
introduction of a time-decay [9].                                               Given an actual rating r (u, i), the factors Puf are updated by regu-
   Describing the algorithm more formally, let us denote by Ui the              larized gradient descent as
set of users that visited item i, by Iu the set of items visited by user
                                                                                          (t +1)       (t )                                           (t )
u, and by sui the index of item i in the sequence of interactions of                     Puf       = Puf + η · (r (u, i) − rˆ (u, i)) · Q if − ηλPuf ,              (5)
user u. The frequency based time-weighted similarity function is
defined by                                                                      where η is the learning rate and λ is the regularization coefficient.
                                                                                Item factors Q if are updated similarly.
                               u ∈U j ∩Ui f (sui − su j )
                             P
                 sim(j, i) =                              ,          (2)            In our tasks, all of the ratings are implicit. Whenever we observe
                                         Uj                                     a user-item interaction (u, i), we may only infer positive feedback
                                                                                r (u, i) = 1. We follow the approach of [27] by treating all unknown
where f (τ ) = γ τ is the time decaying function. For non-stationary
                                                                                elements (u ′, i ′ ) of the matrix as negative feedback, i.e we assume
data we sum only over users that visit item j before item i, setting
                                                                                r (u ′, i ′ ) = 0. In order to avoid the model fitting to r ≡ 1, we
f (τ ) = 0 if τ < 0. For stationary data the absolute value of τ is used.
                                                                                generate negative samples for each item interaction. In other words,
The score assigned to item i for user u is
                                                                                after training on (u, i), we train according to equation (5) for a
                               X                                              fixed negative rate nRate number of (u, i ′ ) pairs. Item i ′ is drawn
                score (u, i) =    f |Iu | − su j sim(j, i).           (3)
                                                                                uniformly nRate times for user u from the set of items that the u
                            j ∈Iu
                                                                                has not interacted with.
The model is represented by the similarity scores. Since computing                  Online matrix factorization takes the same steps, including
the model is time consuming, it is done periodically. Moreover, only            negative sample generation, but strictly in the order of the events.
the most similar items are stored for each item. When the prediction            In the update equation (5), we may consider the superscript t as
scores are computed for a particular user, all items visited by the             time, and process interactions (u, i, t ) in the order of t. For negative
user can be considered, including the most recent ones. Hence, the              samples, we generate items i ′ such that no (u, i ′, t ′ ) interaction
algorithm can be considered semi-online in that it uses the most                exists with t ′ ≤ t.
recent interactions of the current user, but not of the other users.                A natural option to combine the advantages of the batch and the
We note that the time decay function is used here to quantify the               online model is to periodically (e.g. weekly) build batch factors P
strength of connection between pairs of items depending on how                  and Q and continue from the same P and Q with the online gradient
closely are located in the sequence of a user, and not as a way to              descent method. We call this method batch & online.
forget old data as in [9].
                                                                                3.5     Sampling from the past for matrix
3.3    Item-to-item transition model (TRANS)                                            factorization
A simple algorithm that focuses on the sequence of items a user                 While batch & online matrix factorization naturally combines the
has visited is one that records how often users visited item i after            advantages of the two methods, the periodic re-training of the
visiting another item j. This can be viewed as particular form of               batch model may be computationally inefficient. We propose an
the item-to-item nearest neighbor with a time decay function that               online, efficient technique that approximates the batch and online
is non-zero only for the immediately preceding item. While the                  combination. Similar to the online MF model, we only allow a
algorithm is more simplistic, it is fast to update the transition fre-          single iteration for the model in a temporal order. However, for each
quencies after each interaction, thus all recent information is taken           interaction, we generate not only negative but also positive samples.
into account.                                                                   The positive samples are selected randomly from past interactions,
                                                                            3
i.e. we allow the model to re-learn the past. We generate pRate
positive samples along with nRate negative samples, hence for t                                                     10 -1
events, we only take (1 + nRate + pRate) · t gradient steps.                                                        10 -2
    The samples are not drawn uniformly from the past, but selected                                                 10 -3




                                                                               fraction of intervals with time t
randomly from pool S with maximum size s. This avoids oversam-                                                      10 -4
pling interactions that happened at the beginning of the data set.                                                  10 -5
More specifically, as seen in Fig. 2, for each observed new training                                                10 -6
instance, we                                                                                                        10 -7
      • update the model by pRate random samples from S,                                                            10 -8      amazon Books
      • delete the selected samples from S if |S | > s,                                                             10 -9      amazon Electronics
      • and add the new instance pRate times to S.                                                                 10 -10      amazon Movies
                                                                                                                   10 -11      lastfm30m artist
                                                                                                                               lastfm30m track
3.6     Asymmetric Matrix Factorization (AMF)                                                                      10 -12      movielens10m
                                                                                                                   10 -13
Asymmetric Matrix Factorization [25] computes item-to-item                                                                     twitter
similarity with the help of latent vectors for items. In contrast to                                               10 -14 -5
                                                                                                                        10 10 -4 10 -3 10 -2 10 -1         10 0      10 1   10 2    10 3   10 4
Section 3.4, both latent matrices P and Q correspond to items. Using                                                                               time t (days)
the notations and time decaying function from Section 3.2, the score
assigned to item i for user u is:
                                                                                                                         1.0
                                X                     
               score (u, i) =           f |Iu | − su j P j Q i .   (6)
                                                                             fraction of top different following items
                                                                                needed to cover 90% of item views
                                j ∈Iu                                                                                    0.8
Sampling negative instances and updating the latent vectors online
using stochastic gradient descent can be done in a similar way to                                                        0.6
the one described in Section 3.4.
                                                                                                                                     amazon Books
4     DATA                                                                                                               0.4         amazon Electronics
We compare the performance of the batch and the online algorithms                                                                    amazon Movies
                                                                                                                                     lastfm30m artist
of Section 3 over seven data sets described next. For each data set,                                                     0.2         lastfm30m track
we discard items that appear less than ten times in interactions. We                                                                 movielens10m
apply no filtering for the users.                                                                                                    twitter
   Twitter. We use 400M tweets over four months between Feb-                                                             0.0 0
                                                                                                                           10          10 1       10 2        10 3           10 4          10 5
ruary 1 and May 30, 2012 from a crawl introduced in [10]. We                                                                     number of item views excluding the last user items
recommend new hashtags that the user has not used before.
   Last.fm We recommend artists and tracks in the 30Music data
set [31]. New tracks appear more frequently than new artists. We             Figure 3: Statistics over the 7 datasets. Top: Inter-event dis-
expect that user-track events are highly non-stationary, more than           tribution. Bottom: Transition matrix statistics.
user-artists events.
                                                                                 Item-to-item transitions. Users of online music services very
   MovieLens data sets [12], first released in 1998, contain times-
                                                                             often follow playlists, which result in artificially very strong per-
tamped explicit movie ratings. We use the ML10M data set, consist-
                                                                             formance of the item-to-item methods. When listening to playlists,
ing of 10,000,064 ratings of 69,878 users for 10,681 movies. In our
                                                                             users do not actually request recommendations and hence eval-
experiments, we consider these records as implicit interactions.
                                                                             uating user sequences for recommendation quality is somewhat
   Amazon data sets [21] are browsing and co-purchasing data
                                                                             artificial. Next we examine if the data sets contain trivial item-to-
crawled from Amazon. We use four categories, Books, Electronics
                                                                             item transitions by computing the item-to-item transition matrix,
as well as Movies and TV .
                                                                             i.e. the number of times j followed i in the music listening session of
                                                                             the same user. For each item i we count the least number of unique
4.1     Statistics                                                           i → j item transitions needed to cover 90% of the item’s transitions.
Before turning to explaining the findings, we show two statistics            In Figure 3 we plot the fraction of these against the total number of
that highlight the non-stationary properties of the data sets.               unique i → j transitions. We observe that for the Last.fm track data,
   Burstiness. The inter-event time (Fig. 3) is the time elapsed             the MovieLens data set, and for the Twitter data lower fractions
between consecutive events of the same user. In accordance with              are possible, hence these data sets may involve trivial transitions.
several results [3, 16, 32], the distribution is highly skewed for               Shuffling the temporal order. In order to assess the impor-
all data, which show their burstiness in the approximate order               tance of the time dimension in the streaming recommendation task,
from strongest to weakest as Twitter, Amazon, Last.fm and finally            we compare our results over the same data sets but with randomly
MovieLens (Fig. 3, right to left).                                           shuffled order of time. In other words, we evaluate over the shuffled
                                                                         4
stationary variant of each data, where we expect that the temporal                          45000
effects disappear.                                                                          40000                                                weekly events
                                                                                            35000
5     RESULTS                                                                               30000




                                                                                 count
In this section, we describe our results, including the performance                         25000
of the algorithms described in Section 3 and our experiments with                           20000
matrix factorization training strategies. In Table 1, we summarize                          15000
the performance of our models by online DCG. Online MRR behaves                             10000
almost identical and is not shown for space limitations.                                     5000
   Twitter data is very sparse and bursty with the highest fraction                                            0   10    20           30            40         50
of small inter-events (Fig. 3). As a result, factorization methods per-                                                      time (weeks)
form worse and popularity models perform well. The Last.fm track                                   6000
data set is similar, but it incorporates playlist and album listening                                                                        weekly new users
                                                                                                   5000                                      weekly new items
data, hence transition and similarity models perform better than
factorization and popularity. The number of items in the Last.fm                                   4000




                                                                                  count
artist data are significantly lower than for the track data: as a result                           3000
factorization gets close to transition and similarity models on this
data set . In Fig. 3 we can observe that the Last.fm track data, the                               2000
MovieLens data set, and the Twitter data contain several predictable                               1000
item-to-item transitions. This observation is consistent with per-
                                                                                                          0
formance of the transition model, as it provides the strongest DCG                                             0   10    20           30            40         50
values in case of these three data sets.                                                                                     time (weeks)
   For batch rating prediction, MF is known in the literature to per-
                                                                                 Figure 4: Statistics over the 30M data set. Top: number of
form very strong. For ranking prediction, however, our results show
                                                                                 records per week. Bottom: number of weekly new users and
the superiority of item-to-item nearest neighbor methods. Further-
                                                                                 items.
more, performance of batch MF models drop considerably when the
data is non-stationary. For example, for MovieLens, MF is the best
performing algorithm after shuffling, but the worst for the original                                    0.05
temporal order. Popularity, the simplest algorithm, performs well
                                                                                   average weekly DCG




                                                                                                        0.04
for many data sets in non-stationary setting, although the strong
performance is mainly due to users with few interactions. AMF                                           0.03
performs between MF and NN, slightly better for non-stationary
data. The transition model is constantly inferior to NN, but despite                                    0.02                                     batch
its simplicity, it is competitive with more complex factorization                                                                                batch & online
                                                                                                        0.01                                     online
models for non-stationary data.                                                                                                                  sampling
                                                                                                        0.00
                                                                                                               0   10   20          30         40         50
5.1    Online training strategies for MF                                                                                     time (weeks)
The high performance of transition models indicates the existence                                       0.05
of trivial item-to-item transitions. In music data, such as the Last.fm
                                                                                   average weekly DCG




data set, the listening of predefined playlists or albums results                                       0.04
in predictable item-to-item transitions. These are easy to predict,
however they are not valuable for the actual user, therefore they bias                                  0.03
recommender evaluation. We filtered the Last.fm artist data before
                                                                                                        0.02                        batch
comparing different factor model training strategies to reduce the                                                                  batch & online
above effects. We only considered records after at least 30 minutes                                     0.01                        online started from batch
of user inactivity, i.e. the start of each user session. Statistics of the                                                          sampling started from batch
filtered 30M data are shown in Fig. 4.                                                                  0.00
                                                                                                               0   10   20          30         40         50
    We compared multiple different learning strategies of MF mod-                                                            time (weeks)
els: batch, online, batch & online, and sampling methods. In the
experiments we use the following best parameters. We train batch                 Figure 5: Performance of the sampling method compared
MF weekly with η = 0.05, nRate = 69 and 9 iterations in random                   to three MF variants: batch, online and batch & online. Top:
order. For online MF we set η = 0.2, nRate = 99. The parameters are              each model trained individually from the beginning of time.
the same for batch & online. For the sampling model, we use lower                Bottom: the sampling and online models are started from a
learning rate η = 0.05 and nRate = 39. Best results are achieved                 batch model at week 18.
by generating pRate = 3 positive samples from the past for each
record with pool size 3,000.
                                                                             5
                    original             POP    TRANS    t-NN       batch MF                                              online MF, F =            batch     AMF        SVD
                                                                   full sample                                           10     100 1,000        & online                  ++
                   Twitter all          0.350    0.149   0.027   0.010    0.006                                       0.335 0.337 0.340             0.269     0.332     0.348
                Twitter from 10th       0.416    0.267   0.036   0.008    0.005                                       0.523 0.527 0.532             0.457     0.516     0.516
                  Last.fm track         0.006    0.276   0.298   0.007    0.009                                       0.022 0.066 0.010             0.021     0.030     0.028
                  Last.fm artist        0.024    0.051   0.064   0.032    0.031                                       0.048 0.042 0.004             0.049     0.049     0.051
                 MovieLens10M           0.086    0.148   0.148   0.012    0.012                                       0.140 0.142 0.038             0.133     0.167     0.172
                Books all               0.026    0.033   0.042   0.008    0.007                                       0.025 0.032 0.009             0.016     0.025     0.026
                Books from 10th         0.017    0.046   0.074   0.010    0.012                                       0.042 0.053 0.018             0.021     0.040     0.042
       Amazon




                Electronics             0.035    0.015   0.014   0.006    0.006                                       0.016 0.018 0.019             0.011     0.018     0.019
                Electronics from 10th   0.023    0.024   0.035   0.017    0.018                                       0.046 0.049 0.050             0.023     0.048     0.050
                Movies                  0.066    0.029   0.034   0.023    0.007                                       0.035 0.039 0.014             0.024     0.035     0.037
                Movies from 10th        0.048    0.037   0.062   0.040    0.016                                       0.064 0.068 0.037             0.038     0.060     0.064
               shuffled                              NN
              Twitter all         0.020     0.041 0.055 0.050       0.045 0.034 0.039 0.045   0.050 0.044 0.044
          Twitter from 10th       0.017     0.047 0.091 0.073       0.082 0.060 0.068 0.074   0.073 0.066 0.067
             Last.fm track        0.004     0.002 0.043 0.010       0.021 0.008 0.009 0.002   0.010 0.006 0.008
             Last.fm artist       0.023     0.014 0.036 0.043       0.041 0.035 0.025 0.005   0.043 0.033 0.037
            MovieLens10M          0.059     0.058 0.065 0.088       0.085 0.080 0.052 0.004   0.088 0.072 0.080
           Books all              0.009     0.016 0.027 0.009       0.010 0.007 0.009 0.008   0.011 0.010 0.010
           Books from 10th        0.004     0.010 0.042 0.013       0.016 0.012 0.015 0.013   0.014 0.013 0.013
       Amazon




           Electronics           0.020      0.010 0.011 0.009       0.007 0.007 0.008 0.009   0.009 0.011 0.011
           Electronics from 10th 0.013      0.008 0.025 0.020       0.012 0.019 0.020 0.020   0.020 0.021 0.021
           Movies                 0.023     0.018 0.023 0.016       0.014 0.011 0.012 0.013   0.016 0.015 0.015
           Movies from 10th       0.007     0.010 0.034 0.024       0.024 0.019 0.021 0.021   0.024 0.019 0.020
Table 1: Online DCG@100 of different algorithms. Duplicate data marked “from 10th” are re-evaluation over items past the
10th interaction only, for each user. Top: original data. Bottom: shuffled data.



    In Fig. 5 we plot the average weekly DCG over a one year period.
While sampling appears to keep up with batch & online in the                                                  0.040
                                                                            average DCG between weeks 25-45




beginning, its performance drops in the second part. If we start
each factor model from the batch model of week 18, sampling                                                   0.038
produces similar results to batch & online. In Fig. 6, as a function
of k, we start with the batch model of week k. Then, we only use
                                                                                                              0.036
online updates with or without further sampling. It can be seen
that sampling performs roughly the same as batch & online for
k > 10. Note that the number of weekly incoming new users and                                                 0.034
items drops after the first 10 weeks as seen in Fig. 4. The single                                                                                 batch
iteration online model produces a comparable, but slightly worse                                              0.032                                batch & online
results than the sampling version.                                                                                                                 sampling started from batch
    In summary, in a non-stationary environment, multiple passes                                                                                   online started from batch
                                                                                                              0.030
(i.e. batch models) are required over the data to fully incorporate                                                           5             10               15                  20
many new users and items into the system. However, if a large                                                                         starting week k
component of the user-item matrix is previously learned by a robust
batch model, a simple online sampling algorithm is sufficient to keep       Figure 6: Performance of the online and sampling methods
up in performance with periodic batch re-training, while requiring          started from a batch model at week k. Averages are taken
only 3 additional positive samples from the past per iteration and          between week 25 and 45.
thus being much more efficient.
                                                                            learning methods in highly non-stationary environments. We pro-
                                                                            posed and evaluated a variety of algorithms as well as measures that
6   CONCLUSIONS                                                             are predictive of which algorithm would work well on a particular
                                                                            data. We showed that simpler algorithms that can use most recent
Despite the fact that a practical recommender system processes
                                                                            data by updating their models online perform better than more
events and generates top lists in an online sequence, the literature
                                                                            complex algorithms that can be updated only periodically. We also
payed relative little attention to designing and evaluating online
                                                                            showed that sampling from past events may completely replace
                                                                        6
batch modeling needs in a real time recommender system, thus
reducing latency. We released our code as an open source project3 .




3 https://github.com/rpalovics/Alpenglow

                                                                      7
REFERENCES                                                                                         [33] J. Vinagre, A. M. Jorge, and J. Gama. Evaluation of recommender systems in
 [1] J. Abernethy, K. Canini, J. Langford, and A. Simma. Online collaborative filtering.                streaming environments. In Workshop on Recommender Systems Evaluation:
     University of California at Berkeley, Tech. Rep, 2007.                                             Dimensions and Design (REDD), in conjunction with RecSys, 2014.
 [2] A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between ir ef-                   [34] E. M. Voorhees et al. The TREC-8 question answering track report. In TREC,
     fectiveness measures and user satisfaction. In Proc. SIGIR, pp. 773–774. ACM,                      volume 99, pages 77–82, 1999.
     2007.                                                                                         [35] X. Yang, H. Steck, Y. Guo, and Y. Liu. On top-k recommendation using social
 [3] Albert-Laszlo Barabasi. The origin of bursts and heavy tails in human dynamics.                    networks. In Proc. RecSys, pages 67–74. ACM, 2012.
     Nature, 435(7039):207–211, 2005.                                                              [36] Q. Yuan, L. Chen, and S. Zhao. Factorization vs. regularization: fusing hetero-
 [4] J. Bennett and S. Lanning. The netflix prize. In KDD Cup and Workshop in                           geneous social relationships in top-n recommendation. In Proc. RecSys, pages
     conjunction with KDD, 2007.                                                                        245–252. ACM, 2011.
 [5] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge Univ
     Pr, 2006.
 [6] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms
     on top-n recommendation tasks. In Proc. RecSys, pages 39–46. ACM, 2010.
 [7] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms.
     ACM TOIS, 22(1):143–177, 2004.
 [8] C. Desrosiers and G. Karypis. A comprehensive survey of neighborhood-based
     recommendation methods. In Recommender systems handbook, pages 107–144.
     Springer, 2011.
 [9] Y. Ding and X. Li. Time weight collaborative filtering. In Proc. CIKM, pages
     485–492. ACM, 2005.
[10] L. Dobos, J. Szule, T. Bodnar, T. Hanyecz, T. Sebok, D. Kondor, Z. Kallus, J. Stéger,
     I. Csabai, and G. Vattay. A multi-terabyte relational database for geo-tagged
     social network data. In Proc. CogInfoCom, pages 289–294. IEEE, 2013.
[11] S.      Funk.                 Netflix      update:     Try      this     at     home.
     http://sifter.org/˜simon/journal/20061211.html, 2006.
[12] F. M. Harper and J. A. Konstan. The movielens datasets: History and context.
     ACM TiiS, 5(4):19, 2016.
[13] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommenda-
     tions with recurrent neural networks. In ICLR, 2016.
[14] B. Hidasi and D. Tikk. Fast als-based tensor factorization for context-aware
     recommendation from implicit feedback. In Machine Learning and Knowledge
     Discovery in Databases, pages 67–82. Springer, 2012.
[15] B. Hidasi and D. Tikk. Context-aware item-to-item recommendation within the
     factorization framework. In Proc. 3rd Workshop on Context-awareness in Retrieval
     and Recommendation, pp. 19–25. 2013.
[16] Mikko Kivelä and Mason A Porter. Estimating inter-event time distributions
     from finite observation periods in communication networks. arXiv preprint
     arXiv:1412.8388, 2014.
[17] N. Koenigstein and Y. Koren. Towards scalable and accurate item-oriented
     recommendations. In Proc RecSys, pages 419–422. ACM, 2013.
[18] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom-
     mender systems. Computer, 42(8):30–37, 2009.
[19] N. Lathia, S. Hailes, and L. Capra. Temporal collaborative filtering with adaptive
     neighbourhoods. In Proc SIGIR, pages 796–797. ACM, 2009.
[20] G. Linden, B. Smith, and J. York. Amazon.com recommendations: item-to-item
     collaborative filtering. Internet Computing, IEEE, 7(1):76–80, 2003.
[21] J. McAuley, R. Pandey, and J. Leskovec. Inferring networks of substitutable and
     complementary products. In Proc SIGKDD, pages 785–794. ACM, 2015.
[22] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips,
     E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In
     Proc SIGKDD, pages 1222–1230. ACM, 2013.
[23] R. Pálovics and A. A. Benczúr. Temporal influence over the last.fm social network.
     In Proc. ASONAM, pages 486–493. ACM, 2013.
[24] R. Pálovics, A. A. Benczúr, L. Kocsis, T. Kiss, and E. Frigó. Exploiting temporal
     influence in online recommendation. In Proc. RecSys, pages 273–280. ACM, 2014.
[25] A. Paterek Improving regularized singular value decomposition for collaborative
     filtering In Proc. SIGKDD, pages 39–42. ACM, 2007.
[26] I. Pilászy, A. Serény, G. Dózsa, B. Hidasi, A. Sári, and J. Gub. Neighbor methods
     vs matrix factorization - case studies of real-life recommendations. In LSRS2015
     at RECSYS, 2015.
[27] I. Pilászy, D. Zibriczky, and D. Tikk. Fast als-based matrix factorization for explicit
     and implicit feedback datasets. In Proc. RecSys, pages 71–78. ACM, 2010.
[28] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. Factorizing personalized
     markov chains for next-basket recommendation. In Proc. WWW, pages 811–820.
     ACM, 2010.
[29] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-based collaborative filtering
     recommendation algorithms. In Proc. WWW, pages 285–295, 2001.
[30] G. Takács, I. Pilászy, B. Németh, and D. Tikk. Investigation of various matrix
     factorization methods for large recommender systems. In Proc. 2nd KDD Workshop
     on Large-Scale Recommender Systems and the Netflix Prize Competition, pages 1–8.
     ACM, 2008.
[31] R. Turrin, M. Quadrana, A. Condorelli, R. Pagano, and P. Cremonesi. 30music
     listening and playlists dataset. In RecSys Posters, 2015.
[32] Alexei Vazquez, Balazs Racz, Andras Lukacs, and Albert-Laszlo Barabasi. Impact
     of non-poissonian activity patterns on spreading processes. Physical review letters,
     98(15):158702, 2007.

                                                                                               8