=Paper=
{{Paper
|id=Vol-2069/STREAMEVOLV2
|storemode=property
|title=Online Bagging for Recommendation with Incremental Matrix Factorization
|pdfUrl=https://ceur-ws.org/Vol-2069/STREAMEVOLV2.pdf
|volume=Vol-2069
|authors=João Vinagre,Alípio Mário Jorge,João Gama
|dblpUrl=https://dblp.org/rec/conf/pkdd/VinagreJG16
}}
==Online Bagging for Recommendation with Incremental Matrix Factorization==
<pdf width="1500px">https://ceur-ws.org/Vol-2069/STREAMEVOLV2.pdf</pdf>
<pre>
      Online bagging for recommendation with
          incremental matrix factorization

           João Vinagre1,2 , Alı́pio Mário Jorge1,2 , and João Gama1,3
                     1
                     LIAAD - INESC TEC, Porto, Portugal
                     2
                     Faculty of Sciences, University of Porto
                  3
                    Faculty of Economics, University of Porto
           jnsilva@inesctec.pt, amjorge@fc.up.pt, jgama@fep.up.pt


      Abstract. Online recommender systems often deal with continuous, po-
      tentially fast and unbounded flows of data. Ensemble methods for rec-
      ommender systems have been used in the past in batch algorithms, how-
      ever they have never been studied with incremental algorithms, that are
      capable of processing those data streams on the fly. We propose online
      bagging, using an incremental matrix factorization algorithm for positive-
      only data streams. Using prequential evaluation, we show that bagging
      is able to improve accuracy more than 20% over the baseline with small
      computational overhead.


Keywords: Recommender systems, bagging, matrix factorization, data streams


1   Introduction
In many real world recommender systems, user feedback is continuously gener-
ated at unpredictable rates and order, and is potentially unbounded. In large
scale systems, the rate at which user feedback is generated can be very fast.
Building predictive models from these continuous flows of data is a problem ac-
tively studied in the field of data stream mining. Ideally, algorithms that learn
from data streams should be able to process data at least as fast as it arrives, in
a single pass, while maintaining an always-available model [3]. Most incremen-
tal algorithms naturally have these properties, and are thus a viable solution,
including in recommendation problems [12]. Incremental algorithms for recom-
mendation treat user feedback data as a data stream, immediately incorporating
new data in the recommendation model. In most recommendation tasks this is
a desirable feature, since the task of a recommender system is to find the most
relevant items – such as books, music tracks, restaurants or movies – to each
user, individually. Naturally, users are human beings, whose preferences change
over time. Moreover, in large scale systems, new users and items are permanently
entering the system. A model that is immediately updated with fresh data has
the capability of adjusting faster to such changes.
    Ensemble methods in machine learning are convenient techniques to improve
the accuracy of algorithms. Typically, this is achieved by combining results from
a number of weaker sub-models. Bagging [1], Boosting [4] and Stacking [13]
are three well-known ensemble methods used with recommendation algorithms.
Boosting is experimented in [2,9,7,10], bagging is studied also in [7,10], and
stacking in [11]. In all of these contributions, ensemble methods work with batch
learning algorithms only.
    In this paper we propose online bagging for incremental recommendation al-
gorithms designed to deal with streams of positive user feedback. To our best
knowledge this is the first ensemble method proposed for incremental recom-
mender systems in the literature.


2   Online bagging

Bagging [1] is an ensemble technique that takes a number of bootstrap samples
of a dataset and trains a model on each one of the samples. Predictions from the
various sub-models are then aggregated in a final prediction. This is known to
improve the performance of algorithms by reducing variance, which is especially
useful with unstable algorithms that are very sensitive to small changes in the
data. The diversity offered by training several models with slightly different
bootstrap samples of the data helps in giving more importance to the main
concepts being learned – since they must be present in most bootstrap samples
of the data – , and less importance to noise or irrelevant phenomena that may
mislead the learning algorithm.
    To obtain a bootstrap sample of a dataset with size N , we perform N trials,
sampling a random example with replacement from the dataset. Each example
has probability of 1/N to be sampled at each trial. The resulting dataset will
have the same size of the original dataset, however some examples will not be
present whereas some others will occur multiple times. To obtain M samples,
we simply repeat the process M times.
    In its original proposal [1], bagging is a batch procedure requiring N × M
passes through the dataset. However, it has been shown in [8] that this can be
done incrementally in a single pass, if the number of examples is very large –
a natural assumption when learning from data streams. Looking at the batch
method above, we observe that each bootstrap sample contains K occurrences
of each example, with K ∈ {0, 1, 2, . . .}, and:
                                  k       N −k
                                N   1        1
                   P (K = k) =            1−                                  (1)
                                k   N        N
    In an incremental setting, one could just initialize M sub-models – or boot-
strap nodes – and then use (1) to train new examples K times, redrawing K for
each node. The problem is that this would still require knowing N beforehand.
However, if we assume that N → ∞, then the distribution of K tends to a
P oisson(1) distribution, and therefore

                                             e−1
                               P (K = k) =                                    (2)
                                              k!
      eliminating the need of any prior knowledge about the data.

2.1     Online recommendation with bagging
We use incremental bagging with ISGD [12], an online matrix factorization al-
gorithm for positive-only data. ISGD (Algorithm 1) uses Stochastic Gradient
Descent in one pass through the data, which is convenient for data stream pro-
cessing. It is designed for positive-only streams of user-item pairs (u, i). Each
pair indicates a positive interaction between user u and item i.


    Algorithm 1: ISGD - Incremental SGD for positive-only ratings [12]
      Data: a finite set or a data stream D = {(u, i)1 , (u, i)2 , . . .}
      input : no. of latent features k, no. of iterations iter, regularization factor λ,
              learn rate η
      output: user and item factor matrices A and B
      for (u, i) ∈ D do
          if u 6∈ Rows(A) then
               Au ← Vector(size : k)
               Au ∼ N (0, 0.1)
          if i 6∈ Rows(B) then
               Bi ← Vector(size : k)
               Bi ∼ N (0, 0.1)
          for n ← 1 to iter do
               errui ← 1 − Au · Bi
               Au ← Au + η(errui Bi − λAu )
               Bi ← Bi + η(errui Au − λBi )


   By applying the online bagging approach described in Section 2, we obtain
Algorithm 2 – BaggedISGD.
   BaggedISGD learns a model based on M bootstrap nodes. To perform the
actual list of recommendations for a user u, items i are sorted by a function
f = |1 − R̂ui | as with ISGD. The scores R̂ui are the average score of all nodes:
                                           PM    m    m
                                            m=1 Au · Bi
                                  R̂ui =                                     (3)
                                           M
    At training time, this algorithm requires at least m times the computational
resources needed for ISGD, with m bootstrap nodes. Recommendation also has
the overhead of aggregating m predictions from the submodels.

3      Evaluation
To simulate a streaming environment we need datasets that maintain the natural
order of the data points, as they were generated. Additionally, we need positive-
only data, since the tested algorithm is not designed to deal with ratings. We use
    Algorithm 2: BaggedISGD - Bagging version of ISGD (training algorithm)
     Data: a finite set or a data stream of user-item pairs D = {(u, i)1 , (u, i)2 , . . .}
     input : no. of latent features k, no. of iterations iter, regularization factor λ,
             learn rate η, no. of bootstrap nodes M
     output: M user and item factor matrices Am and B m
     for (u, i) ∈ D do
         for m ← 1 to M do
             k ∼ Poisson(1)                                                     // eq. (2)
             if k > 0 then
                 for l ← 1 to k do
                     if u 6∈ Rows(Am ) then
                          Amu ← Vector(size : k)
                          Amu ∼ N (0, 0.1)
                     if i 6∈ Rows(B m ) then
                          Bim ← Vector(size : k)
                          Bim ∼ N (0, 0.1)
                     for n ← 1 to iter do
                          errui ← 1 − Am u · Bi
                                               m

                          Amu ←  Am
                                  u  +  η(err     m     m
                                              ui Bi − λAu )
                          Bi ← Bi + η(errui Au − λBim )
                            m      m              m


4 datasets that conciliate these two requirements – positive-only and naturally
ordered –, described in Table 1. ML1M is based on the Movielens-1M movie
rating dataset4 . To obtain the YHM-6KU, we sample 6000 users randomly from
the Yahoo! Music dataset5 . LFM-50U is a subset consisting of a random sample
of of 50 users taken from the Last.fm6 dataset7 . PLC-STR consists of the music
streaming history taken from Palco Principal8 , a portuguese social network for
non-mainstream artists and fans. All of the 4 datasets consist of a chronologically
ordered sequence of positive user-item interactions. However, ML1M and YHM-
50U are obtained from ratings datasets. To use them as positive-only data, we
retain the user-item pairs for which the rating is in the top 20% of the rating
scale. This means retaining only the rating 5 in ML1M and rating of 80 or more
in the YHM-6KU dataset. Naturally, only single accurrences of user-item pairs
are available in these datasets, since users do not rate the same item more than
once. PLC-STR and LFM-50 have multiple occurrences of the same user-item
pairs.
    We run a set of experiments using the prequential approach [5] as described
in [12]. Each observation in the dataset consists of a simple user-item pair (u, i)

4
  http://www.grouplens.org/data [Jan 2013]
5
  https://webscope.sandbox.yahoo.com/catalog.php?datatype=r [Jan 2013]
6
  http://last.fm/
7
  http://ocelma.net/MusicRecommendationDataset [Jan 2013]
8
  http://www.palcoprincipal.com/
                   Dataset      Events Users Items Sparsity
                   PLC-STR     588 851 7 580 30 092 99.74%
                   LFM-50U 1 121 520      50 159 208 85.91%
                   YHM-6KU 476 886 6 000 127 448 99.94%
                   ML1M        226 310 6 014 3 232 98.84%
                          Table 1. Dataset description


that indicates a positive interaction between user u and item i. The following
steps are performed in the prequential evaluation process:

1. If u is a known user, use the current model to recommend a list of items to
   u, otherwise go to step 3;
2. Score the recommended list given the observed item i;
3. Update the model with (u, i) (optionally);
4. Proceed to – or wait for – the next observation

    This process is entirely applicable to algorithms that learn either incremen-
tally or in batch mode. This is the reason why step 3. is annotated as optional.
For example, instead of performing this step, the system can store the data to
perform batch retraining periodically.


       Dataset    M     Rec@1 Rec@5 Rec@10 Rec@20 Upd. (ms) Rec. (ms)
                  ISGD 0.127 0.241       0.277   0.302      0.237   21.736
                  8       0.076 0.194    0.257   0.316      2.563   64.793
       PLC-STR
                  16      0.081 0.215    0.284   0.349      4.732 132.812
                  32      0.088 0.229    0.302   0.370      9.508 264.846
                  64      0.092 0.237 0.313 0.384          18.012 517.479
                  ISGD 0.034 0.049       0.052   0.055      2.625   94.177
                  8       0.023 0.044    0.052   0.058     21.449 241.452
       LFM-50U
                  16      0.026 0.050    0.059   0.066     43.094 491.689
                  32      0.028 0.055    0.064   0.071     84.536 984.060
                  64      0.030 0.057 0.067 0.075         168.781   1.958 s
                  ISGD 0.030 0.063       0.082   0.103      4.462   89.321
                  8       0.011 0.033    0.051   0.076     28.529 347.422
       YHM-6KU
                  16      0.012 0.037    0.058   0.086     54.723 667.898
                  32      0.019 0.055    0.082   0.117    158.744 990.551
                  64      0.021 0.059 0.087 0.123         328.924   1.934 s
                  ISGD 0.005 0.021       0.034   0.055      0.069     2.557
                  8       0.005 0.019    0.033   0.056      0.517     7.208
       ML1M
                  16      0.006 0.022    0.038   0.063      1.390   21.816
                  32      0.006 0.025    0.042   0.071      1.866   33.496
                  64     0.007 0.026 0.045 0.074            3.999   41.090
Table 2. Average performance of ISGD with and without bagging. M is the number
of bootstrap nodes. The last two columns contain the average update times and the
average recommendation times.
    To kickstart the evaluation process we use 10% the available data to train a
base model in batch, and use the remaining 90% to perform incremental training
and evaluation. We do this initial batch training to avoid cold-start problems,
which are not the subject of our research.
    In our setting, the items that users have already co-occurred with – i.e. items
that users know – are not recommended. This has one important implication in
the prequential evaluation process, specifically on datasets that have multiple
occurrences of the same user-item pair. Evaluation at these points is necessarily
penalized, since the observed item will be not be within the recommendations.
In such cases, we bypass the scoring step, but still use the observation to update
the model.
    We measure two dimensions on the evaluation process: accuracy and time.
In the prequential process described above, we need to make a prediction and
evaluate it at every new user-item pair (u, i) that arrives in the data stream.
To do this, we use the current model to recommend a list of items to user u.
We then score this recommendation list, by matching it to the actually observed
item i. We use a recommendation list with at most 20 items, and then score this
list as 1 if i is within the recommended items, and 0 otherwise, using Recall@C
with cutoffs C ∈ {1, 5, 10, 20}. Because only one item is tested against the list,
Recall@C can only take the values {0, 1}. We can calculate the overall Recall@C
by averaging the scores at every step. Additionally, we can also depict it using
a moving average. Time is measured in milliseconds at every step and we depict
it using the same techniques we use with accuracy.
    All experiments were run in Intel Haswell 4-core machines, with CentOS
Linux 7 64 bit. The algorithms and prequential evaluation code is implemented
on top of MyMediaLite [6]. The recommendation step is implemented with multi-
core code – predictions from nodes are computed in parallel.


3.1   Results

To evaluate bagging, we experiment with four levels of bootstrapping M ∈
{8, 16, 32, 64}. Table 2 summarizes the results of our experiments. Values in
Table 2 are obtained by averaging Recall and time obtained at all prequential
evaluation steps. With all datasets except YHM-6KU, bagging improves the Re-
call, especially with M ≥ 32. One interesting observation is that bagging has a
bigger influence on higher Recall cutoffs, which suggests that improvements of
the predictive ability are typically not obtained in the top 5 recommended items.
    The model update times increase approximately in proportion to the number
of bootstrap nodes M , which is not surprising, since the algorithm performs the
update operations one time (in average) in each one of the M bootstrap nodes.
However, since the baseline update time is small, this overhead is also small.
The last column of Table 2 contains the recommendation time, specifically the
average time required to produce a recommendation list. This is important, be-
cause the bagging algorithm needs to aggregate predictions coming from all M
nodes, which is obviously an overhead. Results show that both the update times
and recommendation times increase proportionally to M . However, the recom-
mendation step is a far more costly operation, even when computed in parallel.
For example, using M = 64 with LFM-50U and YHM-6KU, recommendations
are computed in nearly two seconds in average, in 4-core machines.
    A useful feature of prequential evaluation is that it allows us also to depict the
evolution of Recall@209 in Figure 1. This visualization reveals how the predictive
ability of the algorithm performs over time, as the incremental learning process
occurs.


Fig. 1. Prequential evaluation of Recall@20 with ISGD with and without bagging.
Lines are drawn using a moving average of Recall@20 with n = 10000. The first 10 000
points are drawn using the accumulated average.


3.2     Discussion
Results in Table 2 and Figure 1 show that bagging clearly improves the accuracy
of ISGD, with accuracy improvements of 20% over the baseline (see Table 2 LFM-
50U and ML1M). This improvement is mainly observable with cutoffs C ≥ 5 of
Recall. Given that bagging reduces variance [1], this suggests that the variance
of ISGD is lower in the top few recommendations. Another observation is that
improvements are not consistent with all datasets. With LFM-50U, for example,
bagging only slightly outperforms the baseline ISGD – and only with M ≥ 32 –,
while with PLC-STR, the improvement is much higher in proportion, even with
lower M .
9
    For the sake of space, we omit other cutoffs.
    It is also clear that the time overheads grow linearly with the number of
bootstrap models. However, the overhead in model update times is not very rel-
evant in practice, given that the baseline update times are very low in ISGD
– with M = 64 the highest update time falls below 400ms. The overhead at
recommendation time is more evident, when aggregating results from the M
bootstrap nodes. Fortunately, as with most ensemble techniques, parallel pro-
cessing can be trivially used to alleviate this overhead. Additionally, there may
be room for code optimization or approximate methods that require less and/or
more efficient computations.


4   Conclusions
Bagging is a an ensemble technique successfully used with many machine learn-
ing algorithms, however it has not been thoroughly studied in recommendation
problems, and particularly with incremental algorithms. In this paper, we exper-
iment online bagging with an incremental matrix factorization algorithm that
learns from unbounded streams of positive-only data. Our results suggest that
with manageable overheads, accuracy clearly improves – more than 20% in some
cases –, especially as the number of recommended items increases. In the near
future, we intend to experiment this and other online ensemble methods in a
larger number of stream-based recommendation algorithms.


5   Acknowledgments
Project “TEC4Growth – Pervasive Intelligence, Enhancers and Proofs of Con-
cept with Industrial Impact/NORTE-01-0145-FEDER-000020” is financed by
the North Portugal Regional Operational Programme (NORTE 2020), under
the PORTUGAL 2020 Partnership Agreement, and through the European Re-
gional Development Fund (ERDF). This work is partially funded by the Eu-
ropean Commission through project MAESTRA (Grant no. ICT-2013-612944).
We thank Ubbin Labs, Lda. for kindly providing data from Palco Principal.


References
 1. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996),
    http://dx.doi.org/10.1007/BF00058655
 2. Chowdhury, N., Cai, X., Luo, C.: Boostmf: Boosted matrix factorisation for col-
    laborative ranking. In: Proc. of the European Conf. on Machine Learning and
    Knowledge Discovery in Databases, ECML PKDD 2015, Part II. LNCS, vol. 9285,
    pp. 3–18. Springer (2015), http://dx.doi.org/10.1007/978-3-319-23525-7 1
 3. Domingos,     P.,    Hulten,   G.:   Catching     up  with    the   data:   Re-
    search    issues    in    mining    data     streams.   In:   DMKD       (2001),
    http://www.cs.cornell.edu/johannes/papers/dmkd2001-papers/p8 domingos.pdf
 4. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proc.
    of the 13th Intl. Conference on Machine Learning ICML ’96. pp. 148–156. Morgan
    Kaufmann (1996)
 5. Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream
    learning      algorithms.     Machine      Learning     90(3),    317–346      (2013),
    http://dx.doi.org/10.1007/s10994-012-5320-9
 6. Gantner, Z., Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Mymedialite: a
    free recommender system library. In: Proc. of the 2011 ACM Conference on Rec-
    ommender Systems, RecSys 2011. pp. 305–308. ACM (2011)
 7. Jahrer, M., Töscher, A., Legenstein, R.A.: Combining predictions for accurate rec-
    ommender systems. In: Proc. of the 16th ACM SIGKDD International Conference
    on Knowledge Discovery and Data Mining, KDD 2010. pp. 693–702. ACM (2010),
    http://doi.acm.org/10.1145/1835804.1835893
 8. Oza, N.C., Russell, S.J.: Experimental comparisons of online and batch versions of
    bagging and boosting. In: Proc. of the 7th ACM SIGKDD international conference
    on Knowledge discovery and data mining, KDD 2001. pp. 359–364. ACM (2001),
    http://portal.acm.org/citation.cfm?id=502512.502565
 9. Schclar, A., Tsikinovsky, A., Rokach, L., Meisels, A., Antwarg, L.: Ensemble meth-
    ods for improving the performance of neighborhood-based collaborative filtering.
    In: Proc. of the 2009 ACM Conference on Recommender Systems, RecSys 2009.
    pp. 261–264. ACM (2009), http://doi.acm.org/10.1145/1639714.1639763
10. Segrera, S., Moreno, M.N.: An experimental comparative study of web mining
    methods for recommender systems. In: Proc. of the 6th WSEAS Intl. Conf. on
    Distance Learning and Web Engineering. pp. 56–61. WSEAS (2006)
11. Sill, J., Takács, G., Mackey, L.W., Lin, D.: Feature-weighted linear stacking. CoRR
    abs/0911.0460 (2009), http://arxiv.org/abs/0911.0460
12. Vinagre, J., Jorge, A.M., Gama, J.: Fast incremental matrix factorization for rec-
    ommendation with positive-only feedback. In: Proc. of the 22nd Intl. Conference on
    User Modeling, Adaptation, and Personalization, UMAP 2014. LNCS, vol. 8538,
    pp. 459–470. Springer (2014), http://dx.doi.org/10.1007/978-3-319-08786-3 41
13. Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992),
    http://dx.doi.org/10.1016/S0893-6080(05)80023-1

</pre>