=Paper= {{Paper |id=Vol-1348/maics2013_paper_11 |storemode=property |title=An Initial Investigation of Multi-Cyclic Training Regimen for Collaborative Filtering Models in GraphChi |pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_11.pdf |volume=Vol-1348 |dblpUrl=https://dblp.org/rec/conf/maics/CurnaliaL13 }} ==An Initial Investigation of Multi-Cyclic Training Regimen for Collaborative Filtering Models in GraphChi== https://ceur-ws.org/Vol-1348/maics2013_paper_11.pdf
           An Initial Investigation of Multi-Cyclic Training Regimen for
                    Collaborative Filtering Models in GraphChi


                                           James W. Curnalia and Alina Lazar
                                    Department of Computer Science and Information Systems
                                                  Youngstown State University
                                                   Youngstown, Ohio 44555
                                                  jwcurnalia@student.usu.edu
                                                       alazar@ysu.edu


                           Abstract                               songs. In addition to the fact that a user has listened to a
  More and more outlets are utilizing collaborative filtering     particular song, there is an explicit rating (1-5, 5 being the
  techniques to make sense of the sea of data generated by our    highest). If a user has not listened to a song this fact is
  hyper-connected world. How a collaborative filtering model      highlighted with an 'X'. A predicted rating needs to be
  is generated can be the difference between accurate or          generated for these unknown songs, so that they can be
  flawed predictions. This study is to determine the impact of    suggested to the appropriate users.
  a cyclical training regimen on the algorithms presented in           The first step in the process is to identify which users
  the Collaborative Filtering Toolkit for GraphChi.               have similar tastes. User A and User C have three points of
                                                                  similarity. They both responded very positively to song6
       Introduction to Collaborative Filtering                    (ratings of A:5 and C:4) and very negatively to song5 (both
                                                                  had ratings of 1). The third point of similarity, song1, is
      With the growth of on-line services and e-commerce          less significant. User C responded very favorably to the
the amount of data that users need to make sense of has           song, while User A had a neutral reaction. A similar
increased exponentially. The difficult task facing these          process can generate a taste profile for Users B and D
service providers is to sift through a sea of options and find    (song1 B:1/D:2, song5 B:4/D:5, and song7 B:4/D:5). Using
what a user wants before their competitors. One of the tools      these pairings we can begin to identify likely reactions to
they have at their disposal is collaborative filtering.           some of the users' unknown songs.
      One of the central assumptions about collaborative
filtering is that people with similar tastes, or shopping            Table 2: Suggestions Based on Similarities
histories, are better predictors of a user's future behavior       User song1 song2 song3 song4 song5 song6 song7
than a random person. Collaborative filtering techniques
sift through large datasets to identify patterns and                A        3      X       +       4       1      5       2
similarities between these users, and then make
                                                                     B       1      2       -       2       4      X       4
recommendations. These processes are not limited to the
retail sphere, collaborative filtering has also been applied to      C       5      X       4       +       1      4        -
financial, geological, and other endeavors. An examination
of a small example of collaborative filtering, to produce           D        2       -      1       -       5      X       5
song recommendations for customers, will help illustrate
some of these ideas.                                                   Six of the unknown songs have been replaced with a
                                                                  suggested reaction (+ for positive or – for negative). An
   Table 1: Example User Song History and Ratings                 exact score would require further analysis, and most likely
 User song1 song2 song3 song4 song5 song6 song7                   a great deal more information, but a more simplistic
                                                                  measure of attitude can be inferred.
  A        3       X       X       4        1       5       2          This still leaves four unknown songs in the table. A
   B       1       2       X       2        4       X       4     quick look over the table shows that the two pairs of users
                                                                  have completely opposite tastes in music. It would be
   C       5       X       4       X        1       4       X     reasonable to assume that what one pair likes the other will
  D        2       X       1       X        5       X       5     dislike, and vice versa. In this way we can utilize not only
                                                                  the similarity between users to determine attitudes, but also
                                                                  the reactions of diametrically opposed Users.
    The example data covers four different customers                   Now all that remains is to scale the entire process to
(A-D) and their listening history over a catalog of seven         handle millions of users and products, while maintaining
accuracy and minimizing costs (both financial and              graph-based algorithms have been adopted by many large,
temporal).                                                     commercial websites including Amazon and YouTube
                                                               (Walia 2008). While it is important to note the adoption of
                    Related Work                               techniques like this by powerful and influential
                                                               corporations, it is admittedly the latter that was the driving
     Collaborative Filtering has traditionally included        force in adopting a graph-based approach.
methods such as Bayesian Networks and clustering. A                 The GraphLab Project was developed in order to
Bayesian Network uses probability and causal relationships     facilitate distributed, parallel, graph-based algorithms in an
to classify new observations (Pearl 1994). Clustering          efficient and reliable manner (Low et al. 2010). GraphChi
algorithms attempt to represent observations as “points” in    is an offshoot of the GraphLab Project that seeks to
a multi-dimensional space. The closer together that two        leverage the power graph-based algorithms on a single
points are, the more similar the underlying observations       machine, while maintaining high performance standards
(Witten, Frank and Hall 2011). These and other traditional     (Kyrola, Blelloch, and Guestrin 2012). Bickson, one of the
methods are, and will continue to be, powerful and valid       original developers of GraphLab, has ported a number of
collaborative filtering methods.                               collaborative filtering algorithms from GraphLab to
     However, in the wake of events like the Netflix Prize a   GraphChi in the form of the Collaborative Filtering Toolkit
new series of algorithms were developed to deal with a         (CFT). Thirteen algorithms were supported at the time this
new phenomenon “Big Data”. These new algorithms                study was run, December 2012, but two additional
sought to incorporate the concepts of the traditional          algorithms had already been added as this paper was being
methods into new frameworks capable of dealing with data       written, January 2013.
that was increasingly large, complex, and often extremely           In addition to developing the toolkit, Bickson has
sparse.                                                        written a blog entry which serves as an introduction to the
     This study will concern itself with thirteen of these     CFT and its underlying algorithms (Bickson 2012). In the
modern Collaborative Filtering algorithms:                     tutorial, Bickson identifies a number of algorithms which
  • Alternating Least Squares (ALS), (Zhou et al. 2008)        have an element of fault tolerance. These algorithms allow
  • Stochastic Gradient Descent (SGD), (Koren 2009)            the user to save the model to disk and then resume training
  • Bias Stochastic Gradient Descent (BSGD), (Koren            from that exact state.
     2008)                                                          Experimentation with the fault-tolerant algorithms
  • Koren’s Singular Value Decomposition (SVD++),              seemed to yield an additional benefit, in that the accuracy
     (Koren 2008)                                              of the model would often jump between executions of the
  • Weighted ALS (WALS), (Hu, Koren, and Volinsky              training epochs (this paper will use the terms “epoch” and
     2008)                                                     “cycle” interchangeably to refer to a group of training
  • Non-Negative Matrix Factorization (NMF), (Lee and          iterations). This would seem to suggest that there is an
     Seung 2001)                                               advantage to using multiple cycles beyond simple fault
  • Singular Value Decomposition (SVD), (Hernandez,            tolerance. The cumulative value of this inter-cycle bump in
     Roman, and Tomas 2007)                                    accuracy could be quite significant.
  • One-Sided SVD, (Hernandez, Roman, and Tomas                     As these algorithms train a model, they attempt to
     2007)                                                     reduce the sample space in an attempt to converge on an
  • Tensor ALS (TALS), (Comon, Luciani, and de                 optimal answer. Given that these algorithms become more
     Almeida 2009)                                             restrictive and focused the longer that they run, it is
  • Restricted Bolzman Machines (RBM), (Hinton 2010)           reasonable to assume that restarting an algorithm would
  • Time-SVD++(TSVD++), (Koren 2009)                           have a significant positive impact on the final model. By
  • libFM, (Rendle 2010)                                       loosening the bounds placed on the algorithm it is possible
  • Probabilistic      Matrix     Factorization     (PMF),     to identify the possibility that the current parameters have
     (Salakhutdinov and Mnih 2008)                             focused on a local rather than global minimum.
     Each of the modern Collaborative Filtering algorithms
seeks to either reduce or capitalize on the complexity of                               Proposal
large datasets. Algorithms utilizing decomposition,
factorization, and SGD seek to reduce the dimensionality            The observed increase in inter-epoch accuracy of
of data in order to expose underlying relationships. Least     models being trained with fault tolerant algorithms
squares methods treat recommendations as linear                suggests that there is significant value to be gained from
equations, and attempt to find the best estimation of the      such a training regimen. This study will first seek to
parameters necessary to calculate an accurate rating. TALS     establish whether there is in fact a boost in model accuracy
and TSVD++ try to leverage additional information, in this     between training cycles. If this is successful, we will
case time, in order to more accurately model behaviors.        attempt to identify a cycle size that results in the best final
     This study will focus on graph-based implementations      model.
of these algorithms given their recent popularity and the           It is our belief that a fault-tolerant algorithm, utilizing
ability to be executed on smaller machines. Recently           a epoch size of no more than 20, should result in a model
that is significantly more accurate than those trained using    degree of accuracy that is a result of training.
a single-epoch algorithm.
                                                                Remaining Algorithms (SVD, One-Sided SVD, RBM,
                     Methodology                                TSVD++, libFM, and PMF)
                                                                    Even though these algorithms do not support a
     Since this study is seeking to understand the impact of    multi-cyclic training regimen, they will be trained on the
epoch size on the accuracy of the resulting model, there        same epoch schedule to provide additional context.
will be no attempt to tune any of the algorithms. The
default settings were used in order to limit the impact of      Additional Note on Alternating Least Squares (ALS)
user proficiency on the resulting models (Bickson 2012).        Algorithms
     The CFT reports the accuracy of the model generated             Overfitting was a significant problem with the ALS
using root mean squared error (RMSE). This statistic is         algorithms, and as such some modifications to the methods
generated for every complete pass over the training set, and    were made for those algorithms. Instead of stopping the
is reported in terms of both training and validation. The       training schedule with the first model resulting in a higher
training RMSE will not be used in this study as it is only a    RMSE, all schedules were run to at least a training epoch
reflection of how well the model performs on the training       of 80 iterations in size. This was an attempt to see if
data. Instead the validation RMSE will be the only measure      overfitting could be overcome with additional training.
reported since it demonstrates how well the model handled       Overfitting also led to the inclusion of two smaller epoch
the test data. Additionally, the validation RMSE will           sizes on the training schedule, two and ten, in order to
identify problems with the model such as overfitting which      identify if the optimal size was located at this smaller scale.
are ignored in the training algorithm.
                                                                Control Groups
Fault Tolerance Algorithms (ALS, WALS, TALS,                         One final piece of information is necessary to
NMF, SGD, BSGD and SVD++)                                       determine the effectiveness of a multi-cyclic training
     The initial epoch size will be set to the maximum          regimen, a control group. The control group was trained
iterations specified in the tutorial, usually 6 iterations.     using a single-epoch consisting of a large number of
After the first training cycle, the argument                    iterations. The control groups were generated after the
--load_factors_from_file=1 will be added to the algorithm       experimental group in an attempt to limit the number of
to resume training with the current state. Training cycles      unnecessary training cycles, since large epochs require 8+
will continue until one of the following three conditions are   hours to run.
met:                                                                 Waiting until after the experiment had the additional
      1. The validation RMSE no longer improves with            benefit of identifying a number of algorithms which did not
          subsequent training cycles (a minima is reached).     need to included in the control group. All of the algorithms
      2. The validation RMSE increases with additional          utilizing ALS already had runs which demonstrated that
          training cycles (overfitting).                        even moderately sized single iterations were outperformed
      3. Multiple training cycles result in an improvement      by multi-cyclic regimens. This meant that it was only
          of the validation RMSE of less than .00005 / 10       necessary to run control groups on SGD, BSGD, NMF, and
          iterations (diminishing returns).                     SVD++.
     Upon reaching one of the above criteria, the current            Each of the control groups is trained using the same
state of the model will be saved and the validation RMSE        parameters utilized in the experimental phase, only the
and total number of training cycles recorded.                   progression of epoch size is altered. A starting size of 200
     Next the epoch size will be increased and the entire       iterations was selected since it is within the bounds of each
process will repeat for the new model. The training epoch       of the selected algorithms total number of iterations run
will be increased on the following schedule: 6, 20, 40, 80,     from the experiment. Each succeeding epoch will be
100, 120, 140, 180, 200. After 200 iterations the size of the   doubled in size up to 1,600 iterations. The next epoch will
epoch will be incremented by 50 for every subsequent            be increased to 2,000 and then incremented by 1,000 every
increase. The training of new models on this schedule will      epoch after that. Training will continue until the RMSE no
continue until the resulting model has a higher validation      longer decreases between runs, and actually begins to
RMSE than the previously generated model.                       increase.
     After the complete training of an algorithm is
completed, an average starting RMSE is selected and                 Table 2: Netflix Control Group Results
recorded as well. The starting RMSE is defined as the               Algorithm        # of Iterations       RMSE
validation RMSE after a single iteration of the algorithm.
Since there is a degree of variation inherent in all of these   SGD                                 200             1.123820
algorithms it is necessary to choose a representative initial   BSGD                                400             1.117690
state. The starting RMSE will allow for a model’s training
regimen to be judged both by its final accuracy and the         SVD++                               400             0.982024
                                                                NMF                                2000             2.370640
                        The Data                                 Table 3: Algorithms Ordered by Final RMSE (Multi-Cyclic
                                                                 Algorithms Bolded)
     This study uses the same dataset featured in Bickson’s        Algorithm       Initial RMSE         Final RMSE
blog, which is a synthetic Netflix dataset created using an
anonymized sample of the original. Although the dataset          PMF                        2.498400           0.914566
from the Netflix challenge is unavailable because of             RBM                        0.979169           0.926279
copyright, the general characteristics of the data are well      SVD++                      1.124420           0.931921
established by the competition’s creators (Bennett and
Lanning 2007). The GraphLab Netflix sample was done to           BSGD                       1.363540           0.952970
preserve these characteristics (i.e. sparsity of data, user to   SGD                        1.240700           0.959890
movie ratio, user to rating ratio, etc) while ensuring the       TSVD++                     1.041220           0.995435
anonymity of the users.
     The Netflix subset has the following general                LibFM                      1.090030           1.025770
characteristics: 95,526 unique users, 3,561 movies, and          TALS                       1.244250           1.147030
3,298,163 ratings (non-zeroes). This is a very sparse            ALS                        1.251550           1.159920
dataset with less than 0.97% of the resulting matrix having
                                                                 NMF                        1.580120           2.375430
ratings.
                                                                 WALS                       5.522080           5.325280
                Historic Benchmarks
                                                                      Looking at the results purely in terms of accuracy
     In order to provide some context in which to view the       suggest that the fault-tolerant algorithms are generally of
results of this study the following results from the Netflix     inferior quality. However, this view of the results is
Prize (Netflix 2009) were retrieved:                             misleading. There was no effort made to tune these
  • Cinematch (2006) : RMSE 0.9525                               algorithms, or to even check if there current settings were
  • 2007 Progress Prize : "KorBell" : RMSE 0.8723                conducive to producing good models, before these results
  • 2008 Progress Prize : "BellKor in BigChaos" : RMSE           were generated. So while it is interesting which algorithms
     0.8627                                                      handled the data best, it doesn’t really show how well these
                                                                 models developed over the course of training.
  • Winners : "BellKor's Pragmatic Chaos" (2009) : RMSE
                                                                      By ordering the results by how much a model’s
     0.8567                                                      RMFigure 1: Trendlines of Algorithm PerformanceSE was
   Additionally, KorBell reported that the best result they      improved over the course of training reveals a far different
could get from a single method was an increase of 6.57%          picture of the fault-tolerant algorithms. The improvement
over Cinematch, RMSE ~0.8882 (Bell and Koren 2007).              in final RMSE would suggest that these algorithms are
                                                                 more effective at training, but it is unclear whether this is a
                          Results                                product of the algorithm itself or the training regimen. A
                                                                 closer look at all of the results, as well as the effect of the
     The initial impressions of the CFT suggested that the       restart boost, should provide a clearer picture of the factors
apparent boost in accuracy between training epochs would         at work.
favor a training regimen consisting of a large number of
very small cycles (no more than 20 iterations per epoch).             Table 4: Results Ordered by Percent Improved Over
This type of training would lead to results superior, to those   Initial RMSE
generated without it.
     The first thing of note about Table 3 is that only eleven            Algorithm              % Improvement
algorithms are included. Both SVD and One-Sided SVD              PMF                                            63.39%
have been left off of the table intentionally. No                BSGD                                           30.11%
modification to the number of training iterations yielded
                                                                 SGD                                            22.63%
any variation in how these algorithms performed. Every
run of these methods resulted in both an identical process       SVD++                                          17.12%
and model. Additionally, these algorithms utilize a              TALS                                            7.81%
different error metric than the rest of the CFT by reporting
                                                                 ALS                                             7.32%
an error estimate for each of the features generated. For
these reasons these algorithms were left out of the rest of      LibFM                                           5.90%
the discussion of this study’s results, but an example of the    RBM                                             5.40%
output of each has been included at the end of this paper as     TSVD++                                          4.40%
Appendix 1.
                                                                 WALS                                            3.56%
                                                                 NMF                                           -50.33%
         Figure 1: Algorithmic Spark Lines

     Figure 1 shows the percentage difference between all      Table 5: Epoch Size and Number of Cycles Trained for
of the training regimens for a given algorithm and its         Each Algorithm (Multi-Cyclic Algorithms Bolded)
starting RMSE. The spark lines illustrate that while the                     Epoch Size Training Total Running
                                                                Algorithm
effectiveness of the algorithms may vary, their general                      (Iterations)    Cycles      Time (sec)
training behaviors are very similar. From this limited
                                                               PMF                     180           -      2,501.9900
sample it would seem that there is no evidence that the
restart boost creates a more effective training regimen. But   RBM                      80           -         692.3140
this is not evidence that it has no effect.                    SVD++                    40           7         691.6511
     While the cyclical training regimen fails to outperform   BSGD                     40          92      6,860.8908
the single epoch algorithms, it is clearly not without its
benefits. Most of the cyclically generated models have         SGD                      40          59       3127.5310
significantly higher accuracy than models learned over the     TSVD++                   80           -         251.9750
course of a lone epoch with the same algorithm. All of the     LibFM                   200           -         908.0320
algorithms in the CFT suffer from the same design flaw,
they fail to take into account validation RMSE. This results   TALS                     10           1          74.6055
in either overfitting or the algorithm becoming trapped in a   ALS                        2          4          47.4264
local minima, as the parameters of the algorithms become       NMF                      40          56      9,045.0080
more and more restrictive. Restarting the algorithm loosens
the bounds on the program allowing it to move beyond           WALS                     10           1          66.1071
erroneous assumptions about the data. So while the
hypothesized accuracy failed to materialize, there are              The idea that a cyclical training regimen is superior to
definitely significant advantages to this style of training    a single epoch system has been thoroughly disproved, but
with SGD, BSGD, and SVD++.                                     what about the ideal size for these epochs? Table 5 shows,
                                                               rather conclusively, that the ideal number of iterations is
Table 9: Final RMSE of Control and Experimental Models         larger than the 20 iteration ceiling that had been
 Algorithm       Control     Test RMSE Improvemen              hypothesized. Each cycle needs to be large enough to take
                                                               advantage of as many positive iterations (those that reduce
                 RMSE                           t
                                                               the validation RMSE), while minimizing the number of
SGD                1.123820       0.959890         14.59%      overfitted iterations (a common occurrence in later training
                                                               cycles).
BSGD               1.117690       0.952970         14.74%
SVD++              0.982024       0.931921           5.10%               Conclusion and Further Study
NMF                2.370640       2.375430          -0.20%          This study is only meant to serve as an initial foray
                                                               into the algorithms represented by the CFT, and a great
deal remains to be researched. However, there are               ./toolkits/collaborative_filtering/svd_onesided
questions that relate directly to the results of this study     --training=smallnetflix_mm --nsv=3 --nv=10 --max_iter=5
which would be logical next steps. The first is to determine    --quiet=1 --tol=1e-1
if the results here are a product of the algorithms or the      WARNING: common.hpp(print_copyright:104):
data. This study needs to be repeated on a number of other      GraphChi Collaborative filtering library is written by
datasets to see if similar results are generated. Similar       Danny Bickson (c). Send any comments or bug reports to
training patterns would suggest that the findings of this       danny.bickson@gmail.com
study are a result of the algorithms used and not some          [training] => [smallnetflix_mm]
feature of the dataset. Secondly, a study should determine      [nsv] => [3]
if tuning the algorithms have an effect on training patterns.   [nv] => [10]
     The CFT represents a powerful tool for bringing            [max_iter] => [5]
compact, easily executed data analysis to a variety of          [quiet] => [1]
ventures. Further experimentation with the package will         [tol] => [1e-1]
illuminate how it, and the algorithms it represents, can be     Load matrix smallnetflix_mm
put to the best use.                                            Starting iteration: 1 at time: 0.560262
                                                                Starting step: 1 at time: 1.22546
Appendix 1: SVD and One-Sided SVD Output                        Starting step: 2 at time: 4.76345
                                                                Starting step: 3 at time: 8.32286
./toolkits/collaborative_filtering/svd                          Starting step: 4 at time: 11.8127
--training=smallnetflix_mm --nsv=3 --nv=10 --max_iter=5         Starting step: 5 at time: 15.4576
--quiet=1 --tol=1e-1                                            Starting step: 6 at time: 19.1342
WARNING: common.hpp(print_copyright:104):                       Starting step: 7 at time: 22.8375
GraphChi Collaborative filtering library is written by          Starting step: 8 at time: 26.5084
Danny Bickson (c). Send any comments or bug reports to          Starting step: 9 at time: 30.1965
danny.bickson@gmail.com                                         set status to tol
[training] => [smallnetflix_mm]                                 set status to tol
[nsv] => [3]                                                    set status to tol
[nv] => [10]                                                    set status to tol
[max_iter] => [5]                                               set status to tol
[quiet] => [1]                                                  set status to tol
[tol] => [1e-1]                                                 set status to tol
Load matrix smallnetflix_mm                                     set status to tol
Starting iteration: 1 at time: 2.71863                          Number of computed signular values 5
Starting step: 1 at time: 3.42417                               Singular value 0 3276.69 Error estimate: 3.8991e-14
Starting step: 2 at time: 4.6307                                Singular value 1 1064.07 Error estimate: 0.00146017
Starting step: 3 at time: 5.88049                               Singular value 2 956.541 Error estimate: 0.00782078
Starting step: 4 at time: 7.11613                               Singular value 3 889.028 Error estimate: 0.0440951
Starting step: 5 at time: 8.36737                               Singular value 4       710.42 Error estimate: 0.0864268
Starting step: 6 at time: 9.63118                               Lanczos finished in 47.1228
Starting step: 7 at time: 10.9138
Starting step: 8 at time: 12.2115                                                     References
Starting step: 9 at time: 13.5747
                                                                Bell, R. M.; Koren, Y. (2007). “Lessons from the Netflix
set status to tol
                                                                     Prize Challenge.” SIGKDD Explorations Vol. 9 Issue
set status to tol
                                                                     2, 75-79. ACM New York, NY.
set status to tol
                                                                Bennett, J.; Lanning, S. (2007). “The Netflix Prize”. In
set status to tol
                                                                     Proceedings of KDD Cup and Workshop 2007, 3-6.
set status to tol
                                                                     ACM New York, NY.
set status to tol
                                                                Bickson, D. (2012). “Collaborative filtering with
set status to tol
                                                                     GraphChi”. Large Scale Machine Learning and Other
set status to tol
                                                                     Animals. Accessed on December 5, 2012.
Number of computed signular values 5
                                                                    http://bickson.blogspot.com/2012/08/collaborative-filte
Singular value 0      3276.69 Error estimate: 0.000305186
                                                                     ring-with-demographical
Singular value 1      1064.07 Error estimate: 1.18507e-13
                                                                Comon, P.; Luciani, X.; de Almeida, A. L. F. (2009).
Singular value 2      956.541 Error estimate: 0.00162432
                                                                     “Tensor Decompositions, Alternating Least Squares
Singular value 3      889.028 Error estimate: 0.00841469
                                                                     and other Tales.” Special issue, Journal of
Singular value 4       710.42 Error estimate: 0.0551811
                                                                     Chemometrics. In memory of R. Harshman.
Going to save output vectors U and V
Lanczos finished 22.5142
Hernandez, V.; Roman, J. E.; Tomas, A. (2007). “Restarted     Zhou, Y.; Wilkinson, D.; Schreiber, R.; Pan, R. (2008).
    Lanczos Bidiagonalization for the SVD in SLEPc.”             “Large-Scale Parallel Collaborative Filtering for the
    SLEPc Technical Report STR-8.                                Netflix Prize.” In the Proceedings of the 4th
Hinton, G. (2010).” A Practical Guide to Training                international conference on Algorithmic Aspects in
    Restricted Boltzmann Machines.” University of                Information and Management, 337-348. Shanghai,
    Toronto Tech Report UTML TR 2010-003.                        China.
Hu, Y.; Koren, Y.; Volinsky, C (2008). “Collaborative
    Filtering for Implicit Feedback Datasets”. IEEE
    International Conference on Data Mining (ICDM
    2008), IEEE Washington, DC.
Koren, Y. (2009). “Collaborative filtering with temporal
    dynamics.” In Proceedings of the 15th ACM
    SIGKDD, 447-456. ACM, New York, NY.
Koren, Y (2008). “Factorization Meets the Neighborhood:
    a Multifaceted Collaborative Filtering Model”. ACM
    SIGKDD. ACM, New York, NY.
Koren, Y.; Bell, R.; Volinsky, C (2009). “Matrix
    Factorization Techniques for Recommender Systems.”
    In IEEE Computer, Vol. 42, No. 8. (07 August 2009),
    pp. 30-37. IEEE Washington, DC.
Kyrola, A.; Blelloch, G.; Guestrin, C.       (2012).
    “GraphChi: Large-Scale Graph Computation on Just a
    PC.” In Proceedings of the Tenth USENIX
    Symposium on Operating Systems Design and
    Implementation, 31-46. OSDI Press Hollywood, CA.
Lee, D..D.; Seung, H.S. (2001). “Algorithms for
    Non-negative Matrix Factorization”, Advances in
    Neural Information Processing Systems 13, 556-562.
Low, Y.; Gonzalez, J.; Kyrola, A.; Bickson, D.; Guestrin,
    C.; Hellerstein,J. M. (2010). "GraphLab: A New
    Parallel Framework for Machine Learning." In
    Proceedings of the Twenty-Sixth Conference on
    Uncertainty in Artificial Intelligence. AAAI Press Palo
    Alto, CA.
Netflix (2009). “Leaderboard”. NetflixPrize.com. Accessed
    on January 5, 2013.
      http://www.netflixprize.com//leaderboard
Pearl, J. (1994). "A Probabilistic Calculus of Actions". In
    UAI'94 Proceedings of the Tenth international
    conference on Uncertainty in artificial intelligence.
    Morgan Kaufman San Mateo CA. pp. 454–462.
Rendle, S. (2010). “Factorization Machines.” in
    Proceedings of the 10th IEEE International Conference
    on Data Mining (ICDM 2010), Sydney, Australia.
    IEEE Washington, DC.
Salakhutdinov, R.; Mnih, A. (2008). “Bayesian
    Probabilistic Matrix Factorization using Markov Chain
    Monte Carlo.” In the Proceedings of the International
    Conference on Machine Learning.
Walia, R.R. (2008). Collaborative Filtering: A comparison
    of graph-based semi-supervised learning methods and
    memory-based methods. Strengthening the Role of
    ICT in Development, 70-82.
Witten, I.H.; Frank, E.; Hall, M.A. (2011). Data Mining:
    Practical Machine Learning Tools and Techniques 3rd
    ed. Morgan Kaufman San Mateo CA.