An Empirical Comparison of Collaborative Filtering
                     Approaches on Netflix Data

                                   Nicola Barbieri, Massimo Guarascio, Ettore Ritacco
                                                                   ICAR-CNR
                                                       Via Pietro Bucci 41/c, Rende, Italy
                                              {barbieri,guarascio,ritacco}@icar.cnr.it


ABSTRACT                                                                   and discover their preferences. The main advantage in us-
Recommender systems are widely used in E-Commerce for                      ing CF techniques relies on their simplicity: only users’ past
making automatic suggestions of new items that could meet                  ratings are used in the learning process, no further informa-
the interest of a given user. Collaborative Filtering ap-                  tions, like demographic data or item descriptions, are needed
proaches compute recommendations by assuming that users,                   (techniques that use this knowledge are called Content Based
who have shown similar behavior in the past, will share a                  [10, 14]). Four different families of techniques have been
common behavior in the future. According to this assump-                   studied: Baseline, Neighborhood based, Latent Factor anal-
tion, the most effective collaborative filtering techniques try            ysis and Probabilistic models. This work aims to show an
to discover groups of similar users in order to infer the pref-            empirical comparison of a set of well-known approaches for
erences of the group members. The purpose of this work                     CF, in terms of quality prediction, over a real (non syn-
is to show an empirical comparison of the main collabora-                  thetic) dataset. Several works have focused on the analysis
tive filtering approaches, namely Baseline, Nearest Neigh-                 and performance evaluation of single techniques (i.e. exclud-
bors, Latent Factor and Probabilistic models, focusing on                  ing ensemble approaches), but at the best of our knowledge
their strengths and weaknesses. Data used for the analysis                 there is no previous work that performed such a deep anal-
are a sample of the well-known Netflix Prize database.                     ysis comparing different approaches.


Categories and Subject Descriptors                                         2.   BACKGROUND
                                                                               The following notation is used: u is a user, m is a movie,
H.2.8 [Database Application]: Data Mining                                    u
                                                                           r̂m  is the rating (stored into the data set) expressed by the
                                                                           user u with respect to the movie m (zero if missing), and
                                                                                                  u
Keywords                                                                   given a CF model, rm      is the predicted rating of the user u
                                                                           for the movie m. On October 2006, Netflix2 , leader in the
Recommender Systems, Collaborative Filtering, Netflix
                                                                           movie-rental American market, released a dataset contain-
                                                                           ing more of 100 million of ratings and promoted a competi-
1.     INTRODUCTION                                                        tion, the Netflix Prize 3 , whose goal was to produce a 10%
  The exponential growth of products, services and infor-                  improvement on the prediction quality achieved by its own
mation makes fundamental the adoption of intelligent sys-                  recommender system, Cinematch. The competition lasted
tems to guide the navigation of the users on the Web. The                  three years and was attended by several research groups from
                                                                                                                                         u
goal of Recommender Systems is to profile a user to suggest                all over the world. The dataset is a set of tuple (u, m, r̂m    )
him contents and products of interest. Such systems are                    and the model comparison is performed over a portion of the
adopted by the major E-commerce companies, for example                     entire Netflix data 4 . This portion is a random sample of
Amazon.com 1 , to provide a customized view of the systems                 the data, and is divided into two sets: a training set D and a
to each user. Usually, a recommendation is a list of items,                test set T . D contains 5, 714, 427 ratings of 435, 659 users on
that the system considers the most attractive to customers.                2, 961 movies, T consists of 3, 773, 781 ratings (independent
User profiling is performed through the analysis of a set of               from the training set) of a subset of training users (389, 305)
users’ evaluations of purchased/viewed items, typically a nu-              on the same set of movies. The evaluation criterion chosen
merical score called rating. Most recommender systems are                  is the Root Mean Squared Error (RMSE):
based on Collaborative Filtering (CF) techniques [6], which                                         sP
                                                                                                                           u 2
                                                                                                         (u,m) ∈ T (rm − r̂m )
                                                                                                                      u
analyze the past behavior of the users, in terms of previ-
                                                                                         RM SE =                                        (1)
ously given ratings, in order to foresee their future choices                                                    |T |
1                                                                          Cinematch achieves (over the entire Netflix test set) an RMSE
    http://amazon.com/
                                                                           value equals to 0.9525, while the team BellKor’s Pragmatic
                                                                           Chaos, that won the prize, achieved a RMSE of 0.8567. This
                                                                           score was produced using an ensemble of severeal predictors.
                                                                           2
                                                                             http://www.netflix.com/
Appears in the Proceedings of the 1st Italian Information Retrieval        3
                                                                             http://www.netflixprize.com/
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                     4
                                                                             http://repository.icar.cnr.it/sample netflix/
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
3.    COLLABORATIVE                                               much smaller than the number of users. The prediction is
      FILTERING MODELS                                            computed as:
                                                                                          PK        u
   Studied models belong to four algorithm families: Base-                            u     i=1 si rmi
line, Nearest Neighbor, Latent Factor and Probabilistic mod-                         rm = P   K
                                                                                                                       (3)
                                                                                              i=1 si
els. A detailed description of all the analyzed techniques
follows.                                                          In the rest of the paper, only item-based K-NN algorithms
                                                                  will be considered. The similarity function plays a central
3.1    Baseline Models                                            role : its coefficients are necessary for the identification of
   Baseline algorithms are the simplest approaches for rat-       the neighbors and they act as weights in the prediction. Two
ing prediction. This section will focus on the analysis of        functions, commonly used for CF, are Pearson Correlation
the following algorithms: OverallMean, MovieAvg, UserAvg,         and Adjusted Cosine [15] coefficients: preliminary studies
DoubleCentering. OverallMean computes the mean of all             proved that Pearson Correlation is more effective in detect-
ratings in the training set, this value is returned as predic-    ing similarities than Adjusted Cosine. Moreover as discussed
tion for each pair (u, m). MovieAvg predicts the rating of        in [9], similarity coefficients based on a larger support are
a pair (u, m) as the mean of all ratings received by m in         more reliable than the ones computed using few rating val-
the training set. Similarly, UserAvg predicts the rating of a     ues, so it is a common practice to weight the similarity coef-
pair (u, m) as the mean of all ratings given by u. Given a        ficients using the support size, technique often called shrink-
pair (u, m), DoubleCentering compute separately the mean          age. Shrinkage is performed as follows. Let U (mi , mj ) be
of the ratings of the movie rm , and the mean of all the rat-     the set of users that rated movies mi and mj , and let smi ,mj
ings given by the user ru . The value of the prediction is a      be the similarity coefficient between these two movies:
linear combination of these means:                                                             smi ,mj |U (mi , mj )|
                    u                                                             s0mi ,mj =                                    (4)
                   rm = α rm + (1 − α) ru                  (2)                                  |U (mi , mj )| + α
where 0 ≤ α ≤ 1. Experiments on T have shown that the             Where α is an empirical value. Experiments showed that the
best value for α is 0.6 (see Fig. 1).                             best value for α is 100, so in the following K-NN algorithms
                                                                  with Pearson Correlation and shrinkage with α = 100 will
                                                                  be considered. This first model will be called SimpleK-NN.
                                                                  An improved version can be obtained considering the dif-
                                                                  ference of preference of u with respect to the movies in the
                                                                  neighborhood ({m1 , . . . , mK }) of m. Formally:
                                                                                              PK          u
                                                                                  u                 si (r̂m − bumi )
                                                                                 rm = bum + i=1 PK i                        (5)
                                                                                                       i=1 si

                                                                  Where {s1 , . . . , sK } are the similarity coefficients between m
                                                                  and its neighbors, bum and bumi are baseline values computed
                                                                  using Eq. 2. In this case the model is named BaselineK-NN,
                                                                  otherwise, if the baseline values are computed according to
                                                                  the so called User Effect Model [2], the model will be called
                 Figure 1: RMSE vs. α                             K-NN (user effect). An alternative way to estimate item-
                                                                  to-item interpolation weights is by solving a least squares
                                                                  problem minimizing the error of the prediction rule. This
3.2    Nearest Neighbor models                                    strategy, proposed in [1, 3], defines the Neighborhood Rela-
   Neighborhood based approaches compute the prediction           tionship Model, one of the most effective approaches applied
                                                                                                 u
basing on a chosen portion of the data. The most common           during the Netflix prize. rm       is computed as:
formulation of the neighborhood approach is the K-Nearest-                                      K
                          u                                                               u
                                                                                                X       m
Neighbors (K-NN). rm        is computed following simple steps.                          rm =          wm  r̂u
                                                                                                          i mi
                                                                                                                                (6)
A similarity function associates a numerical coefficient to                                      i=1
each pair of user, then K-NN finds the neighborhood of u
                                                                  Where mi is a generic movie in the neighborhood of m, and
selecting the K most similar users to him, said neighbors.         m
                                                                  wm   are weights representing the similarity between m and
The rating prediction is computed as the average of the rat-         i
                                                                  mi computed as the solution of the following optimization
ings in the neighborhood, weighted by the similarity coeffi-
                                                                  problem:
cients. User-based K-NN algorithm is intuitive but doesn’t
                                                                                                              !2
scale because it requires the computation of similarity coef-                        X           XK
                                                                                            v          m   v
ficients for each pair of users. A more scalable formulation                   minw        rmi −     wmi r̂mj             (7)
can be obtained considering an item-based approach [15]:                               v6=u            j=1
the predicted rating for the pair (u, m) can be computed by
                                                                  Fig. 2 shows the behaviors of K-NN models with different
aggregating the ratings given by u on the K most similar
                                                                  values of K. Best performances are achieved by the Neigh-
movies to m: {m1 , . . . , mK }. The underlying assumption is
                                                                  borhood Relationship Model.
that the user might prefer movies more similar to the ones
he liked before, because they share similar features. In this
approach the number of similarity coefficients (respectively
{s1 , . . . , sK }) depends on the number of movies which is
                                                                 Where c is the user bias vector and d is the movie bias vector.
                                                                 An interesting version of the SVD model was proposed in
                                                                 [13]. According to this formulation, known as Asymmetric
                                                                 SVD, each user is modeled through her the rated items:
                                                                                              1          X
                                                                            Uu,i = p                           wi,m         (15)
                                                                                          |M (u)| + 1 m∈ M (u)

                                                                 Where M (u) is the set of all the movies rated by the user
                                                                 u. A slight different version, called SVD++, proposed in
                 Figure 2: RMSE vs. α                            [9], models each user by using both a user-features vector
                                                                 and the corresponding implicit feedback component (movies
                                                                 rated by each user in the training set and the ones for whom
3.3   Latent Factor Models via Singular Value                    is asked the prediction in the test-set).
      Decomposition (SVD)                                        Latent factor models based on the SVD decomposition change
   The assumption behind Latent Factor models is that the        according to the number of considered features and the struc-
rating value can be expressed considering a set of contributes   ture of model, characterized by presence of bias and base-
which represent the interaction between the user and the         line contributes. The optimization procedure used in the
target item on a set of features. Let A be a matrix [|users|×    learning phase plays an important role: learning could be
|movies|], Au,m is equal to the rank chosen by the user u        incremental (one feature at the time) or in batch (all fea-
for the movie m. A can be approximated as the product            tures are updated during the same iteration of data). In-
between two matrices: A ≈ U × M , where U is a matrix            cremental learning usually achieves better performances at
[|users| × K] and M is a matrix [K × |movies|], K is an          the cost of learning time. Several version of SVD models
input parameter of the model and represents the number of        have been tested, considering the batch learning with learn-
features to be considered. Intuitively, A is generated by a      ing rate
                                                                        p 0.001. Feature values have been initialized with the
                                                                           µ
combination of users (U ) and movies (M ) with respect to        value K     + rand(−0.005, 0.005) where µ is the overall rat-
a certain number of features. Fixed the number of features       ing average and K is the number of the considered features.
K, SVD algorithms try to estimate the values within U and        The regularization coefficient, where needed, has been set
                                  u                              to 0.02. To avoid overfitting, the training set has been par-
M , and give the prediction of rm   as:
                                                                 titioned into two different parts: the first one is used as
                             K
                       u
                             X                                   actual training set, while the second one, called validation
                      rm =         Uu,i Mi,m              (8)    set, is used to evaluate the model. The learning procedure
                             i=1
                                                                 is stopped as soon the error on the validation set increases.
where Uu,i is the response of the user u to the feature i, and   Performance of the different SVD models are summarized in
Mi,m is the response of the movie m on i. Several approaches     Tab.1, while Fig.3 shows the accuracy of the main SVD ap-
have been proposed to overcome the sparsity of the original      proaches. An interesting property of the analyzed models is
rating matrix A and to determine a good approximation            that they reach convergence after almost the same number
solving the following optimization problem:                      of iteration, no matter how many features are considered.
                                                    !          Better performances are achieved if the model includes bias
                         X             XK                        or baseline components; the regularization factors decrease
                                   u
  (U, M ) = arg min             r̂m −     Uu,i Mi,m  (9)       the overall learning rate but are characterized by an high
                U,M
                       (u,m)in D             i=1                 accuracy. In the worst case, the learning time for the regu-
                                                                 larized versions is about 60 min. The SVD++ model with
Funk in [5] proposed an incremental procedure, based on
                                                                 20 features obtains the best performance with a relative im-
gradient descent, to minimize the error of the model on ob-
                                                                 provement on the Cinematch score of about 5%.
served ratings. User and movie feature values are randomly
initialized and updated as follows:                                           Model             Best RMSE      Avg #Iter.
                0                                                              SVD                 0.9441          43
              Uu,i    = Uu,i + η (2eu,m · Mi,m )         (10)             SVD with biases          0.9236          45
               0                                                         SVD with baseline         0.9237          45
             Mi,m     = Mi,m + η (2eu,m · Uu,i )         (11)                Reg. SVD              0.9388          32
                u     u                                                 Reg. SVD with biases       0.9053         186
where eu,m = r̂m  − rm  is the prediction error on the pair            Reg. SVD with baseline      0.9062         190
(u, m) and η is the learning rate. The initial model could                    SVD++                0.9039          8
be further improved considering regularization coefficients
λ. Updating rules become:                                                 Table 1: Performance of SVD Models
          0
        Uu,i   = Uu,i + η (2eum · Mi,m − λ · Uu,i )      (12)
         0
       Mi,m    = Mi,m + η (2eum · Uu,i − λ · Mi,m )      (13)
                                                                 3.4    Probabilistic Approaches
                                                                   Several probabilistic methods have been proposed for the
An extension of this model could be obtained considering         CF, they try to estimate the relations between users or
user and movie bias vectors, which define a parameter for        products through probabilistic clustering techniques. The
each user and movie:                                             Aspect Model [8, 7], also called pLSA, is the main prob-
                                   K                             abilistic model used in the CF, and belongs to the class of
                                                                 Multinomial Mixture Models. Such models assume that data
              u
                                   X
             rm = cu + dm +              Uu,i Mi,m       (14)
                                   i=1
                                                                 were independently generated, and introduce a latent vari-
         Figure 3: SVD Models Performance

                                                                                   Figure 5: RMSE - pLSA
able (also called hidden), namely Z, that can take K values.
Fixed a value of Z, u and m are conditionally independent.
The hidden variable is able to detect the hidden structure         drawback of the model is the process of learning: a few iter-
within data in terms of user communities, assuming that            ations (3 to 5) of the data are sufficient to overfit the model.
                                      u
Z, associated to observation (u, m, r̂m ), models the reason
                                                     u
why the user u voted for the movie m with rating r̂m   . For-
mally, assuming the user community version, the posterior          4.   MODEL COMPARISON
                 u
probability of r̂m = v is:                                            In this section it is performed a comparative analysis of
                                                                   the above described models. Each model is tuned with it
                        K
        u
                        X          u                               best parameters settings. As said before Cinematch, the
   P (r̂m = v|u, m) =         P (r̂m = v|m, z)P (Z = z|u)   (16)   Netflix’s Recommender System, achieves an RMSE equals
                        z=1
                                                                   to 0.9525. Figure 6 shows the RMSE of all Baseline mod-
Where P (Z = z|u) represents the participation in a pattern        els mentioned. The best model is the doubleCentering, but
                           u
of interest by u, and P (r̂m = v|m, z) is the probability that
a user belonging to pattern z gives rating v on the movie m.
A simplified version of the Aspect Model is the Multinomial
Mixture Model that assumes there is only one type of user
[11]:
                         K
                         X
         u                          u
    P (r̂m = v|u, m) =         P (r̂m = v|m, z)P (Z = z)    (17)
                         z=1

The standard learning procedure, for the Multinomial Mix-
ture Model, is the Expectation Maximization algorithm [12].
Fig. 4 shows the RMSE achieved by the Multinomial Mix-
ture Model with different number of latent class. The model                       Figure 6: Baseline models

                                                                   no one of them outcomes the accuracy of Cinematch. Fig-
                                                                   ure 7 shows the mentioned K-NN models performances.
                                                                   Performances are really better than baseline ones. Except


      Figure 4: RMSE - Multinomial Mixture

has been initialized randomly and the learning phase re-
quired about 40 iterations of the training set but since the
first 10 iterations the model reaches the 90% of its poten-
tiality. The best result (0.9662) is obtained considering 10                        Figure 7: K-NN models
latent settings for Z. The pLSA model was tested assum-
ing a Gaussian distribution for the rating probability given       the SimpleK-NN, all approaches improve Cinematch’s preci-
the state of the hidden variable and the considered movie          sion, especially the Neighborhood Relationship Model. Qual-
m, in the user-community version. The model was tested             ity of SVD models is shown in figure 8. SVD models show
for different values of user-communities, as in Fig. 5. To         the best performances, note SVD++. Figure 9 shows the
avoid overfitting was implemented the early stopping strat-        behavior of the two proposed probabilistic models. Only
egy, described in the previous section. The best pLSA model        pLSA outcomes Cinematch. Finally, figure 10 compare the
produces an improvement of around 1% on Cinematch. The             best models for each algorithm family. In this experimen-
                                                                6.   REFERENCES
                                                                 [1] R. Bell, Y. Koren, and C. Volinsky. Modeling
                                                                     relationships at multiple scales to improve accuracy of
                                                                     large recommender systems. In KDD ’07: Proceedings
                                                                     of the 13th ACM SIGKDD international conference on
                                                                     Knowledge discovery and data mining, pages 95–104,
                                                                     New York, NY, USA, 2007. ACM.
                                                                 [2] R. M. Bell and Y. Koren. Improved
                                                                     neighborhood-based collaborative filtering. In In Proc.
                                                                     of KDD-Cup and Workshop at the 13th ACM
                                                                     SIGKDD International Conference of Knowledge
                 Figure 8: SVD models                                Discovery and Data Mining, pages 7–14, 2007.
                                                                 [3] R. M. Bell and Y. Koren. Scalable collaborative
                                                                     filtering with jointly derived neighborhood
                                                                     interpolation weights. In ICDM ’07: Proceedings of the
                                                                     2007 Seventh IEEE International Conference on Data
                                                                     Mining, pages 43–52, Washington, DC, USA, 2007.
                                                                     IEEE Computer Society.
                                                                 [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
                                                                     dirichlet allocation. J. Mach. Learn. Res., 3:993–1022,
                                                                     2003.
                                                                 [5] S. Funk. Netflix update: Try this at home. URL:
                                                                     http://sifter.org/ simon/Journal/20061211.html.
                                                                 [6] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry.
            Figure 9: Probabilistic models                           Using collaborative filtering to weave an information
                                                                     tapestry. Communications of the ACM, 35:61–70,
                                                                     1992.
                                                                 [7] T. Hofmann. Collaborative filtering via gaussian
                                                                     probabilistic latent semantic analysis. In SIGIR ’03:
                                                                     Proceedings of the 26th annual international ACM
                                                                     SIGIR conference on Research and development in
                                                                     informaion retrieval, pages 259–266, New York, NY,
                                                                     USA, 2003. ACM.
                                                                 [8] T. Hofmann. Latent semantic models for collaborative
                                                                     filtering. ACM Trans. Inf. Syst., 22(1):89–115,
                                                                     January 2004.
                                                                 [9] Y. Koren. Factorization meets the neighborhood: a
                Figure 10: Best models                               multifaceted collaborative filtering model. In KDD
                                                                     ’08: Proceeding of the 14th ACM SIGKDD
                                                                     international conference on Knowledge discovery and
tation SVD++ results to be the best model among all pro-             data mining, pages 426–434. ACM, 2008.
posed ones.                                                     [10] H. Lieberman. Letizia: An Agent that Assists Web
                                                                     Browsing. In Proc. of Int. Joint Conf. on Artificial
5.   CONCLUSIONS AND FUTURE WORK                                     Intelligence, pages 924 – 929, 1995.
   This work has presented an empirical comparison of some      [11] B. Marlin. Modeling user rating profiles for
of the most effective individual CF approaches applied to            collaborative filtering. In In NIPS*17, 2003.
the Netflix dataset, with their best settings. Best perfor-     [12] T. K. Moon. The expectation-maximization algorithm.
mances are achieved by the Neighborhood Relationship and             Signal Processing Magazine, IEEE, 13(6):47–60, 1996.
the SVD++ models. Moreover, the symbiosis of standard           [13] A. Paterek. Improving regularized singular value
approaches with simple baseline or biases models improved            decomposition for collaborative filtering. Proceedings
the performances, obtaining a considerable gain with respect         of KDD Cup and Workshop, pages 39–42, 2007.
to Cinematch. From a theoretical point of view, proba-          [14] M. J. Pazzani and D. Billsus. Content-based
bilistic models should be the most promising, since the un-          recommendation systems. In P. Brusilovsky, A. Kobsa,
derlying generative process should in principle summarize            and W. Nejdl, editors, The Adaptive Web, volume
the benefits of latent modeling and neighborhood influence.          4321 of Lecture Notes in Computer Science, pages
However, these approaches seem to suffer from overfitting            325–341. Springer, 2007.
issues: experiments showed that their RMSE value is not         [15] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
comparable to the one achieved by SVD or K-NN models.                Item-based collaborative filtering recommendation
Future works will focus on the study of the Latent Dirichlet         algorithms. In WWW ’01: Proceedings of the 10th
Allocation (LDA) [4] that extends the pLSA model reduc-              international conference on World Wide Web, pages
ing the risk of over fitting, and on the integration of base-        285–295. ACM, 2001.
line/bias contributes in probabilistic approaches.