Large-scale Ordinal Collaborative Filtering

                 Ulrich Paquet, Blaise Thomson, and Ole Winther

    Microsoft Research Cambridge, University of Cambridge, Technical University of
                                     Denmark
              ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk


        Abstract. This paper proposes a hierarchical probabilistic model for
        ordinal matrix factorization by actively modelling the ordinal nature of
        ranking data, which is typical of large-scale collaborative filtering tasks.
        Two algorithms are presented for inference, one based on Gibbs sam-
        pling and one based on variational Bayes. The model is evaluated on a
        collaborative filtering task, where users have rated a collection of movies
        and the system is asked to predict their ratings for other movies. The
        Netflix data set is used for evaluation, which consists of around 100 mil-
        lion ratings. Using root mean-squared error (RMSE) as an evaluation
        metric, results show that the suggested model improves similar factor-
        ization techniques. Results also show how Gibbs sampling outperforms
        variational Bayes on this task, despite the large number of ratings and
        model parameters.

        Keywords: Large scale machine learning, collaborative filtering, ordi-
        nal regression, low rank matrix decomposition, hierarchial modelling,
        Bayesian inference, variational Bayes, Gibbs sampling


1     Introduction
Matrix factorization is a highly effective technique for both predictive and ex-
planatory data modeling. An observed data set is represented as an M ×N matrix
of results, R, where rows represent the observation number and columns repre-
sent the variables of interest. This matrix is then decomposed as R = U⊤ V + ǫ,
where U is a K × M matrix, V is a K × N matrix, and ǫ is a noise term. U
and V represent the values of explanatory variables which, when multiplied and
added, give a predictor of the values in R.
    This paper discusses a model for matrix factorization when the observed data
is a collection of ranks, also known as ordinal data. It contributes:
 – An extended probabilistic model for ordinal matrix factorization. The addi-
   tional parameters express a rich model, through which the prior distributions
   over factors are flexible, but remain coupled through shared hyper-priors.
   Salakhutdinov and Mnih’s hierarchical framework of matrix factorization [7]
   is coupled with a likelihood function that models the ordinal nature of the
   data [2].
 – An efficient variational approach for inference in the model.
2         U. Paquet, B. Thomson, and O. Winther

    – An efficient Gibbs sampling algorithm for inference in the model.
The model is tested on the Netflix data set, which is a collection of around
100 million movie ratings. The matrix is sparse in that many of the movie-user
pairs have no rating. Performance is computed by estimating missing values
and computing the root mean-squared error (RMSE) on a held-out collection of
ratings. The model is shown to improve or equal alternative matrix factorization
approaches on this metric. There exists a rich ensemble of similar probabilistic
factorization models for large-scale collaborative filtering [4, 7, 9]. Alternative
approaches to collaborative filtering are equally common, such as those using
user rating profiles [5], restricted Boltzmann machines [8], nearest neighbours
[1], and non-parameteric models [10, 11]. The best performance on this data has
been achieved by averaging ensembles of many different models [3].


2      Probabilistic model

Ordinal regression arises when the possible observed values in R are ranked. For
example, in a collaborative filtering task an item with a five-star rating (r = 5)
is regarded as superior to one with a four-star rating (r = 4), which in turn is
better than one with a three-star rating. Instead of directly modelling the ranks,
r = 1, . . . , R, we will model a collection of hidden variables, h. The probability
distribution of the ranks will depend on the position of h compared to a collection
of fixed boundaries br ,

                       −∞ = b1 < b2 < . . . < bR+1 = +∞ .

Let rmn denote the rank for row m = 1, . . . , M and columns n = 1, . . . , N , or
user n’s rank for item m. The probability of rmn in terms of a hidden variable
hmn is
                 p(rmn |hmn ) = Φ(hmn − br ) − Φ(hmn − br+1 ) ,                (1)
               Rx
where Φ(x) = −∞ N (z; 0, 1) dz is the cumulative Gaussian density or probit
function. The hidden variables, hmn , are modelled by a Gaussian distribution,
with mean equal to the dot product of a row factor um and a column factor vn ,

                    p(hmn |um , vn , γ) = N (hmn ; u⊤
                                                    m vn , γ
                                                             −1
                                                                ).              (2)

Factors um and vn can be viewed as item m and user n’s latent traits, each of
which has a Normal prior distribution. The um ’s are conditionally independent
given a shared mean and covariance,

                        p(um |µu , Ψ u ) = N (um ; µu , Ψ −1
                                                          u ) .                 (3)

A completely analogous model is used for the user factors vn . The mean and pre-
cision matrix (inverse covariance) is modelled with a conjugate Normal-Wishart
prior
               p(µu , Ψ u ) = N (µu ; µ0 , (β0 Ψ u )−1 ) W(Ψ u ; W0 , ν0 ) . (4)
                                   Large-scale Ordinal Collaborative Filtering     3

                  ν0 W0      β0 µ0      a 0 b0     β0 µ0     ν0 W0


                    Ψu        µu          γ         µv        Ψv


                              um         hmn        vn


                                         rmn


                             M                        N


Fig. 1. Graphical model for factor model Bayesian hierarchy as it is described in the
main text.


The Wishart distribution W(Ψ ; W, ν) ∝ |Ψ |(ν−K+1)/2 exp(− 21 tr[W−1 Ψ ]) over
symmetric positive definite matrices is parameterized with a scale matrix W
and ν degrees of freedom. The inverse variance parameter γ is non-negative and
is modelled with its conjugate Gamma distribution p(γ; a0 , b0 ) = Γ (γ; a0 , b0 ),
where Γ (γ; a, b) ∝ γ a−1 exp(−γ/b). The hyperparameters {β0 , W0 , ν0 } and {a0 , b0 }
have to be specified by the user. Figure 1 summarizes the joint distribution of
the data and model parameters θ = {H, U, V, γ, µu , Ψ u , µv , Ψ v } in a Bayesian
network. In Section 3 we will sample from, and in Section 4 approximate the
posterior distribution p(θ|D).


3   Gibbs sampling
Gibbs sampling is a MCMC method that sequentially samples from the condi-
tional distributions of the model. We briefly describe two sampling steps here.

Factors. With Ω(m) being the set of users that rated item m, the conditional
distribution for each item factor um is Gaussian
                                                            
                                              X
              um ∼ N um ; Σm Ψ u µu + γ        hmn vn  , Σm 
                                                 n∈Ω(m)


with covariance Σm = (Ψ u + γ n∈Ω(m) vn vn⊤ )−1 . Notice that the distribution
                               P
for factor m only requires knowledge of γ and the variables directly connected
with m: {(vn , hmn |n ∈ Ω(m)}.
4       U. Paquet, B. Thomson, and O. Winther

Latent variables. Sampling the conditional for the latent variable

              p(hmn |rmn , um , vn , γ) ∝ p(rmn |hmn ) p(hmn |um , vn , γ)

requires one evaluation of a unit interval random number, one random Normal
number, two evaluations of Φ and one of Φ−1R. To derive the sampler we introduce
the “noise-free” latent variable Φ(h − b) = N (f ; h, 1) Θ(f − b) df . For any m
and n, which is omitted here for brevity, the joint marginal distribution of r, f ,
and h, given µ = u⊤ v and γ, is
                            h                         i
 p(r|f ) p(f |h) p(h|µ, γ) = Θ(br+1 − f ) − Θ(br − f ) N (f ; h, 1) N (h; µ, γ −1 ) . (5)

The density f, h|r, µ, γ in (5) can be sampled from in two steps, f |r, µ, γ and
h|f, µ, γ. The distribution f |r, µ, γ is a truncated Normal
                                                                         
                            N (f ; µ, 1 + γ −1 ) Θ(br+1 − f ) − Θ(br − f )
            p(f |r, µ, γ) =                                                 ,
                                              Φmax − Φmin
                                p                                      p
with Φmax = Φ((br+1 − µ)/ 1 + γ −1 ) and Φmin = Φ((br − µ)/ 1 + γ −1 )). A
     p can be drawn from p(f |r, µ, γ) using the cumulative distribution f =
sample
µ + 1 + γ −1 Φ−1 (Φmin + rand (Φmax − Φmin )), where “rand” gives a uniform
random number between zero and its argument. The desired           sample is obtained
from p(h|f, µ, γ) = N h ; (f + γµ)/(1 + γ), (1 + γ)−1 .


4    Variational Bayes
Variational Bayes (VB) is one popular algorithm for obtaining an approxima-
tion q(θ) to p(θ|D). At each stage in the algorithm, one of the approximating
factors is chosen, all other factors are fixed and the algorithm minimizes the
Kullback-Leibler divergence KL(q(θ)kp(θ|D)) by varying the given factor. An
approximating family
               Y            Y          Y
       q(θ) =       q(hmn )     q(um )     q(vn ) q(µu , Ψ u ) q(µv , Ψ v ) q(γ)
               (m,n)           m           n


is chosen, with the item and user factors having Gaussian approximations q(um ) =
N (um ; hum i, Σm ) and q(vn ) = N (vn ; hvn i, Ξ n ). The factorized approximations
for the hierarchical parameters follow the prior’s Normal-Wishart form, while
q(γ) is chosen to be a Gamma density. We present two example VB updates:

Factors. The full update for each of the item factors is
                            "                                 #     
                                               X
       q(um ) = N um ; Σm hΨ u µu i + hγi          hhnm ihvn i , Σm  ,
                                                 n∈Ω(m)

                                                  ⊤    −1
                          P
with Σm = (hΨ u i + hγi      n∈Ω(m) (Ξ n + hvn ihvn i))   .
                                   Large-scale Ordinal Collaborative Filtering       5

Latent variables. The mean and variance of hmn are determined when needed
in other updates from
                                D                        E      
     q(hmn ) ∝ p(rmn |hmn ) exp    log p(hmn |um , vn , γ)          .
                                                              q(um ) q(vn ) q(γ)


Define µ = hu⊤ m ihvn i, and let γ be short for hγi, and N (z) short for Np
                                                                          (z; 0, 1). If
br and br+1 are the boundaries associated with rmn , and zr = (µ−br )/ 1 + γ −1 ,
the explicit expression for hhnm i is (hh2nm i can be similarly evaluated)

                                  γ −1    N (zr ) − N (zr+1 )
                   hhmn i = µ + p                             .
                                 1+γ   −1 Φ(zr ) − Φ(zr+1 )


5    Evaluation
The proposed model is evaluated by testing its predictive performance on the
Netflix data set. The data set contains around 100 million ratings from N =
480, 189 users on M = 17, 770 movie titles. Each rating is a number of stars 1
to 5, and is used as the ranked observation at the relevant point in R. A test or
“qualifying” set of almost three million user–movie pairs for which the ratings
are withheld. Algorithms are benchmarked by their RMSE computed over an
unknown half of the test set.

Performance results We compare inference with Gibbs sampling and VB for a
number latent dimensions, K = 50, 100, 200, and γ settings γ = 0.8, 0.9, 0.10,
and one setting in which γ is also inferred. The results are summarized in Figure
2. When the latent factor dimensionality K is increased, a higher precision γ
gives better performance. The Gibbs-sampled results provide further evidence
that proper regularization in models with far more parameters than the data set
size |D| is possible with Bayesian averaging [6].


6    Conclusion and outlook
This paper has proposed a hierarchical model for ordinal matrix factorization.
Comparing our results to the regular factor model of [7], we see that the ordinal
likelihood and hierarchical priors give substantial improvements (compare for
example the 0.8958, 0.8928, and 0.8913 RMSE for K = 50, 100 and 200 to their
Gaussian likelihood’s 0.8989, 0.8965 and 0.8954 RMSE for K = 60, 150 and 300).
It is therefore clear that the use of ordinal likelihoods and hierarchical models is
important when modeling ordinal data.
    Two standard methods of inference were compared: Gibbs sampling and vari-
ational Bayes. The comparison of the two approaches on such a large task gives
rise to several conclusions that are applicable to other similar settings (factor
models with relatively high noise levels). Variational inference for the smaller
factor sizes (K = 50) showed promising results, but unfortunately overfitted for
6       U. Paquet, B. Thomson, and O. Winther

                                0.9
                                       γ inferred
                                                    Variational Bayes
                              0.898     γ = 0.1


            Qualifying RMSE
                              0.896

                                                          Gibbs sampling
                              0.894
                                                                          γ inferred
                                                                          γ = 0.08
                              0.892                                       γ = 0.09
                                                                          γ = 0.1
                                  50      100             150           200
                                                    K


Fig. 2. RMSE on the qualifying set as a function of K for different γ-settings. The four
lower lines are for Gibbs sampling and the two upper lines for VB.


larger models. On this problem, Gibbs sampling performed better and would
generally be recommended.
    In this paper a minimal model was used to keep the message as clean as
possible. The results can clearly be improved by blending with other models,
as was shown by [3]. Future work will evaluate the extent to which the gains
obtained from this model affect the performance of a blended model.

References
 1. Bell, R.M., Koren, Y.: Improved neighborhood-based collaborative filtering. In:
    Proceedings of KDD Cup and Workshop (2007)
 2. Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of
    Machine Learning Research 6, 1019–1041 (2005)
 3. Koren, Y.: The BellKor solution to the Netflix Grand Prize. Tech. rep. (2009)
 4. Lim, Y.J., Teh, Y.W.: Variational Bayesian approach to movie rating prediction.
    In: Proceedings of KDD Cup and Workshop (2007)
 5. Marlin, B.: Modeling user rating profiles for collaborative filtering. In: Thrun, S.,
    Saul, L., Schölkopf, B. (eds.) NIPS (2004)
 6. Neal, R.: Bayesian Learning for Neural Networks. Springer-Verlag (1996)
 7. Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using
    Markov chain Monte Carlo. In: ICML (2008)
 8. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for col-
    laborative filtering. In: ICML (2007)
 9. Stern, D.H., Herbrich, R., Graepel, T.: Matchbox: large scale online Bayesian rec-
    ommendations. In: WWW. pp. 111–120 (2009)
10. Yu, K., Lafferty, J., Zhu, S., Gong, Y.: Large-scale collaborative prediction using
    a nonparametric random effects model. In: ICML (2009)
11. Zhu, S., Yu, K., Gong, Y.: Stochastic relational models for large-scale dyadic data
    using MCMC. In: NIPS (2009)