Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk Abstract. This paper proposes a hierarchical probabilistic model for ordinal matrix factorization by actively modelling the ordinal nature of ranking data, which is typical of large-scale collaborative filtering tasks. Two algorithms are presented for inference, one based on Gibbs sam- pling and one based on variational Bayes. The model is evaluated on a collaborative filtering task, where users have rated a collection of movies and the system is asked to predict their ratings for other movies. The Netflix data set is used for evaluation, which consists of around 100 mil- lion ratings. Using root mean-squared error (RMSE) as an evaluation metric, results show that the suggested model improves similar factor- ization techniques. Results also show how Gibbs sampling outperforms variational Bayes on this task, despite the large number of ratings and model parameters. Keywords: Large scale machine learning, collaborative filtering, ordi- nal regression, low rank matrix decomposition, hierarchial modelling, Bayesian inference, variational Bayes, Gibbs sampling 1 Introduction Matrix factorization is a highly effective technique for both predictive and ex- planatory data modeling. An observed data set is represented as an M ×N matrix of results, R, where rows represent the observation number and columns repre- sent the variables of interest. This matrix is then decomposed as R = U⊤ V + ǫ, where U is a K × M matrix, V is a K × N matrix, and ǫ is a noise term. U and V represent the values of explanatory variables which, when multiplied and added, give a predictor of the values in R. This paper discusses a model for matrix factorization when the observed data is a collection of ranks, also known as ordinal data. It contributes: – An extended probabilistic model for ordinal matrix factorization. The addi- tional parameters express a rich model, through which the prior distributions over factors are flexible, but remain coupled through shared hyper-priors. Salakhutdinov and Mnih’s hierarchical framework of matrix factorization [7] is coupled with a likelihood function that models the ordinal nature of the data [2]. – An efficient variational approach for inference in the model. 2 U. Paquet, B. Thomson, and O. Winther – An efficient Gibbs sampling algorithm for inference in the model. The model is tested on the Netflix data set, which is a collection of around 100 million movie ratings. The matrix is sparse in that many of the movie-user pairs have no rating. Performance is computed by estimating missing values and computing the root mean-squared error (RMSE) on a held-out collection of ratings. The model is shown to improve or equal alternative matrix factorization approaches on this metric. There exists a rich ensemble of similar probabilistic factorization models for large-scale collaborative filtering [4, 7, 9]. Alternative approaches to collaborative filtering are equally common, such as those using user rating profiles [5], restricted Boltzmann machines [8], nearest neighbours [1], and non-parameteric models [10, 11]. The best performance on this data has been achieved by averaging ensembles of many different models [3]. 2 Probabilistic model Ordinal regression arises when the possible observed values in R are ranked. For example, in a collaborative filtering task an item with a five-star rating (r = 5) is regarded as superior to one with a four-star rating (r = 4), which in turn is better than one with a three-star rating. Instead of directly modelling the ranks, r = 1, . . . , R, we will model a collection of hidden variables, h. The probability distribution of the ranks will depend on the position of h compared to a collection of fixed boundaries br , −∞ = b1 < b2 < . . . < bR+1 = +∞ . Let rmn denote the rank for row m = 1, . . . , M and columns n = 1, . . . , N , or user n’s rank for item m. The probability of rmn in terms of a hidden variable hmn is p(rmn |hmn ) = Φ(hmn − br ) − Φ(hmn − br+1 ) , (1) Rx where Φ(x) = −∞ N (z; 0, 1) dz is the cumulative Gaussian density or probit function. The hidden variables, hmn , are modelled by a Gaussian distribution, with mean equal to the dot product of a row factor um and a column factor vn , p(hmn |um , vn , γ) = N (hmn ; u⊤ m vn , γ −1 ). (2) Factors um and vn can be viewed as item m and user n’s latent traits, each of which has a Normal prior distribution. The um ’s are conditionally independent given a shared mean and covariance, p(um |µu , Ψ u ) = N (um ; µu , Ψ −1 u ) . (3) A completely analogous model is used for the user factors vn . The mean and pre- cision matrix (inverse covariance) is modelled with a conjugate Normal-Wishart prior p(µu , Ψ u ) = N (µu ; µ0 , (β0 Ψ u )−1 ) W(Ψ u ; W0 , ν0 ) . (4) Large-scale Ordinal Collaborative Filtering 3 ν0 W0 β0 µ0 a 0 b0 β0 µ0 ν0 W0 Ψu µu γ µv Ψv um hmn vn rmn M N Fig. 1. Graphical model for factor model Bayesian hierarchy as it is described in the main text. The Wishart distribution W(Ψ ; W, ν) ∝ |Ψ |(ν−K+1)/2 exp(− 21 tr[W−1 Ψ ]) over symmetric positive definite matrices is parameterized with a scale matrix W and ν degrees of freedom. The inverse variance parameter γ is non-negative and is modelled with its conjugate Gamma distribution p(γ; a0 , b0 ) = Γ (γ; a0 , b0 ), where Γ (γ; a, b) ∝ γ a−1 exp(−γ/b). The hyperparameters {β0 , W0 , ν0 } and {a0 , b0 } have to be specified by the user. Figure 1 summarizes the joint distribution of the data and model parameters θ = {H, U, V, γ, µu , Ψ u , µv , Ψ v } in a Bayesian network. In Section 3 we will sample from, and in Section 4 approximate the posterior distribution p(θ|D). 3 Gibbs sampling Gibbs sampling is a MCMC method that sequentially samples from the condi- tional distributions of the model. We briefly describe two sampling steps here. Factors. With Ω(m) being the set of users that rated item m, the conditional distribution for each item factor um is Gaussian     X um ∼ N um ; Σm Ψ u µu + γ hmn vn  , Σm  n∈Ω(m) with covariance Σm = (Ψ u + γ n∈Ω(m) vn vn⊤ )−1 . Notice that the distribution P for factor m only requires knowledge of γ and the variables directly connected with m: {(vn , hmn |n ∈ Ω(m)}. 4 U. Paquet, B. Thomson, and O. Winther Latent variables. Sampling the conditional for the latent variable p(hmn |rmn , um , vn , γ) ∝ p(rmn |hmn ) p(hmn |um , vn , γ) requires one evaluation of a unit interval random number, one random Normal number, two evaluations of Φ and one of Φ−1R. To derive the sampler we introduce the “noise-free” latent variable Φ(h − b) = N (f ; h, 1) Θ(f − b) df . For any m and n, which is omitted here for brevity, the joint marginal distribution of r, f , and h, given µ = u⊤ v and γ, is h i p(r|f ) p(f |h) p(h|µ, γ) = Θ(br+1 − f ) − Θ(br − f ) N (f ; h, 1) N (h; µ, γ −1 ) . (5) The density f, h|r, µ, γ in (5) can be sampled from in two steps, f |r, µ, γ and h|f, µ, γ. The distribution f |r, µ, γ is a truncated Normal   N (f ; µ, 1 + γ −1 ) Θ(br+1 − f ) − Θ(br − f ) p(f |r, µ, γ) = , Φmax − Φmin p p with Φmax = Φ((br+1 − µ)/ 1 + γ −1 ) and Φmin = Φ((br − µ)/ 1 + γ −1 )). A p can be drawn from p(f |r, µ, γ) using the cumulative distribution f = sample µ + 1 + γ −1 Φ−1 (Φmin + rand (Φmax − Φmin )), where “rand” gives a uniform random number between zero and its argument. The desired  sample is obtained from p(h|f, µ, γ) = N h ; (f + γµ)/(1 + γ), (1 + γ)−1 . 4 Variational Bayes Variational Bayes (VB) is one popular algorithm for obtaining an approxima- tion q(θ) to p(θ|D). At each stage in the algorithm, one of the approximating factors is chosen, all other factors are fixed and the algorithm minimizes the Kullback-Leibler divergence KL(q(θ)kp(θ|D)) by varying the given factor. An approximating family Y Y Y q(θ) = q(hmn ) q(um ) q(vn ) q(µu , Ψ u ) q(µv , Ψ v ) q(γ) (m,n) m n is chosen, with the item and user factors having Gaussian approximations q(um ) = N (um ; hum i, Σm ) and q(vn ) = N (vn ; hvn i, Ξ n ). The factorized approximations for the hierarchical parameters follow the prior’s Normal-Wishart form, while q(γ) is chosen to be a Gamma density. We present two example VB updates: Factors. The full update for each of the item factors is  " #  X q(um ) = N um ; Σm hΨ u µu i + hγi hhnm ihvn i , Σm  , n∈Ω(m) ⊤ −1 P with Σm = (hΨ u i + hγi n∈Ω(m) (Ξ n + hvn ihvn i)) . Large-scale Ordinal Collaborative Filtering 5 Latent variables. The mean and variance of hmn are determined when needed in other updates from D E  q(hmn ) ∝ p(rmn |hmn ) exp log p(hmn |um , vn , γ) . q(um ) q(vn ) q(γ) Define µ = hu⊤ m ihvn i, and let γ be short for hγi, and N (z) short for Np (z; 0, 1). If br and br+1 are the boundaries associated with rmn , and zr = (µ−br )/ 1 + γ −1 , the explicit expression for hhnm i is (hh2nm i can be similarly evaluated) γ −1 N (zr ) − N (zr+1 ) hhmn i = µ + p . 1+γ −1 Φ(zr ) − Φ(zr+1 ) 5 Evaluation The proposed model is evaluated by testing its predictive performance on the Netflix data set. The data set contains around 100 million ratings from N = 480, 189 users on M = 17, 770 movie titles. Each rating is a number of stars 1 to 5, and is used as the ranked observation at the relevant point in R. A test or “qualifying” set of almost three million user–movie pairs for which the ratings are withheld. Algorithms are benchmarked by their RMSE computed over an unknown half of the test set. Performance results We compare inference with Gibbs sampling and VB for a number latent dimensions, K = 50, 100, 200, and γ settings γ = 0.8, 0.9, 0.10, and one setting in which γ is also inferred. The results are summarized in Figure 2. When the latent factor dimensionality K is increased, a higher precision γ gives better performance. The Gibbs-sampled results provide further evidence that proper regularization in models with far more parameters than the data set size |D| is possible with Bayesian averaging [6]. 6 Conclusion and outlook This paper has proposed a hierarchical model for ordinal matrix factorization. Comparing our results to the regular factor model of [7], we see that the ordinal likelihood and hierarchical priors give substantial improvements (compare for example the 0.8958, 0.8928, and 0.8913 RMSE for K = 50, 100 and 200 to their Gaussian likelihood’s 0.8989, 0.8965 and 0.8954 RMSE for K = 60, 150 and 300). It is therefore clear that the use of ordinal likelihoods and hierarchical models is important when modeling ordinal data. Two standard methods of inference were compared: Gibbs sampling and vari- ational Bayes. The comparison of the two approaches on such a large task gives rise to several conclusions that are applicable to other similar settings (factor models with relatively high noise levels). Variational inference for the smaller factor sizes (K = 50) showed promising results, but unfortunately overfitted for 6 U. Paquet, B. Thomson, and O. Winther 0.9 γ inferred Variational Bayes 0.898 γ = 0.1 Qualifying RMSE 0.896 Gibbs sampling 0.894 γ inferred γ = 0.08 0.892 γ = 0.09 γ = 0.1 50 100 150 200 K Fig. 2. RMSE on the qualifying set as a function of K for different γ-settings. The four lower lines are for Gibbs sampling and the two upper lines for VB. larger models. On this problem, Gibbs sampling performed better and would generally be recommended. In this paper a minimal model was used to keep the message as clean as possible. The results can clearly be improved by blending with other models, as was shown by [3]. Future work will evaluate the extent to which the gains obtained from this model affect the performance of a blended model. References 1. Bell, R.M., Koren, Y.: Improved neighborhood-based collaborative filtering. In: Proceedings of KDD Cup and Workshop (2007) 2. Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of Machine Learning Research 6, 1019–1041 (2005) 3. Koren, Y.: The BellKor solution to the Netflix Grand Prize. Tech. rep. (2009) 4. Lim, Y.J., Teh, Y.W.: Variational Bayesian approach to movie rating prediction. In: Proceedings of KDD Cup and Workshop (2007) 5. Marlin, B.: Modeling user rating profiles for collaborative filtering. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) NIPS (2004) 6. Neal, R.: Bayesian Learning for Neural Networks. Springer-Verlag (1996) 7. Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: ICML (2008) 8. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for col- laborative filtering. In: ICML (2007) 9. Stern, D.H., Herbrich, R., Graepel, T.: Matchbox: large scale online Bayesian rec- ommendations. In: WWW. pp. 111–120 (2009) 10. Yu, K., Lafferty, J., Zhu, S., Gong, Y.: Large-scale collaborative prediction using a nonparametric random effects model. In: ICML (2009) 11. Zhu, S., Yu, K., Gong, Y.: Stochastic relational models for large-scale dyadic data using MCMC. In: NIPS (2009)