<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-scale Ordinal Collaborative Filtering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ulrich Paquet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blaise Thomson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ole Winther</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Microsoft Research Cambridge, University of Cambridge, Technical University of</institution>
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper proposes a hierarchical probabilistic model for ordinal matrix factorization by actively modelling the ordinal nature of ranking data, which is typical of large-scale collaborative filtering tasks. Two algorithms are presented for inference, one based on Gibbs sampling and one based on variational Bayes. The model is evaluated on a collaborative filtering task, where users have rated a collection of movies and the system is asked to predict their ratings for other movies. The Netflix data set is used for evaluation, which consists of around 100 million ratings. Using root mean-squared error (RMSE) as an evaluation metric, results show that the suggested model improves similar factorization techniques. Results also show how Gibbs sampling outperforms variational Bayes on this task, despite the large number of ratings and model parameters.</p>
      </abstract>
      <kwd-group>
        <kwd>Large scale machine learning</kwd>
        <kwd>collaborative filtering</kwd>
        <kwd>ordinal regression</kwd>
        <kwd>low rank matrix decomposition</kwd>
        <kwd>hierarchial modelling</kwd>
        <kwd>Bayesian inference</kwd>
        <kwd>variational Bayes</kwd>
        <kwd>Gibbs sampling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Matrix factorization is a highly effective technique for both predictive and
explanatory data modeling. An observed data set is represented as an M ×N matrix
of results, R, where rows represent the observation number and columns
represent the variables of interest. This matrix is then decomposed as R = U⊤V + ǫ,
where U is a K × M matrix, V is a K × N matrix, and ǫ is a noise term. U
and V represent the values of explanatory variables which, when multiplied and
added, give a predictor of the values in R.</p>
      <p>
        This paper discusses a model for matrix factorization when the observed data
is a collection of ranks, also known as ordinal data. It contributes:
– An extended probabilistic model for ordinal matrix factorization. The
additional parameters express a rich model, through which the prior distributions
over factors are flexible, but remain coupled through shared hyper-priors.
Salakhutdinov and Mnih’s hierarchical framework of matrix factorization [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
is coupled with a likelihood function that models the ordinal nature of the
data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        – An efficient variational approach for inference in the model.
– An efficient Gibbs sampling algorithm for inference in the model.
The model is tested on the Netflix data set, which is a collection of around
100 million movie ratings. The matrix is sparse in that many of the movie-user
pairs have no rating. Performance is computed by estimating missing values
and computing the root mean-squared error (RMSE) on a held-out collection of
ratings. The model is shown to improve or equal alternative matrix factorization
approaches on this metric. There exists a rich ensemble of similar probabilistic
factorization models for large-scale collaborative filtering [
        <xref ref-type="bibr" rid="ref4 ref7 ref9">4, 7, 9</xref>
        ]. Alternative
approaches to collaborative filtering are equally common, such as those using
user rating profiles [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], restricted Boltzmann machines [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], nearest neighbours
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and non-parameteric models [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The best performance on this data has
been achieved by averaging ensembles of many different models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Probabilistic model</title>
      <p>Ordinal regression arises when the possible observed values in R are ranked. For
example, in a collaborative filtering task an item with a five-star rating (r = 5)
is regarded as superior to one with a four-star rating (r = 4), which in turn is
better than one with a three-star rating. Instead of directly modelling the ranks,
r = 1, . . . , R, we will model a collection of hidden variables, h. The probability
distribution of the ranks will depend on the position of h compared to a collection
of fixed boundaries br,</p>
      <p>−∞ = b1 &lt; b2 &lt; . . . &lt; bR+1 = +∞ .</p>
      <p>Let rmn denote the rank for row m = 1, . . . , M and columns n = 1, . . . , N , or
user n’s rank for item m. The probability of rmn in terms of a hidden variable
hmn is</p>
      <p>p(rmn|hmn) = Φ(hmn − br) − Φ(hmn − br+1) , (1)
where Φ(x) = R−x∞ N (z; 0, 1) dz is the cumulative Gaussian density or probit
function. The hidden variables, hmn, are modelled by a Gaussian distribution,
with mean equal to the dot product of a row factor um and a column factor vn,
p(hmn|um, vn, γ) = N (hmn ; u⊤mvn, γ−1) .
(2)
Factors um and vn can be viewed as item m and user n’s latent traits, each of
which has a Normal prior distribution. The um’s are conditionally independent
given a shared mean and covariance,</p>
      <p>p(um|μu, Ψ u) = N (um; μu, Ψ u−1) .</p>
      <p>A completely analogous model is used for the user factors vn. The mean and
precision matrix (inverse covariance) is modelled with a conjugate Normal-Wishart
prior
p(μu, Ψ u) = N (μu; μ0, (β0Ψ u)−1) W(Ψ u; W0, ν0) .
(3)
ν0 W0</p>
      <p>The Wishart distribution W(Ψ ; W, ν) ∝ |Ψ |(ν−K+1)/2 exp(− 21 tr[W−1Ψ ]) over
symmetric positive definite matrices is parameterized with a scale matrix W
and ν degrees of freedom. The inverse variance parameter γ is non-negative and
is modelled with its conjugate Gamma distribution p(γ; a0, b0) = Γ (γ; a0, b0),
where Γ (γ; a, b) ∝ γa−1 exp(−γ/b). The hyperparameters {β0, W0, ν0} and {a0, b0}
have to be specified by the user. Figure 1 summarizes the joint distribution of
the data and model parameters θ = {H, U, V, γ, μu, Ψ u, μv, Ψ v} in a Bayesian
network. In Section 3 we will sample from, and in Section 4 approximate the
posterior distribution p(θ|D).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Gibbs sampling</title>
      <p>Gibbs sampling is a MCMC method that sequentially samples from the
conditional distributions of the model. We briefly describe two sampling steps here.
Factors. With Ω(m) being the set of users that rated item m, the conditional
distribution for each item factor um is Gaussian
</p>
      <p>
um ∼ N um; Σm Ψ uμu + γ</p>
      <p>X
n∈Ω(m)
</p>
      <p>
hmnvn , Σm
with covariance Σm = (Ψ u + γ Pn∈Ω(m) vnvn⊤)−1. Notice that the distribution
for factor m only requires knowledge of γ and the variables directly connected
with m: {(vn, hmn|n ∈ Ω(m)}.
Latent variables. Sampling the conditional for the latent variable
p(hmn|rmn, um, vn, γ) ∝ p(rmn|hmn) p(hmn|um, vn, γ)
requires one evaluation of a unit interval random number, one random Normal
number, two evaluations of Φ and one of Φ−1. To derive the sampler we introduce
the “noise-free” latent variable Φ(h − b) = R N (f ; h, 1) Θ(f − b) df . For any m
and n, which is omitted here for brevity, the joint marginal distribution of r, f ,
and h, given μ = u⊤v and γ, is</p>
      <p>p(r|f ) p(f |h) p(h|μ, γ) = hΘ(br+1 − f ) − Θ(br − f )i N (f ; h, 1) N (h; μ, γ−1) . (5)
The density f, h|r, μ, γ in (5) can be sampled from in two steps, f |r, μ, γ and
h|f, μ, γ. The distribution f |r, μ, γ is a truncated Normal
p(f |r, μ, γ) =</p>
      <p>N (f ; μ, 1 + γ−1) Θ(br+1 − f ) − Θ(br − f )
Φmax − Φmin
with Φmax = Φ((br+1 − μ)/p1 + γ−1) and Φmin = Φ((br − μ)/p1 + γ−1)). A
sample can be drawn from p(f |r, μ, γ) using the cumulative distribution f =
μ + p1 + γ−1 Φ−1(Φmin + rand (Φmax − Φmin)), where “rand” gives a uniform
random number between zero and its argument. The desired sample is obtained
from p(h|f, μ, γ) = N h ; (f + γμ)/(1 + γ), (1 + γ)−1 .
4</p>
    </sec>
    <sec id="sec-4">
      <title>Variational Bayes</title>
      <p>Variational Bayes (VB) is one popular algorithm for obtaining an
approximation q(θ) to p(θ|D). At each stage in the algorithm, one of the approximating
factors is chosen, all other factors are fixed and the algorithm minimizes the
Kullback-Leibler divergence KL(q(θ)kp(θ|D)) by varying the given factor. An
approximating family
q(θ) =</p>
      <p>Y q(hmn) Y q(um) Y q(vn) q(μu, Ψ u) q(μv, Ψ v) q(γ)
(m,n) m n
is chosen, with the item and user factors having Gaussian approximations q(um) =
N (um; humi, Σm) and q(vn) = N (vn; hvni, Ξn). The factorized approximations
for the hierarchical parameters follow the prior’s Normal-Wishart form, while
q(γ) is chosen to be a Gamma density. We present two example VB updates:
Factors. The full update for each of the item factors is</p>
      <p> "
q(um) = N um; Σm hΨ uμui + hγi</p>
      <p>X
n∈Ω(m)</p>
      <p>#
hhnmihvni , Σm ,

with Σm = (hΨ ui + hγi Pn∈Ω(m)(Ξn + hvnihvn⊤i))−1.
Latent variables. The mean and variance of hmn are determined when needed
in other updates from
q(hmn) ∝ p(rmn|hmn) exp</p>
      <p>D log p(hmn|um, vn, γ)E
q(um) q(vn) q(γ)
.</p>
      <p>Define μ = hu⊤mihvni, and let γ be short for hγi, and N (z) short for N (z; 0, 1). If
br and br+1 are the boundaries associated with rmn, and zr = (μ−br)/p1 + γ−1,
the explicit expression for hhnmi is (hh2nmi can be similarly evaluated)
hhmni = μ +
γ−1</p>
      <p>N (zr) − N (zr+1) .</p>
      <p>p1 + γ−1 Φ(zr) − Φ(zr+1)
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>The proposed model is evaluated by testing its predictive performance on the
Netflix data set. The data set contains around 100 million ratings from N =
480, 189 users on M = 17, 770 movie titles. Each rating is a number of stars 1
to 5, and is used as the ranked observation at the relevant point in R. A test or
“qualifying” set of almost three million user–movie pairs for which the ratings
are withheld. Algorithms are benchmarked by their RMSE computed over an
unknown half of the test set.</p>
      <p>
        Performance results We compare inference with Gibbs sampling and VB for a
number latent dimensions, K = 50, 100, 200, and γ settings γ = 0.8, 0.9, 0.10,
and one setting in which γ is also inferred. The results are summarized in Figure
2. When the latent factor dimensionality K is increased, a higher precision γ
gives better performance. The Gibbs-sampled results provide further evidence
that proper regularization in models with far more parameters than the data set
size |D| is possible with Bayesian averaging [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and outlook</title>
      <p>
        This paper has proposed a hierarchical model for ordinal matrix factorization.
Comparing our results to the regular factor model of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we see that the ordinal
likelihood and hierarchical priors give substantial improvements (compare for
example the 0.8958, 0.8928, and 0.8913 RMSE for K = 50, 100 and 200 to their
Gaussian likelihood’s 0.8989, 0.8965 and 0.8954 RMSE for K = 60, 150 and 300).
It is therefore clear that the use of ordinal likelihoods and hierarchical models is
important when modeling ordinal data.
      </p>
      <p>Two standard methods of inference were compared: Gibbs sampling and
variational Bayes. The comparison of the two approaches on such a large task gives
rise to several conclusions that are applicable to other similar settings (factor
models with relatively high noise levels). Variational inference for the smaller
factor sizes (K = 50) showed promising results, but unfortunately overfitted for</p>
      <p>Gibbs sampling
100</p>
      <p>150
γ = 0.09
γ = 0.1
200
larger models. On this problem, Gibbs sampling performed better and would
generally be recommended.</p>
      <p>
        In this paper a minimal model was used to keep the message as clean as
possible. The results can clearly be improved by blending with other models,
as was shown by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Future work will evaluate the extent to which the gains
obtained from this model affect the performance of a blended model.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Improved neighborhood-based collaborative filtering</article-title>
          .
          <source>In: Proceedings of KDD Cup and Workshop</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Gaussian processes for ordinal regression</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>6</volume>
          ,
          <fpage>1019</fpage>
          -
          <lpage>1041</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Koren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The BellKor solution to the Netflix Grand Prize</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>Y.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teh</surname>
            ,
            <given-names>Y.W.</given-names>
          </string-name>
          :
          <article-title>Variational Bayesian approach to movie rating prediction</article-title>
          .
          <source>In: Proceedings of KDD Cup and Workshop</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Marlin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Modeling user rating profiles for collaborative filtering</article-title>
          . In: Thrun,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Saul</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          , Sch¨olkopf, B. (eds.)
          <source>NIPS</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Neal</surname>
          </string-name>
          , R.:
          <source>Bayesian Learning for Neural Networks</source>
          . Springer-Verlag (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Bayesian probabilistic matrix factorization using Markov chain Monte Carlo</article-title>
          . In: ICML (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Restricted Boltzmann machines for collaborative filtering</article-title>
          . In: ICML (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stern</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herbrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graepel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Matchbox: large scale online Bayesian recommendations</article-title>
          .
          <source>In: WWW</source>
          . pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Large-scale collaborative prediction using a nonparametric random effects model</article-title>
          . In: ICML (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Stochastic relational models for large-scale dyadic data using MCMC. In: NIPS (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>