=Paper=
{{Paper
|id=None
|storemode=property
|title=A Generalized Probabilistic Framework and its Variants for Training Top-k Recommender System
|pdfUrl=https://ceur-ws.org/Vol-676/paper5.pdf
|volume=Vol-676
|dblpUrl=https://dblp.org/rec/conf/recsys/SteckX10
}}
==A Generalized Probabilistic Framework and its Variants for Training Top-k Recommender System==
<pdf width="1500px">https://ceur-ws.org/Vol-676/paper5.pdf</pdf>
<pre>
A Generalized Probabilistic Framework and its Variants for
         Training Top-k Recommender Systems

                                                                                                           ∗
                                Harald Steck                                                      Yu Xin
                         Bell Labs, Alcatel-Lucent                                               CSAIL MIT
                               Murray Hill, NJ                                                 Cambridge, MA
               Harald.Steck@alcatel-lucent.com                                               YuXin@mit.edu


ABSTRACT                                                                     1.   INTRODUCTION
Accounting for missing ratings in available training data                       The idea of recommender systems is to automatically sug-
was recently shown [3, 17] to lead to large improvements                     gest items to each user that s/he may find appealing. The
in the top-k hit rate of recommender systems, compared                       quality of recommender systems can be assessed with respect
to state-of-the-art approaches optimizing the popular root-                  to various criteria, including accuracy, diversity, surprise /
mean-square-error (RMSE) on the observed ratings. In this                    serendipity, and explainability of recommendations.
paper, we take a Bayesian approach, which lends itself natu-                    This paper is concerned with accuracy. The root mean
rally to incorporating background knowledge concerning the                   squared error (RMSE) has become the most popular accu-
missing-data mechanism. The resulting log posterior distri-                  racy measure in the literature of recommender systems–for
bution is very similar to the objective function in [17]. We                 training and testing. Its computational efficiency is one of
conduct elaborate experiments with real-world data, testing                  its main advantages. Impressive progress has been made in
several variants of our approach under different hypotheti-                  predicting rating values with small RMSE, and it is impos-
cal scenarios concerning the missing ratings. In the second                  sible to name all approaches, e.g., [4, 6, 7, 11, 13]). There is,
part of this paper, we provide a generalized probabilistic                   however, also some work on optimizing the ranking of items,
framework for dealing with possibly multiple observed rat-                   e.g., measured in terms of normalized Discounted Cumula-
ing values for a user-item pair. Several practical applica-                  tive Gain (nDCG) [18]. Despite their differences, they have
tions are subsumed by this generalization, including aggre-                  in common that they were trained and tested on observed
gate recommendations (e.g., recommending artists based on                    ratings only. Obviously, these measures cannot immediately
ratings concerning their songs) as well as collaborative filter-             be evaluated if some items have unobserved ratings.
ing of sequential data (e.g., recommendations based on TV                       In this paper, we consider the top-k hit rate–based on all
consumption over time). We present promising preliminary                     (unrated) items–as the natural accuracy measure for rec-
experimental results on IP-TV data.                                          ommender systems, as only a few out of all unrated items
                                                                             can be recommended to a user in practice (see Section 3
                                                                             for exact definition of top-k hit rate). While this measure
Categories and Subject Descriptors                                           is computationally tractable for testing the predictions of
H.2.8 [Database Management]: Database Applications–                          recommender systems, unfortunately it is computationally
Data Mining                                                                  very costly for training recommender systems. For training,
                                                                             we thus resort to appropriate surrogate objective functions
                                                                             that are computationally efficient.
General Terms                                                                   In recent work [3, 17], it was shown that the top-k hit
Algorithms                                                                   rate can be significantly improved on large real-world data
                                                                             by accounting for the fact that the observed ratings provide
                                                                             a skewed picture of the (unknown) distribution concerning
Keywords                                                                     all (unrated) items and users.
Recommender Systems                                                             Motivated by the results of [17], as the first contribution of
                                                                             this paper, in Section 2 we present a probabilistic approach
∗This work was done while an intern at Bell Labs, Alcatel-                   that allows us to naturally include background knowledge
Lucent.                                                                      concerning the (unknown) distribution of all items and users
                                                                             into our training objective function, the posterior probabil-
                                                                             ity of the model.
                                                                                In our second contribution, we conduct elaborate experi-
                                                                             ments on the Netflix Prize data [1] and test several models
                                                                             under different hypothetical scenarios concerning the miss-
                                                                             ing ratings in Section 4. These different scenarios serve as a
                                                                             sensitivity analysis, as the ground truth of the missing data-
                                                                             mechanism is unknown due to lack of data. These experi-
Copyright is held by the author/owner(s). Workshop on the Practical Use of
Recommender Systems, Algorithms and Technologies (PRSAT 2010), held
                                                                             ments are based on our popularity-stratified recall measure,
in conjunction with RecSys 2010. September 30, 2010, Barcelona, Spain.       which we define in Section 3.
   As the third contribution of this paper, we generalize this                   are free parameters of the zero-mean normal prior distribu-
probabilistic approach as to account for possibly multiple ob-                   tion, denoted by N . There are several ways of defining the
served rating values for a user-item pair in Section 5. This                     standard deviations in Eq. 2, eventually resulting in differ-
general framework subsumes several applications in addi-                         ent kinds of regularization.
                                                                                                         √      The obvious choice is to assume
tion to the one outlined in Section 2. Two of which are                          that σP,i = σQ,u = 1/ 2λ0 ∀ i, u, with λ0 ∈ R. This results
outlined in Section 5: while the training objective function                     in the regularization term
for a recommender system concerning TV programs seems
                                                                                         log p(M |σP2 , σQ
                                                                                                         2
                                                                                                           ) = −λ0 ||P ||22 + ||Q||22 + c1 ,
                                                                                                                  `                  ´
motivated in an ad-hoc manner in [5], we show that it can
be understood and improved naturally in a Bayesian frame-
work; apart from that, we also provide a Bayesian approach                       where || · ||2 denotes the Frobenius norm of a matrix, and c1
for making aggregate recommendations, e.g., recommending                         is an irrelevant constant when training our model.
an artist or concert to a user based on the ratings given to                        When optimizing root mean square error on observed data
individual songs.                                                                (like in the Netflix Prize competition [1]), however, numer-
                                                                                 ous experimental works reported significant improvements
2.    MODEL TRAINING                                                             by using a different regularization. This is obtained by
                                                                                 choosing   the standard deviations
                                                                                                                 p σP and σQ as follows: σP,i =
   In this section, we outline a probabilistic framework that                       p
                                                                                 1/ 2λ0 · u0 (i) , σQ,u = 1/ 2λ0 · i0 (u), where i0 (u) de-
allows us to incorporate background knowledge when train-
                                                                                 notes the number of items rated by user u, and u0 (i) is the
ing recommender systems on available data. The use of
                                                                                 number of users who rated item i. This results in the pop-
background knowledge in addition to the available training
                                                                                 ular regularization term
data can significantly improve the accuracy of recommender
systems on performance measures like top-k hit rate, recall,
                                                                                       log p(M |σP2 , σQ
                                                                                                       2
                                                                                                         )
precision, area under the ROC curve, etc. This was demon-                                                                                                      !
strated for implicit feedback data in [5, 10], and for explicit                              0
                                                                                                 X                 X      2
                                                                                                                                     X             X
feedback data in [3, 17]. Like in [17], we use background                              = −λ               u0 (i)         Pi,d +           i0 (u)       Q2u,d       + c2
                                                                                                      i            d                  u            d
knowledge that missing rating values tend to reflect negative                                    0                                                 !1
feedback, as experimentally observed in [9, 8]; i.e., negative                                            X                X
                                                                                             0@                                  2
feedback tends to be missing from the available data with a                            = −λ                                     Pi,d + Q2u,d        A + c2 ,         (3)
larger probability than positive feedback does.                                                      observed (i,u)         d
   The Bayesian approach lends itself naturally to this task.
We consider the rating matrix R as a matrix of random                            where c2 denotes again an irrelevant constant when train-
variables: each element Ri,u concerning item i = 1, ..., i0                      ing our model. Note that this choice increasingly regularizes
and user u = 1, ..., u0 is a random variable with normal                         the model parameters related to the items and users with a
distribution, where i0 denotes the number of items and u0                        larger number of observed ratings. This may seem counter-
is the number of users.                                                          intuitive at first glance. A theoretical explanation for this
                                                                                 empirical finding was recently provided in [14].
2.1    Model
  We take a collaborative filtering approach, and use a low-                     2.3    Informative Background Knowledge
rank matrix-factorization model, which has proven success-                          We now incorporate the following background knowledge
ful in many publications. Like the rating matrix, we consider                    into our sequential Bayesian approach: absent rating val-
our model as a matrix of random variables, M . Each random                       ues tend to be lower than the observed ratings on average
variable Mi,u corresponds to the rating of item i assigned by                    (see [17]). We insert this knowledge into our approach by
user u. In matrix notation, it reads                                             means of a virtual data point for each pair (i, u): a virtual
                                                                                                 prior
                        M = roffset + P Q>                                 (1)   rating value ri,u     with small confidence (i.e., large variance
                                                                                   2
         offset                                                                  σprior,i,u ). Then the likelihood of our model in light of these
where r       ∈ R is an offset value, and P , Q are low-rank
                                                                                 virtual data points reads (assuming i.i.d. data):
matrices of random variables with dimensions i0 × d0 and
u0 × d0 , respectively, where rank d0  i0 , u0 . We use upper                                               Y Y
case symbols to denote random variables (with a Gaussian                         p(rprior |M, σprior
                                                                                               2
                                                                                                     )=                              prior
                                                                                                                           p(Ri,u = ri,u            2
                                                                                                                                           |Mi,u , σprior,i,u ),
distribution), and lower case symbols to denote values.                                                      all i all u
                                                                                                                                           (4)
                                                                                                                                         prior
2.2    Prior over Matrix Elements                                                where rprior denotes the matrix of virtual data points ri,u   ,
                                                                                       2                                 2
  In our Bayesian approach, we first define the usual prior                      and σprior the matrix with elements σprior,i,u . We assume
over model parameters, concerning each entry of the low                          that the probabilities in this likelihood are determined by
                                                                                                                                      2
rank matrices P and Q (see also [12]):                                           normal distributions with mean Mi,u and variance σprior,i,u   .
                        (                          )                             The log likelihood then reads
            2    2
                          YY                  2
      p(M |σP , σQ ) =           N (Pi,d |0, σP,i )                                                                    XX                   “             ”2
                                      i       d                                  log p(rprior |M, σprior
                                                                                                   2
                                                                                                         )=−                          prior
                                                                                                                                     wi,u     prior
                                                                                                                                             ri,u   − Mi,u +c3
                                  (                                    )                                               all i all u

                              ·
                                      YY                           2
                                                      N (Qu,d |0, σQ,u )   (2)                                                             (5)
                                          u
                                                                                 where we defined the weights of the virtual data points as
                                                  d                               prior         2
                                                                                 wi,u   = 1/(2σprior,i,u ); c3 is again an irrelevant constant
                                2           2
The vectors of variances       σQ   = (σQ,u    )u=1,...,u0 and σP2 =             when training our model.
  2
(σP,i )i=1,...,i0 for all users u = 1, ..., u0 and items i = 1, ..., i0            With Bayes rule, we obtain the posterior distribution of
                                                                                                                                prior
the model in light of these virtual data points:                                       prior weights to be identical: wprior = wi,u   for all (i, u). In
                                           p(rprior , wprior |M )p(M )                 summary, the three tuning parameters in Eq. 9 are wprior ,
          p(M |rprior , wprior ) =                                     .       (6)     rprior and λ, which can be chosen as to optimize the perfor-
                                               p(rprior , wprior )
                                                                                       mance measure on cross-validation data.
This equation combines our prior concerning the elements
in the matrices P and Q (for regularization) with our back-                            2.6      MAP Estimate of Model
ground knowledge on the expected rating values. This serves                               For computational efficiency, our training aims to find
as our prior when observing the actual rating values in the                            the maximum-a-posteriori (MAP) parameter estimate of our
training data.                                                                         model, i.e., the MAP estimates P̂ and Q̂ of the matrices P
2.4       Training Data                                                                and Q. We use the alternating least squares approach. The
                                                                                       idea is that one matrix can be optimized exactly while the
  Now we use the rating values actually observed in the                                other one is assumed fixed. A local maximum of the log
training data. The likelihood of the model in light of ob-                             posterior can be found by alternating between the matrices
                      obs
served rating values ri,u reads (assuming i.i.d. data):                                P̂ and Q̂. While local optima exist [16], we did not find this
     obs     2
                        Y                 obs          2                               to cause major computational problems in our experiments.
 p(r |M, σobs ) =               p(Ri,u = ri,u |Mi,u , σobs,i,u )
                                                                                       The update equation for each row i of P̂ is (for fixed Q̂):
                              observed (i,u)

Again assuming a normal distribution, the log likelihood                                    P̂i,· = (r̄i,· − rprior )(W̃ (i) + wprior I)Q̂ ·
reads:                                                                                           h                                                         i−1
                         X                                                                         Q̂> (W̃ (i) + wprior I)Q̂ + λ(tr(W̃ (i) ) + wprior u0 )I (10)
 log p(robs |M, σobs
                 2
                     )=−          obs obs
                                wi,u (ri,u − Mi,u )2 + c4
                                                                                                          obs obs          prior           prior
                                      observed (i,u)                                   where r̄i,u = (ri,u   wi,u +rprior wi,u       obs
                                                                                                                                 )/(wi,u +wi,u   ) denotes
                                                         (7)                                                                   obs
                                                                                       the average rating; we defined wi,u = 0 if rating at (i, u)
where we defined the weights of the observed rating values
    obs         2                                                                      is missing; note that r̄i,· − rprior = 0 if rating is missing for
as wi,u = 1/(2σobs,i,u ); c4 is again an irrelevant constant
                                                                                       (i, u); W̃ (i) is a diagonal matrix containing the ith column of
when training our model.
                                                                                       the weight matrix wobs ; the trace is tr(W̃ (i) ) = u∈Si wi,u  obs
                                                                                                                                            P
                                                                                                                                                          ,
2.5       Posterior                                                                    where Si is the set of users who rated item i; I denotes
  The posterior after seeing the observed ratings is again                             the identity matrix; and u0 is the number of users. This
obtained by Bayes rule (we omit the weights wobs , wprior for                          equation can be re-written for efficient computations, e.g.,
brevity of notation here):                                                             see [17]. The update equation for Q̂ is analogous.

                                       p(robs |M, rprior )p(M |rprior )
  p(M |robs , rprior )            =                                                    3.     MODEL TESTING
                                                  p(robs )
                                                                                          A key challenge in testing recommender systems is that
                                  ∝   p(robs |M, rprior )p(rprior |M )p(M )(8)         the observed ratings in the available data typically provide a
where we assumed in the denominator that the observed                                  skewed picture of the (unknown) true distribution concern-
ratings are independent of the chosen prior ratings, i.e.,                             ing all ratings [9, 8]. This may be caused by the fact that
p(robs |rprior ) = p(robs ). Substituting Eqs. 3, 5, 6 and 7                           users are free to choose what items to rate, and they tend
into Eq. 8, we obtain the following log posterior:                                     to not rate items that would otherwise receive a low rating.
                                                                                       If the ratings are missing not at random (MNAR), it is not
      log p(M |robs , rprior , wobs , wprior , λ) =                                    guaranteed that correct or meaningful results are obtained
                       (                                        )
           X      obs       obs           2
                                                  Xˆ 2      2 ˜
                                                                                       from testing a recommender system on the observed ratings
      −         wi,u (ri,u − Mi,u ) + λ             Pi,d + Qu,d                        only. The latter is, however, common practice in the lit-
          obs.(i,u)                                       d                            erature or recommender systems, using measures like root
                                                                                       mean square error or nDCG [4, 6, 7, 11, 13, 18].
                              (                                                    )
           X          prior         prior
                                                          Xˆ
                                          − Mi,u )2 + λ            2
                                                                       + Q2u,d
                                                                               ˜
      −              wi,u         (ri,u                           Pi,d                    The top-k hit-rate / recall of relevant items is a particu-
          all(i,u)                                            d                        larly useful performance measure for assessing the accuracy
      +c5                                                                      (9)     of recommender systems [17]. An item i is relevant to user
                                                                                       u if s/he finds this item interesting or appealing [17]. For
We found that using a prior that involves also the regular-                            instance, in the Netflix data [1] we consider items i with a
ization term of P and Q in the third line in Eq. 9 leads                                               obs
                                                                                       5-star rating, ri,u = 5, as relevant to user u.
to a slight improvements in our experimental results. The                                 Recall can be calculated for a user by ranking the items
                      prior
           obs
weights wi,u    and wi,u    as well as λ0 are absorbed in λ; this                      according to their scores predicted by the recommender sys-
is a slight but straight-forward generalization of the prior in                        tem, and determining the fraction of relevant items that are
Section 2.2; c5 is again an irrelevant constant for training.                          among the top-k items, i.e., the k items with the highest
   Eq. 9 serves as our training objective function. For sim-                           scores. The value of k ∈ N has to be chosen, e.g., as the
plicity, we choose the same value for all virtual rating values                        number of items that can be recommended to a user. Only
rprior . For computational efficiency, we choose our model                             small values of k are important in practice. The goal is to
offset to equal the prior rating values: roffset = rprior . Its                        maximize recall for the chosen value of k.
main effect is that this retains the sparsity of the observed                             Recall has two interesting properties in this context [17]:
rating matrix. Apart from that, it also leads to the simpli-                           it is proportional to precision on fixed data and fixed k when
             prior
fication: (ri,u    − Mi,u )2 = ((P Q> )i,u )2 . For simplicity, we                     comparing different recommender systems with each other.
       obs
set wi,u = 1 for all observed pairs (i, u), and also choose all                        In other words, the recommender system with the larger
recall also has the larger precision. More interestingly, how-     In the first scenario, pobs may take only two values: it is
ever, recall can be calculated from the available MNAR data        small for unpopular items, and large for popular items. If
and provides an unbiased estimate for the recall concern-          the ratio of these two values approaches infinity, this results
ing the (unknown) complete data (which comprises all rat-          in si → 0 for popular items. Removing the relevant rat-
ing values of all users) under the following assumption: the       ings of the most popular items from the test set is indeed
relevant ratings are missing at random, while an arbitrary         common practice, e.g., in [3]. In the second scenario, we con-
missing-data mechanism may apply to all other rating values        sider the case where pobs is a smooth function of the items’
(as long as they are missing with a larger probability than        popularities. We assume the polynomial relationship
the relevant ones) [17]. This assumption is much milder                                          ` +          ´γ
than the one underlying the popular approach of ignoring                               pobs (i) ∝ Ncomplete,i                 (13)
missing ratings, i.e., assuming that all ratings are missing       with γ ∈ R. This is consistent with the power-law behavior
at random.                                                         of the observed relevant ratings (see Figure 1 a) in the sense
   The expected performance on the (unknown) complete              that also the (unknown) complete data then follows a power-
data is important because it is directly related to user ex-       low distribution concerning the relevant ratings. This results
perience: the recommender system has to pick a few items           in the stratification weights
from among all items the user has not rated yet (and which
                                                                                                +
may hence be new to the user); one can expect the dis-                                 si ∝ 1/(Nobs,i )γ/(γ+1) .
tribution on all unrated items to be well-approximated by
the distribution on all (rated and unrated) items under the        Note that the unknown proportionality factor cancels out in
(mild) assumption that only a small fraction of the relevant       Eq. 12, so that it provides an unbiased estimate of the re-
ratings has been observed.                                         call concerning the complete data for the correct polynomial
   Given that ground truth (i.e., the complete data) is typi-      degree γ (and assuming that there are no additional factors
cally not available (at low cost), the validity of the assump-     underlying the missing data mechanism). If γ = 0, the rel-
tion in [17] cannot be verified in practice. For this reason,      evant ratings are missing at random; if γ = 1, the probabil-
we carry out a sensitivity analysis in the following. We re-       ity of observing relevant ratings increases linearly with item
                                                                                    +
lax this assumption even further and determine its effect          popularity (Ncomplete,i   ). The extreme case when γ → ∞ has
on the recall test-results for different models. Note that         several interesting properties: first, one observes in the avail-
the assumption in [17] allows for an arbitrary missing-data        able data only relevant ratings of the item with the largest
mechanism concerning all ratings, except for the relevant          popularity in the complete data, which does not agree with
ratings; only the latter are assumed to be missing at ran-         empirical evidence. Second, as γ/(γ + 1) → 1, the stratifi-
dom. For this reason, the following is concerned with the          cation weights si are inversely proportional to the number
                                                                                                      +
relevant ratings only.                                             of observed relevant ratings Nobs,i    ; this means that every
   We consider the case that the probability of observing          item has the same weight in the recall-measure in Eq. 12,
a relevant rating depends on the popularity of items. We           independent of its number of observed relevant ratings. This
                                                       +
define the popularity of an item by the number Ncomplete,i         means that, once an item obtains its first relevant rating, it
of relevant ratings it obtained in the (unknown) complete          is weighted the same as all other items that may have ob-
              +
data. Let Nobs,i  be the number of relevant ratings observed       tained thousands of relevant ratings. This obviously entails
in the available data; then the probability of observing a         low robustness against statistical noise as well as against
relevant rating regarding item i is                                manipulation and attacks. As this is the limiting case of
                                                                   γ, we nevertheless provide experimental results for this ex-
                                    +
                                  Nobs,i                           treme scenario in Section 4. The (unknown) realistic case
                    pobs (i) =               .             (11)
                                  +
                                 Ncomplete,i                       can be expected to be less extreme.
                                                                      If the test set is a random subset of the observed data,
Assuming that there are no additional (possibly hidden) fac-                +
                                                                   then Nobs,i   can be determined from the test set. Given that
tors underlying the missing data mechanism concerning the          the training set is typically much larger than the test set,
relevant ratings, we obviously obtain an unbiased estimate         it might be more robust to determine Nobs,i   +
                                                                                                                      based on all
for recall on the (unknown) complete data by calculating the       the available data, i.e., the training and test set combined.
popularity-stratified recall (for user u),                         We use the latter in our experiments reported in the next
                                                                                                          +
                                  P
                                        +,k si
                                                                   section, but the former choice of Nobs,i   leads to very similar
                                    i∈S
                    recallu (k) = P u                   (12)       results.
                                         + si
                                                                      Finally, we define recall(k) = u wu recallu (k) asP
                                                                                                      P
                                     i∈Su                                                                                  the aver-
on the available MNAR data; Su+ denotes the set of relevant        age recall over all users, with normalized weights, P u wu =
items of user u; Su+,k is the subset of relevant items ranked in   1, like in [17]. In our experiments, we choose wu ∝ i∈Su+ si ,
the top k items based on the predictions of the recommender        as a generalization of the definition in [7, 17].
system; the popularity-stratification weight for each item i          Obviously, stratification like in Eq. 12 carries over analo-
is                                                                 gously to other measures, like ATOP [17] or the area under
                                   1                               the ROC curve.
                           si =          .
                                pobs (i)
If we choose the stratification weights si = 1 for all items       4.   EXPERIMENTS
i, we obtain the usual recall measure. Given that complete           This section summarizes our results on the Netflix Prize
data is unavailable (at low cost), pobs in Eq. 11 cannot be        data [1]. These data contain 17,770 movies and almost half
calculated in practice. For this reason, we now examine dif-       a million users. About 100 million ratings are available.
ferent choices for pobs and their effects on recall in Eq. 12.     Ratings are observed for about 1% of all possible movie-
             5
           10                                                                                              0.7
                  (a)                                    (b)                                               0.6
                                                                                                                 (c)

                                                            MF: w
                                                                  prior
                                                                       =0.005, λ=0.04                      0.5
                                                         0.6     prior
# movies


                                                            MF: w      =0.0005, λ=0.04                     0.4


                                                                                                  recall
                                                         0.4


                                                recall
                                                            MF: wprior=0.00005, λ=0.06
                                                                                                           0.3
                                                         0.2MF: wprior=0,       λ=0.07
                                                           0SVD                                            0.2
                                                            0         5   10        15      20
                                                            BS             k                               0.1
             0
           10 2                 4           6                                                               0
             10               10           10                                                                0         5   10    15       20
                        # 5−star ratings                                                                                    k

           0.4                                           0.15                                              0.5
                  (d)                                           (e)                                              (f)
                                                                                                           0.4
           0.3
                                                          0.1
                                                                                                           0.3
                                                recall


                                                                                                  recall
recall


           0.2
                                                                                                           0.2
                                                         0.05
           0.1
                                                                                                           0.1

            0                                              0                                                0
             0          5     10      15   20               0         5    10       15      20               0         5   10    15       20
                               k                                            k                                               k


Figure 1: Netflix data [1]: (a) the number of relevant (i.e., 5-star) ratings per movie in the training data
shows a close to power-law distribution (i.e., straight line in the log-log plot); (b) legend of models (see text
for details); (c)-(f ) show recall on probe set for different hypothetical missing-data mechanisms concerning
the relvant (i.e., 5-star) ratings (while an arbitrary missing-data mechanism is allowed for the other ratings):
(c) relevant ratings are missing at random (γ = 0 in Eq. 13); (d) relevant ratings observed with probability
increasing linearly with item popularity (γ = 1 in Eq. 13); (e) unrealistic extreme case where γ → ∞ in
Eq. 13; (f ) relevant ratings of the 10% most popular items removed. As a result, for all these missing-data
mechanisms, recall test-results are improved by using an appropriate small prior weight wprior > 0 during
training, compared to the popular approach of ignoring the missing-data mechanism (wprior = 0) during
training.


user-pairs. The ratings take integer values from 1 (worst) to                  values of tuning parameters in our training objective func-
5 (best). The provided data are already split into a train-                    tion (log posterior) in Eq. 9 are summarized in Figure 1 (b).
ing and a probe set. We removed the probe set from the                         Like in [17], we chose the prior rating value rprior = 2 in Eq.
provided data as to obtain our training set.                                   9.
   We consider 5-star ratings as relevant to a user (as defined                   Figures 1 (c)-(f) shows the performance of these mod-
above), and use the popularity-stratified recall, as outlined                  els under different test scenarios, concerning our popularity-
in Section 3, as our performance measures on the Netflix                       stratified recall measure for the practically important range
probe set. For all experiments, we chose rank d = 50 of our                    of small k values. For computational efficiency, we computed
low-rank matrix factorization (MF) model (Eq. 1). In the                       recall by randomly sampling, for a user, 1,000 unrated items
following, we compared our MF model, trained with different                    for each relevant rating in the test set, like in [3]. The only
prior weights wprior > 0, against the popular MF approach                      difference to the test procedure used in [7, 17] is that we
that ignores the missing-data mechanism (i.e., wprior = 0 in                   sample from unrated items only, rather than from all items.
our notation). The latter achieved a root mean square error                    This is more realistic. It also results in slightly higher recall
on the observed ratings in the probe set of 0.922. Addition-                   values compared to the procedure used in [7, 17].
ally, we compared to singular value decomposition (SVD),                          When comparing the different graphs, it is obvious that
where we used the svds function of Matlab, which implicitly                    the performance of all the models depends on the (unknown)
imputes a zero value (with unit weight) for all missing rat-                   missing-data mechanism concerning the relevant ratings. In
ings; and to the bestseller list (BS), which ranks the items                   particular, when pobs of relevant ratings is assumed to in-
                                                                                                                                      +
according to the number of ratings in the training set. The                    crease more rapidly with growing popularity (Ncomplete,i       ),
the expected recall on the (unknown) complete data de-                   of our model:
creases for all models, cf. Figures 1 (c)→(d)→(e), and                                                    XX
                                                                                    log p(M |D) = −                     wi,u,v (v − Mi,u )2
(c)→(f).
                                                                                                          i,u       v
   As pobs of relevant ratings increases more rapidly with
item popularity (compare Figures 1 (c)→(d)→(e)), the dif-
                                                                                        X     1 X 2      X 1 X 2
                                                                                    −          2
                                                                                                  Pi,d −     2
                                                                                                                 Qu,d + c6 (16)
ference in recall among the various MF models decreases.                                i
                                                                                            2σP,i        u
                                                                                                           2σQ,i
                                                                                                      d                             d
Training with smaller but positive weights wprior > 0 results
in the best recall on the test set, even in the unrealistic ex-          The standard deviations σQ,i and σP,i may be chosen as to
treme limit in Figure 1 (e). This suggests that, compared                achieve the desired variant of regularization, as discussed in
to the popular approach of ignoring the missing-data mech-               Section 2.2.
anism when training MF models, recall can be improved                    5.1     MAP Estimate of Model
by using a small prior weight wprior > 0; its value is up-
per bounded by the value that optimizes the (usual) recall                 For computational efficiency, we focus on optimizing the
measure on the test set, i.e., under the assumption that the             log posterior in Eq. 16. The maximum-a-posteriori (MAP)
relevant ratings are missing at random, like in [17].                    parameter estimate of our model, i.e., the MAP estimates P̂
   The bestseller list (BS) and SVD perform surprisingly well            and Q̂ of the matrices P and Q, can be determined by alter-
if relevant ratings are missing at random, see Figure 1 (c),             nating least squares, which alternately optimizes one matrix
while the popular MF model with wprior = 0 has low recall                while the other one is assumed fixed. Using the usual nec-
in comparison. This was also found in [17, 3]. BS and SVD                essary condition for the optimum of Eq. 16, we equate its
perform rather poorly, however, if pobs increases rapidly with           partial derivative to zero, and obtain the following update
item popularity, as shown in the extreme scenarios in Figure             equation for each row i of P̂ (for fixed Q̂):
1 (e) and (f). This suggests that not only BS, but also SVD                                                    "                     #−1
                                                                                                                                 1
tend to recommend items that are popular in the available                 P̂i,· = (v̄i,· − roffset )W̃ (i) Q̂ · Q̂> W̃ (i) Q̂ + 2 I      , (17)
training data. Their recommendations may hence result in                                                                       2σP,i
a relatively low degree of serendipity or surprise, relative to
our MF models trained with a small positive prior weight.                where I denotes the identity matrix, and W̃ (i) is a diagonal
                                                                         matrix containing the ith row of the aggregate weight matrix
                                                                         with elements
5.    GENERALIZED APPROACH                                                                            X
                                                                                              wi,u =      wi,u,v ,
   This section outlines a generalization of the Bayesian ap-                                                   v
proach given above. In our probabilistic approach, we con-
sider the rating matrix R as a matrix of random variables.               and v̄i,u is the average rating value
As each entry Ri,u is a random variable (rather than a
                                                                                                                          !
                                                                                                          X
value), this naturally allows for possibly multiple values con-                             v̄i,u =           vwi,u,v         /wi,u .
cerning each pair (i, u) in the data. This has several ad-                                                v
vantages over a matrix of values, which has typically been
considered in the literature of recommender systems. Af-                 Analogously, the update equation for each row u of Q̂ is:
                                                                                                                                       #−1
ter developing our generalized probabilistic framework, we
                                                                                                               "
                                                                                                                                    1
outline three special cases / applications in Section 5.2.                  Q̂u,· = (v̄·,u − roffset )W̃ (u) P̂ P̂ > W̃ (u) P̂ + 2 I       ,
   Let the given data set be D = {ri,u,j }i,u,j , where i =                                                                      2σQ,u
1, ..., i0 is the index concerning items, u = 1, ..., u0 is the                                                                          (18)
index regarding users, and j = 1, ... is the index over possi-           where W̃u is the diagonal matrix containing the uth column
bly multiple observed ratings for the same pair (i, u). The              of the aggregate weight matrix.
likelihood of the model in light of i.i.d. data reads                       This derivation shows that optimizing Eq. 16 is equivalent
                           Y                                             to optimizing
                p(D|M ) =      p(Riu = ri,u,j |M ).        (14)                                          X
                           i,u,j                                                    log p(M |D) = −            wi,u (v̄i,u − Mi,u )2
                                                                                                          i,u
Assuming again a normal distribution of the ratings (with                               X     1 X 2      X 1 X 2
standard deviations σi,u,j , or equivalently weights wi,u,j =                       −          2
                                                                                                  Pi,d −     2
                                                                                                                 Qu,d + c6 (19)
     2                                                                                  i
                                                                                            2σP,i        u
                                                                                                           2σQ,i
1/(2σi,u,j )), the log likelihood of the model is                                                     d                             d

                          XX                                             where multiple rating values for an item-user pair are re-
   log p(D|M ) = −                wi,u,j (ri,u,j − Mi,u )2 + c5          placed by their mean value v̄i,u and their aggregate weight
                           i,u     j                                     wi,u .
                           XX
                   =   −               wi,u,v (v − Mi,u )2 + c5   (15)   5.2     Applications
                           i,u     v
                                                                           This generalized probabilistic approach subsumes several
where the second line is obtained by switching–for each pair             applications as special cases. In addition to the use of virtual
(i, u)–from index j (over multiple ratings in the data) to               ratings in the prior, as outlined in Section 2, we present two
the
P actual rating values v; the cumulative weight is wi,u,v =              additional applications in the following.
   j wi,u,j Iri,u,j =v , where indicator function Iri,u,j =v = 1 if
ri,u,j = v and 0 otherwise.                                              5.2.1     Aggregate Recommendations
   Combining this likelihood with the same kind of prior over             The universe of all items available for recommendation
the model parameters as in Eq. 2, we obtain the log posterior            may have a structure that goes beyond a flat list. Items can
often be arranged in a hierarchical manner. For instance,                  fidence / weight. This weight is small, as it can also be
songs may be grouped by artist, album, genre, etc. Possibly,               interpreted as the variance of our prior, which is large as
there are several layers of hierarchy. Now let us consider                 there are many reasons for not watching a show. Analo-
the problem that data are available where users have rated                 gously to Section 2, we use virtual observations of value 0,
individual songs, but the task is to recommend an artist to a              with weight wprior . This results in the log likelihood in light
user. This problem arises in several situations. For instance,             of the virtual data points Dprior :
the recommender system may want to suggest a concert to                                                           X
the user, based on the data on individual songs. Another                     log p(Dprior |M ) = −nmax wprior           (0 − Mi,u )2 + c8
scenario is the release of a new song by an artist: this cold                                                                all (i,u)
start problem may be overcome by recommending the new
song to users who like the artist.                                         Combining these two likelihoods, together with the prior
   The ratings matrix concerning songs and users can be used               over the model parameters in Eq. 2 (first variant), we obtain
to construct a rating matrix regarding artists and users by                the log posterior
aggregating the songs of each artist. As a user may have                         log p(M |D, Dprior ) ∝
rated several songs of an artist, we now have possibly multi-                        X                                   X
ple rating values for an entry in the artist-user matrix. This                   −        ti,u (r̄i,u − Mi,u )2 − wprior   (0 − Mi,u )2
is exactly the problem solved by our general approach in                           (i,u)∈S                                         all (i,u)
Section 5. Our framework also shows that, for each user,                               "
                                                                                      X X                      X
                                                                                                                         #
                                                                                                       2
the rating of each artist can be determined as the weighted                      −λ                   Pi,d +       Q2u,d + c9 ,                (20)
average of the ratings of his/her songs, and the weight is the                        d       i                u
sum of the weights of the songs, as one may have intuitively
expected.                                                                  where we replaced the standard deviations in the prior by
                                                                           λ ∈ R; the proportionality is due to omitting nmax , which is
5.2.2     Recommendation of TV Shows                                       an irrelevant constant when optimizing the posterior; c9 is
    IP-TV is much more interactive than traditional TV. Con-               an irrelevant additive constant in our optimization problem.
cerning recommender systems, it allows one to collect infor-                 In [5], the following objective function was experimentally
mation on the users’ TV consumption. This can be used                      found to result in the best recommendation accuracy from
to learn preferences of users to TV programs, as to make                   among several variants (re-written, but equivalent to Eq. 3
accurate recommendations of TV shows. Unlike the pre-                      in [5]):
vious applications described in this paper, we now use im-                           X                                X
                                                                                          (1 + αnti,u )(1 − Mi,u )2 +     (0 − Mi,u )2
plicit feedback data (time spent watching a TV show) in
                                                                                   (i,u)∈S                                       (i,u)6∈S
place of the ratings. Our approach carries over immedi-                                  "                                   #
ately. This problem can be cast in our general probabilistic                            X X             2
                                                                                                                X
framework as follows: we divide the length of each show                            +λ                  Pi,d +          Q2u,d ,                 (21)
                                                                                          d       i                u
into nmax time intervals (of equal length), where nmax is the
same large integer for all shows. We consider each time in-                where nti,u =
                                                                                            P
                                                                                                j ni,u,j is the total time spent by user u
terval associated with a random experiment; a show hence
                                                                           watching show i (including all episodes j); S denotes again
comprises nmax repetition of the experiment. The random
                                                                           the set of pairs (i, u) where show i is at least partially watched
variable Ri,u takes the value 1 if user u watched show i
                                                                           by user u; α takes essentially the role of wprior .
for a time-interval, and 0 otherwise. So, if user u watches
                                                                              Interestingly, their objective function is similar to ours.
ni,u ∈ {1, ..., nmax } out of nmax intervals of show i, then
                                                                           The main difference, however, is in the least squares term,
we have ni,u observations of value 1 for the random vari-
                                                                           where we fit our model to the fraction r̄i,u of the show
able Ri,u , and nmax − ni,u observations of value 0; as shown
                                                                           watched, while the indicator value 1 is used in [5]. We at-
in Section 5, these multiple observations can be substituted
                                                                           tribute to this difference the fact that our model performs
equivalently in our least-squares objective function in Eq.
                                                                           slightly better in our preliminary experiments on our (pro-
16 by the aggregate weight nmax and the averaged value
                                                                           prietary) IP-TV data [2], see below. Besides the experimen-
of the observations: r̄i,u = ni,u /nmax , i.e., the fraction of
                                                                           tal improvement, our theoretical framework also provides a
the show watched. In addition, if a show comprises several
                                                                           clear understanding of the assumptions underlying our ap-
(e.g., weekly) episodes, then we assume that the above ap-
                                                                           proach, while the approach in [5] appears to be found ex-
plies to each episode; the total weight of the show is then
                                                                           perimentally in a somewhat ad-hoc manner.
ti,u nmax , where ti,u ∈ N is the number of episodes of show
                                                                              Preliminary Experiments: We used a (proprietary)
i watched by user u. Then the averaged observed value is
         Pti,u                                                             IP-TV data set [2] concerning TV consumption of N =25,777
r̄i,u = j=1    r̄i,u,j /ti,u , where r̄i,u,j is the fraction of episode    different shows by 14,731 users (living in 6,423 households)
j watched (analogous to above). This results in the log like-              in the UK over a 6 month period in 2008/2009 (see also
lihood (with an irrelevant constant c7 in our optimization                 [15] for a more detailed description). In our collaborative
problem):                                                                  filtering approach, we used only implicit feedback data con-
     log p(D|M ) = −nmax
                                    X
                                            ti,u (r̄i,u − Mi,u )2 + c7 ,   cerning TV consumption (user ID, program ID, the length of
                                                                           the program and the time the user spent on this program),
                                 (i,u)∈S
                                                                           and ignored the available explicit user profiles and content
where the set S contains all pairs (i, u) with shows i that                information for simplicity. We randomly split these data
have been watched at least partially by user u. Concerning                 into a training and a disjoint test set, with 10 shows per
all shows, we additionally incorporate background knowl-                   user assigned to the test set. In the test set, we considered
edge that users tend to not like shows with some small con-                shows interesting or relevant to users if they watched at least
 model     k0 = 1%   k0 = 2%    k0 = 3%    k0 = 4%   k0 = 5%     Acknowledgements
 ours       0.671     0.763      0.819      0.856     0.882      We are greatly indebted to Tin Ho for her encouragement
 [5]        0.624     0.723      0.785      0.828     0.857      and support of this work. We are also very grateful to the
 Nbr        0.547     0.642      0.704      0.760     0.804      anonymous reviewers for their valuable feedback.

     Table 1: Recall(k0 ) test results on IP-TV data.
                                                                 7.   REFERENCES
                                                                  [1] J. Bennet and S. Lanning. The Netflix Prize. In
                                                                      Workshop at SIGKDD-07, ACM Conference on
80 % of them. We used these shows to evaluate the recom-              Knowledge Discovery and Data Mining, 2007.
mender systems w.r.t. the performance measures recall (as         [2] BARB: Broadcaster Audience Research Board.
defined in Section 3 with γ = 0). Table 1 summarizes our              http://www.barb.co.uk.
preliminary results in terms of recall(k0 ), where k0 = k/N       [3] P. Cremonesi, Y. Koren, and R. Turrin. Performance
is normalized regarding the number N of available shows.              of recommender algorithms on top-N recommendation
Again, we used rank d = 50 for the matrix factorization               tasks. In ACM Conference on Recommender Systems,
models. We find that our new approach (with λ = 0.005)                2010.
and the approach in [5] (with λ = 0.02) give similar results,     [4] S. Funk. Netflix update: Try this at home, 2006.
                                                       ri r 0
compared to the neighborhood model (Nbr), sij = kri kkrj j k          http://sifter.org/ simon/journal/20061211.html.
                                                                  [5] Y. Hu, Y. Koren, and C. Volinsky. Collaborative
           P
and r̂iu =   sij ruj , which was also used in [5] for compar-
ison. Concerning recall, small values of k0 are particularly          filtering for implicit feedback datasets. In ICDM, 2008.
important in practice, as only a small number of shows can        [6] R. Keshavan, A. Montanari, and S. Oh. Matrix
be recommended to a user; in this regime, our new approach            completion from noisy entries, 2009. arXiv:0906.2027.
seems to perform better than the approach of [5]. We are          [7] Y. Koren. Factorization meets the neighborhood: a
currently running more refined experiments to confirm this            multifaceted collaborative filtering model. In KDD,
finding.                                                              2008.
                                                                  [8] B. Marlin and R. Zemel. Collaborative prediction and
                                                                      ranking with non-random missing data. In ACM
                                                                      Conference on Recommender Systems (RecSys), 2009.
6.     CONCLUSIONS                                                [9] B. Marlin, R. Zemel, S. Roweis, and M. Slaney.
   This paper provides three contributions. First, we out-            Collaborative filtering and the missing at random
lined a Bayesian framework that naturally allowed us to               assumption. In Conference on Uncertainty in Artificial
insert background knowledge concerning the missing-data               Intelligence (UAI), 2007.
mechanism underlying the observed rating data. The ob-           [10] R. Pan, Y. Zhou, B. Cao, N. Liu, R. Lukose,
tained log posterior probability of the model is very similar         M. Scholz, and Q. Yang. One-class collaborative
to the training objective function outlined in [17].                  filtering. In ICDM, 2008.
   In our second contribution, we conducted experiments          [11] A. Paterek. Improving regularized singular value
where we considered several hypothetical missing-data mech-           decomposition for collaborative filtering. In KDDCup,
anisms underlying the observed real-world data. Given that            2007.
the true missing-data mechanism is unknown in the ab-
                                                                 [12] R. Salakhutdinov and A. Mnih. Probabilistic matrix
sence of ground truth data, this sensitivity analysis pro-
                                                                      factorization. In NIPS, 2008.
vided valuable information. Our key insight based on these
experiments is that the top-k hit-rate or recall can be im-      [13] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted
proved considerably by training recommender systems with              Boltzmann machines for collaborative filtering. In
an appropriately chosen small positive prior weight concern-          ICML, 2007.
ing background knowledge on the missing-data mechanism.          [14] R. Salakhutdinov and N. Srebro. Collaborative
This is in contrast to the popular approach in the literature,        filtering in a non-uniform world: Learning with the
which only considers observed ratings.                                weighted trace norm, 2010. arXiv:1002.2780.
   As third contribution, we provided a generalized prob-        [15] C. Senot, D. Kostadinov, M. Bouzid, J. Picault,
abilistic framework for factorizing a user-item-matrix that           A. Aghasaryan, and C. Bernier. Analysis of strategies
possibly has multiple observed rating values associated with          for building group profiles. In Conference on User
each user-item pair. We discussed three important special             Modeling, Adaption and Personalization (UMAP),
cases / applications: besides training a top-k recommender            2010.
system by using virtual data points, we outlined how ratings     [16] N. Srebro and T. Jaakkola. Weighted low-rank
can be aggregated when items are grouped in a hierarchical            approximations. In ICML, pages 720–7, 2003.
manner rather than in a flat list, and how recommendations       [17] H. Steck. Training and testing of recommender
can be made using this hierarchical structure. Addition-              systems on data missing not at random. In KDD,
ally, this framework enabled us to derive the training objec-         2010.
tive function for a recommender system on sequential data        [18] M. Weimer, A. Karatzoglou, Q. Le, and A. Smola.
concerning TV consumption. This derivation not only pro-              Cofi rank - maximum margin matrix factorization for
vides a clear understanding of the assumptions underlying             collaborative ranking. In Advances in Neural
the training objective function, but also led to improvements         Information Processing Systems (NIPS), 2008.
in the top-k hit rate over state-of-the-art approaches in our
preliminary experiments.

</pre>