=Paper= {{Paper |id=Vol-1521/paper6 |storemode=property |title=Scalable Bayesian Matrix Factorization |pdfUrl=https://ceur-ws.org/Vol-1521/paper6.pdf |volume=Vol-1521 |dblpUrl=https://dblp.org/rec/conf/pkdd/SahaMR15 }} ==Scalable Bayesian Matrix Factorization== https://ceur-ws.org/Vol-1521/paper6.pdf
          Scalable Bayesian Matrix Factorization

          Avijit Saha??,1 , Rishabh Misra??,2 , and Balaraman Ravindran1
          1
              Department of CSE, Indian Institute of Technology Madras, India
                            {avijit, ravi}@cse.iitm.ac.in
                     2
                       Department of CSE, Thapar University, India
                              rishabhmisra1994@gmail.com



         Abstract. Matrix factorization (MF) is the simplest and most well stud-
         ied factor based model and has been applied successfully in several do-
         mains. One of the standard ways to solve MF is by finding maximum a
         posteriori estimate of the model parameters, which is equivalent to min-
         imizing the regularized objective function. Stochastic gradient descent
         (SGD) is a common choice to minimize the regularized objective func-
         tion. However, SGD suffers from the problem of overfitting and entails
         tedious job of finding the learning rate and regularization parameters. A
         fully Bayesian treatment of MF avoids these problems. However, the ex-
         isting Bayesian matrix factorization method based on the Markov chain
         Monte Carlo (MCMC) technique has cubic time complexity with respect
         to the target rank, which makes it less scalable. In this paper, we propose
         the Scalable Bayesian Matrix Factorization (SBMF), which is a MCMC
         Gibbs sampling algorithm for MF and has linear time complexity with
         respect to the target rank and linear space complexity with respect to
         the number of non-zero observations. Also, we show through extensive
         experiments on three sufficiently large real word datasets that SBMF
         incurs only a small loss in the performance and takes much less time as
         compared to the baseline method for higher latent dimension.

         Keywords: Recommender Systems, Matrix Factorization, Bayesian In-
         ference, Markov Chain Monte Carlo, Scalability.


1      Introduction
Factor based models have been used extensively in collaborative filtering. In a
factor based model, preferences of each user are represented by a latent factor
vector. Matrix factorization (MF) [1–6] is the simplest and most well studied
factor based model and has been applied successfully in several domains. For-
mally, MF recovers a low-rank latent structure of a matrix by approximating it
as a product of two low-rank matrices. For delineation, consider a user-movie
??
     Both the authors contributed equally.
     Copyright c 2015 by the papers authors. Copying permitted only for private and
     academic purposes. In: M. Atzmueller, F. Lemmerich (Eds.): Proceedings of 6th
     International Workshop on Mining Ubiquitous and Social Environments (MUSE),
     co-located with the ECML PKDD 2015. Published at http://ceur-ws.org
44      Avijit Saha, Rishabh Misra, and Balaraman Ravindran

matrix R ∈ RI×J where the rij cell represents the rating provided to the j th
movie by the ith user. MF decomposes the matrix R into two low-rank matrices
U = [u1 , u2 , ..., uI ]T ∈ RI×K and V = [v 1 , v 2 , ..., v J ]T ∈ RJ×K (K is the latent
space dimension) such that:
                                     R ∼ UV T .                                       (1)
   Probabilistic Matrix Factorization (PMF) [4] provides a probabilistic inter-
pretation for MF. In PMF, latent factor vectors are assumed to be marginally in-
dependent, whereas rating variables, given the latent factor vectors, are assumed
to be conditionally independent. PMF considers the conditional distribution of
the rating variables (the likelihood term) as:
                                        Y
                   p(R|U , V , τ −1 ) =     N (rij |uTi v j , τ −1 ),          (2)
                                         (i,j)∈Ω

where Ω is the set of all observed entries in R provided during the training and
τ is the model precision. Zero-mean spherical Gaussian priors are placed on the
latent factor vectors of users and movies. The main drawback of this model is
that inferring the posterior distribution over the latent factor vectors, given the
ratings, is intractable. PMF handles this intractability by providing a maximum
a posteriori estimation of the model parameters by maximizing the log-posterior
over the model parameters, which is equivalent to minimizing the regularized
square error loss defined as:
                     X                  2
                           rij − uTi v j + λ ||U ||2F + ||V ||2F ,
                                                                
                                                                                (3)
                    (i,j)∈Ω


where λ is the regularization parameter and ||X||2F is the Frobenius norm of
X. The optimization problem in Eq. (3) can be solved using stochastic gradient
descent (SGD) [2]. SGD is an online algorithm which obviates the need to store
the entire dataset in the memory. Although SGD is scalable and enjoys local
convergence guarantee [7], it often overfits the data and requires manual tuning
of the learning rate and regularization parameters. Hence, maximum a posteriori
estimation of MF suffers from the problem of overfitting and entails tedious job of
finding the learning rate (if SGD is the choice of optimization) and regularization
parameters.
    On the other hand, fully Bayesian methods [5, 8–10] for MF do not require
manual tuning of the learning rate and regularization parameters and are ro-
bust to overfitting. As direct evaluation of posterior is intractable in practice,
approximate inference techniques are adopted to learn the posterior distribu-
tion. One of the possible choices of approximate inference is to apply variational
approximate inference technique [8, 9]. Bayesian MF based on the variational
approximation [11–13, 10] considers a simplified factorized distribution and as-
sumes that the latent factors of users are independent of the latent factors of
items while approximating the posterior. But this assumption often leads to
over simplification and can produce inaccurate results as shown in [5]. On the
other hand, Markov chain Monte Carlo (MCMC) based approximation method
                                                      Scalable Bayesian Matrix Factorization                   45


                      µ0 , ν0              α 0 , β0                           α 0 , β0               µ0 , ν0



                       µα          σα       µ uk         σuk          σ vk         µ vk    σβ         µβ


                                                   uik                       vjk
                                   αi                                        k = 1..K      βj

                                                                rij
                                i = 1..I                                                  j = 1..J

                                                         τ                    µ

                                                 a0            b0      µg           σg




                 Fig. 1. Graphical model representation of SBMF.


can produce exact results when provided with infinite resources. MCMC based
Bayesian Probabilistic Matrix Factorization (BPMF) [5] directly approximates
the posterior distribution using the Gibbs sampling technique and outperforms
the variational based approximation.
    In BPMF, user/item latent factor vectors are assumed to follow a multi-
variate Gaussian distribution, which results cubic time complexity with respect
to the latent factor vector dimension. Though BPMF performs well in many
applications, this cubic time complexity makes it difficult to apply BPMF on
very large datasets. In this paper, we propose the Scalable Bayesian Matrix
Factorization (SBMF) based on the MCMC Gibbs sampling, where we assume
univariate Gaussian priors on each dimension of the latent factor. Due to this as-
sumption, the complexity of SBMF reduces to linear with respect to the latent
factor vector dimension. We also consider user and item bias terms in SBMF
which are missing in BPMF. These bias terms capture the variation in rating
values that are independent of any user-item interaction. Also, the proposed
SBMF algorithm is parallelized for multicore environments. We show through
extensive experiments on three large scale real world datasets that the adopted
univariate approximation in SBMF results in only a small performance loss and
provides significant speed up when compared with the baseline method BPMF
for higher values of latent dimension.


2     Method

2.1   Model

Fig. 1 shows a graphical model representation of SBMF. Consider Ω as the set
of observed entries in R provided during the training phase. The observed data
rij is assumed to be generated as follows:

                          rij = µ + αi + βj + uTi v j + ij ,                                                  (4)
46       Avijit Saha, Rishabh Misra, and Balaraman Ravindran

where (i, j) ∈ Ω, µ is the global bias, αi is the bias associated with the ith
user, βj is the bias associated with the j th item, ui is the latent factor vector of
dimension K associated with the ith user, and v j is the latent factor vector of
dimension K associated with the j th item. Uncertainty in the model is absorbed
by the noise ij which is generated as ij ∼ N (0, τ −1 ), where τ is the precision
parameter. Bias terms are particularly helpful in capturing the individual bias
for user/item: a user may have the tendency to rate all the items higher than
the other users or an item may get higher ratings if it is perceived better than
the others [2].
    The conditional on the observed entries of R (the likelihood term) can be
written as follows:
                             Y
                p(R|Θ) =          N (rij |µ + αi + βj + uTi v j , τ −1 ),         (5)
                             (i,j)∈Ω

where Θ = {τ, µ, {αi }, {βj }, U , V }. We place independent univariate priors on
all the model parameters in Θ as follows:

                            p(µ) =N (µ|µg , σg−1 ),                                              (6)
                           p(αi ) =N (αi |µα , σα−1 ),                                           (7)
                           p(βj ) =N (βj |µβ , σβ−1 ),                                           (8)
                                       I Y
                                       Y K
                           p(U ) =                 N (uik |µuk , σu−1
                                                                    k
                                                                      ),                         (9)
                                       i=1 k=1
                                       J Y
                                       Y K
                           p(V ) =                 N (vjk |µvk , σv−1
                                                                    k
                                                                      ),                        (10)
                                       j=1 k=1

                            p(τ ) =N (τ |a0 , b0 ).                                             (11)

   We further place Normal-Gamma priors on all the hyperparameters Θ H =
{µα , σα , µβ , σβ , {µuk , σuk }, {µvk , σvk }} as follows:

                      p(µα , σα ) = N G (µα , σα |µ0 , ν0 , α0 , β0 ) ,                         (12)
                       p(µβ , σβ ) = N G (µβ , σβ |µ0 , ν0 , α0 , β0 ) ,                        (13)
                    p(µuk , σuk ) = N G (µuk , σuk |µ0 , ν0 , α0 , β0 ) ,                       (14)
                     p(µvk , σvk ) = N G (µvk , σvk |µ0 , ν0 , α0 , β0 ) .                      (15)

We denote {a0 , b0 , µg , σg , µ0 , ν0 , α0 , β0 } as Θ 0 for notational convenience. The
joint distribution of the observations and the hidden variables can be written as:
                                             I
                                             Y              J
                                                            Y
     p(R, Θ, Θ H |Θ 0 ) = p(R|Θ)p(µ)               p(αi )         p(βj )p(U )p(V )p(µα , σα )
                                             i=1            j=1
                                       K
                                       Y
                         p(µβ , σβ )         p(µuk , σuk )p(µvk , σvk ).                        (16)
                                       k=1
                                              Scalable Bayesian Matrix Factorization             47

2.2       Inference


Since evaluation of the joint distribution in Eq. (16) is intractable, we adopt a
Gibbs sampling based approximate inference technique. As all our model pa-
rameters are conditionally conjugate [10], equations for Gibbs sampling can be
written in closed form using the joint distribution as given in Eq. (16). Replac-
ing Eq. (5)-(15) in Eq. (16), the sampling distribution of uik can be written as
follows:

                               p(uik |−) ∼ N (uik |µ∗ , σ ∗ ) ,                             (17)

where,

                              −1
                      X
      ∗                       2 
   σ = σuk + τ              vjk      ,                                                     (18)
                      j∈Ωi
                                                                                   
                              X                                         K
                                                                        X
   µ∗ = σ ∗ σuk µuk + τ             vjk rij − µ + αi + βj +                   uil vjl  .
                              j∈Ωi                                    l=1&l6=k
                                                                                            (19)

Here, Ωi is the set of items rated by the ith user in the training set. Now,
directly sampling uik from Eq. (17) requires O(K|Ωi |) complexity. However if
we precompute a quantity eij = rij − (µ + αi + βj + uTi vj ) for all (i, j) ∈ Ω and
write Eq. (19) as:
                                                                         
                                                 X
                    µ∗ = σ ∗ σuk µuk + τ              vjk (eij + uik vjk ) ,              (20)
                                                j∈Ωi


then the sampling complexity of uik reduces to O(|Ωi |). Table 1 shows the space
and time complexities of SBMF and BPMF. We sample model parameters in
parallel whenever they are independent to each other. Algorithm 1 describes the
detailed Gibbs sampling procedure.


                             Table 1. Complexity Comparison

                    Method Time Complexity      Space Complexity
                    SBMF       O(|Ω|K)            O((I + J)K)
                    BPMF O(|Ω|K 2 + (I + J)K 3 ) O((I + J)K)
48       Avijit Saha, Rishabh Misra, and Balaraman Ravindran

Algorithm 1 Scalable Bayesian Marix Factorization (SBMF)
Require: Θ 0 , initialize Θ and Θ H .
Ensure: Compute eij for all (i, j) ∈ Ω
1: for t = 1 to T do
2:    // Sample hyperparameters
                                                                    I
3:    α∗ = α0 + 12 (I + 1), β ∗ = β0 + 12 (ν0 (µα − µ0 )2 +              (αi − µα )). Sample σα ∼ Γ (α∗ , β ∗ ).
                                                                    P
                                                                   i=1
                                                            I
4:    σ ∗ = (ν0 σα + σα I)−1 , µ∗ = σ ∗ (ν0 σα µ0 + σα            αi ). Sample µα ∼ N (µ∗ , σ ∗ ).
                                                            P
                                                            i=1
                                                                    J
5:    α∗ = β0 + 12 (J + 1), β ∗ = β0 + 12 (ν0 (µβ − µ0 )2 +              (βj − µβ )). Sample σβ ∼ Γ (α∗ , β ∗ ).
                                                                    P
                                                                   j=1
                                                            J
6:    σ ∗ = (ν0 σβ + σβ J)−1 , µ∗ = σ ∗ (ν0 σβ µ0 + σβ            βj ). Sample µβ ∼ N (µ∗ , σ ∗ ).
                                                            P
                                                            j=1
7:    for k = 1 to K do in parallel
                                                              I
8:       α∗ = α0 + 12 (I + 1), β ∗ = β0 + 12 (ν0 µuk − µ0 2 +
                                                             P          
                                                                uik − µuk ).
                                                                         i=1
9:       Sample σuk ∼ Γ (α∗ , β ∗ ).
                                                                         I
10:       σ ∗ = (ν0 σuk + σuk I)−1 , µ∗ = σ ∗ (ν0 σuk µ0 + σuk               uik ). Sample µuk ∼ N (µ∗ , σ ∗ ).
                                                                         P
                                                                        i=1
                                                                           J
11:       α∗ = β0 + 12 (J + 1), β ∗ = β0 + 12 (ν0 µvk − µ0 2 +
                                                                         P               
                                                                                 vjk − µvk ).
                                                                          j=1
12:       Sample σvk ∼ Γ (α∗ , β ∗ ).
                                                                        J
13:       σ ∗ = (ν0 σvk + σvk J)−1 , µ∗ = σ ∗ (ν0 σvk µ0 + σvk               vjk ). Sample µvk ∼ N (µ∗ , σ ∗ ).
                                                                        P
                                                                      j=1
14:    end for
15:    a∗        1       ∗        1
                                                  e2ij . Sample τ ∼ Γ (a∗    ∗
                                          P
        0 = a0 + 2 |Ω|, b0 = b0 + 2                                     0 , b0 ).
                                        (i,j)∈Ω
16:    // Sample model parameters
17:    σ ∗ = (σg + τ |Ω|)−1 , µ∗ = σ ∗ (σg µg + τ               (eij + µ)). Sample µ ∼ N (µ∗ , σ ∗ ).
                                                        P
                                                      (i,j)∈Ω
18:    for (i, j) ∈ Ω do in parallel
19:       eij = eij + (µold − µ)
20:    end for
21:    for i = 1 to I do in parallel
22:       σ ∗ = (σα + τ |Ωi |)−1 , µ∗ = σ ∗ (σα µα + τ   (eij + αi )). Sample αi ∼ N (µ∗ , σ ∗ ).
                                                       P
                                                           j∈Ωi
23:       for j ∈ Ωi do
24:          eij = eij + (αold − αi )
25:       end for
26:       for k = 1 to K do P
27:          σ ∗ = (σuk + τ       2 −1
                                      ) , µ∗ = σ ∗ (σuk µuk + τ
                                                                P
                                 vjk                              vjk (eij + uik vjk )).
                             j∈Ωi                                         j∈Ωi
28:           Sample uik ∼ N (µ∗ , σ ∗ ).
29:           for j ∈ Ωi do
30:              eij = eij + vjk (uold
                                   ik − uik )
31:           end for
32:       end for
33:    end for
34:    for j = 1 to J do in parallel
35:       σ ∗ = (σβ + τ |Ωj |)−1 , µ∗ = σ ∗ (σβ µβ + τ   (eij + βj )). Sample βj ∼ N (µ∗ , σ ∗ ).
                                                       P
                                                           i∈Ωj

36:       for i ∈ Ωj do
37:          eij = eij + (βold − βj )
38:       end for
39:       for k = 1 to K doP
40:          σ ∗ = (σvk + τ      u2ik )−1 , µ∗ = σ ∗ (σvk µvk + τ
                                                                  P
                                                                    uik (eij + uik vjk )).
                            i∈Ωj                                          i∈Ωj

41:         Sample vjk ∼ N (µ∗ , σ ∗ ).
42:         for i ∈ Ωj do
                                 old
43:            eij = eij + uik (vjk  − vjk )
44:         end for
45:      end for
46:   end for
47: end for
                                         Scalable Bayesian Matrix Factorization       49

3     Experiments

3.1    Datasets

In this section, we show empirical results on three large real world movie-rating
datasets3,4 to validate the effectiveness of SBMF. The details of these datasets
are provided in Table 2. Both the Movielens datasets are publicly available and
90:10 split is used to create their train and test sets. For Netflix, the probe data
is used as the test set.


3.2    Experimental Setup and Parameter Selection

All the experiments are run on an Intel i5 machine with 16GB RAM. We have
considered the serial as well as the parallel implementation of SBMF for all the
experiments. In the parallel implementation, SBMF is parallelized in multicore
environment using OpenMP library. Although BPMF can also be parallelized,
the base paper [5] and it’s publicly available code provide only the serial imple-
mentation. So in our experiments, we have compared only the serial implemen-
tation of BPMF against the serial and the parallel implementations of SBMF.
Serial and parallel versions of the SBMF are denoted as SBMF-S and SBMF-
P, respectively. Since the performance of both SBMF and BPMF depend on
the dimension of latent factor vector (K), it is necessary to investigate how the
models work with different values of K. So three sets of experiments are run for
each dataset corresponding to K = {50, 100, 200} for SBMF-S, SBMF-P, and
BPMF. As our main aim is to validate that SBMF is more scalable as compared
to BPMF under same conditions, we choose 50 burn-in iterations for all the ex-
periments of SBMF-S, SBMF-P, and BPMF. In Gibbs sampling process burn-in
refers to the practice of discarding an initial portion of a Markov chain sample,
so that the effect of initial values on the posterior inference is minimized. Note
that, if SBMF takes less time than BPMF for a particular burn-in period, then
increasing the number of burn-in iterations will make SBMF more scalable as
compared to BPMF. Additionally, we allow the methods to have 100 collection
iterations.
    In SBMF, we initialize parameters in Θ using a Gaussian distribution with 0
mean and 0.01 variance. All the parameters in Θ H are set to 0. Also, a0 , b0 , ν0 , α0 ,
and β0 are set to 1, µ0 and µg are set to 0, and σg is initialized to 0.01. In
BPMF, we use standard parameter setting as provided in the paper [5]. We
collect samples of user and item latent factor vectors and bias terms from the
collection iterations and approximate a rating rij as:

                                    C
                          t      1 X c
                                       µ + αic + βjc + uci .v cj ,
                                                                
                        r̂ij =                                                      (21)
                                 C c=1

3
    http://grouplens.org/datasets/movielens/
4
    http://www.netflixprize.com/
50        Avijit Saha, Rishabh Misra, and Balaraman Ravindran

                             Table 2. Dataset Description

                  Dataset     No. of users No. of movies No. of ratings
                Movielens 10m    71567         10681         10m
                Movielens 20m 138493           27278         20m
                   Netflix      480189         17770        100m



where uci and v cj are the cth drawn samples of the ith user and the j th item latent
factor vectors respectively, µc , αic , and βjc are the cth drawn samples of the global
bias, the ith user bias, and the j th item bias, respectively. C is the number of
drawn samples. Then the Root Mean Square Error (RMSE) [2] is used as the
evaluation metric for all the experiments. The code for SBMF will be publicly
available 5 .


3.3     Results

In all the graphs of Fig. 2 , X-axis represents the time elapsed since the starting
of experiment and Y-axis presents the RMSE value. Since we allow 50 burn-in
iterations for all the experiments and each iteration of BPMF takes more time
than SBMF-P, collection iterations of SBMF-P begin earlier than BPMF. Thus
we get the initial RMSE value of SBMF-P earlier. Similarly, each iteration of
SBMF-S takes less time as compared to BPMF (except for K = {50, 100} in the
Netflix dataset). We believe that in Netflix dataset (for K = {50, 100}), BPMF
takes less time than SBMF-S because BPMF is implemented in Matlab where
matrix computations are efficient. On the other hand, SBMF is implemented in
C++ where the matrix storage is unoptimized. As the Netflix data is large with
respect to the number of entries and the number of users and items, number
of matrix operations are more in it as compared to the other datasets. So for
lower values of K, the cost of matrix operations for SBMF-S dominates the
cost incurred due to O(K 3 ) complexity of BPMF. Thus BPMF takes less time
than SBMF-S. However, with large values of K, BPMF starts taking more time
as the O(K 3 ) complexity of BPMF becomes dominating. We leave the task of
optimizing the code of SBMF as future work.
    We can observe from the Fig. 2 that SBMF-P takes much less time in all
the experiments than BPMF and incurs only a small loss in the performance.
Similarly, SBMF-S also takes less time than the BPMF (except for K = {50, 100}
in Netflix dataset) and incurs only a small performance loss. Important point
to note is that total time difference between both of the variants of SBMF and
BPMF increases with the dimension of latent factor vector and the speedup is
significantly high for K = 200. Table 3 shows the final RMSE values and the total
time taken correspond to each dataset and K. We find that the RMSE values
for SBMF-S and SBMF-P are very close for all the experiments. We also observe
that increasing the latent space dimension reduces the RMSE value in the Netflix
5
     https://github.com/avijit1990, https://github.com/rishabhmisra
                                      Scalable Bayesian Matrix Factorization       51




             (a)                         (b)                         (c)




             (d)                         (e)                          (f)




             (g)                         (h)                          (i)

Fig. 2. Left, middle, and right columns show results for K = 50, 100, and 200, respec-
tively. {a,b,c}, {d,e,f}, and {g,h,i} are results on Movielens 10m, Movielens 20m, and
Netflix datasets, respectively.



dataset. With high latent dimension, the running time of BPMF is significantly
high due to its cubic time complexity with respect to the latent space dimension
and it takes approximately 150 hours on Netflix dataset with K = 200. However,
SBMF has linear time complexity with respect to the latent space dimension and
SBMF-P and SBMF-S take only 35 and 90 hours (approximately) respectively on
the Netflix dataset with K = 200. Thus SBMF is more suited for large datasets
with large latent space dimension. Similar speed up patterns are found on the
other datasets also.
52      Avijit Saha, Rishabh Misra, and Balaraman Ravindran

                           Table 3. Results Comparison

                          K = 50        K = 100       K = 200
     Dataset   Method RMSE Time(Hr) RMSE Time(Hr) RMSE Time(Hr)
                BPMF 0.8629 1.317 0.8638 3.517 0.8651 22.058
 Movielens 10m SBMF-S 0.8655  1.091 0.8667  2.316 0.8654  5.205
               SBMF-P 0.8646 0.462 0.8659 0.990 0.8657 2.214
                BPMF 0.7534 2.683 0.7513 6.761 0.7508 45.355
 Movielens 20m SBMF-S 0.7553  2.364 0.7545  5.073 0.7549 11.378
               SBMF-P 0.7553 1.142 0.7545 2.427 0.7551 5.321
                BPMF 0.9057 11.739 0.9021 28.797 0.8997 150.026
    Netflix    SBMF-S 0.9048 17.973 0.9028 40.287 0.9017 89.809
               SBMF-P 0.9047 7.902 0.9026 16.477 0.9017 34.934


4    Related Work

MF [1–6] is widely used in several domains because of performance and scalabil-
ity. Stochastic gradient descent [2] is the simplest method to solve MF but it often
suffers from the problem of overfitting and requires manual tuning of the learn-
ing rate and regularization parameters. Thus many Bayesian methods [5, 11, 13]
have been developed for MF that automatically select all the model parameters
and avoid the problem of overfitting. Variational Bayesian approximation based
MF [11] considers a simplified distribution to approximate the posterior. But
this method does not scale well on large datasets. Consequently, scalable varia-
tional Bayesian methods [12, 13] have been proposed to scale to large datasets.
However variational approximation based Bayesian method might give inaccu-
rate results [5] because of its over simplistic assumptions. Thus, Gibbs sampling
based MF [5] has been proposed which gives better performance than the vari-
ational Bayesian MF counter part.
     Since performance of MF depends on the latent dimensionality, several non-
parametric MF methods [14–16] have been proposed that set the number of
latent factors automatically. Non-negative matrix factorization (NMF) [3] is a
variant of MF, which recovers two low rank matrices, each of which is non-
negative. Bayesian NMF [6, 17] considers Poisson likelihood and different type
of priors and generates a family of MF model based on the prior imposed on the
latent factor. Also, in real world the preferences of user changes over time. To
incorporate this dynamics into the model, several dynamic MF models [18, 19]
have been developed.


5    Conclusion and Future Work

We have proposed the Scalable Bayesian Matrix Factorization (SBMF), which
is a Markov chain Monte Carlo based Gibbs sampling algorithm for matrix fac-
torization and has linear time complexity with respect to the target rank and
linear space complexity with respect to the number of non-zero observations.
                                      Scalable Bayesian Matrix Factorization       53

SBMF gives competitive performance in less time as compared to the baseline
method. Experiments on several real world datasets show the effectiveness of
SBMF. In future, it would be interesting to extend this method in applications
like matrix factorization with side-information, where the time complexity is cu-
bic with respect to the number of features (which can be very large in practice).


6    Acknowledgement

This project was supported by Ericsson India.


References
 1. N. Srebro and T. Jaakkola, “Weighted low-rank approximations,” in Proc. of
    ICML, pp. 720–727, AAAI Press, 2003.
 2. Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recom-
    mender systems,” Computer, vol. 42, pp. 30–37, Aug. 2009.
 3. D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in
    Proc. of NIPS, pp. 556–562, 2000.
 4. R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Proc. of
    NIPS, 2007.
 5. R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using
    markov chain monte carlo,” in Proc. of ICML, pp. 880–887, 2008.
 6. P. Gopalan, J. Hofman, and D. Blei, “Scalable recommendation with Poisson fac-
    torization,” CoRR, vol. abs/1311.1704, 2013.
 7. M.-A. Sato, “Online model selection based on the variational Bayes,” Neural Com-
    putation, vol. 13, no. 7, pp. 1649–1681, 2001.
 8. M. J. Beal, “Variational algorithms for approximate Bayesian inference,” in PhD.
    Thesis, Gatsby Computational Neuroscience Unit, University College London.,
    2003.
 9. D. Tzikas, A. Likas, and N. Galatsanos, “The variational approximation for
    Bayesian inference,” IEEE Signal Processing Magazine, vol. 25, pp. 131–146, Nov.
    2008.
10. M. Hoffman, D. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,”
    JMLR, vol. 14, pp. 1303–1347, may 2013.
11. Y. Lim and Y. Teh, “Variational bayesian approach to movie rating prediction,”
    in Proc. of KDDCup, 2007.
12. J. Silva and L. Carin, “Active learning for online bayesian matrix factorization,”
    in Proc. of KDD, pp. 325–333, 2012.
13. Y. Kim and S. Choi, “Scalable variational Bayesian matrix factorization with side
    information,” in Proc. of AISTATS, pp. 493–502, 2014.
14. M. Zhou and L. Carin, “Negative binomial process count and mixture modeling,”
    IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 99, no. PrePrints,
    p. 1, 2013.
15. M. Zhou, L. Hannah, D. B. Dunson, and L. Carin, “Beta-negative binomial process
    and poisson factor analysis,” in Proc. of AISTATS, pp. 1462–1471, 2012.
16. M. Xu, J. Zhu, and B. Zhang, “Nonparametric max-margin matrix factorization
    for collaborative prediction,” in Proc. of NIPS, pp. 64–72, 2012.
54      Avijit Saha, Rishabh Misra, and Balaraman Ravindran

17. P. Gopalan, F. Ruiz, R. Ranganath, and D. Blei, “Bayesian nonparametric Poisson
    factorization for Recommendation Systems,” in Proc. of AISTATS, pp. 275–283,
    2014.
18. Y. Koren, “Factorization meets the neighborhood: A multifaceted collaborative
    filtering model,” in Proc. of KDD, pp. 426–434, 2008.
19. L. Xiong, X. Chen, T. K. Huang, J. Schneider, and J. G. Carbonell, “Temporal
    collaborative filtering with Bayesian probabilistic tensor factorization.,” in Proc.
    of SDM, pp. 211–222, SIAM, 2010.