Introduction

Another View on Optimization as Probabilistic Inference

Felix Gonsior

Nico Piatkowski

Katharina Morik

0 0 TU Dortmund, AI Group , Dortmund , Germany

We convert an optimization model for Boolean Matrix Factorization (BMF) into a Bayesian probabilistic model by plugging it into a probabilistic context. We infer the parameter distribution by using a state of the art sampling method based on Langevin di usions. A visual analysis of the sampled uncertainty values shows a connection to the model uncertainty.

Matrix factorization Bayesian models Markov-Chain Monte Carlo Langevin di usion uncertainty

Introduction

Boolean Matrix Factorization belongs to the more prominent machine learning models. However, multiple issues with the BMF approach require a lot of work Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). from the user of existing methods to verify that a given solution can be trusted. Therefore the focus of recent work by Hess et al. on the PAL-Tiling framework[ 2 ] lies on increasing the trustworthiness of BMF by increasing the interpretability of its results.

BMF is an unsupervised learning method, which can discover meaningful whole-parts relationships within boolean databases. A boolean database is represented as a matrix D 2 f0; 1gm n where the columns are labeled with items (features) I = f1; : : : ; ng and the rows are labeled with transactions (data cases) T = f1; : : : ; mg. If an item occurs in a transaction the corresponding matrix entry is set to one. As an example take the database of a movie rental service, where transactions represent customers and items represent movies. For customers that have rented a certain movie at least once, the corresponding matrix entry is set to one. Subsets of users which have rented similar subsets of movies form patterns1. BMF assumes that the matrix D can be constructed by the multiplying two smaller factor matrices. Assume factor matrices X 2 Bn r and Y 2 Bm r with the factorization rank r > 0 and r << min(m; n). The goal is to minimize the reconstruction error jD XY T j as follows[ 2 ] mX;iYn jD

XY T j + jXj + jY j

X 2 Bn r; Y 2 Bm r (1) Having a small factorization rank reduces the amount of space available in the factor matrices. If it it is possible to construct the full matrix D from the smaller matrices, these must contain the same information as D itself. In this way BMF achieves a compression of D. Due to the nontrivial interactions within the model, in many cases it is not clear if a given BMF solution can be trusted. It is unknown if the dataset factors as assumed by BMF and with which factorization rank. Non separable patterns might yield many false positives. Column permutations of the factor matrices do not change the objective value, blowing up the search space. Finally, NP-completeness as well as APX-hardness have been proven[ 4 ]. 3

Implementation

With the formulation in eq. (1) we still have a hard to solve combinatorial optimization problem. Hess et al. obtain an approximate solution by relaxing the the parameters to the interval [0; 1], thereby obtaining the related Nonnegative Matrix Factorization problem (NMF)[ 5 ]. They solve this problem with gradient descent, followed by reconstruction of binary factor matrices by thresholding. In our work we follow this idea, but instead of optimizing we use a probabilistic model over factor matrices X; Y , from which we obtain a sample.

We assume a Gibbs distribution p( ) = Z1 eE( ), where E( ) is called the energy function and Z is the normalization constant2. To plug an optimization model into this formula, we reinterpret the objective of the optimization problem as an energy function over parameters X; Y given the data D. In our case we chose the 1 This is interpreted as di erent users having a similar taste in movies. 2 As an integral over the whole parameter space, it is often intractable. PANPAL[ 2 ] objective for the reconstruction error F (X; Y; D) = 21 kD Y XT k22 without the regularizer. Using F as the energy and adding a prior distribution p(X; Y ) = p(x11; : : : ; xnr; y11; : : : ; ymr) N (0; I) in place of the regularizer we have p(X; Y jD) = Therefore, low reconstruction errors are associated with high probabilities. Standard normal priors for each parameter model sparsity assumptions for the factor matrices. Recent research has focused on gradient based MCMC sampling approaches inspired by physical models of motion. The Langevin di usion dX = 12 r log LD(X) + dB with ddBt N (0; ) models a particle moving in an external potential LD(X) and in uenced by random e ects dB3. In our case, the potential LD(X) is given by the log-likelihood function of our probabilistic model. This continuous stochastic process is guaranteed to converge in its stationary distribution[ 6 ] to PD(X) LD(X). Starting with the work of Welling and Teh[ 7 ], multiple samplers based on this principle were constructed. Some use minibatches D~ D to simulate the noise term dB for resource constrained operation. Recent work by Mandt et al.[ 3 ] gives an optimal step size for logquadratic likelihoods. They develop the Iterate averaging Stochastic Gradient (IASG) sampler with the update equation

Xt = Xt 1 + rX log LD(Xt 1; D~ t) (2) Here = NS F( X1)ii , the optimal step size depends on the number of data cases N and the batch size S as well as on the diagonal of the empirical Fisher Information[ 1 ] F (X) = cdov [rX log LD(X)]. Some changes to eq. (2) are necessary in our case, they are not presented here due to space constraints. The update equation is iterated starting from a given X0. Decorrelated samples are produced by Polyak averaging over NS intermediate results Xt. 4

Results

We devised two scenarios to test our implementation. In one case the sampling process begins at a random point in parameter space. In the other case sampling starts at a known local optimum. For each scenario we generated test datasets with data matrices of size 512 512 and factorization ranks r 2 f20; 30; 40g using the method given in [ 2 ] leading to a parameter space of dim. 40960 for large instances. In the rst test runs we drew samples of size 60k, which for the random start scenario did not yield proper estimates of the posterior parameter distribution but still showed tendencies. With samples of size 600k and 2M we observed improved sample quality. In gures 1a and 1b we have visualized samples for di erent con gurations as scatterplots. Each point in the diagram relates to mean and standard deviation of an entry within a factor matrix with means on 3 Brownian motion (a) correct factorization rank (30) (b) reduced factorization rank (20) the x-axis and s.d. on the y-axis. Di erent colors mark samples of di erent sizes. As the factors are sampled on the relaxed (NMF) problem, they take values in [0; 1]. Mean values near to 0.5 signal internal con icts in the model, mostly due to noisy data. In NMF this e ect is known as fuzzy assignment between items and transactions. In this way the parameter values in NMF themselves express a type of uncertainty within the model. The standard deviation values on the y-axis are obtained through the sampling process and express uncertainty in the sense of our probabilistic model.

In the upper rows of gures 1a and 1b we show samples of di erent sizes when the element values for the initial parameter X0; Y0 were chosen uniformly random out of [0; 1]. For each sample size (color) we observe a dense cluster of mean values around zero (sparsity) and also a lower density cluster of mean values with high standard deviations. With increasing sample size this cluster moves towards mean one and towards lower standard deviations. As there is an abundance of zeros within the data and also a sparsity prior, it becomes clear that the correct assignment of the crucial nonzero values is only validated after many observations of the parameter space, i.e. after a long sampling period.

A totally di erent situation presents itself in the lower rows of gures 1a and 1b. In these plots the sampling process has been started at a known local optimum for the factor matrices. Also, in gure 1a the factorization rank matches the rank used when generating the data (30) while it is mismatched in gure 1b. In gure 1a we observe a very tight clustering of mean values around zero and one. In addition for each cluster we observe an almost linear relationship between the mean values and the corresponding standard deviations. With increasing sample size the standard deviations diminish, expressing very low probabilistic uncertainty, while the spread in the mean values is still visible. For a matched factorization rank, both types of uncertainty express related facts about the model t. The situation is di erent with a mismatched factorization rank, as shown in gure 1b. While we observe clustering at means zero and one we also observe values in between with large standard deviations. With increasing sample size these large standard deviations prevail meaning that some con icts between data and model are not resolvable in terms of fuzzy assignments. In this way probabilistic uncertainty adds valuable information which cannot be obtained by only looking at fuzzy assignments. 5

Conclusion

We have \plugged in" the BMF/NMF problem into a probabilistic model and analyzed di erent samples. With mismatched factorization rank, the information about the rank mismatch is carried in the probabilistic uncertainty but not in the fuzzy assignment. Developing this insight is target of further research. Realizing sampling for BMF/NMF posed nontrivial challenges even with state of the art methods, requiring further research into sampling methods for high dimensional spaces.

Acknowledgement This research has been funded by the Federal Ministry of Education and Research of Germany (BMBF) as part of the competence center for machine learning ML2R (01jS18038A).

1. Girolami , M. , Calderhead , B. : Riemann manifold Langevin and Hamiltonian Monte Carlo methods: Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods . Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 ( 2 ), 123 {214 (Mar 2011 )

2. Hess , S. , Morik , K. , Piatkowski , N. : The PRIMPING routine|Tiling through proximal alternating linearized minimization" . Data Mining and Knowledge Discovery 31 ( 4 ), 1090 {1131 (Jul 2017 )

3. Mandt , S. , Ho

man

, M.D., Blei , D.M. : Stochastic Gradient Descent As Approximate Bayesian Inference . J. Mach. Learn. Res . 18 ( 1 ), 4873 {4907 (Jan 2017 )

4. Miettinen , Pauli and Mielikainen, Taneli and Gionis, Aristides and Das, Gautam and Mannila, Heikki: The Discrete Basis Problem . In: Furnkranz, J., Sche er, T., Spiliopoulou , M. (eds.) PKDD. pp. 335 { 346 . Springer ( 2006 )

5. Paatero , P. , Tapper , U. : Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values . Environmetrics 5 ( 2 ), 111 { 126 (Jun 1994 )

6. Roberts , G.O. , Tweedie , R.L. : Exponential Convergence of Langevin Distributions and Their Discrete Approximations . Bernoulli 2 ( 4 ), 341 (Dec 1996 )

7. Welling , M. , Teh , Y.W. : Bayesian Learning via Stochastic Gradient Langevin Dynamics . In: Getoor, L. , Sche er, T. (eds.) Proceedings of the 28th International Conference on Machine Learning (ICML) . pp. 681 { 688 . ACM , New York, NY, USA ( 2011 )