<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Another View on Optimization as Probabilistic Inference</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Gonsior</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nico Piatkowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katharina Morik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Dortmund, AI Group</institution>
          ,
          <addr-line>Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We convert an optimization model for Boolean Matrix Factorization (BMF) into a Bayesian probabilistic model by plugging it into a probabilistic context. We infer the parameter distribution by using a state of the art sampling method based on Langevin di usions. A visual analysis of the sampled uncertainty values shows a connection to the model uncertainty.</p>
      </abstract>
      <kwd-group>
        <kwd>Matrix factorization</kwd>
        <kwd>Bayesian models</kwd>
        <kwd>Markov-Chain Monte Carlo</kwd>
        <kwd>Langevin di usion</kwd>
        <kwd>uncertainty</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Boolean Matrix Factorization belongs to the more prominent machine learning
models. However, multiple issues with the BMF approach require a lot of work
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
from the user of existing methods to verify that a given solution can be trusted.
Therefore the focus of recent work by Hess et al. on the PAL-Tiling framework[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
lies on increasing the trustworthiness of BMF by increasing the interpretability
of its results.
      </p>
      <p>
        BMF is an unsupervised learning method, which can discover meaningful
whole-parts relationships within boolean databases. A boolean database is
represented as a matrix D 2 f0; 1gm n where the columns are labeled with items
(features) I = f1; : : : ; ng and the rows are labeled with transactions (data cases)
T = f1; : : : ; mg. If an item occurs in a transaction the corresponding matrix
entry is set to one. As an example take the database of a movie rental service,
where transactions represent customers and items represent movies. For
customers that have rented a certain movie at least once, the corresponding matrix
entry is set to one. Subsets of users which have rented similar subsets of movies
form patterns1. BMF assumes that the matrix D can be constructed by the
multiplying two smaller factor matrices. Assume factor matrices X 2 Bn r and
Y 2 Bm r with the factorization rank r &gt; 0 and r &lt;&lt; min(m; n). The goal is
to minimize the reconstruction error jD XY T j as follows[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
mX;iYn jD
      </p>
      <p>XY T j + jXj + jY j</p>
      <p>
        X 2 Bn r; Y 2 Bm r
(1)
Having a small factorization rank reduces the amount of space available in the
factor matrices. If it it is possible to construct the full matrix D from the smaller
matrices, these must contain the same information as D itself. In this way BMF
achieves a compression of D. Due to the nontrivial interactions within the model,
in many cases it is not clear if a given BMF solution can be trusted. It is unknown
if the dataset factors as assumed by BMF and with which factorization rank.
Non separable patterns might yield many false positives. Column permutations
of the factor matrices do not change the objective value, blowing up the search
space. Finally, NP-completeness as well as APX-hardness have been proven[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Implementation</title>
      <p>
        With the formulation in eq. (1) we still have a hard to solve combinatorial
optimization problem. Hess et al. obtain an approximate solution by relaxing the
the parameters to the interval [0; 1], thereby obtaining the related Nonnegative
Matrix Factorization problem (NMF)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. They solve this problem with gradient
descent, followed by reconstruction of binary factor matrices by thresholding. In
our work we follow this idea, but instead of optimizing we use a probabilistic
model over factor matrices X; Y , from which we obtain a sample.
      </p>
      <p>
        We assume a Gibbs distribution p( ) = Z1 eE( ), where E( ) is called the energy
function and Z is the normalization constant2. To plug an optimization model
into this formula, we reinterpret the objective of the optimization problem as an
energy function over parameters X; Y given the data D. In our case we chose the
1 This is interpreted as di erent users having a similar taste in movies.
2 As an integral over the whole parameter space, it is often intractable.
PANPAL[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] objective for the reconstruction error F (X; Y; D) = 21 kD Y XT k22
without the regularizer. Using F as the energy and adding a prior distribution
p(X; Y ) = p(x11; : : : ; xnr; y11; : : : ; ymr) N (0; I) in place of the regularizer we
have
p(X; Y jD) =
Therefore, low reconstruction errors are associated with high probabilities.
Standard normal priors for each parameter model sparsity assumptions for the factor
matrices. Recent research has focused on gradient based MCMC sampling
approaches inspired by physical models of motion. The Langevin di usion dX =
12 r log LD(X) + dB with ddBt N (0; ) models a particle moving in an external
potential LD(X) and in uenced by random e ects dB3. In our case, the
potential LD(X) is given by the log-likelihood function of our probabilistic model.
This continuous stochastic process is guaranteed to converge in its stationary
distribution[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to PD(X) LD(X). Starting with the work of Welling and
Teh[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], multiple samplers based on this principle were constructed. Some use
minibatches D~ D to simulate the noise term dB for resource constrained
operation. Recent work by Mandt et al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] gives an optimal step size for
logquadratic likelihoods. They develop the Iterate averaging Stochastic Gradient
(IASG) sampler with the update equation
      </p>
      <p>
        Xt = Xt 1 +
rX log LD(Xt 1; D~ t)
(2)
Here = NS F( X1)ii , the optimal step size depends on the number of data cases
N and the batch size S as well as on the diagonal of the empirical Fisher
Information[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] F (X) = cdov [rX log LD(X)]. Some changes to eq. (2) are necessary
in our case, they are not presented here due to space constraints. The update
equation is iterated starting from a given X0. Decorrelated samples are produced
by Polyak averaging over NS intermediate results Xt.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        We devised two scenarios to test our implementation. In one case the sampling
process begins at a random point in parameter space. In the other case sampling
starts at a known local optimum. For each scenario we generated test datasets
with data matrices of size 512 512 and factorization ranks r 2 f20; 30; 40g
using the method given in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] leading to a parameter space of dim. 40960 for
large instances. In the rst test runs we drew samples of size 60k, which for the
random start scenario did not yield proper estimates of the posterior parameter
distribution but still showed tendencies. With samples of size 600k and 2M we
observed improved sample quality. In gures 1a and 1b we have visualized
samples for di erent con gurations as scatterplots. Each point in the diagram relates
to mean and standard deviation of an entry within a factor matrix with means on
3 Brownian motion
(a) correct factorization rank (30)
(b) reduced factorization rank (20)
the x-axis and s.d. on the y-axis. Di erent colors mark samples of di erent sizes.
As the factors are sampled on the relaxed (NMF) problem, they take values in
[0; 1]. Mean values near to 0.5 signal internal con icts in the model, mostly due
to noisy data. In NMF this e ect is known as fuzzy assignment between items
and transactions. In this way the parameter values in NMF themselves express
a type of uncertainty within the model. The standard deviation values on the
y-axis are obtained through the sampling process and express uncertainty in the
sense of our probabilistic model.
      </p>
      <p>In the upper rows of gures 1a and 1b we show samples of di erent sizes
when the element values for the initial parameter X0; Y0 were chosen uniformly
random out of [0; 1]. For each sample size (color) we observe a dense cluster
of mean values around zero (sparsity) and also a lower density cluster of mean
values with high standard deviations. With increasing sample size this cluster
moves towards mean one and towards lower standard deviations. As there is an
abundance of zeros within the data and also a sparsity prior, it becomes clear
that the correct assignment of the crucial nonzero values is only validated after
many observations of the parameter space, i.e. after a long sampling period.</p>
      <p>A totally di erent situation presents itself in the lower rows of gures 1a
and 1b. In these plots the sampling process has been started at a known local
optimum for the factor matrices. Also, in gure 1a the factorization rank matches
the rank used when generating the data (30) while it is mismatched in gure 1b.
In gure 1a we observe a very tight clustering of mean values around zero and
one. In addition for each cluster we observe an almost linear relationship between
the mean values and the corresponding standard deviations. With increasing
sample size the standard deviations diminish, expressing very low probabilistic
uncertainty, while the spread in the mean values is still visible. For a matched
factorization rank, both types of uncertainty express related facts about the
model t. The situation is di erent with a mismatched factorization rank, as
shown in gure 1b. While we observe clustering at means zero and one we also
observe values in between with large standard deviations. With increasing sample
size these large standard deviations prevail meaning that some con icts between
data and model are not resolvable in terms of fuzzy assignments. In this way
probabilistic uncertainty adds valuable information which cannot be obtained
by only looking at fuzzy assignments.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have \plugged in" the BMF/NMF problem into a probabilistic model and
analyzed di erent samples. With mismatched factorization rank, the information
about the rank mismatch is carried in the probabilistic uncertainty but not in the
fuzzy assignment. Developing this insight is target of further research. Realizing
sampling for BMF/NMF posed nontrivial challenges even with state of the art
methods, requiring further research into sampling methods for high dimensional
spaces.</p>
      <p>Acknowledgement This research has been funded by the Federal Ministry of
Education and Research of Germany (BMBF) as part of the competence center
for machine learning ML2R (01jS18038A).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Girolami</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calderhead</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Riemann manifold Langevin and Hamiltonian Monte Carlo methods: Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods</article-title>
          .
          <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>
          <volume>73</volume>
          (
          <issue>2</issue>
          ),
          <volume>123</volume>
          {214 (Mar
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hess</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatkowski</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>The PRIMPING routine|Tiling through proximal alternating linearized minimization"</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>31</volume>
          (
          <issue>4</issue>
          ),
          <volume>1090</volume>
          {1131 (Jul
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mandt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            <given-names>man</given-names>
          </string-name>
          , M.D.,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Stochastic Gradient Descent As Approximate Bayesian Inference</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <volume>4873</volume>
          {4907 (Jan
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Miettinen</surname>
          </string-name>
          ,
          <article-title>Pauli and Mielikainen, Taneli and Gionis, Aristides and Das, Gautam and Mannila, Heikki: The Discrete Basis Problem</article-title>
          . In: Furnkranz, J., Sche er, T.,
          <string-name>
            <surname>Spiliopoulou</surname>
          </string-name>
          , M. (eds.) PKDD. pp.
          <volume>335</volume>
          {
          <fpage>346</fpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Paatero</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tapper</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values</article-title>
          .
          <source>Environmetrics</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <volume>111</volume>
          { 126 (Jun
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>G.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tweedie</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>Exponential Convergence of Langevin Distributions and Their Discrete Approximations</article-title>
          .
          <source>Bernoulli</source>
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <volume>341</volume>
          (Dec
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teh</surname>
            ,
            <given-names>Y.W.</given-names>
          </string-name>
          :
          <article-title>Bayesian Learning via Stochastic Gradient Langevin Dynamics</article-title>
          . In: Getoor,
          <string-name>
            <surname>L.</surname>
          </string-name>
          , Sche er,
          <source>T. (eds.) Proceedings of the 28th International Conference on Machine Learning (ICML)</source>
          . pp.
          <volume>681</volume>
          {
          <fpage>688</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>