<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classi cation Factored Gated Restricted Boltzmann Machine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivan Sorokin</string-name>
          <email>i.sorokin@cit.ifmo.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University, Department of Secure Information Technology</institution>
          ,
          <addr-line>9 Lomonosova str., St. Petersburg, 191002</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Factored gated restricted Boltzmann machine is a generative model, which capable to extract the transformation from an image pair. We extend this model by adding discriminative component, which allows directly use this model as a classi er, instead of using the hidden unit responses as features for another learning algorithm. To evaluate the capabilities of this model, we have created a synthetically transformed image pairs and demonstrated that the model is able to determine the velocity of object presented on two consecutive images.</p>
      </abstract>
      <kwd-group>
        <kwd>Multiplicative interaction</kwd>
        <kwd>temporal coherence</kwd>
        <kwd>translational motion</kwd>
        <kwd>gated Boltzmann machine</kwd>
        <kwd>supervision learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The gated Boltzmann machine is one of the models that uses multiplicative
interactions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for learning the representation, which can be useful to extract the
transformation between pairs of temporally coherent video frames [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Factorized version of this model is presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where authors train the model on
shifts of random dot images and demonstrate that the model is able to identify
the di erent directions correctly. We continue this research by studying the
possibility to predict not only directions, but also a shift value. From all types of
motion, we chose only translational motion, because it gives a great opportunity
to use this model in many vision tasks, such as object tracking or visual
odometry [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore, the main objective of this work is to create a model that is
trained to identify velocity vector in the image coordinate.
      </p>
      <p>
        Instead of using additional model on top of the mapping units, we are adding
discriminative component directly to the model. This technique was rst applied
for restricted Boltzmann machine [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and since that has become widely used for
similar models [
        <xref ref-type="bibr" rid="ref10 ref11">11, 10</xref>
        ]. In this paper, we are focused on the model that extracts
transformation from two consecutive images. Without considering the additional
discriminative component, there are several approaches of three-way structure
model training [
        <xref ref-type="bibr" rid="ref13 ref9">9, 13</xref>
        ]. We propose a simple learning algorithm and show that
it is not inferior to the existing. Moreover, our learning algorithm takes into
account additional label variables and we demonstrate how it e ects the training
discriminative features. We refer to our model variants as classi cation factored
gated restricted Boltzmann machine (cfgRBM).
      </p>
      <p>
        Copyright c 2015 for this paper by its authors. Copying permitted for private and academic
purposes.
We propose a model (Fig. 1) in which the hidden units h not only captures
the relationship between two images x and y, but also interacts with associated
label z. The model is de ned in the terms of its energy function and the function
consists of two basic parts. The rst of these is the factored three-way Boltzmann
machine [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the second is classi cation restricted Boltzmann machine [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Combining these two models we de ned expression for the energy function as
follows:
      </p>
      <p>E(x; y; z; h) =</p>
      <p>X(X</p>
      <sec id="sec-1-1">
        <title>X aixi</title>
        <p>i
f
i</p>
        <sec id="sec-1-1-1">
          <title>Wixf xi)(X</title>
          <p>j</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>Wjyf yj )(X</title>
          <p>k</p>
          <p>Wkhf hk)</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>X hkVklzl</title>
        <p>kl
X bj yj
j</p>
      </sec>
      <sec id="sec-1-3">
        <title>X ckhk</title>
        <p>k</p>
        <p>
          X dlzl ;
l
where matrices W x; W y; W h has size I F; J F and K F respectively, I and
J are equal size of visible units, F - number of factors, K - number of hidden
units. The discriminative component is weight matrix V with size K L and
one-hot encoded label vector z with L classes. Bias terms a; b; c and d associated
with two visible, hidden and label vectors respectively. We will assume that the
visible vectors are binary, but the model can be de ned with real-valued units
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Every column W xf and W yf can be consider as lter pairs (Fig. 1).
        </p>
        <p>To train the model, we also need to de ne the joint probability distribution
over three vectors:
p(x; y; z) = Px;y;z;h exp( E(x; y; z; h))</p>
        <p>Ph exp( E(x; y; z; h))
;
where the numerator is summing over all possible hidden vectors and
denominator is partition function which cannot be e ciently approximated.
(1)
(2)</p>
        <p>
          Inference
The inference task of proposed model is de ned as the problem of classifying
the motion between two related images. In order to choose the most probable
label under this model, we must compute conditional distribution p(zjx; y). We
have adapted the calculations from the case of single input units [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for the case
of three-way interaction. As a result, for reasonable numbers of labels L, this
conditional distribution can be also computed exactly and e ciently, by writing
it as follows:
        </p>
        <p>exp(dl) Qk (1 + exp(okl(x; y)))
p(zl = 1 j x; y) = Pl exp(dl ) Qk (1 + exp(okl (x; y)))
;
okl(x; y) = ck + Vkl + X Wkhf (W xf &gt;x)(W yf &gt;y)</p>
        <p>f
is an input to k hidden unit received from images x, y and estimated label l.
where
2.2</p>
        <p>Learning
(3)
(4)
(5)
In order to train a cfgRBM to solve a classi cation problem, we need to learn the
model parameters = (W x; W y; W h; V; a; b; c; d). Given a training set Dtrain =
f(x ; y ; z )g and a prede ned joint distribution (2) between three variables,
the model can be trained by minimizing the negative log-likelihood:
Lgen(Dtrain) =
jDtrainj</p>
        <p>X log p(x ; y ; z ) :
a=1
2
In order to minimize this function the gradient for any cfgRBM parameters
can be written as follows:</p>
        <p>
          Ehjx ;y ;z
+ Ex;y;z;h
;
(6)
where subscript of the expectation denotes the distribution for variables. There
exists a learning rule [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], called \Contrastive Divergence", which can be used
to approximate this gradient. Taking this rule into consideration we proposed
the Algorithm 1 for the training of cfgRBM model. The main di erence from
the other approaches for training three-way interaction is in symmetrically
sample vectors x; y in the negative phase. Detailed information about the partial
derivatives with respect to the model parameters can be obtained from [
          <xref ref-type="bibr" rid="ref5 ref9">9, 5</xref>
          ].
        </p>
        <p>
          In case of factored three-way interactions the calculation of the gradient (6)
involves numerical instabilities. Especially when using a large input vectors. To
avoid this we also use a norm constraint on columns of matrics W x and W y. It
is a common approach to stabilizing learning. For example, the same
recommendations are given by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for method \Adaptive Subspace Self-Organizing Map"
to learn invariant properties of moving input patterns.
Algorithm 1 Symmetric training update of the cfgRBM model
Require: training triplet (x ; y ; z ) and learning rate
# Notation
# a b means a is set to value b
# a p means a is sampled from p
# Positive phase
x0 x , y0 y , z0
h0k sigm(okl0 (x0; y0))
        </p>
        <p>z
# Sample
h^ p(hjx0; y0; z0)
# Negative phase
x1 p(xjy0; h^), y1 p(yjx0; h^), z1
h1k sigm(okl1 (x1; y1))</p>
        <p>p(zjh^)
# Update
for</p>
        <p>2
end for</p>
        <p>do
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>@E(x0;y0;z0;h0)
@
@E(x1;y1;z1;h1)</p>
      <p>@
The main goal of this research is to build a model that is capable to extract
translational motion from two related images. Therefore, we created a synthetic
data consisting of image pairs in which the second image is horizontally and
relatively shifted towards the rst. We take MNIST dataset1 and randomly choose
a shift value in the range [-3 3] for each image. As a result we get 7 possible
labels for 60; 000 training and 10; 000 test image pairs of relatively shifted
handwritten digits. All the models in the following experiments have 200 factors and
100 hidden units. For detailed information about learning parameters we refer
to our implementation2 of the models.</p>
      <p>
        In the rst experiment (Fig. 2), we compare di erent learning strategies for
the cfgRBM model. The rst learning method is taken from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where authors
described a conditional model. The second method is proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], where
authors de ne the joint distribution for an image pair. The results show that
Algorithm 1 in the end of learning has the lowest classi cation and reconstruction
test error. It is also interesting to note that there are di erent delays before lters
become specialized in their frequency and phase-shift characteristics.
      </p>
      <p>In the second experiment (Fig. 3), we compare hidden units activities of
models with and without a discriminative component. In the rst case we trained
a model completely unsupervised without any labeled information. In the
second case cfgRBM model was trained using Algorithm 1. The results show that
1 http://yann.lecun.com/exdb/mnist/
2 https://cit.ifmo.ru/~sorokin/cfgRBM/
descriminative component has a strong e ect on hidden features. In addition,
we also demonstrate an e ect on the hidden units in the case with wrong label
information.</p>
      <p>
        Fig. 3. Hidden units activations. For every test sample, activation of 100 hidden units
projected to 2D coordinates using t-SNE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. a) model trained without discriminative
component. b) model extendet with additional labeled units. c) exactly the same model
as in the case (b), but labels of classes f-3,-2g and f2,3g are deliberetly combined.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we incorporate supervision learning for factored gated restricted
Boltzmann machine model. Our results show that proposed model is capable to
identify the velocity of the object presented on two consecutive images. In the
future work we plan to apply this model for videos which may be represented
as a temporally ordered sequence of images. Particularly, the ability to extract
translational motion will be useful for tracking tasks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Igel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Training restricted Boltzmann machines: an introduction</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>47</volume>
          (
          <issue>1</issue>
          ), pp.
          <volume>25</volume>
          {
          <issue>39</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Training products of experts by minimizing contrastive divergence</article-title>
          .
          <source>Neural computation 14</source>
          , pp.
          <fpage>1771</fpage>
          -
          <lpage>1800</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kohonen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The adaptive-subspace som (assom) and its use for the implementation of invariant feature detection</article-title>
          .
          <source>In: Proc. ICANN95</source>
          ,
          <source>Int. Conf. on Arti cial Neural Networks</source>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Konda</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Memisevic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Learning visual odometry with a convolutional network</article-title>
          .
          <source>International Conference on Computer Vision Theory and Applications</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning algorithms for the classi cation restricted boltzmann machine</article-title>
          .
          <source>Journal of Machine Learning Research 13</source>
          , pp.
          <fpage>643</fpage>
          -
          <lpage>669</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Classi cation using discriminative restricted Boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 25th international conference on Machine learning</source>
          , pp.
          <volume>536</volume>
          {
          <issue>543</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Van der Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of Machine Learning Research 9</source>
          , pp.
          <volume>2579</volume>
          {
          <issue>2605</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Memisevic</surname>
          </string-name>
          , R.:
          <article-title>Learning to relate images</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>35</volume>
          , pp.
          <fpage>1829</fpage>
          -
          <lpage>1846</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Memisevic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          :
          <article-title>Learning to represent spatial transformations with factored higher-order boltzmann machines</article-title>
          .
          <source>Neural Computation</source>
          <volume>22</volume>
          , pp.
          <fpage>1473</fpage>
          -
          <lpage>1492</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Learning to disentangle factors of variation with manifold interaction</article-title>
          .
          <source>In: Proceedings of the 31st International Conference on Machine Learning</source>
          , pp.
          <fpage>1431</fpage>
          -
          <lpage>1439</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Learning and selecting features jointly with point-wise gated boltzmann machines</article-title>
          .
          <source>In: Proceedings of The 30th International Conference on Machine Learning</source>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>225</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          , N.:
          <article-title>Unsupervised Learning of Visual Representations using Videos</article-title>
          . Department of Computer Science, University of Toronto.
          <source>Technical Report</source>
          . (
          <year>2015</year>
          ) Retrived from http://www.cs.toronto.edu/~nitish/depth_oral.pdf
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Susskind</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Memisevic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollefeys</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Modeling the joint density of two images under a variety of transformations</article-title>
          .
          <source>In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>2793</fpage>
          -
          <lpage>2800</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>