Introduction

Classi cation Factored Gated Restricted Boltzmann Machine

Ivan Sorokin

i.sorokin@cit.ifmo.ru 0 0 ITMO University, Department of Secure Information Technology , 9 Lomonosova str., St. Petersburg, 191002 , Russia

2015

Factored gated restricted Boltzmann machine is a generative model, which capable to extract the transformation from an image pair. We extend this model by adding discriminative component, which allows directly use this model as a classi er, instead of using the hidden unit responses as features for another learning algorithm. To evaluate the capabilities of this model, we have created a synthetically transformed image pairs and demonstrated that the model is able to determine the velocity of object presented on two consecutive images.

Multiplicative interaction temporal coherence translational motion gated Boltzmann machine supervision learning

Introduction

The gated Boltzmann machine is one of the models that uses multiplicative interactions [ 8 ] for learning the representation, which can be useful to extract the transformation between pairs of temporally coherent video frames [ 12 ]. Factorized version of this model is presented in [ 9 ], where authors train the model on shifts of random dot images and demonstrate that the model is able to identify the di erent directions correctly. We continue this research by studying the possibility to predict not only directions, but also a shift value. From all types of motion, we chose only translational motion, because it gives a great opportunity to use this model in many vision tasks, such as object tracking or visual odometry [ 4 ]. Therefore, the main objective of this work is to create a model that is trained to identify velocity vector in the image coordinate.

Instead of using additional model on top of the mapping units, we are adding discriminative component directly to the model. This technique was rst applied for restricted Boltzmann machine [ 6 ] and since that has become widely used for similar models [ 11, 10 ]. In this paper, we are focused on the model that extracts transformation from two consecutive images. Without considering the additional discriminative component, there are several approaches of three-way structure model training [ 9, 13 ]. We propose a simple learning algorithm and show that it is not inferior to the existing. Moreover, our learning algorithm takes into account additional label variables and we demonstrate how it e ects the training discriminative features. We refer to our model variants as classi cation factored gated restricted Boltzmann machine (cfgRBM).

Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes. We propose a model (Fig. 1) in which the hidden units h not only captures the relationship between two images x and y, but also interacts with associated label z. The model is de ned in the terms of its energy function and the function consists of two basic parts. The rst of these is the factored three-way Boltzmann machine [ 13 ] and the second is classi cation restricted Boltzmann machine [ 5 ]. Combining these two models we de ned expression for the energy function as follows:

E(x; y; z; h) =

X(X

X aixi

i f i

Wixf xi)(X

Wjyf yj )(X

Wkhf hk)

X hkVklzl

kl X bj yj j

X ckhk

X dlzl ; l where matrices W x; W y; W h has size I F; J F and K F respectively, I and J are equal size of visible units, F - number of factors, K - number of hidden units. The discriminative component is weight matrix V with size K L and one-hot encoded label vector z with L classes. Bias terms a; b; c and d associated with two visible, hidden and label vectors respectively. We will assume that the visible vectors are binary, but the model can be de ned with real-valued units [ 1 ]. Every column W xf and W yf can be consider as lter pairs (Fig. 1).

To train the model, we also need to de ne the joint probability distribution over three vectors: p(x; y; z) = Px;y;z;h exp( E(x; y; z; h))

Ph exp( E(x; y; z; h)) ; where the numerator is summing over all possible hidden vectors and denominator is partition function which cannot be e ciently approximated. (1) (2)

Inference The inference task of proposed model is de ned as the problem of classifying the motion between two related images. In order to choose the most probable label under this model, we must compute conditional distribution p(zjx; y). We have adapted the calculations from the case of single input units [ 5 ] for the case of three-way interaction. As a result, for reasonable numbers of labels L, this conditional distribution can be also computed exactly and e ciently, by writing it as follows:

exp(dl) Qk (1 + exp(okl(x; y))) p(zl = 1 j x; y) = Pl exp(dl ) Qk (1 + exp(okl (x; y))) ; okl(x; y) = ck + Vkl + X Wkhf (W xf >x)(W yf >y)

f is an input to k hidden unit received from images x, y and estimated label l. where 2.2

Learning (3) (4) (5) In order to train a cfgRBM to solve a classi cation problem, we need to learn the model parameters = (W x; W y; W h; V; a; b; c; d). Given a training set Dtrain = f(x ; y ; z )g and a prede ned joint distribution (2) between three variables, the model can be trained by minimizing the negative log-likelihood: Lgen(Dtrain) = jDtrainj

X log p(x ; y ; z ) : a=1 2 In order to minimize this function the gradient for any cfgRBM parameters can be written as follows:

Ehjx ;y ;z + Ex;y;z;h ; (6) where subscript of the expectation denotes the distribution for variables. There exists a learning rule [ 2 ], called \Contrastive Divergence", which can be used to approximate this gradient. Taking this rule into consideration we proposed the Algorithm 1 for the training of cfgRBM model. The main di erence from the other approaches for training three-way interaction is in symmetrically sample vectors x; y in the negative phase. Detailed information about the partial derivatives with respect to the model parameters can be obtained from [ 9, 5 ].

In case of factored three-way interactions the calculation of the gradient (6) involves numerical instabilities. Especially when using a large input vectors. To avoid this we also use a norm constraint on columns of matrics W x and W y. It is a common approach to stabilizing learning. For example, the same recommendations are given by [ 3 ] for method \Adaptive Subspace Self-Organizing Map" to learn invariant properties of moving input patterns. Algorithm 1 Symmetric training update of the cfgRBM model Require: training triplet (x ; y ; z ) and learning rate # Notation # a b means a is set to value b # a p means a is sampled from p # Positive phase x0 x , y0 y , z0 h0k sigm(okl0 (x0; y0))

z # Sample h^ p(hjx0; y0; z0) # Negative phase x1 p(xjy0; h^), y1 p(yjx0; h^), z1 h1k sigm(okl1 (x1; y1))

p(zjh^) # Update for

2 end for

do 3

Experiments

@E(x0;y0;z0;h0) @ @E(x1;y1;z1;h1)

@ The main goal of this research is to build a model that is capable to extract translational motion from two related images. Therefore, we created a synthetic data consisting of image pairs in which the second image is horizontally and relatively shifted towards the rst. We take MNIST dataset1 and randomly choose a shift value in the range [-3 3] for each image. As a result we get 7 possible labels for 60; 000 training and 10; 000 test image pairs of relatively shifted handwritten digits. All the models in the following experiments have 200 factors and 100 hidden units. For detailed information about learning parameters we refer to our implementation2 of the models.

In the rst experiment (Fig. 2), we compare di erent learning strategies for the cfgRBM model. The rst learning method is taken from [ 9 ], where authors described a conditional model. The second method is proposed in [ 13 ], where authors de ne the joint distribution for an image pair. The results show that Algorithm 1 in the end of learning has the lowest classi cation and reconstruction test error. It is also interesting to note that there are di erent delays before lters become specialized in their frequency and phase-shift characteristics.

In the second experiment (Fig. 3), we compare hidden units activities of models with and without a discriminative component. In the rst case we trained a model completely unsupervised without any labeled information. In the second case cfgRBM model was trained using Algorithm 1. The results show that 1 http://yann.lecun.com/exdb/mnist/ 2 https://cit.ifmo.ru/~sorokin/cfgRBM/ descriminative component has a strong e ect on hidden features. In addition, we also demonstrate an e ect on the hidden units in the case with wrong label information.

Fig. 3. Hidden units activations. For every test sample, activation of 100 hidden units projected to 2D coordinates using t-SNE [ 7 ]. a) model trained without discriminative component. b) model extendet with additional labeled units. c) exactly the same model as in the case (b), but labels of classes f-3,-2g and f2,3g are deliberetly combined.

Conclusion

In this paper, we incorporate supervision learning for factored gated restricted Boltzmann machine model. Our results show that proposed model is capable to identify the velocity of the object presented on two consecutive images. In the future work we plan to apply this model for videos which may be represented as a temporally ordered sequence of images. Particularly, the ability to extract translational motion will be useful for tracking tasks.

1. Fischer , A. , Igel , C. : Training restricted Boltzmann machines: an introduction . Pattern Recognition 47 ( 1 ), pp. 25 { 39 ( 2014 )

2. Hinton , G.: Training products of experts by minimizing contrastive divergence . Neural computation 14 , pp. 1771 - 1800 ( 2002 )

3. Kohonen , T. : The adaptive-subspace som (assom) and its use for the implementation of invariant feature detection . In: Proc. ICANN95 , Int. Conf. on Arti cial Neural Networks , pp. 3 - 10 ( 1995 )

4. Konda , K. , Memisevic , R. : Learning visual odometry with a convolutional network . International Conference on Computer Vision Theory and Applications . ( 2015 )

5. Larochelle , H. , Mandel , M. , Pascanu , R. , Bengio , Y. : Learning algorithms for the classi cation restricted boltzmann machine . Journal of Machine Learning Research 13 , pp. 643 - 669 ( 2012 )

6. Larochelle , H. , Bengio , Y. : Classi cation using discriminative restricted Boltzmann machines . In: Proceedings of the 25th international conference on Machine learning , pp. 536 { 543 ( 2008 )

7. Van der Maaten , L. , Hinton , G.: Visualizing data using t-sne . Journal of Machine Learning Research 9 , pp. 2579 { 2605 ( 2008 )

8. Memisevic , R.: Learning to relate images . IEEE Transactions on Pattern Analysis and Machine Intelligence 35 , pp. 1829 - 1846 ( 2013 )

9. Memisevic , R. , Hinton , G.E. : Learning to represent spatial transformations with factored higher-order boltzmann machines . Neural Computation 22 , pp. 1473 - 1492 ( 2010 )

10. Reed , S. , Sohn , K. , Zhang , Y. , Lee , H. : Learning to disentangle factors of variation with manifold interaction . In: Proceedings of the 31st International Conference on Machine Learning , pp. 1431 - 1439 ( 2014 )

11. Sohn , K. , Zhou , G. , Lee , C. , Lee , H. : Learning and selecting features jointly with point-wise gated boltzmann machines . In: Proceedings of The 30th International Conference on Machine Learning , pp. 217 - 225 ( 2013 )

12. Srivastava , N.: Unsupervised Learning of Visual Representations using Videos . Department of Computer Science, University of Toronto. Technical Report . ( 2015 ) Retrived from http://www.cs.toronto.edu/~nitish/depth_oral.pdf

13. Susskind , J. , Memisevic , R. , Hinton , G. , Pollefeys , M. : Modeling the joint density of two images under a variety of transformations . In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , pp. 2793 - 2800 ( 2011 )