-

Workshop, Stavropol and Arkhyz, Russian Federation

uence of Dropout and Dynamic Receptive Field Operations on Convolutional Networks

Viktoria Berezina

berezinava@yandex.ru 1

Department of Information Systems & Technologies, North-Caucasus Federal University.

0 0 2, Kulakov Prospect , Stavropol, Russian Federation 1 Oksana Mezentseva

2019

1 7 09

The method and the experiments which have been performed in order to struggle with coadaptation and to improve generalization abilities of networks with the help of two techniques: dynamic receptive elds and dropout have been presented of the article. It is an e ective approach for networks training. The use of the method, combining the dropout technique and dynamic receptive elds, allows to reduce the generalization error and prevents the co-adaptation of neurons.

The main algorithm of convolutional neural networks (CNN) training is the backpropagation (BPA) one.

As we known, weights changing happens according to the formula ( 1 ).

wij = j oi where is learning rate, usually it is a constant, j is a local gradient for j neuron, oi is an input signal for j neuron. A weight is changed to the value which received from a multiplication of local gradient to an input value (an output value from previous layer) for this weight. In this formulation the rule is similar to Hebb's rule (the empirical regularity was found in neural networks of living things) [1].

But, the formula ( 1 ) has analytical view which doesn't take into account a network architecture absolutely. Any neural network is a particular form of a graph and therefore di erent optimizing techniques appeared to arise very soon for training improvement. The techniques have been based on ( 2 ).

wij =

j (network architecture)oi(network architecture)

We consider j (network architecture) rstly. Today one of the key technique for backpropagation of local gradient taking into account of network architecture is a dropout technique [2], i.e. a technique of neurons dropout during the training. The technique has come a long way since 2012. And today it is the main way for ( 1 ) ( 2 ) the struggle with overtraining for deep neural networks. There are di erent variants of this technique with wide range of modi cations of each (dropconnects [3], dropblocks[4]).

Neural networks and especially deep ones tend to the overtraining. The dropout technique allows to obstruct this process. When we delete a part of neurons during training process we have another neural network. If we have n neurons then we can obtain 2n networks from the original network with custom weights with a total of O(n2). From the mathematical viewpoint we can consider such training as training of 2n sparse (partially related) networks with common weights [5].

We can imagine functioning of usual neural network as ( 3 ), ( 4 ) during forward propagation. where zil+1 is weighted sum for i neuron and l+1 layer, w is custom weights, b is bias for neuron, yil+1 is a neuron output, f ( ) is activation function of neuron (usually it's sigmoid function or ReLU for modern models).

The forward propagation for dropout technique changes for each input pattern according to the formulas ( 5 ) ( 8 ): zl+1 = wl+1yl + bli+1 i i yl+1 = f (zil+1) i rjl

Bernoulli(p) y~l = rlyl zl+1 = wl+1y~l + bli+1 i i yl+1 = f (zil+1) i ( 3 ) ( 4 ) ( 5 ) (6) ( 7 ) ( 8 ) ( 9 ) where rjl is Bernoulli random distribution for neurons of the same layer if to include them in forward propagation or not.

In the simplest case a delete technique means the neuron deletion with probability of 0.5. During the test values of weights are multiplied by vector of probability of neurons participate in the training ( 9 ):

Wtlest = pW l

The use of such technique leads to an interesting e ect. The derivative which is obtained by each parameter (local gradient) tells it how it should change in order to minimize the function of nal losses (taking into account the activity of the other parameters (weights)). Therefore, weights can change correcting errors of other weights. It can lead to an excessive joint adaptation (co-adaptation), which, in turn, leads to the overtraining because these joint adaptations cannot be generalized to data that were not involved in the training.

The dropout prevents joint adaptation for each hidden parameter, making the presence of other hidden parameters unreliable. Therefore, the hidden weight can not rely on other weights correcting their own mistakes.

The trained features for hidden neurons without dropout from autoencoders for MNIST dataset [6] are shown on gure 1.a (The picture was taken from [2]). The same features which were obtained by dropout with probability of 0.5 are shown on gure 1.b.

As we can see the features on the right part of gure 1 are clear and not similar to each other. This increases the ability for invariant pattern recognition.

However, as seen from ( 2 ), the architecture change can be applied not only to a local gradient but and to an input as well. A receptive eld (RF) works with an input. If we use RFs with nonstandard forms we will increase quantity of information which in uences the neuron-detector setup [7].

Therefore the proposed method in that work consists in combination of these two techniques (dependent on network architectures) for two purposes: the struggle with overtraining and improving the quality of the invariant recognition. In this approach the training rule ( 1 ) will completely depend on the architectural properties of the network. And this new information embedded in the rule ( 2 ) not clearly will decrease network entropy (if an entropy is used in the view of cross-entropy in output layer) with more rapid speed than usual during the training process. The idea of CNN using with dynamic RFs consists in the fact that if we change the set of RFs for some layers then the same pattern can be perceived in di erent ways by the network. With the help of this we can increase a training dataset. It is known that a classical form of RF is a square. We o er to use the template for obtaining RFs with a nonstandard form. The template consists of indexes which identify their neighbors within two discrete steps from the element of the index on the pixel matrix. If we change all RFs for the card (a layer consists of cards) the additional information will in uence custom features and it will lead to obtaining better invariants, as we can see on gure 2. where Cmi;n is a neuron output for i Card of a C-layer in the position m and n, '( ) = A tanh(B p) when A = 1:7159, B = 2=3, b is a bias, Qi is a set of indexes of cards of a previous layer which are linked with q the Ci card, KC is a size of square RF for Cmi;n neuron, Xm+k;n+l is input value for Cmi;n neuron, W and A vectors are custom weights for neurons of C-layer, Smi;n is an output value of neuron for pooling; Fi( ) and Fj( ) are Fi(RFm;n; k; l), Fj(RFm;n; k; l), i.e. functions which returning the row and column o sets for the RF template belonging to the neuron m, n at the position k, l within this template. indexk;l is an element of template of RFm;n in the position k, l, indexk;l = 0::24. The functions are determined by the following formulas: 8>0; indexk;l 2 f0; 4; 5; 16; 17g > >>>1; indexk;l 2 f6; 7; 8; 18; 19g > Fi( ) = <2; indexk;l 2 f20; 21; 22; 23; 24g >>>> 1; indexk;l 2 f1; 2; 3; 14; 15g > > : 2; indexk;l 2 f9; 10; 11; 12; 13g 8>0; indexk;l 2 f0; 2; 7; 11; 22g > >>>1; indexk;l 2 f3; 5; 8; 12; 23g > Fj( ) = <2; indexk;l 2 f13; 15; 17; 19; 24g >>>> 1; indexk;l 2 f1; 4; 6; 10; 21g > > : 2; indexk;l 2 f9; 14; 16; 18; 20g ; (11)

There are no problems in combining the two techniques. After the feeding the next pattern, it is necessary to select the corresponding RFs for the neurons, and also to decide which neurons will be skipped. Details of the implementation of dynamic RFs are given in [7, 8, 9]. The details of the dropout are given in [2, 5]. 3

Experiments

The experiments with the proposed method are carried out on MNIST [6]. It consists of 784 patterns, each of them is 28x28 in grayscale. The test dataset is 10 Kb, the training dataset is 60 Kb.

We have used the classical LeNet-5 architecture. The type of regularization is L2.

We have got the result of 0.95 within the test dataset without any techniques.

The results of work with the help of dropout and the proposed method of dropout and the dynamic RFs combination are shown on gure 3. The parameters of the network were taken from the similar work [8].

Blue color is the result of LeNet-5 work with the help of dropout and red is the result of the combined method. A horizontal axis means the probability of neuron dropout. It is seen that the more increase of the probability of a neuron dropout the more decrease of the generalization error and the generalizing abilities of the network grow (or, equivalently, the networks committee). 4

Conclusion

The use of the method, combining the dropout technique and dynamic receptive elds, allows to reduce the generalization error and prevents the co-adaptation of neurons. In general, the architectural changes occurring in the graph of convolutional neural networks have a positive e ect on the quality of invariant recognition and, in fact, correspond to the committee of networks being trained with common weights. [6] https://en.wikipedia.org/wiki/MNIST database

[1] Hebb

D.O.

The Organization of Behavior . John Wiley & Sons, New York, 1949 .

[2]

Hinton ,

Srivastava ,

Krizhevsky , I. Sutskever , and

R. R.

Salakhutdinov . Improving neural networks by preventing co-adaptation of feature detectors . http://arxiv.org/abs/1207.0580, 2012 .

[3]

Wan , Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus. Regularization of Neural Network using DropConnect, International Conference on Machine Learning , 2013 .

[4]

Golnaz

Ghiasi , Tsung-Yi

Lin

, Quoc V. Le DropBlock. A regularization method for convolutional networks , 30 Oct 2018 , NIPS 2018 , https://arxiv.org/pdf/ 1810 .12890.

[5]

Baldi ,

Peter

Sadowski . Understanding Dropout, Advances in neural information processing systems , January 2013

[7] Nemkov

, Mezentsev

, Mezentseva

, Brodnikov

. Image Recognition by a Second-Order Convolutional Neural Network with Dynamic Receptive Fields , Young Scientists International Workshop on Trends in Information Processing (YSIP2) . Dombai, Russian Federation, May 16 -20, 2017 . 212 . http://ceurws.org/Vol-1837/paper21.pdf.

[8] Nemkov , R. M. , Mezentseva

O. S.

Dynamical change of the perceiving properties of convolutional neural networks and its impact on generalization . Neurocomputers: development and application , 2015 , no. 2 , pp. 12 - 18 .

[9] Nemkov , R. The method of a mathematical model parameters synthesis for a convolution neural network with an expanded training set. The modern problems science and education, 2015 . 1. URL: http://www.scienceeducation.ru/125- 19867 .