<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop, Stavropol and Arkhyz, Russian Federation</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>uence of Dropout and Dynamic Receptive Field Operations on Convolutional Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktoria Berezina</string-name>
          <email>berezinava@yandex.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Department of Information Systems &amp; Technologies, North-Caucasus Federal University.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2, Kulakov Prospect</institution>
          ,
          <addr-line>Stavropol, Russian Federation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oksana Mezentseva</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>09</lpage>
      <abstract>
        <p>The method and the experiments which have been performed in order to struggle with coadaptation and to improve generalization abilities of networks with the help of two techniques: dynamic receptive elds and dropout have been presented of the article. It is an e ective approach for networks training. The use of the method, combining the dropout technique and dynamic receptive elds, allows to reduce the generalization error and prevents the co-adaptation of neurons.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The main algorithm of convolutional neural networks (CNN) training is the backpropagation (BPA) one.</p>
      <p>
        As we known, weights changing happens according to the formula (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>wij =
j oi
where is learning rate, usually it is a constant, j is a local gradient for j neuron, oi is an input signal for
j neuron. A weight is changed to the value which received from a multiplication of local gradient to an input
value (an output value from previous layer) for this weight. In this formulation the rule is similar to Hebb's rule
(the empirical regularity was found in neural networks of living things) [1].</p>
      <p>
        But, the formula (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) has analytical view which doesn't take into account a network architecture absolutely.
Any neural network is a particular form of a graph and therefore di erent optimizing techniques appeared to
arise very soon for training improvement. The techniques have been based on (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
      </p>
      <p>wij =</p>
      <p>j (network architecture)oi(network architecture)</p>
      <p>
        We consider j (network architecture) rstly. Today one of the key technique for backpropagation of local
gradient taking into account of network architecture is a dropout technique [2], i.e. a technique of neurons
dropout during the training. The technique has come a long way since 2012. And today it is the main way for
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
the struggle with overtraining for deep neural networks. There are di erent variants of this technique with wide
range of modi cations of each (dropconnects [3], dropblocks[4]).
      </p>
      <p>Neural networks and especially deep ones tend to the overtraining. The dropout technique allows to obstruct
this process. When we delete a part of neurons during training process we have another neural network. If we
have n neurons then we can obtain 2n networks from the original network with custom weights with a total
of O(n2). From the mathematical viewpoint we can consider such training as training of 2n sparse (partially
related) networks with common weights [5].</p>
      <p>
        We can imagine functioning of usual neural network as (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) during forward propagation.
where zil+1 is weighted sum for i neuron and l+1 layer, w is custom weights, b is bias for neuron, yil+1 is a
neuron output, f ( ) is activation function of neuron (usually it's sigmoid function or ReLU for modern models).
      </p>
      <p>
        The forward propagation for dropout technique changes for each input pattern according to the formulas (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">8</xref>
        ):
zl+1 = wl+1yl + bli+1
i i
yl+1 = f (zil+1)
i
rjl
      </p>
      <p>
        Bernoulli(p)
y~l = rlyl
zl+1 = wl+1y~l + bli+1
i i
yl+1 = f (zil+1)
i
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(6)
(
        <xref ref-type="bibr" rid="ref6">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">9</xref>
        )
where rjl is Bernoulli random distribution for neurons of the same layer if to include them in forward
propagation or not.
      </p>
      <p>
        In the simplest case a delete technique means the neuron deletion with probability of 0.5. During the test
values of weights are multiplied by vector of probability of neurons participate in the training (
        <xref ref-type="bibr" rid="ref8">9</xref>
        ):
      </p>
      <p>Wtlest = pW l</p>
      <p>The use of such technique leads to an interesting e ect. The derivative which is obtained by each parameter
(local gradient) tells it how it should change in order to minimize the function of nal losses (taking into account
the activity of the other parameters (weights)). Therefore, weights can change correcting errors of other weights.
It can lead to an excessive joint adaptation (co-adaptation), which, in turn, leads to the overtraining because
these joint adaptations cannot be generalized to data that were not involved in the training.</p>
      <p>The dropout prevents joint adaptation for each hidden parameter, making the presence of other hidden
parameters unreliable. Therefore, the hidden weight can not rely on other weights correcting their own mistakes.</p>
      <p>The trained features for hidden neurons without dropout from autoencoders for MNIST dataset [6] are shown
on gure 1.a (The picture was taken from [2]). The same features which were obtained by dropout with probability
of 0.5 are shown on gure 1.b.</p>
      <p>As we can see the features on the right part of gure 1 are clear and not similar to each other. This increases
the ability for invariant pattern recognition.</p>
      <p>
        However, as seen from (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), the architecture change can be applied not only to a local gradient but and to
an input as well. A receptive eld (RF) works with an input. If we use RFs with nonstandard forms we will
increase quantity of information which in uences the neuron-detector setup [7].
      </p>
      <p>
        Therefore the proposed method in that work consists in combination of these two techniques (dependent on
network architectures) for two purposes: the struggle with overtraining and improving the quality of the invariant
recognition. In this approach the training rule (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) will completely depend on the architectural properties of the
network. And this new information embedded in the rule (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) not clearly will decrease network entropy (if an
entropy is used in the view of cross-entropy in output layer) with more rapid speed than usual during the training
process.
The idea of CNN using with dynamic RFs consists in the fact that if we change the set of RFs for some layers
then the same pattern can be perceived in di erent ways by the network. With the help of this we can increase a
training dataset. It is known that a classical form of RF is a square. We o er to use the template for obtaining
RFs with a nonstandard form. The template consists of indexes which identify their neighbors within two discrete
steps from the element of the index on the pixel matrix. If we change all RFs for the card (a layer consists of
cards) the additional information will in uence custom features and it will lead to obtaining better invariants,
as we can see on gure 2.
where Cmi;n is a neuron output for i Card of a C-layer in the position m and n, '( ) = A tanh(B p) when
A = 1:7159, B = 2=3, b is a bias, Qi is a set of indexes of cards of a previous layer which are linked with
q
the Ci card, KC is a size of square RF for Cmi;n neuron, Xm+k;n+l is input value for Cmi;n neuron, W and A
vectors are custom weights for neurons of C-layer, Smi;n is an output value of neuron for pooling; Fi( ) and
Fj( ) are Fi(RFm;n; k; l), Fj(RFm;n; k; l), i.e. functions which returning the row and column o sets for the
RF template belonging to the neuron m, n at the position k, l within this template. indexk;l is an element
of template of RFm;n in the position k, l, indexk;l = 0::24. The functions are determined by the following
formulas:
8&gt;0; indexk;l 2 f0; 4; 5; 16; 17g
&gt;
&gt;&gt;&gt;1; indexk;l 2 f6; 7; 8; 18; 19g
&gt;
Fi( ) = &lt;2; indexk;l 2 f20; 21; 22; 23; 24g
&gt;&gt;&gt;&gt; 1; indexk;l 2 f1; 2; 3; 14; 15g
&gt;
&gt;
: 2; indexk;l 2 f9; 10; 11; 12; 13g
8&gt;0; indexk;l 2 f0; 2; 7; 11; 22g
&gt;
&gt;&gt;&gt;1; indexk;l 2 f3; 5; 8; 12; 23g
&gt;
Fj( ) = &lt;2; indexk;l 2 f13; 15; 17; 19; 24g
&gt;&gt;&gt;&gt; 1; indexk;l 2 f1; 4; 6; 10; 21g
&gt;
&gt;
: 2; indexk;l 2 f9; 14; 16; 18; 20g
;
(11)
      </p>
      <p>There are no problems in combining the two techniques. After the feeding the next pattern, it is necessary to
select the corresponding RFs for the neurons, and also to decide which neurons will be skipped. Details of the
implementation of dynamic RFs are given in [7, 8, 9]. The details of the dropout are given in [2, 5].
3</p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>The experiments with the proposed method are carried out on MNIST [6]. It consists of 784 patterns, each of
them is 28x28 in grayscale. The test dataset is 10 Kb, the training dataset is 60 Kb.</p>
      <p>We have used the classical LeNet-5 architecture. The type of regularization is L2.</p>
      <p>We have got the result of 0.95 within the test dataset without any techniques.</p>
      <p>The results of work with the help of dropout and the proposed method of dropout and the dynamic RFs
combination are shown on gure 3. The parameters of the network were taken from the similar work [8].</p>
      <p>Blue color is the result of LeNet-5 work with the help of dropout and red is the result of the combined method.
A horizontal axis means the probability of neuron dropout. It is seen that the more increase of the probability
of a neuron dropout the more decrease of the generalization error and the generalizing abilities of the network
grow (or, equivalently, the networks committee).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>The use of the method, combining the dropout technique and dynamic receptive elds, allows to reduce the
generalization error and prevents the co-adaptation of neurons. In general, the architectural changes occurring
in the graph of convolutional neural networks have a positive e ect on the quality of invariant recognition and,
in fact, correspond to the committee of networks being trained with common weights.
[6] https://en.wikipedia.org/wiki/MNIST database</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hebb</surname>
            <given-names>D.O.</given-names>
          </string-name>
          <article-title>The Organization of Behavior</article-title>
          . John Wiley &amp; Sons, New York,
          <year>1949</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Improving neural networks by preventing co-adaptation of feature detectors</article-title>
          . http://arxiv.org/abs/1207.0580,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Li</given-names>
            <surname>Wan</surname>
          </string-name>
          , Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus.
          <source>Regularization of Neural Network using DropConnect, International Conference on Machine Learning</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Golnaz</given-names>
            <surname>Ghiasi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Quoc V.
          <article-title>Le DropBlock. A regularization method for convolutional networks</article-title>
          ,
          <source>30 Oct</source>
          <year>2018</year>
          ,
          <string-name>
            <surname>NIPS</surname>
          </string-name>
          <year>2018</year>
          , https://arxiv.org/pdf/
          <year>1810</year>
          .12890.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Sadowski</surname>
          </string-name>
          .
          <source>Understanding Dropout, Advances in neural information processing systems</source>
          , January 2013
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Nemkov</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mezentsev</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mezentseva</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodnikov</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>Image Recognition by a Second-Order Convolutional Neural Network with Dynamic Receptive Fields</article-title>
          ,
          <source>Young Scientists International Workshop on Trends in Information Processing (YSIP2)</source>
          . Dombai, Russian Federation, May
          <volume>16</volume>
          -20,
          <year>2017</year>
          . 212 . http://ceurws.org/Vol-1837/paper21.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Nemkov</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mezentseva</surname>
            <given-names>O. S.</given-names>
          </string-name>
          <article-title>Dynamical change of the perceiving properties of convolutional neural networks and its impact on generalization</article-title>
          .
          <source>Neurocomputers: development and application</source>
          ,
          <year>2015</year>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Nemkov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>The method of a mathematical model parameters synthesis for a convolution neural network with an expanded training set. The modern problems science</article-title>
          and education,
          <year>2015</year>
          . 1. URL: http://www.scienceeducation.ru/125-
          <fpage>19867</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>