Dynamical Change of the Perceiving Properties of Neural Networks as Training with Noise and Its Impact on Pattern Recognition Roman Nemkov Department of Information Systems & Technologies, North-Caucasus Federal University. 2, Kulakov Prospect, Stavropol, Russian Federation nemkov.roman@yandex.ru Abstract. General parameters of convolutional networks (kernels) are set in the learning process. Also in addition to the method of training the quantity of information that is passed through the kernel influences the quality of setting. This quantity of information depends on the size of training sample and the concentration of receptive fields. You can increase the concentration of the re-ceptive fields for a fixed training set size due to the multilayer coating of arbitrary maps with fields of different types, that will be equivalent to the use of noisy training sample. This can increase the networks performance in the test. Keywords: convolutional networks, training with noise, different types of receptive fields. 1 Introduction To date, problem of invariance is the main and yet unsolved problem of pattern recognition: the same object may have substantially different from each other ex- ternal characteristic (shape, colour, texture, etc.) as well as it may be differently displayed on the retina (view from different angles), which greatly complicates its classification. This global problem within the neural network technologies can be solved by to create big training sets [1]. If creating such sets are difficult then this sets are expanded by the addition of noise. [4–6]. A neural network can be regarded as a pyramidal hierarchical graph then noise can be created by changing communications between nodes in a graph [2, 3] or changing perceiving properties in nodes of a graph [8]. Convolutional neural networks (CNNs) have three perceiving properties in a node-neuron: a receptive field (RF), an activa- tion function and a method for producing a weighted sum (simple weighted sum or higher-order polynomial). A change of RFs is the easiest and most promising way of the three. The same pattern can be differently perceived by changing the RFs, that leads to the creation of noise. The influence of noise (which was created due to changing the shape of the fields) is rinvestigated on the pattern recognition in this article. 36 Dynamical Change of the Perceiving Properties 2 The Generation of Noise by Changing the Receptive Fields Training with noise in the context of gradient descent for a neural network can be written as ∂E + ε = ∇, (1) ∂w where ∂E∂w is the gradient vector from the network’s weights, ε is the additional noisy component corrects the gradient vector. The gradient vector after a cor- rection can’t exactly point to the local minimum, but training with noise has two benefits: generalization ability increases and local minimums can be better overcome during the gradient descent. When you change perceiving properties in the nodes of networks you have the same training with noise, where ε can be explicitly written: ∂E ∂E ε= (new perception) − (standart perception), (2) ∂w ∂w where ∂E ∂w (new perception) is the gradient vector from weights (the perceiving properties have been already changed), ∂E ∂w (standart perception) is the gradient vector from weights with the standart perceiving properties (RFs have square form). The changing of perceiving properties in the nodes of a network occur due to changing the shape of the RFs. Each element of RF has neighbors. The neighbors are elements located in the one or two discrete steps from current element. Thus, the value of current element (within RF) can be replaced by the neighboring value. If you do this operation for all elements of RF then weighted sum will be changed, hence output of neuron will be also changed. The replacement is shown on the Fig. 1. Fig. 1. The replacement of neuron-pixel X by Y with the help of changing the shape of the receptive field (the receptive field with the satellite). The map (which is the input for the current neurons) receives another cover- ing of RFs, but the kernels of convolution layer (which passed through themselves Roman Nemkov 37 this new covering) remain the same. The quantity of information affecting the kernels increases and the kernels can extract the best invariant features. This process is shown in Fig. 2. Fig. 2. Discrete perception of information with the help of convolution layer (C-Layer) (Left). The same perception, but with the help of C-Layer with different receptive fields. The pattern will be perceived by the first type of receptive field in the first stroke. In the second stroke the same pattern will be perceived by the second type of receptive field (Right). This technology can expand the training set by the patterns which are created depending on where (how far away from the current element) elements of RFs take the information for a replacement. Let all elements of RFs are replaced then additional training sets, which will be obtained by this technology, is shown in Fig. 3. Fig. 3. Additional sets which will be obtained by the changing of RFs. Any convolutional layers may have their coverings, hence the change of per- ception can be on the different layers. If CNN has three convolutional layers 38 Dynamical Change of the Perceiving Properties then the quantity of combinations (or “refracting prisms”) will be 23 -1=7 (one “prism” is a standard perception). The unique pattern will be obtained within the frames of each scheme-“prism”. A strategy is also important for a marking the particular layer using RFs. There are two opposing strategies: the RF is cho- sen by random way and is superimposed on desired location or the same type of RF with a specific index is superimposed on all desired locations. The second strategy can model the primitive affine transformations if the RF simulates a shift for all its elements. Patterns need create with the combinations of all “refracting prisms”, with using the both strategies, with using the RFs which are fully updated for maxi- mum coverage of any of the three additional sets. 3 Experiment MNIST was chosen for experiments with noise. This is due to the fact, that the most schemes of the creating of noise have been tested in this set. The architecture of CNN is shown in Fig. 4. Fig. 4. The architecture of CNN for the work with MNIST set. The simplest algorithm of gradient descent (without momentum, weights decay and other tricks) has been used for maximum simplicity and repeatability of the experiment. The initial value of the learning rate (η) is equal to 0.005, after every 100 epochs new value are obtained from the old value by multiplying by 0.3. Error function is mean-square error (MSE). The pattern is recognized if the error on the output layer does not exceed the value of 0.001. Pools of RFs for convolutional layers are shown in Figure 5. Roman Nemkov 39 Fig. 5. Pools of RFs for convolutional layers. The RF with arbitrary index is set in the proper position by the strategy of markup. Geometrical interpretation of the index for shift is shown in Fig. 6. Fig. 6. The geometric interpretation of index for the shift of element of RF. Thus, the noise from the first additional training set (Fig. 3, set (1)) was used. Comparative results are given in Table 1. Table 1. The comparison between different learning algorithms Algorithm Distortion Error Ref. 2 layer MLP (MSE) affine 1.6% [4] SVM affine 1.4% [5] Tangent dist. affine+thick 1.1% [4] LeNet-5 (MSE) affine 0.8% [4] 2 layer MLP (MSE) elastic 0.9% [6] CNN (MSE) this distortions 1.2% this paper Best result elastic 0.23% [7] This is a good result that has been achieved without the involvement of additional noise from the sets (2) or (3) (Fig. 3). Research has shown that the 40 Dynamical Change of the Perceiving Properties change of perceiving properties in the nodes of CNN can effectively expand the training set and reduce the error of generalization. Also, this technology is easy compatible with the elastic distortions and dropconnects or dropouts. References [1] Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M.A., Senior, A., Tucker, P., Yang, K., Ng, A. Y.: Large Scale Distributed Deep Networks. NIPS, 2012. [2] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Im- proving Neural Networks by Preventing Co-adaptation of Feature Detectors. CoRR, 2012. [3] Wan, L., Zeiler, M. D., Zhang, S., LeCun, Y., Fergus, R.: Regularization of Neural Networks using DropConnect. ICML 3, volume 28 of JMLR Proceedings, page 1058- 1062. [4] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE. v. 86, pp. 2278-2324, 1998. [5] Decoste, D., Scholkopf, B.: Training Invariant Support Vector Machines. Machine Learning Journal. vol 46, No 1-3, 2002. [6] Simard, P., Steinkraus, D., Platt, J.C.: Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. ICDAR, page 958-962. IEEE Com- puter Society, 2003. [7] Ciresan, D., Meier, U., Schmidhuber, J.: Multicolumn Deep Neural Networks for Image Classification. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages 3642-3649, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-1-4673-1226-4. [8] Nemkov, R., Mezentseva, O.: The Use of Convolutional Neural Networks with Non- specific Receptive Fields. The 4-th International Scientific Conference ” Applied Natural Science 2013”, Novy Smokovec, High Tatras, Slovak Republic, Oktober 2-4, 2013, p.148.