-

Deep Learning Neural Networks with Controlled Switching of Neural Planes*

0 Saint Petersburg Electrotechnical University "LETI" Saint Petersburg , Russia

0000 0002

The algorithm of network topology construction and training of twodimensional fast neural networks with additional switched planes is considered. It is noted that the structure of fast neural networks has a fractal nature. The constructed topology is ideologically close to the topology of convolutional neural networks of deep learning, but it has regular topology with the number of layers established by the factor representation of the dimensions of the image and the output plane of the classes. The learning algorithm has an analytical representation, and it is stable and converges in a finite number of steps. Additional planes extend the information capacity of the tunable transformation to the maximum possible. Control of the planes in the training and processing mode is realized by numerical coordinate codes of the output plane. The architecture of a regular neural network with additional planes is presented. Variants for image ordering in the output plane are considered. Examples are given.

fast tunable transformation neural network learning convolutional neural network the bitmap the planes of the neural layers First Section

Deep learning technology involves a process of configuration complexity o f informative features in the sequence of neural layers. Starting with the neocognitron by K. Fukushima [ 1 ] it has been proposed some variants of realization of this idea. One of the successful solutions is the architecture of convolutional neural networks [ 2 ], which has shown high efficiency in solving various problems. A distinctive feature of this architecture is the presence in convolutional layers of several data processing channels (called maps or planes). In each plane, the output image of the previous layer is being convoluted with a fixed kernel of small dimensions. Convolutional layers alternate with pooling layers that multiply reduce the dimension of the feature space. The pooling layers are optional and there are variants of completely eliminating them from the network architecture [ 3 ]. The second distinctive feature of the architecture is the use of * semi-linear activation functions that act as switching keys controlled by the values of hidden layers variables [ 4, 5, 6, 7 ].

The disadvantage of the convolution network architecture is the lack of theoretically justified methods for choosing the network structure and convolution kernel parameters. There are several well-functioning network configurations for specific tasks, but it is not clear how to build a network for a new task. Until now, the choice of the structure of the convolution network is an art object. The second significant drawback is related to the training time of convolutional networks. On a typical processor, the time can vary from several hours to several days, therefore high-performance GPUs are often used to train networks. In [ 8 ] the authors proposed the idea of using fast transformation algorithms to construct the structure and topology of multilayer perceptron neural networks. We show that this approach with some modification can also be used to construct the structure and topology of convolutional neural networks.

At present, fast algorithms for linear Fourier, Walsh, Haar, and similar transformations are widely known. With the use of fast algorithms, the gain on computational operations increases exponentially with the increase in the dimension of the transformation. Since the end of the 20-th century, there has arisen a direction of fast tunable transformations [ 9 ], which are essentially neural networks with limited connections and linear activation functions. There have been developed methods for training such neural networks that converge in a finite number of steps. The number of layers in fast transformations and their configuration are determined by the dimension of the processed images. To construct fast algorithms, the dimension of the transformation must be a composite number, and the more multipliers in the composition of the dimension, the higher the computational efficiency of the fast algorithm. Despite the wide variety of fast algorithms, the configurations of their structures satisfy the system invariant of selfsimilarity [ 10 ]. Fractals are known to have the same property. Therefore, fast algorithms can be interpreted as quasi-fractals. The property of structural fractality allows solving two tasks simultaneously: to realize fast data processing and fast transformation training.

In this paper, we will show that a small modification of the system invariant of fast algorithms leads to convolutional neural network structures. At the same time, it is possible to preserve the algorithm of fast learning and increase the information capacity of the network recognition up to the maximum possible, determined by the number of neurons of the output layer of the network. The proposed architecture cannot be called convolution networks, because in the planes of neural layers more general transformations than convolution are used, and there are no pooling layers in the transformation, but the principle of pooling is used in the training of the network. All neurons have linear activation functions, however non-linear processing exists, but it is implemented not at the expense of activation functions, but by switching the planes of neural networks. To some extent, this is similar to the switching semi-linear activation functions of convolutional neural networks. Control of switching planes is carried out by the coordinates of neurons in the output plane of the network. Call this class of networks neural networks with controlled switching of planes (CSPNN). 2

The topology of Two-Dimensional Fast Tunable Transformations

Let us designate through F U y ,Ux  an image matrix by dimensionality N y  N x . In case of impact on the image through linear transformation, H U y ,Ux ;Vy ,Vx  the array from M y  M x coefficients turns out. Two-dimensional transformation is executed by the rule:

Ny1 Nx1 S Vy ,Vx     F U y ,U x H U y ,U x ;Vy ,Vx  .

Uy0 Ux0 A necessary condition of the existence of a fast algorithm is the possibility of multiplicative decomposition of values of input and output dimensionalities of the transformation to an equal number of multiplicands: (1) (2) (3) (4) N y  p0y p1y Nx  p0x p1x pny1, pnx1,

M y  g0y g1y M x  g0x g1x gny1, gnx1.

Here indexes x, y mean the belonging to coordinate axes of the source image, and value n defines the number of layers in the graph of a fast algorithm. Using multiplicands of decomposition, coordinates of points of the image representation in a positional system notation with the mixed radices:

U y  uny1un2

y U x  unx1un2 x u1yu0y , u1xu0x , where the weight of m ’s position digit is defined by an expression pm*1 pm2 *

* * p p 1 0 and um* is the digit variable accepting values 0, pm* 1 (the asterisk replaces indexes x, y here). It is similarly possible to represent coordinates of spectral coefficients for the plane Vy ,Vx  :

Vy  vny1vny2 Vx  vnx1vn2 x v1yv0y , v1xv0x , where the weight of m ’s position digit is defined by an expression gm*1gm2 *

* * g g 1 0 and vm* is the digit variable accepting values 0, gm* 1 .

The algorithm of fast transformation is usually presented in the form of a graph with a different topology. It is convenient to use the digit-by-digit form for the analytical description of a graph of the topology of a fast algorithm. For example, topology a graph can be described by "Cooley–Tukey topology with decimation on to time" in the form of a linguistic sentence [ 10 ] (topological model):  un*1un2

*    un*1un2 * u1*u0* un*1un2

* um*1umvm1vm2 * * * u1*v0* v0* vn*1vn2 *   , v1*v0*  where words are digit-by-digit representations of coordinate numbers, and letters – names of digit variables. The number of words in the sentence is equal n  1. The first and last words in the sentence correspond to coordinates of points of the terminal planes provided by expressions. The intermediate words define input U ym ,Uxm and output coordinate Vym ,Vxm in the planes of inner layers of the fast algorithm. For an algorithm with substitution of values, the condition is follow-up satisfied:

Uym1  Vym,

Uxm1  Vxm Graph of topology contains basic operations in a layer Wixmm,imy umyumx ;vmyvmx  , representing four-dimensional matrixes of dimensionality  pmy , pmx; gmy , g mx  . Where digit-by-digit expressions of indexes of kernels of a layer m for the selected topology have viewed:

Expression is an analytical representation of a system invariant of fast algorithms [ 10 ]. In general topologies, for the directions, x and y may be different. Connections between basic operations are defined by the structural model of fast transformation, where to each node there corresponds to basic operation (differently called hereinafter as neural kernels). For the selected topology of graphs of the structural model is described by the following linguistic sentence:  un*1un2

*    un*1un2 * u1* un*1un2

* um*1vm1vm2 * * u2*v0* v0* vn*1vn2 *   . v1*v0*  Each word in this sentence defines the number of basic operations i*m in the layer m . The number of words is equal n in the sentence.

In Fig. 1 structural model of fast two-dimensional transformation for dimensionality 8  8 is shown. Input image enters on the low layer and spectral coefficients turn out in the high layer. Nodes of the model there correspond to basic operations (neural kernels) with dimensionality [2,2;2,2]. Kernels in layer m execute two-dimensional transformation over the spatial unit with size pmy  pmx : (5) (6) S m Vym ,Vxm     F m U ym ,U xm Wixmm,iym umyumx ; vmy vmx  .

umy umx Y

X Setting specific values for all digit variables um* , vm* (where m runs through the values 0,1, n 1) defines some path in a topological graph between pair of nodes of an initial and finite layer. From the uniqueness of digit-by-digit representation of coordinate numbers, it follows that such path is single for each pair combination of spatial points of the input and output plane. This circumstance allows obtaining the convenient analytical expression connecting array elements of fast transformation with elements of kernels. From expression it follows:

H U y ,Ux ;Vy ,Vx   S Vy ,Vx  F U y ,Ux  .

Differentiating by the rule of differentiation of the composite function we will obtain: H U y ,U x ;Vy ,Vx   F n1 S n2 F n2 S n1 F n1 S n2 F1 S 0 . (7) (8) F m From condition it follows that for all m the following equals take place S m1  1 , S m and from, – that F m  Wixmm,imy umyumx ; vmy v mx  . Thus, we will obtain that each element of a four-dimensional transformation matrix H expresses through elements of kernels in the form of the following product:

H U y ,U x ;Vy ,Vx   Wixnn11,iyn1 uny1unx1; vny1vnx1 

Wixnn2 2,iyn2 uny2unx2; vny2vnx2 

Wix00,i0y u0yu0x ; v0yv0x  , (9) where digit-by-digit expressions of kernel indexes for a layer m for the selected topology are defined by expression Multiplicative Decomposition of Two-Dimensional Images.

The algorithm of multiplicative decomposition is based on the ideas of fractal filtering [ 10 ] (in the notation of convolution neural networks this operation corresponds to pooling). For a two-dimensional case, fractal filtering represents the multiple scale image processing sequentially squeezing its sizes up to a single point. The diagram of fractal filtering can be presented in the form of the pyramid shown in Fig 2.

F U y,Ux 

F2 U y,Ux 

F1U y,Ux  The base of the pyramid is the source image, F U y ,Ux  for which arguments U y and Ux are presented in a radix notation (see expression. In this positional representation, we will fix all digits except two the lowest u0y and u0x . If to vary these digits on all possible values, then we will obtain a two-dimensional selection with the size p0y  p0x . The fractal filter is understood as any functional  , acting on this selection. Formally, it can be written in the form of the following expression: The image F1 will be multiply reduced by the sizes with the source image. For example, the rule of average calculation of selection or its median line can be such functional. The source image can be now formally presented in the form of a product: F  uny1un2

y  F1  uny1un2 y u u x 1y 0y , unx1un2 u1y , unx1un2 x u1xu0x   u1x  f j0y jx0 u0y , u0x  , (10) (11) where f j0y j0x u0y , u0x  - is a set of the two-dimensional function factors depending on digit variables u0y and u0x , and indexes jy0, jx0 selecting a two-dimensional function from this set. The value of these indexes is set to equal values of arguments of the image F1 , so that jy0  uny1uny2 u1y and jx0  unx1unx2 u1x . For obtaining the factor functions, it is enough to execute scalar division of the image F to the image F1 in case of variation of all digit variables. In turn, the image F1 can also be represented as the product of the image F2 on factors from the set f j1y j1x u1y , u1x  . Repeating multiply the operation of fractal filtering and decomposition, we will reach the peak of the pyramid of images and we will obtain multiplicative decomposition:

F  uny1un2 y u u x 1y 0y , unx1un2

u1xu0x   f jyn1 jxn1 uny1, unx1  f jyn2 jxn2 uny2 , unx2  f j1y j1x u1y , u1x  f j0y jx0 u0y , u0x , where indexes of multiplicands are defined by expressions: jxm  unx1un2 x

x um1 , jym  uny1un2 y

y um1 .

F  uny1un2

y 1   F  uny1uny2 u0y ,u0x   u1y , unx1un2 x

u1x   u u x 1y 0y , unx1un2 u1xu0x  . 4

Attuning of Adapted Transformations

We will call transformation adapted to the image if one of the transformation base functions with coordinates Vy ,Vx  coincides with this image. The value of a scalar product of the image with this function will be maximal among other coefficients of the spectral area of the transformation. The purpose of transformation attuning consists in it. Value of coordinates Vy ,Vx  we will call as adaptation point the function in the spectral plane.

Attuning can be realized also in several images. If to compare the obtained multiplicative decomposition of the image with the decomposition of fast transformation, it is easy to note that they are similar, and set kernel indexes in each layer cover a set of indexes of function multiplicands. From this constructive result, follows that fast transformation will be attuned to the image when transformation kernels are attuned to function multiplicands. Attuning of transformation kernels is defined by the rule:

Wixmm,iym umyumx ;vmyvmx   f jxm jmy umy ,umx  .

(12) Comparing expression for ixm ,iym and for jxm, jym , it is possible to obtain the result conclusion that quantity of components in the multiplicative expansion of the image and quantity of kernels of transformation coincide for a layer m  0 (thus it takes place following equals ix0  jx0 and iy0  jy0 ), and less number of kernels for all remaining layers. Therefore, in the case of attuning a part of degrees of freedom of a transformation is not used. Digit variables v0y , v0x are freely variational variables; therefore, the kernel may be attuned to D  g0y g0x images. The remaining layers have a bigger number of degrees of freedom and cannot worsen this value. Thus, it is possible to conclude that a fast transformation cannot adapt more than to D different images. On this, the opportunities of this algorithm of attuning are exhausted. Value D let us call as the level of transformation attuning. 5

Regular Neural Networks with Additional Planes

Remaining within the considered topology, not used in case of attuning degrees of freedom it is possible to determine in several ways (in more detail see. [ 10 ]). In this case, the level of attuning does not change, but at the same time remaining transformation functions change.

Let us consider the alternative decision, which consists of an extension of the topology through additional planes to use the remained degrees of freedom for an increase in the level of attuning. In this case, the number of computing operations in the new topology increases, but the structural regularity of the network remains.

In the beginning, let us specify a choice rule of the adapted kernels for the former topology. By adaptation let's express point coordinates in a radix notation, having designated digit variables through y and x :

Vy  yn1, yn2 y0 , Vx  xn1, xn2 x0 .

The fixed values of digit variables ym , xm correspond to variables vmy , vmx , therefore (as it follows from) in case of a choice of this point of adaptation, according to the rule need adapt only kernels with numbers: umy1 ym1 ym2 x0 , In particular for m  0 we have:

Thus, irrespective of a choice of points of adaptation all kernels of a zero layer always shall be adapted. At the same time, the level of transformation attuning is restricted to value D .

To increase of transformation attuning level we will enter the additional plane structure copying the main plane in each layer. We will determine the order numbers of the additional planes within a layer by the rule:  m  xn1xn2 xm1, yn1 yn2 ym1 .

The maximum quantity of the additional planes will be in the zero layers, and in process of increase in a number of a layer the quantity of the additional planes will decrease, and in the last layer we will obtain  n1  , i.e. the additional planes will not be absolute. Thus, in the new topology, the plane of the last layer will remain the same, and in younger layers, the additional planes will appear.

The architecture of the neural network with the additional planes is shown in Fig. 3. The input image is fed at the same time to all planes of the input layer. Layers are divided by switchboards which are controlled by position digits of coordinate numbers of output class.

Switchboard 0

Switchboard 1

y  y2 y1y0   x2x1y2 y1   x2 y2  Since the rule of a generation of the new planes does not contradict to condition, so for attuning of kernels of the transformation it is possible to use the former rule, with an additional argument for selecting the planes:

Wixmm,iym  m umyumx;vmy vmx   f jxm jym umy ,umx  .

k Image ``` ` x  x2x1x0  The index k in the right part enumerates an adaptation point. For m  0 we have ix0  jx0 and iy0  jy0 , here variational variables are the number of plane  0  xn1xn2 x1, yn1 yn2 y1 and digit variables v0y , v0x . Together, they cover the full coordinate range of the output plane. Possible index values k correspond to this range. The remaining layers do not impair the level of transformation attuning. Thus, the transformation with additional planes can be adapted to D  M y  M x images, i.e. each point of the output plane will exactly correspond to one image of the learning set.

If the image corresponds to one of the adaptation points, the value of this spectral plane coefficient will be maximal. The transformation result for above the image of the digit "0" is shown in Fig. 5. It is seen that the coefficients corresponding to the subclasses of the number “0” have maximum values.

Ordering of the Adaptation Points

For each k ’s adaptation point in the range of k  [0, D 1] , you must set your values for bit variables ym , xm . A one-to-one correspondence k  Vy ,Vx defines rules for ordering adaptation points in the output plane. Let's look at some typical variants. Recall that the bitwise representation of coordinates is defined by expressions (13). The task of ordering is to establish a correspondence between the ordinal number k and the bit variables yi , xi . We assume that moving in the spectral plane along a column is determined by changing the coordinate Vy , and along a row by changing the coordinate Vx . 6.1

Ordering Along of Columns

The ordering algorithm can be specified by the following sequence number representations: k  xn1xn2 x0 yn1 yn2 y0 .

In this expression, the digit y0 is the lowest, so when you increase the number k , the digits yi will change first, and as a result, the adaptation points will be placed along with the columns. Fig. 4 shows the variant of the ordering where classes are placed along columns and subclasses are placed along rows.

Ordering Along of Rows

In this expression, the digit x0 is the lowest, so when you increase the number k , the digits xi will change first, and as a result, the adaptation points will be placed along the rows. Fig. 6 shows a variant of implementing a transformation with the ordering of the adaption points along rows. The ordering algorithm can be specified by the following sequence number representations: k  xn1 yn1xn2 yn2 x1 y1x0 y0 .

In this case, the spectral plane will be filled with increasing values k when moving clockwise along the "circular" segments. Fig. 7 shows a variant of implementing the transformation with the ordering of basic functions along with circular segments. Regular tunable transformations have a unique possibility of analytical representation of the topology of the implementing network, which allows developing learning algorithms that converge in a finite number of steps. It is shown that the implementing topology is easily expanded by additional planes, and the number of recognized images increases dramatically and covers all elements of the output plane. Moreover, the topology extension does not violate the principle of building a training algorithm. The constructed topology is ideologically close to the topology of convolutional deep learning networks [ 2 ] but is regular. The presented solution provides a constructive answer to the fundamental questions of deep learning neural networks: how to choose a topology and how to reduce the learning time of the network.

1. Fukushima

К.

, Miyake

, Takayuki

Neocognitron: A neural network model for a mechanism of visual pattern recognition . IEEE Transaction on Systems, Man and Cybernetics SMC- 13 ( 5 ): 826 - 34 . ̵ 1983 .

LeCun ,

Boser ,

J. S.

Denker ,

Henderson ,

R. E.

Howard ,

Hubbard , and

L. D.

Jackel : Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation , 1 ( 4 ): 541 - 551 , Winter 1989 .

3. Springenberg , Jost Tobias; Dosovitskiy, Alexey; Brox, Thomas & Riedmiller, Martin ( 2014 - 12-21), "Striving for Simplicity: The All Convolutional Net" , arΧiv: 1412 .6806 https://arxiv.org/pdf/1412.6806.

4. Romanuke , Vadim. Appropriate number and allocation of ReLUs in convolutional neural networks (англ .) // Research Bulletin of NTUU “Kyiv Polytechnic Institute”: journal. - 2017 . - Vol. 1 . - P. 69 - 78 .

Xavier

Glorot , Antoine Bordes, BengioY. Deep Sparse Rectifier Neural Networks January 2010Journal of Machine Learning Research 15 Conference: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS).

Nair and

G. E.

Hinton . Rectified linear units improve restricted boltzmann machines . In Proc. 27th. International Conference on Machine Learning , 2010 .

7. Maas , Andrew L.; Hannun , Awni Y.; Ng , Andrew Y. ( June 2013 ). "Rectifier nonlinearities improve neural network acoustic models" (PDF) . Proc. ICML . 30 ( 1 ). Retrieved 2 January 2017 .

8. Dorogov

A. Yu.

, Alekseev

. A. // Mathematical models of fast neural networks . In: collection of scientific. Tr. SPbGETU “Information management and processing systems”. Issue . 490 , 1996 , p. 79 - 84 . In Russian.

9. Solodovnikov

A. I.

, Spivakovsky , A. M. Fundamentals of the theory and methods of spectral information processing : Proc. benefit. L.: publishing House of LSU , 1986 . 272c. In Russian.

10. Dorogov

Yu . Theory and design of fast tunable transformations and weakly connected neural networks . SPb.: "Polytechnic" , 2014 . 328 pp. In Russian. http://dorogov.su/