111 Deep Learning Neural Networks with Controlled Switching of Neural Planes* Alexander Yu. Dorogov [0000-0002-7596-6761] Saint Petersburg Electrotechnical University "LETI" Saint Petersburg, Russia vaksa2006@yandex.ru Abstract. The algorithm of network topology construction and training of two- dimensional fast neural networks with additional switched planes is considered. It is noted that the structure of fast neural networks has a fractal nature. The con- structed topology is ideologically close to the topology of convolutional neural networks of deep learning, but it has regular topology with the number of layers established by the factor representation of the dimensions of the image and the output plane of the classes. The learning algorithm has an analytical representa- tion, and it is stable and converges in a finite number of steps. Additional planes extend the information capacity of the tunable transformation to the maximum possible. Control of the planes in the training and processing mode is realized by numerical coordinate codes of the output plane. The architecture of a regular neu- ral network with additional planes is presented. Variants for image ordering in the output plane are considered. Examples are given. Keywords: fast tunable transformation; neural network; learning; convolutional neural network; the bitmap; the planes of the neural layers First Section 1 Introduction Deep learning technology involves a process of configuration complexity of informa- tive features in the sequence of neural layers. Starting with the neocognitron by K. Fu- kushima [1] it has been proposed some variants of realization of this idea. One of the successful solutions is the architecture of convolutional neural networks [2], which has shown high efficiency in solving various problems. A distinctive feature of this archi- tecture is the presence in convolutional layers of several data processing channels (called maps or planes). In each plane, the output image of the previous layer is being convoluted with a fixed kernel of small dimensions. Convolutional layers alternate with pooling layers that multiply reduce the dimension of the feature space. The pooling layers are optional and there are variants of completely eliminating them from the net- work architecture [3]. The second distinctive feature of the architecture is the use of * Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 112 semi-linear activation functions that act as switching keys controlled by the values of hidden layers variables [4, 5, 6, 7]. The disadvantage of the convolution network architecture is the lack of theoretically justified methods for choosing the network structure and convolution kernel parame- ters. There are several well-functioning network configurations for specific tasks, but it is not clear how to build a network for a new task. Until now, the choice of the struc- ture of the convolution network is an art object. The second significant drawback is related to the training time of convolutional networks. On a typical processor, the time can vary from several hours to several days, therefore high-performance GPUs are often used to train networks. In [8] the authors proposed the idea of using fast transformation algorithms to construct the structure and topology of multilayer perceptron neural net- works. We show that this approach with some modification can also be used to con- struct the structure and topology of convolutional neural networks. At present, fast algorithms for linear Fourier, Walsh, Haar, and similar transfor- mations are widely known. With the use of fast algorithms, the gain on computational operations increases exponentially with the increase in the dimension of the transfor- mation. Since the end of the 20-th century, there has arisen a direction of fast tunable transformations [9], which are essentially neural networks with limited connections and linear activation functions. There have been developed methods for training such neural networks that converge in a finite number of steps. The number of layers in fast trans- formations and their configuration are determined by the dimension of the processed images. To construct fast algorithms, the dimension of the transformation must be a composite number, and the more multipliers in the composition of the dimension, the higher the computational efficiency of the fast algorithm. Despite the wide variety of fast algorithms, the configurations of their structures satisfy the system invariant of self- similarity [10]. Fractals are known to have the same property. Therefore, fast algo- rithms can be interpreted as quasi-fractals. The property of structural fractality allows solving two tasks simultaneously: to realize fast data processing and fast transformation training. In this paper, we will show that a small modification of the system invariant of fast algorithms leads to convolutional neural network structures. At the same time, it is pos- sible to preserve the algorithm of fast learning and increase the information capacity of the network recognition up to the maximum possible, determined by the number of neurons of the output layer of the network. The proposed architecture cannot be called convolution networks, because in the planes of neural layers more general transfor- mations than convolution are used, and there are no pooling layers in the transfor- mation, but the principle of pooling is used in the training of the network. All neurons have linear activation functions, however non-linear processing exists, but it is imple- mented not at the expense of activation functions, but by switching the planes of neural networks. To some extent, this is similar to the switching semi-linear activation func- tions of convolutional neural networks. Control of switching planes is carried out by the coordinates of neurons in the output plane of the network. Call this class of networks neural networks with controlled switching of planes (CSPNN). 113 2 The topology of Two-Dimensional Fast Tunable Transformations Let us designate through F U y ,U x  an image matrix by dimensionality N y  N x . In case of impact on the image through linear transformation, H U y ,U x ;Vy ,Vx  the array from M y  M x coefficients turns out. Two-dimensional transformation is executed by the rule: Ny 1 Nx 1 S Vy ,Vx     F U y ,U x H U y ,U x ;V y ,Vx  . (1) Uy  0 Ux  0 A necessary condition of the existence of a fast algorithm is the possibility of multipli- cative decomposition of values of input and output dimensionalities of the transfor- mation to an equal number of multiplicands: N y  p0y p1y pny1 , M y  g0y g1y g ny1 , (2) N x  p0x p1x pnx1 , M x  g0x g1x g nx1. Here indexes x, y mean the belonging to coordinate axes of the source image, and value n defines the number of layers in the graph of a fast algorithm. Using multipli- cands of decomposition, coordinates of points of the image representation in a posi- tional system notation with the mixed radices: U y  uny1uny 2 u1y u0y , (3) U x  unx1unx 2 u1x u0x , where the weight of m ’s position digit is defined by an expression pm* 1 pm* 2 p1* p0* and um* is the digit variable accepting values 0, pm*  1 (the asterisk replaces indexes x, y here). It is similarly possible to represent coordinates of spectral coefficients for the plane Vy ,Vx  : V y  vny1vny 2 v1y v0y , (4) Vx  vnx1vnx 2 v1x v0x , where the weight of m ’s position digit is defined by an expression gm* 1 gm* 2 g1* g0* and v m* is the digit variable accepting values 0, gm*  1 . The algorithm of fast transformation is usually presented in the form of a graph with a different topology. It is convenient to use the digit-by-digit form for the analytical description of a graph of the topology of a fast algorithm. For example, topology a 114 graph can be described by "Cooley–Tukey topology with decimation on to time" in the form of a linguistic sentence [10] (topological model):  un*1un*2 u1*u0* un*1un*2 u1*v0*   ,  un*1un*2 um* 1um* vm* 1vm*  2 v0* vn*1vn* 2 v1*v0*    where words are digit-by-digit representations of coordinate numbers, and letters – names of digit variables. The number of words in the sentence is equal n  1 . The first and last words in the sentence correspond to coordinates of points of the terminal planes provided by expressions. The intermediate words define input U ym ,U xm and output co- ordinate Vym ,Vxm in the planes of inner layers of the fast algorithm. For an algorithm with substitution of values, the condition is follow-up satisfied: U ym1  Vym , U xm1  Vxm (5) Graph of topology contains basic operations in a layer Wimm,im  umy umx ; vmy vmx  , representing x y four-dimensional matrixes of dimensionality  p , p ; g , gmx  . Where digit-by-digit y m x m y m expressions of indexes of kernels of a layer m for the selected topology have viewed: ixm  unx1unx 2 umx 1vmx 1vmx  2 v0x , (6) i ym  uny1uny 2 umy 1vmy 1vmy  2 v0y . Expression is an analytical representation of a system invariant of fast algorithms [10]. In general topologies, for the directions, x and y may be different. Connections be- tween basic operations are defined by the structural model of fast transformation, where to each node there corresponds to basic operation (differently called hereinafter as neu- ral kernels). For the selected topology of graphs of the structural model is described by the following linguistic sentence:  un*1un* 2 u1* un*1un* 2 u2*v0*   .  un*1un* 2 um* 1vm* 1vm*  2 v0* vn*1vn*2 v1*v0*   Each word in this sentence defines the number of basic operations i*m in the layer m . The number of words is equal n in the sentence. In Fig. 1 structural model of fast two-dimensional transformation for dimensionality 8  8 is shown. Input image enters on the low layer and spectral coefficients turn out in the high layer. Nodes of the model there correspond to basic operations (neural kernels) with dimensionality [2,2;2,2]. Kernels in layer m execute two-dimensional transfor- mation over the spatial unit with size pmy  pmx : 115 S m V ym ,Vxm    F m U ym ,U xm Wi mm,i m  umy umx ; vmy vmx  . (7) x y umy umx Y X Fig. 1. Structure model two-dimensional transformation 3 Мultiplicative Decomposition of Fast Transformation Setting specific values for all digit variables um* , vm* (where m runs through the values 0,1, n  1 ) defines some path in a topological graph between pair of nodes of an initial and finite layer. From the uniqueness of digit-by-digit representation of coordinate numbers, it follows that such path is single for each pair combination of spatial points of the input and output plane. This circumstance allows obtaining the convenient ana- lytical expression connecting array elements of fast transformation with elements of kernels. From expression it follows: S Vy ,Vx  H U y ,U x ;V y ,Vx   . (8) F U y ,U x  Differentiating by the rule of differentiation of the composite function we will obtain: S n 1 F n 1 S n  2 F 1 S 0 H U y ,U x ;V y ,Vx   . F n 1 S n  2 F n  2 S 0 F 0 116 F m From condition it follows that for all m the following equals take place 1 , S m 1 S m and from, – that  Wi mm,i m  umy umx ; vmy vmx  . Thus, we will obtain that each element of F m x y a four-dimensional transformation matrix H expresses through elements of kernels in the form of the following product: H U y ,U x ;V y ,Vx   Wi nn11,i n1  uny1unx1 ; vny1vnx1  x y (9) W n 2 ixn 2 ,i yn 2 u u ; v v  W u u ; v v  , y x n 2 n 2 y x n 2 n 2 0 ix0 ,i 0y y x 0 0 y x 0 0 where digit-by-digit expressions of kernel indexes for a layer m for the selected topol- ogy are defined by expression Multiplicative Decomposition of Two-Dimensional Images. The algorithm of multiplicative decomposition is based on the ideas of fractal filtering [10] (in the notation of convolution neural networks this operation corresponds to pool- ing). For a two-dimensional case, fractal filtering represents the multiple scale image processing sequentially squeezing its sizes up to a single point. The diagram of fractal filtering can be presented in the form of the pyramid shown in Fig 2. F2 U y , U x  F1 U y , U x  F U y , U x  Fig. 2. The scheme of the fractal image filtering The base of the pyramid is the source image, F U y ,U x  for which arguments U y and U x are presented in a radix notation (see expression. In this positional representation, we will fix all digits except two the lowest u0y and u0x . If to vary these digits on all possible values, then we will obtain a two-dimensional selection with the size p0y  p0x . The fractal filter is understood as any functional  , acting on this selection. Formally, it can be written in the form of the following expression: 117  F1 uny1uny 2 u1y , unx1unx 2 u1x    u0y ,u0x     F uny1uny 2 u1y u0y , unx1unx 2  u1x u0x  . The image F1 will be multiply reduced by the sizes with the source image. For example, the rule of average calculation of selection or its median line can be such functional. The source image can be now formally presented in the form of a product:  F uny1uny 2 u1y u0y , unx1unx 2 u1x u0x    F1 uny1uny 2 u1y , unx1unx 2 u1x  f u , u  , j 0y jx0 y 0 x 0 where f j y j x  u0y , u0x  - is a set of the two-dimensional function factors depending on 0 0 digit variables u0y and u0x , and indexes j y0 , jx0 selecting a two-dimensional function from this set. The value of these indexes is set to equal values of arguments of the image F1 , so that j y0  uny1uny2 u1y and jx0  unx1unx2 u1x . For obtaining the factor func- tions, it is enough to execute scalar division of the image F to the image F1 in case of variation of all digit variables. In turn, the image F1 can also be represented as the product of the image F2 on factors from the set f j1 j1  u1y , u1x  . Repeating multiply the y x operation of fractal filtering and decomposition, we will reach the peak of the pyramid of images and we will obtain multiplicative decomposition:  F uny1uny2 u1y u0y , unx1unx2 u1x u0x  (10) f jn1 j n1  u , u y n 1 x n 1 f j ny 2 jxn 2 u , u  y n 2 x n 2 f j1 j1  u1y , u1x  f j0 j0  u0y , u0x  , y x y x y x where indexes of multiplicands are defined by expressions: jxm  unx1unx2 umx 1 , j ym  uny1uny2 umy 1 . (11) 4 Attuning of Adapted Transformations We will call transformation adapted to the image if one of the transformation base func- tions with coordinates Vy ,Vx  coincides with this image. The value of a scalar product of the image with this function will be maximal among other coefficients of the spectral area of the transformation. The purpose of transformation attuning consists in it. Value of coordinates Vy ,Vx  we will call as adaptation point the function in the spectral plane. Attuning can be realized also in several images. If to compare the obtained multipli- cative decomposition of the image with the decomposition of fast transformation, it is 118 easy to note that they are similar, and set kernel indexes in each layer cover a set of indexes of function multiplicands. From this constructive result, follows that fast trans- formation will be attuned to the image when transformation kernels are attuned to func- tion multiplicands. Attuning of transformation kernels is defined by the rule: Wimm,im  umy umx ; vmy vmx   f jm jm  umy , umx  . (12) x y x y Comparing expression for ixm , i ym and for jxm , j ym , it is possible to obtain the result con- clusion that quantity of components in the multiplicative expansion of the image and quantity of kernels of transformation coincide for a layer m  0 (thus it takes place following equals ix0  jx0 and i y0  j y0 ), and less number of kernels for all remaining layers. Therefore, in the case of attuning a part of degrees of freedom of a transfor- mation is not used. Digit variables v0y , v0x are freely variational variables; therefore, the kernel may be attuned to D  g 0y g 0x images. The remaining layers have a bigger number of degrees of freedom and cannot worsen this value. Thus, it is possible to conclude that a fast transformation cannot adapt more than to D different images. On this, the opportunities of this algorithm of attuning are exhausted. Value D let us call as the level of transformation attuning. 5 Regular Neural Networks with Additional Planes Remaining within the considered topology, not used in case of attuning degrees of free- dom it is possible to determine in several ways (in more detail see. [10]). In this case, the level of attuning does not change, but at the same time remaining transformation functions change. Let us consider the alternative decision, which consists of an extension of the topol- ogy through additional planes to use the remained degrees of freedom for an increase in the level of attuning. In this case, the number of computing operations in the new topology increases, but the structural regularity of the network remains. In the beginning, let us specify a choice rule of the adapted kernels for the former topology. By adaptation let's express point coordinates in a radix notation, having des- ignated digit variables through y and x : Vy  yn 1 , yn 2 y0 , Vx  xn 1 , xn 2 x0 . (13) The fixed values of digit variables ym , xm correspond to variables vmy , vmx , therefore (as it follows from) in case of a choice of this point of adaptation, according to the rule need adapt only kernels with numbers: ixm  unx1unx 2 umx 1 xm 1 xm  2 x0 , (14) i  u u m y y y n 1 n  2 u y y y m 1 m 1 m  2 y0 . 119 In particular for m  0 we have: ix0  unx1unx2 u1x , i y0  uny1uny2 u1y . Thus, irrespective of a choice of points of adaptation all kernels of a zero layer always shall be adapted. At the same time, the level of transformation attuning is restricted to value D . To increase of transformation attuning level we will enter the additional plane struc- ture copying the main plane in each layer. We will determine the order numbers of the additional planes within a layer by the rule:  m  xn1 xn2 xm1 , yn1 yn2 ym1 . The maximum quantity of the additional planes will be in the zero layers, and in process of increase in a number of a layer the quantity of the additional planes will decrease, and in the last layer we will obtain  n 1  , i.e. the additional planes will not be absolute. Thus, in the new topology, the plane of the last layer will remain the same, and in younger layers, the additional planes will appear. The architecture of the neural network with the additional planes is shown in Fig. 3. The input image is fed at the same time to all planes of the input layer. Layers are divided by switchboards which are controlled by position digits of coordinate numbers of output class. Switchboard 0 Switchboard 1 y  y2 y1 y0  Image ` ` ` ` x  x2 x1 x0   x2 x1 y2 y1   x2 y 2  Fig. 3. Regular neural network architecture with additional planes Since the rule of a generation of the new planes does not contradict to condition, so for attuning of kernels of the transformation it is possible to use the former rule, with an additional argument for selecting the planes: Wimm,im  m  umy umx ; vmy vmx   f jkm jm  umy , umx  . x y x y 120 The index k in the right part enumerates an adaptation point. For m  0 we have ix0  jx0 and i y0  j y0 , here variational variables are the number of plane  0  xn 1 xn 2x1 , yn 1 yn 2 y1 and digit variables v0y , v0x . Together, they cover the full coordinate range of the output plane. Possible index values k correspond to this range. The remaining layers do not impair the level of transformation attuning. Thus, the transformation with additional planes can be adapted to D  M y  M x images, i.e. each point of the output plane will exactly correspond to one image of the learning set. Fig. 4. Two-dimensional functions of the adapted transformation Fig. 4 shows the two-dimensional transformation function adapted to the average im- ages of handwritten numbers [2]. For this transformation: N y  N x  p0 p1 p2  4  3  2  24, M y  M x  g0 g0 g 2  2  2  2  8 . If the image corresponds to one of the adaptation points, the value of this spectral plane coefficient will be maximal. The transformation result for above the image of the digit "0" is shown in Fig. 5. It is seen that the coefficients corresponding to the subclasses of the number “0” have maximum values. 121 Fig. 5. The result of the two-dimensional transformation 6 Ordering of the Adaptation Points For each k ’s adaptation point in the range of k  [0, D  1] , you must set your values for bit variables ym , xm . A one-to-one correspondence k  Vy ,Vx defines rules for ordering adaptation points in the output plane. Let's look at some typical variants. Re- call that the bitwise representation of coordinates is defined by expressions (13). The task of ordering is to establish a correspondence between the ordinal number k and the bit variables yi , xi . We assume that moving in the spectral plane along a column is determined by changing the coordinate V y , and along a row by changing the coordi- nate Vx . 6.1 Ordering Along of Columns The ordering algorithm can be specified by the following sequence number representa- tions: k  xn 1 xn 2 x0 yn 1 yn 2 y0 . In this expression, the digit y0 is the lowest, so when you increase the number k , the digits yi will change first, and as a result, the adaptation points will be placed along with the columns. Fig. 4 shows the variant of the ordering where classes are placed along columns and subclasses are placed along rows. 122 6.2 Ordering Along of Rows The ordering algorithm can be specified by the following sequence number representa- tions: k  yn 1 yn 2 y0 xn 1 xn 2 x0 . In this expression, the digit x0 is the lowest, so when you increase the number k , the digits xi will change first, and as a result, the adaptation points will be placed along the rows. Fig. 6 shows a variant of implementing a transformation with the ordering of the adaption points along rows. Fig. 6. The transformation with the ordering of basis functions along the rows 6.3 Ordering Along of Circular Segments The ordering algorithm can be specified by the following sequence number representa- tions: k  xn 1 yn 1 xn 2 yn 2 x1 y1 x0 y0 . In this case, the spectral plane will be filled with increasing values k when moving clockwise along the "circular" segments. Fig. 7 shows a variant of implementing the transformation with the ordering of basic functions along with circular segments. 123 Fig. 7. Transformation with the ordering of basic functions by circular segments 7 Summary Regular tunable transformations have a unique possibility of analytical representation of the topology of the implementing network, which allows developing learning algo- rithms that converge in a finite number of steps. It is shown that the implementing to- pology is easily expanded by additional planes, and the number of recognized images increases dramatically and covers all elements of the output plane. Moreover, the to- pology extension does not violate the principle of building a training algorithm. The constructed topology is ideologically close to the topology of convolutional deep learn- ing networks [2] but is regular. The presented solution provides a constructive answer to the fundamental questions of deep learning neural networks: how to choose a topol- ogy and how to reduce the learning time of the network. References 1. Fukushima К., Miyake S., Takayuki I. Neocognitron: A neural network model for a mecha- nism of visual pattern recognition. IEEE Transaction on Systems, Man and Cybernetics SMC–13(5):826–34. ̵ 1983. 2. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computa- tion, 1(4):541-551, Winter 1989. 3. Springenberg, Jost Tobias; Dosovitskiy, Alexey; Brox, Thomas & Riedmiller, Martin (2014- 12-21), "Striving for Simplicity: The All Convolutional Net", arΧiv:1412.6806 https://arxiv.org/pdf/1412.6806. 4. Romanuke, Vadim. Appropriate number and allocation of ReLUs in convolutional neural networks (англ.) // Research Bulletin of NTUU “Kyiv Polytechnic Institute”: journal. — 2017. — Vol. 1. — P. 69—78. 5. Xavier Glorot, Antoine Bordes, BengioY. Deep Sparse Rectifier Neural Networks January 2010Journal of Machine Learning Research 15 Conference: Proceedings of the 14th Inter- national Conference on Artificial Intelligence and Statistics (AISTATS). 6. V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. 27th. International Conference on Machine Learning, 2010. 124 7. Maas, Andrew L.; Hannun, Awni Y.; Ng, Andrew Y. (June 2013). "Rectifier nonlinearities improve neural network acoustic models" (PDF). Proc. ICML. 30 (1). Retrieved 2 January 2017. 8. Dorogov A. Yu., Alekseev A. A. // Mathematical models of fast neural networks. In: collec- tion of scientific. Tr. SPbGETU “Information management and processing systems”. Is- sue.490, 1996, p. 79-84. In Russian. 9. Solodovnikov A. I., Spivakovsky, A. M. Fundamentals of the theory and methods of spectral information processing: Proc. benefit. L.: publishing House of LSU, 1986. 272c. In Russian. 10. Dorogov A. Yu. Theory and design of fast tunable transformations and weakly connected neural networks. SPb.: "Polytechnic", 2014. 328 pp. In Russian. http://dorogov.su/