1. Introduction

Optimization of computational complexity of an arti cial neural network

Nikolay Vershkov

Viktor Kuchukov

or@list.ru zzvkuchukov@ncfu.ru 0

Natalia Kuchukova

Nikolay Kucherov

ynkucherov@ncfu.ru 0

Egor Shiriaev

0 0 North-Caucasus Center for Mathematical Research, North-Caucasus Federal University , 1, Pushkin Street, 355017, Stavropol , Russia

2021

The article deals with the modelling of Arti cial Neural Networks as an information transmission system to optimize their computational complexity. The analysis of existing theoretical approaches to optimizing the structure and training of neural networks is carried out. In the process of constructing the model, the well-known problem of isolating a deterministic signal on the background of noise and adapting it to solving the problem of assigning an input implementation to a certain cluster is considered. A layer of neurons is considered as an information transformer with a kernel for solving a certain class of problems: orthogonal transformation, matched ltering, and nonlinear transformation for recognizing the input implementation with a given accuracy. Based on the analysis of the proposed model, it is concluded that it is possible to reduce the number of neurons in the layers of neural network and to reduce the number of features for training the classi er.

1. Introduction

The necessity to represent functions of n variables in the form of a superposition of functions of smaller number variables arose in connection with the development of the neural networks theory and practice. The appearance of neural networks is associated with an article by McCulloch et al. [ 1 ], which describes a mathematical model of a neuron and a neural network. It has been proven that both Boolean functions and nite state machines can be represented by neural networks. Later, a serious mathematical analysis of perceptron revealed limitations on the area of their applicability. Later, the restrictions were weakened by replacing the threshold activation functions of neurons with sigmoid ones. The theoretical basis of Arti cial Neural Networks (ANNs) was the Kolmogorov-Arnold theorem proved as a result of scienti c discussion [ 2, 3 ], which showed the possibility of representing a continuous real function of n variables f (x1; x2; : : : ; xn) in the form of a superposition of functions of a smaller number variables. A signi cant theoretical development of ANN was the proof of the Hecht-Nielsen theorem [ 4 ], which showed the possibility of approximating a function of several variables with a given accuracy of ANN with one hidden layer in a non-constructive form. Interest in Deep Networks stems from the limitations of the perceptron. The use of multilayer networks was initially limited by the complexity of their training. Due to the ideas of Hinton's team, training multilayer ANNs became possible [ 5 ]. Multilayer networks enabled solving problems of classi cation, extrapolation, feature extraction, etc. in conditions of high uncertainty, i. e. with a su ciently small volume of the training sample, obtain satisfactory results. Thus, the modern theory of ANN is based on a vector (geometric) approach [ 2, 3, 4, 5, 6, 7 ].

An interesting approach to the problem of constructing image recognition systems of optimal architecture was proposed by Rao et al. [ 8 ]. They proposed to present the pattern recognition system functionally in the form of two blocks: feature selection and a trained classi er. The selection of features is carried out using orthogonal transformations of the input signal. In order to increase the e ciency of the system, a method for decreasing the dimension of the feature vector for training the classi er is proposed. The ANN input receives a sequence of values fXig = (xi1; xi2; : : : ; xin), which can be represented as discrete samples of some continuous function x(t). ANN output is a sequence fYig = (y1i; y2i; : : : ; ymi), which can also represent discrete samples of the function y(t). Here i is the number of the sample, m and n are the dimensions of the output and input samples of the sequences, respectively. This approach makes it possible to study the output sequence of the ANN not only as a geometric interpretation of the input samples, but also to consider the information interaction of the layers of complex (deep) ANNs using the mathematical apparatus of the information transmission theory.

We developed the ideas of Ahmed and Rao, widely applying the methods of decoding and separating signals against the background of noise, used in the theory of information transmission [ 9 ]. However, the use of the discrete Fourier transform (DFT) did not allow obtaining a signi cant gain in reducing the computational complexity of the ANN [ 9 ]. Despite the signi cant improvement in the algorithms of the ANN operation and the reduction of the training time by tens and hundreds of times, a number of problems remain in this area that require theoretical comprehension. First of all, these include a signi cant computational load and, as a result, a signi cant training time.

The solution of the above problem using existing theoretical approaches is di cult, and the proposed ANN model in the form of an information transmission system will allow studying the interaction of layers and signi cantly reduce the complexity of training.

Within the framework of this article, the following restrictions apply. We did not seek to consider all the diversity of the ANN architecture, but limited themselves to feed-forward networks. This article did not set the task of a complete study of the ANN and obtaining practically signi cant results, but considered the problem of constructing a mathematical model of the ANN based on the processes of parallel digital processing of input information, presented as a random process containing a deterministic signal that must be attributed to a certain class. We suggest that such an approach will signi cantly reduce the costs of training ANNs and, thereby, increase the e ciency of their application. 2. Feature selection as an orthogonal transformation of the input vector Studying the problem of feature selection, we relied on the provisions formulated by Rao et al. [ 8 ]. ANN input vector fXig = (xi1; xi2; : : : ; xin) is considered as discretization of the continuous signal x(t) considering the provisions of the Kotelnikov theorem (better known abroad as the Nyquist-Shannon theorem). The output signal y(t) can also be represented by discrete values fYig = (y1i; y2i; ; ymi). Investigating the ANN, we proceed from the classical scheme of the information transmission system using wideband signals [ 10, 11 ]. The transmission of broadband (complex) signals is characterized by the shape of a time-frequency matrix [ 11 ]. In [ 6 ] distinguish 3 types of time-frequency matrices: parallel, serial and serial-parallel. A signal with a serialparallel matrix is fed to the input of the ANN.

Let the function x(t) be a complex signal consisting of i variants, each of which is encoded with a sequence of n symbols. At the output, we get the function y(t), consisting of i variants, each of the length m. Considering the principles of complex signals analysis, it is more convenient to represent them in a generalized spectral form, i.e. each version of the signal can be represented in the form [ 5, 10, 11 ]:

There T = n tx = m ty, and the expansion coe cients 8< xr(t) = Pkr2

k=kr1 akr k(t) : yl(t) = Pkl2 k=kl1 akl k(t)

; t 2 [0; T ] 8> akr = R0T 2k(t)dt R0T xr(t) k(t)dt

1 < > akl = R0T 2k(t)dt R0T yl(t) k(t)dt

1 :

The coordinate functions k(t) satisfy the orthogonality condition [ 11 ]. Representation (1) makes it possible to understand the procedure for processing complex signals not only in the time domain, but in the time-frequency domain, as it happens with the classical ANN design methods. Using orthogonal transformations, the features (spectrum) of the input signal will be selected, which, after processing, can be used to train the classi er. Thus, the ANN in the form of an information transmission system (ITS) performs the transformation y(ti) = F (x(ti)), where x(ti) = Pin=1 xi(i tx) and y(ti) = Pin=1 yi(i ty) and the functional of F is the subject of this article.

The classical approach (McCulloch-Pitts model [ 1 ]) considers a mathematical model of a neuron in the form yk;l = f (Pin=1 wik;lxik;l), where k; l are the number of the layer and the number of the neuron in the layer, yk;l is the output of the neuron, xik;l are the inputs of the neuron, wik;l are the weights (synapses) of input signals, f is the output function of the neuron, which may or may not be linear. The transformation of a signal in a neuron can be considered in the traditional sense as an algebraic sum of products of input signals and weights (adaptive adder), and in the sense of expressions (1),(2). Indeed, if the set of weights of the k-th neuron of the i-th layer fwkigk=0;1;2;::: is a discrete representation of the i-th orthogonal function i(t), then the output will be ai = n X x(k t) i(k t) k=0 Applying to expression (3) a normalizing coe cient of the form k = R0T 1i2(t)dt , denoting wi(k t) =

1 R0T i2(t)dt i(k t) and summing over the period, we arrive at expression (2).

Let us consider the application of the orthogonal transform based on the widely used Fourier transform [ 10, 11 ]. However, in some cases, instead of trigonometric functions, some others are more appropriate as a core, such as Lagger, Legendre, Hermite, Walsh, Chebyshev, Hadamard, etc. Since the input and output signals are presented in discrete form, we use a kind of Fourier transform for discrete signals, discrete Fourier transform (DFT).

Thus, choosing the weights wi(k t) in accordance with (4), at the output of the i-th neuron, we obtain the value of one of the spectral components ai. A layer of neurons with selected weights wi(k t) for spectral components with serial numbers i = 0; 1; 2; 3; : : : gives the spectrum of the input signal with a given accuracy as a result of the transformation.

Let us consider in more detail the process of orthogonal transformation of the DFT based on the McCulloch-Pitts model [ 1 ]. The DFT is based on the well-known expression for the (1) (2) (3) (4) continuous Fourier transform X(!) = R 1 x(t)e j1tdt Passing from the continuous form to the 1 discrete one, we replace the integration with the summation and, introducing a restriction on the signal spectrum width, we obtain the expression X(m) = Pn i=01 x(i)e j2 im=n. A similar expression can be implemented using complex numbers, but, more conveniently, it can be reduced to the form

X(m) = Pin=01 x(i) cos 2 nim j sin 2 nim = Pin=01 x(i) cos 2 nim j Pin=01 x(i) sin 2 nim

= qopt = s 1 Z 1 jX(!)j2 d! 2 N (!) using Euler's identity e j = cos( ) j sin( ).

Here X(m) is the m-th harmonic of the DFT, m is the index in the frequency domain, x(i) is a sequence of input samples of size n. In this case, the number of neurons in the layer must be at least 2n to convert the real (cosine) and imaginary (sine) parts of the complex number, which is in agreement with the Hecht-Nielsen theorem [ 8 ]. It requires the number of neurons in the rst hidden layer at least 2n + 1. If a di erent orthogonal transformation is used as a core function, then the substitution of weights is performed based on the transformation used.

In addition to DFT, in the practice of digital signal processing, the discrete cosine transform (DCT) is widely used [ 8 ]. DCT is used to compress images in MPEG and JPEG formats. DCT is closely related to the Fourier transform and is a homomorphism of its vector space. Since DCT operates with real numbers, it does not require 2n neurons in a layer, n is enough, which reduces the number of neurons in the rst hidden layer by 2 times compared to DFT. To solve the problem of reducing the computational load on the ANN, we use DCT. 3. Implementation of the classi er based on the McCulloch-Pitts model For successful classi cation of the i-th implementation of the input value, it is necessary to perform the operation of determining the degree of similarity of the implementation with the values that determine the classi cation levels. The classical approach for training the classi er is based on adaptive algorithms that minimize the value of the working function [ 12, 13 ] As a working function for training ANN with a teacher, the mean square of the neural layer error is widely used. Newton's and steepest descent methods are used as algorithms, as well as their variations [ 13 ].

Since the proposed model widely uses digital signal processing methods, the implementation of the classi er is based on the criteria of optimal ltering. A similar approach was considered in [ 9 ], where the correlation function of the ANN output signal and the training sequence was proposed as a working function. Let each realization of the input value be a deterministic signal, which is the average value of all realizations from the training sample belonging to a certain class mixed with additive noise, i.e. Xi = Zj + N . Here Xi is the input implementation belonging to the j-th class. Zj = Pip=01 Xi=p is the number of input implementations belonging to the j-th class. N is a random process, which has a Gaussian distribution. In the theory of information transfer, the ratio of the maximum value of the input implementation to the standard deviation of the noise is used as a characteristic that determines the ratio of the implementation to a certain class [ 10 ]: q = max(Xi)

out

The optimal lter tends to maximize the value (6), i.e. q ! max. Since after passing through the rst hidden layer, the input implementation is presented in the form of a spectrum, Xiout(!) = Xi(!)H(!), where H(!) is the frequency response of the output layer. Omitting intermediate calculations [ 10 ], we get: (5) (6) (7) H(!) = k

X (!) e j!t

N (!) H(!) = k jX(!)j

n Yj = X XiHij i=1 (8) (9) (10)

Based on expression (8), the amplitude-frequency characteristic of the optimal lter will be the amplitude-frequency characteristic of the input implementation with an accuracy up to the coe cient k:

Expression (9) allows us to de ne the ANN output layer as an m n matrix, i. e. the number of matrix rows is determined by the number of classes, columns { by the dimension of the input implementation, and the values of the matrix rows contain the averaged values of the input implementations related to each class. In accordance with the McCulloch-Pitts model, the value at the output of the j-th neuron of the output layer is de ned as

Based on the Bunyakovsky-Schwarz inequality and expression (7), we obtain the complex frequency response of the optimal circuit:

In expression (10), Hij is the j-th row of the output layer's matrix of weights, which is the mathematical expectation of the implementations Xi of the training sample belonging to the j-th class. The physical meaning of expression (10) means the correlation function of the input implementation with the representation of the class, de ned in the form of the output layer weights.

Based on the above reasoning and the obtained expression (10), an ANN with one hidden layer was built based on the PyTorch library [ 14 ]. The experiment was carried out using the MNIST database [ 11 ]. The ANN layer weights were lled in as follows: the rst hidden layer implements DCT, its dimension is equal to the dimension of the input implementation n, and the output layer performs the function of an optimal receiver based on the correlator. Without the use of nonlinear functions of the layers, the ANN gives the recognition accuracy on test data of 72%. In order to assess the correctness of the choice of the mathematical expectation criterion for the output layer, we will train the classi er, i.e. ANN output layer using the gradient method. To do this, we arti cially prohibit changing the weights of all layers, except for the last one. The result of training the classi er is shown in gure 1. Due to the training, the recognition reliability was increased to 90%, which indicates that the weights in expression (10), represented by the mathematical expectation of class realizations, are not an ideal criterion for the optimal receiver. The use of a nonlinear layer, which is a ReLu function, increases the recognition capabilities of test data from the MNIST database up to 95%. 4. Minimization of the feature vector in training the classi er The next step, in accordance with [ 8 ], is to minimize the feature vector (in our case, the spectrum of the input signal) in such a way that, with insigni cant losses in the information content of the feature vector, reduce its dimension. To do this, Ahmed and Rao propose to conduct an analysis of variance of the feature vector in order to identify features with minimum variance, i. e. not informative, and exclude them from the number of analyzed ones. To assess the degree of in uence of harmonics with low dispersion, we use an ANN with one hidden layer and the ReLu function. The result of the successive removal of uninformative harmonics is shown in gure 2.

It is clearly seen from the presented graph that the removal of more than 300 of 784 harmonics has no e ect at all on the recognition accuracy of test data. Removing also 200 leads to a decrease in the recognition quality by 1%. That is, leaving approximately 250 features of 784, we can recognize the test data with 94% reliability. Thus, the rst hidden layer of the ANN can be reduced by approximately 70% with a decrease in reliability of no more than 1%. The weight matrix of the output layer also will be reduced by 70%. Therefore, the gain from the application of the proposed model in comparison with the standard one [ 2, 3, 4 ] is approximately 84%.

5. Conclusion

Despite the signi cant gain in ANN performance, the proposed model has a number of disadvantages. These include the low reliability of test data recognition: 95% versus 98% or more for ANNs implemented on the basis of the traditional model. This can be explained by the fact that the simplest ANNs with one hidden layer have been simulated so far. In addition, the use of the averaged value as a measure of the similarity with the implementation requires further study, since further training is needed, and therefore computational costs. We hope that further research in the proposed direction will reduce the recognition error and expand the range of application of the proposed model.

Acknowledgements This work was supported by a grant from the Russian Science Foundation Grant No. 19-71-10033.

[1] McCulloch W S and Pitts

W 1943

The bulletin of mathematical biophysics 5 115 { 133

[2] Kolmogorov

A N

1957 On the representation of continuous functions of several variables in the form of superpositions of continuous functions of one variable and addition Doklady Akademii nauk vol 114 (Rossijskaya akademiya nauk ) pp 953 { 956

[3] Arnol'd V I 1958 Mat . Prosveshchenie 3 41 { 61

[4] Hecht-Nielsen

1988 IEEE spectrum 25 36 { 41

[5] Hinton

G E

2007 Trends in cognitive sciences 11 428 { 434

[6] Alexandrovich

S A

1983

[7] Klod

S 1963

Works on information theory and cybernetics (Foreign Literature Publishing House)

[8] Rao

and Ahmed N 1976 Orthogonal transforms for digital signal processing ICASSP'76 . IEEE International Conference on Acoustics, Speech, and Signal Processing vol 1 (IEEE) pp 136 { 140

[9] Vershkov

N A

, Kuchukov

V A

, Kuchukova N N and Babenko

M 2020

The wave model of arti cial neural network 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus) (IEEE ) pp 542 { 547

[10] Vasilievich

S A

1967 Information theory and its application to automatic control problems (Publishing House "Science" Head edition of physical and mathematical literature)

[11] Deng

2012 IEEE Signal Processing Magazine 29 141 { 142

[12] Hebb D O 1949 A Wiley

Book in Clinical Psychology 62 78

[13] Widrow

1959 Part 4 74 { 85

[14] Paszke

, Gross

, Massa

, Lerer

, Bradbury

, Chanan

, Killeen

, Lin

, Gimelshein

, Antiga

et al. 2019 arXiv preprint arXiv: 1912 .01703