Multilayer Neural Network Training Error when AMSGrad, Adam, AdamMax Methods Used Bohdan Melnyk, Serhiy Sveleba, Ivan Katerynchuk, Ivan Kuno and Volodymyr Franiv Ivan Franko National University of Lviv, 1, Universytetska St., Lviv, 79000, Ukraine Abstract The multilayer neural network training errors when the optimization methods of learning Adam, AdamMax, AMSGrad used are considered. The multilayer neural network used for the recognition of printed numbers. It was established that with an increase of the learning speed value there are mode of underlearning, satisfactory learning, and a chaotic learning mode. The process of neuron retraining is characterized by the appearance of local minima of the error function. The chaotic mode of learning is described by the process of doubling the number of existing local minima. The work defines and describes the mechanism for determining the optimal learning speed, which corresponds to the appearance of the first harmonic of the error function, or the learning speed at which the first loss of stability of the learning system is observed on the branching diagram. The conducted studies of the learning error of multilayer neural network when using the optimization learning methods AMSGrad, Adam, AdamMax prove that these methods do not affect the value of the optimal learning speed, and it does not depend on the change of the optimization parameter β 2. For all considered optimization methods, it is 0.45. Keywords The multilayer neural network, the training error, optimization methods 1 1. Introduction An important aspect in the process of learning and testing neural networks is to avoid the process of their retraining. This is a key task in the development of machine learning models. The most famous ways to avoid retraining:  using a validation set [1]. The data is divided into training, validation and test sets. The validation set is used to evaluate the performance of the model during training, which allows timely detection of overtraining and taking measures to avoid it.  use of regularization methods [2], such as L1 or L2 regularization, which add corrections to the magnitude of model parameters. This helps to avoid over-complexity of the model.  reducing the number of parameters, namely using fewer layers or neurons to avoid overtraining [3]. This is especially important when computing resources are limited.  application of cutting methods(dropout) [4] during training, when certain neurons are randomly dropped during each iteration. This prevents the model from adapting to specific noise dependencies in the training data.  cross check [5], that is, instead of a one-time division of the data set into training and validation, cross-validation is used, which allows for a more accurate assessment of the overall performance of the model. COLINS-2024: 8th International Conference on Computational Linguistics and Intelligent Systems, April 12–13, 2024, Lviv, Ukraine bohdan.melnyk@lnu.edu.ua (B. Melnyk); serhiy.sveleba@lnu.edu.ua (S. Sveleba); ivan.katerynchuk@lnu.edu.ua (I. Katerynchuk); ivan.kuno@lnu.edu.ua (I. Kuno); volodymyr.franiv@lnu.edu.ua (V. Franiv) 0000-0001-6399-6317 (B. Melnyk); 0000-0002-0823-910X (S. Sveleba); 0000-0001-8877-8324 (I. Katerynchuk); 0000-0001-6092-7949 (I. Kuno); 0000-0001-9856-1962 (V. Franiv) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings  using an early stop (early stopping) [6], that is, the learning process is interrupted when the performance of the model on the validation set begins to deteriorate. This may indicate the beginning of retraining.  using deep learning architectures, which in themselves are less prone to retraining [7]. Some architectures, such as neural networks with multiple layers of abstraction, may be less prone to overtraining.  combining these methods to get the best results and avoid overtraining the neural network. All of the above methods are aimed at avoiding retraining, but not at the very cause of its appearance. According to the results of work [8], the process of retraining neural networks is associated with the appearance of local minima on the target error function. Thus, it was noted in [8] that when the error function approaches the global minimum, the appearance of local minima increases. Based on the analysis of the objective function of the learning error with the help of a logistic function describing the doubling process, it was noted in the paper [9] that the appearance of local minima is primarily caused by the retraining of neurons. When approaching the global minimum, due to the inhomogeneity of the learning process, the number of retrained neurons increases, and in the first approximation this process can be described as a frequency doubling process. This is particularly indicated by the Fourier spectra of the target error function and the obtained branching diagrams [9]. Also, as it was shown in [9], the process of relearning is related to the choice of the input array and its heterogeneity. Based on the maps of dynamic modes, according to [9], the heterogeneity of the input array serves as a catalyst for neuron retraining. It was also stated in [9] that the following learning modes are inherent in the neural network learning process: undertraining, satisfactory learning mode, and retraining mode. According to the branching diagrams obtained from the objective function of the error, the retraining mode of the neural network is characterized by both partial retraining and chaotic training. According to the Fourier spectra of the learning error function, the transition to the chaotic learning regime is accompanied by a doubling of the number of local minima. Along with this, transparency windows are observed on the branching diagrams, which testify to the emergence of a satisfactory learning process of the neural network. Based on the above, in [9] a model was proposed to describe the appearance of the relearning mode for a multilayer neural network with inverse error propagation. Since in neural networks, input values are processed on each neuron from all neurons of the previous layer (on the second layer from all input values), the existence of a significant number of periodicities will be inherent in the error function. This behavior of the error function is caused by the retraining of individual neurons. That is, the increase in the number of local minima of a multilayer neural network when approaching the global minimum is due to the process of retraining a certain number of neurons. Such retraining causes the periodic behavior of the error function to appear. Since the error function of the neural network is a symbiosis of the error functions of each neuron, its behavior will be characterized by a spectrum of possible oscillation frequencies. Depending on such a parameter as the learning rate (alpha), all learning modes will be inherent in the neural network, including the chaotic learning mode. In the retraining mode of a small number of neurons, the error function is a periodic function and is described by several different oscillations. Under this condition, the learning error function is characterized by the existence of several local minima. As a result of doubling the number of local minima, the neural network goes into a chaotic learning mode. In this learning mode, the error function of the neural network is described by a set of existing oscillations, and the average wave vector over such an ensemble of oscillations can take an incommensurate value to the existing oscillations. Therefore, the emerging chaotic learning mode of the neural network is characterized by the retraining of a significant number of neurons, and their number changes dramatically when the learning speed changes. This work is devoted to the consideration of the algorithm for avoiding the retraining mode of the neural network, and the comparative analysis of the learning and testing error function under the condition of practically no retraining mode of the neural network, when using the most effective optimization methods of training, such as Adam, AdamMax and AMSGrad. Since the process of relearning is strongly influenced by both the sample and the array of setting the numbers themselves, we will conduct this analysis by considering the influence of both the sample and the array of setting the number itself for both homogeneous and heterogeneous input arrays. 2. Methodology Printed digit representation arrays such as TensorFlow and Keras, which have the MNIST dataset, which also includes handwritten digits, are used to train a digit recognition model. Since an array of 28x28 pixels in gradation of shades of gray is mainly used for displaying numbers, these arrays for displaying numbers can be classified as homogeneous arrays. The conducted preliminary studies of maps of dynamic modes show that these arrays can be considered as homogeneous. The analysis for these arrays will be given in our next work. We will focus at the analysis of the learning process of the neural network with the help of printed numbers. Due to their simple form of representation of digital data, as a set of zeros and ones, they are widely used for machine learning and pattern recognition. The paper deals with the array of 3x5 and 4x7 numbers. An increase in the number representation array will contribute to the movement towards the uniformity of the array. The sample will consist of five options for presenting the number. That is, one version without distortion of the numbers and four distorted with an error of 15-20% (for the 3x5 array) and 11-15% (for the 4x7 array). In this representation of the number, one unit is replaced by a zero or vice versa. In the first approximation, we will consider such an array to be homogeneous. A non-homogeneous array will be considered an array in which the number representation error is close to 50%. For this, two variants of the number representation with an error of ≈ 50% were added to the number array display sample. That is, such an array, which was given by a set of zeros and ones, did not correspond to any of the numbers. Calculations were performed in the Python programming environment for a neural network with three hidden layers of 15 and 28 neurons in each layer, respectively, for a 3x5 and 4x7 digit representation array. The learning error function was analyzed using a logistic function of the following form: хn+1= alpha - xn - xn2 where n is a step, alpha is a parameter that determines the learning rate. The selection of β1 and β2 parameter values for the considered optimization methods Adam, AdamMax and AMSGrad was carried out according to work. According to [9], the sigmoidal function played the role of the activation function. The architecture of the multilayer neural network was chosen according to [9], where it was noted that this architecture has the lowest learning error. Testing was carried out when presenting a number with an error of 15-20% (for a 3x5 array) and 11- 15% (for a 4x7 array). 3. The Adam method 3.1. Homogeneous array, representation of number in a 3x5 array Adam Method is quite popular in deep learning because it works effectively with different types of neural network architectures and different tasks. This method is an optimization algorithm used for neural network training, in particular in deep learning. It combines the ideas of gradient descent optimization methods (for example, the method of moments and RMSProp) with additional corrections. The main idea of the Adam method is to use the exponentially weighted mean gradient (from RMSProp) for each parameter, and also use the exponentially weighted mean square of the gradient (similar to the method of moments). This allows the algorithm to effectively adapt to different magnitudes and directions of gradients. This method is described by the following formulas: This is the calculation of the exponentially weighted mean gradient: mn=β1mn−1+(1−β1)gn This is the calculation of the exponentially weighted mean square of the gradient: vn=β2vn−1+(1−β2)gn2 Offset correction: 𝑚̂𝑛 = 𝑚𝑛 /(1 − 𝛽1𝑛 ), : 𝑣 ̂𝑛 = 𝑣𝑛 /(1 − 𝛽2𝑛 ) Weights are updated according to the formula: 𝑤𝑛+1 = 𝑤𝑛 −𝜂𝑚 ̂𝑛 /√𝑣 ̂𝑛 + 𝜖 In these formulas: gn is the gradient of the target error function at point n, where n is the number of iterations, mn is exponentially weighted average of the gradient, vn is exponentially weighted average of the square of the gradient, β1 and β2 - parameters, usually close to 1, which control the degree of attenuation of the previous values, according to the work [10], respectively, 0.9 and 0.99, 0.999 or 0.9999 are chosen η - learning rate (learning rate), wn- parameters of the model at n, ϵ - a small addition for numerical stability (usually a very small number, for example, 10−8). Fig. 1 shows the Fourier spectra at the learning speed that corresponds to the retraining of the neural network (Fig. 1, a, alpha=0.9) and the branching diagram (Fig. 1, b) at N=100 iterations. According to Fig. 1, the Fourier spectrum is characterized by a wide range of existing harmonics, which testify to the existence of the retraining mode of the neural network. According to the branching diagram shown in Fig. 1b, already at the initial stage of neural network training, the learning error function on each neuron is a complex functional dependence. Starting with a learning speed of alpha>0.45, the process of the appearance of local minima should be followed. According to the branching diagram, their number (and therefore the number of neurons that are inherent in the relearning process) doubles when the learning speed increases. That is, the further increase in the learning speed begins to be described by the process of doubling the number of local minima in the learning error function. In the final case, this leads to the appearance of chaotic behavior of the learning error function. Considering in combination with Fourier spectra and branching diagrams, the conclusion is suggested that the increase in the number of harmonics on the Fourier spectra and the number of branches on the branching diagram is associated with the appearance of local minima in the behavior of the target error function. And the appearance of local minima is due to the retraining of neurons. Therefore, a chaotic learning mode occurs in a multilayer neural network, which is caused by the appearance of local minima, which arise as a result of an increase in the learning speed, during which the process of relearning is traced. The resulting chaotic learning mode is sensitive to changes in the parameters of the multilayer neural network. A slight change in the multilayer neural network parameters causes significant changes in the multilayer neural network training mode. The appearance of local solutions (local minima) as a result of an increase in the speed of learning leads to the appearance of bifurcations on the dependence of the learning error on the number of epochs and all this associated with the process of relearning neurons. One of the ways to solve this problem (avoid the chaotic learning regime of multilayer neural network) is to determine the parameters of the appearance of the harmonics of the objective function of the learning error, and on the branching diagram, the presence of a bifurcation. Other words: the determination of the value of the learning speed, at which the process of the appearance of local minima due to the relearning process of neurons takes place. This mechanism assumes the absence of local minima, and therefore the retraining regime. The algorithm for solving this problem consists in determining the optimal value of the learning speed, and therefore the optimal value of the learning error at which the first harmonic occurs on the learning error function. Under these conditions, the obtained optimal value of the learning speed (alpha) is equal to 0.4501 for each digit. This shows that the Adam method effectively selects the learning parameters for different input data. The learning error in this case varied from 2.608e-05 to 2.6295e-05, which is a negligible value. Such results indicate the high accuracy of neural network learning and its ability to effectively implement the learning process. a) b) Figure 1: Fourier spectra a), and branch diagram for the function of the learning error from the number of iterations b), subject to the use of the Adam optimization method, for a homogeneous array of dimensions 3x5, with β1=0.9, β2=0.999 and N=100 iterations. Figure 2a shows the dependence of the quality of training (1) and testing (2) under the conditionβ1=0.9, β2=0.999 and N=100. The test curve shows a better result than the learning curve. Values of testing errors for different numbers are shown in Fig. 2, b. The test error for the digits "1", "7", "8" and "9" shows a worse result than for the other digits. a) b) Figure 2: Dependence of the value of digit recognition for the training array (1) and the test array (2) on the number of iterations, and the testing error for each digit, subject to the application of the Adam optimization method, for a homogeneous array of dimensions 3x5, with β1=0.9, β2=0.999 and N=100. At 1000 iterations, a significant improvement in the accuracy and efficiency of the training model can be seen. The training error values, which range from 8.129e-06 to 8.139e-06 for different figures, are significantly lower compared to the previous data obtained at 100 iterations. According to the Fourier spectra and branching diagrams shown in Fig. 3, under the condition β1=0.9, β2=0.999 and 1000 iterations, the optimal learning rate was 0.4501. When studying the influence of the optimization parameter β2 the following results were obtained: β2=0.99, optimal alpha = 0.4501; learning error = 8.129e-06÷8.139e-06; β2=0.999, optimal alpha = 0.4501; learning error = 8.129e-06÷8.139e-06; β2=0.9999, optimal alpha = 0.4501; learning error = 8.129e-06÷8.139e-06. That is, changing the optimization parameter β2 within the accuracy of the experiment, it does not affect either the value of the optimal learning speed or learning errors. a) b) Figure 3: Fourier spectra a), and branch diagram for the function of the learning error from the number of iterations b), provided the optimization method is Adam, for a homogeneous array of dimensions 3x5, at β2=0.999 and N=1000 iterations. 3.2. Heterogeneous array, representation of numbers in a 3x5 array Figure 4 shows the Fourier spectra of the error function and the branching diagram of the learning rate. The resulting Fourier spectra and branching diagrams are practically identical to the Fourier spectra and branching diagrams for a homogeneous array. a) b) Figure 4: Fourier spectra a), and branch diagram for the function of the learning error from the number of iterations b), provided the Adam optimization method, for a non-homogeneous 3x5 array, with β2=0.99 and N=1000 iterations. When studying the influence of the optimization parameter β2 the following results were obtained: β2=0.99, optimal alpha = 0.4501; learning error = 6.429e-06÷6.434e-06; β2=0.999, optimal alpha = 0.4501; learning error = 6.429e-06÷6.434e-06; β2=0.9999, optimal alpha = 0.4501; learning error = 6.429e-06÷6.434e-06. That is, changing the optimization parameter β2 within the accuracy of the experiment does not affect either the value of the optimal speed of learning or the accuracy of learning for a heterogeneous array. So, the change of the β2 parameter for a homogeneous or non- homogeneous input array does not affect the training result when using the Adam optimization method. Consider how the homogeneity or non-homogeneity of the input array affects the testing error. Fig. 5 shows the testing error for different numbers when the optimization parameter β2 is changed and constant value β1=0.9, for homogeneous and heterogeneous input array. Within the accuracy of the experiment, the change in the value of the optimization parameter β2 practically does not affect the testing error, both for a homogeneous array and for a non- homogeneous array. For the testing process, it does not depend on the value of β 2 there is a pattern that the testing error for the digits "1", "7", "8" and "9" shows a worse result than for other digits. 3x5 homogeneous array, at 3x5 homogeneous array, at 3x5 homogeneous array, at β2=0.99. β2=0.999. β2=0.9999. 3x5 non-homogeneous array, 3x5 non-homogeneous array, 3x5 non-homogeneous array, at β2=0.99. at β2=0.999. at β2=0.9999. Figure 5: Testing error for different numbers when changing the optimization parameter β2 and a constant value of β1=0.9, for a homogeneous and heterogeneous input array, provided the Adam optimization method is used, N=1000 iterations. Comparing the testing error for a homogeneous and non-homogeneous array, according to Fig. 5, a pattern can be observed that the testing error for a non-homogeneous array is almost two times smaller. If you compare the testing error with the learning error, the latter is almost an order of magnitude smaller. Although the dependences of training quality (1) and testing quality (2) shown in Fig. 6 do not reflect this. But comparing the dependencies obtained for a homogeneous input array and a non-homogeneous one, it should be noted that the steepness of the change in the quality of testing as well as training for a homogeneous array is greater and does not depend on the parameter β2. 3x5 homogeneous array, at 3x5 homogeneous array, at 3x5 homogeneous array, at β2=0.99. β2=0.999. β2=0.9999. 3x5 non-homogeneous array, 3x5 non-homogeneous array, 3x5 non-homogeneous array, at β2=0.99. at β2=0.999. at β2=0.9999. Figure 6: The quality of training (1) and testing (2) from the number of iterations (N=1000) and the optimization parameter β2 for a homogeneous and non-homogeneous array. 3.3. Homogeneous array, representation of numbers in a 4x7 array Fig. 7 shows the Fourier spectrum and the branching diagram for the learning error function from the learning speed, subject to the application of the Adam optimization method, for a homogeneous array of dimensions 4x7, with β2=0.999. The obtained Fourier spectra for both the 3x5 and 4x7 number arrays are characterized by the existence of harmonics, and the first, second and third are clearly visible in the spectra. The branching diagram shown in Fig. 7 is, in the first approximation, similar to the diagram for the 3x5 array (Fig. 4), with the only difference that the learning process in the interval alpha<0.4 is more uniform. That is, the correction of the magnitude of the weights for all neurons is almost the same. a) b) Figure 7: Fourier spectra a) and branch diagram for the function of the learning error from the number of iterations b), subject to the application of the Adam optimization method, for a homogeneous array of dimensions 4x7, with β2=0.999 and N=1000 iterations. Consider the influence of the optimization parameter β2 on the learning error at the optimal value of the learning speed for a homogeneous array of dimensions 4x7: β2=0.99, optimal alpha = 0.4501; learning error = 5.955e-06÷5.96e-06; β2=0.999, optimal alpha = 0.4501; learning error = 5.955e-06÷5.96e-06; β2=0.9999, optimal alpha = 0.4501; learning error = 5.955e-06÷5.96e-06. Changing the optimization parameter β2 within the accuracy of the experiment does not affect either the value of the optimal learning speed or the learning accuracy for a homogeneous array. Increasing the number display array from 3x5 to 4x7 caused a decrease in the learning error. 3.4. Non-homogeneous array, representation of numbers in a 4x7 array Fig. 8 shows the Fourier spectra and branching diagram obtained using the logistic function, for the learning error function under the condition of using the Adam optimization method, for a non-homogeneous 4x7 array, with β2=0.999 and at 1000 iterations. The resulting Fourier spectra and branching diagram are similar to those for a homogeneous array. This shows that increasing the size of the number representation has a positive effect on the learning process of the neural network. a) b) Figure 8: Fourier spectra a) and branching diagram for the function of the learning error from the number of iterations b), subject to the use of the Adam optimization method, for a non- homogeneous array with the size of the representation of numbers 4x7, with β2=0.999 and N=1000 iterations. Consider the influence of the optimization parameter β2 on the learning error at the optimal value of the learning speed for a non-homogeneous array the size of the representation of the number is 4x7: β2=0.99, optimal alpha = 0.4501; learning error = 4.709e-06÷4.714e-06; β2=0.999, optimal alpha = 0.4501; learning error = 4.709e-06÷4.714e-06; β2=0.9999, optimal alpha = 0.4501; learning error = 4.709e-06÷4.714e-06. Changing the optimization parameter β2 does not affect either the value of the optimal learning speed or the learning error for a non-homogeneous array. Increasing the dimension of array from 3x5 to 4x7 caused a decrease in the learning error not only for the homogeneous array but also for the non-homogeneous array. Let's now consider how the homogeneity and non-homogeneity of the input array affects the testing error for the 4x7 digit display array. Fig. 9 shows the testing error for different numbers when the optimization parameter is changed β2 and constant value β1=0.9, for homogeneous and heterogeneous input array. Within the accuracy of the experiment, the change in the value of the optimization parameter β2 practically does not affect the testing error, both for a homogeneous array and for a non-homogeneous array. For the testing process, it does not depend on the value of β2 there is a pattern that the testing error for the digits "0" shows a worse result than for other digits. Comparing the testing error for a homogeneous and non-homogeneous array, according to Fig. 9, a pattern can be observed that the testing error for a non-homogeneous array is almost a third smaller. Comparing the testing error with the learning error, the latter is almost an order of magnitude smaller. 4x7 homogeneous array 4x7 homogeneous array 4x7 homogeneous array β2=0.99. β2=0.999. β2=0.9999. 4x7 non-homogeneous array 4x7 non-homogeneous array 4x7 non-homogeneous array β2=0.99. β2=0.999. β2=0.9999. Figure 9: Testing error for different numbers when changing the optimization parameter β2 and constant value β1=0.9, for homogeneous and non-homogeneous input array, N=1000 iterations. The dependencies of training quality (1) and testing quality (2) shown in Fig. 10 do not reflect this. Comparing the obtained dependencies for a homogeneous input array, it should be noted that they practically coincide. Their coincidence is followed even when the optimization parameter is changed β2. The interval of a sharp change in their dependence is followed up to 100 iterations. For a non-homogeneous array, as mentioned above, the training error is smaller than the testing error. Therefore, the learning quality curve is higher than the corresponding test curve. But in comparison with a homogeneous array of numbers, with a non-homogeneous array, the quality of training and testing is characterized by a smaller change in the number of iterations. This leads to the need to carry out the training and testing process with a greater number of iterations. But an increase in the number of iterations when using a non-homogeneous array is accompanied by a smaller value of both training and testing errors. With a homogeneous input array, the learning speed is higher at the beginning, which then sharply slows down starting with 100 iterations. 4x7 homogeneous array 4x7 homogeneous array 4x7 homogeneous array β2=0.99. β2=0.999. β2=0.9999. 4x7 non-homogeneous array 4x7 non-homogeneous array 4x7 non-homogeneous array β2=0.99. β2=0.999. β2=0.9999. Figure 10: The quality of training (1) and testing (2) from the number of iterations and the optimization parameter β2 for a homogeneous and non-homogeneous array. 4. Method AdamMax 4.1. Homogeneous array, representation of numbers in a 3x5 array The AdamMax optimization method is an improved version of the Adam optimization algorithm. AdamMax can be useful for training deep neural networks, especially in cases where problems with instability or large gradients may arise. The main idea of AdamMax is to use the maximum norm for regularization [10]. AdamMax uses the maximum value of the absolute values of the gradients for each parameter instead of using the average of the squares of the gradients (as in Adam). This allows you to focus on large gradients, which can improve the stability and learning speed of the model. The general AdamMax optimization algorithm consists in initializing the following values: learning rate, exponential average moments of the first and second orders, β1 and β2, the maximum norm (max_norm) used to limit the gradients. This method is given by the following formulas: Calculation of the exponentially weighted mean gradient: mn=β1mn−1+(1−β1)gn Calculation of the exponentially weighted mean square of the gradient: vn=β2vn−1+(1−β2)gn2 vn=max(β2⋅vn-1, ∣g∣), Offset correction: 𝑚̂𝑛 = 𝑚𝑛 /(1 − 𝛽1𝑛 ), : 𝑣 ̂𝑛 = 𝑣𝑛 /(1 − 𝛽2𝑛 ) Then the weights are updated according to the formula: 𝑤𝑛+1 = 𝑤𝑛 −𝜂𝑚 ̂𝑛 /√𝑣̂𝑛 + 𝜖 Since this method is related to maximizing of the value of the absolute gradients, we will consider the error of training and testing under the condition of 100 iterations. Figure 11 shows Fourier spectra and branching diagram, for the function of the learning error from the number of iterations, subject to the application of the optimization method AdamMax, for a homogeneous array of dimensions 3x5, at β2=0.999 and 1000 iterations. Harmonics can be traced on the Fourier spectra, which prove that when the learning speed is greater than the optimal speed, this method is also characterized by overlearning, which can be associated with the appearance of local minima in the learning error function. The resulting branch diagram for the method AdamMax, provided that the array size is 3x5, in comparison with the Adam method, under the same conditions, proves that the learning process in the interval alpha<0.4 it is more homogeneous. a) b) Figure 11: Fourier spectra a) and branching diagram for the learning error function from the number of iterations b), subject to the application of the optimization method AdamMax, for a homogeneous array of dimensions 3x5, at β2=0.999 and 1000 iterations. Consider the influence of the optimization parameter β2 on the learning error at the optimal value of the learning speed for a homogeneous array with a given size of 3x5 digits: β2=0.99, optimal alpha = 0.4501; learning error = 2.6065e-05÷2.6291e-05; β2=0.999, optimal alpha = 0.4501; learning error = 2.6026e-05÷2.6285e-05; β2=0.9999, optimal alpha = 0.4501; training error = 2.6029e-05÷2.6292e-05. Changing the optimization parameter β2, within the accuracy of the experiment, does not affect the value of the optimal learning speed. As for the learning error, as it was noted in the paper [12], with an increase in the value β2 a decrease in its value is observed at first (at β2=0.999), and then an increase (at β2=0.9999). As noted in the paper [11], the smallest learning error is observed when the value of the optimization parameter β2=0.999. So, unlike the method Adam in the AdamMax method, the dependence of the learning error on the value of the parameter β2 is observed. 4.2. A non-homogeneous array, displaying numbers in a 3x5 array When studying the influence of the optimization parameter β2 the following results were obtained for the learning error under the condition of the optimal value of the learning speed for a non-homogeneous array of a given size of 3x5 digits: β2=0.99, optimal alpha = 0.4501; learning error = 2.0633e-05÷2.0784e-05; β2=0.999, optimal alpha = 0.4501; learning error = 2.0614e-05÷2.0785e-05; β2=0.9999, optimal alpha = 0.4501; training error = 2.0635e-05÷2.0786e-05. Changing the optimization parameter β2 within the accuracy of the experiment, both for a homogeneous input array and for a non-homogeneous one, does not affect the value of the optimal learning speed. As for the learning error, here, as noted in the paper [12], with an increase in β2 first its decrease and then its increase can be traced. According to work [12], the best value of the value of the optimization parameter β2=0.999. So, in the AdamMax method, the dependence of the learning error on the value of the parameter β2 can be traced both with a homogeneous and with a non-homogeneous input array. To understand the mechanism of the dependence of the learning error on the parameter β2, consider the testing error and the dependence of the quality of learning and testing on the number of iterations. Let's consider how the homogeneity and non-homogeneity of the input array affects the testing error for the 3x5 digit display array. Fig. 12 shows the testing error for different numbers when the optimization parameter is changed β2 and constant value β1=0.9, for homogeneous and non-homogeneous input array. Changing the value of the optimization parameter β2 affects the testing error, both for a homogeneous array and for a non- homogeneous array. Increasing the value of β2 causes a decrease in the testing error by almost an order of magnitude at 100 iterations. For the testing process, it does not depend on the value of β2 there is a pattern that the testing error for the digits "0" shows a worse result than for other digits. Comparing the testing error for a homogeneous and non-homogeneous array, according to Fig. 12, a pattern can be observed that the testing error for a non-homogeneous array is almost a third smaller. Comparing the testing error with the learning error, the latter is approximately an order of magnitude smaller. This reflects the dependence of the quality of training (1) and quality of testing (2) on the number of iterations shown in Fig. 13. The interval of a sharp change in their dependence depends both on the optimization parameter β2 and on the homogeneity of the input array. For a non-homogeneous array, as mentioned above, the training error is smaller than the testing error. Therefore, the learning quality curve is higher than the corresponding test curve. By increasing the parameter β2 there is a narrowing of the interval in terms of the number of iterations in which a sharp change in the steepness of this dependence takes place. That is, the larger the value of the β2 parameter, the smaller the number of iterations required to achieve a certain testing accuracy. But compared to a homogeneous array of numbers, with a non-homogeneous array, the quality of training and testing is characterized by a steeper change in the number of iterations. Such a feature of the behavior of both learning and testing errors from the number of iterations proves an important role of large gradients, which improve the stability and learning speed of the neural network when applying the optimization method AdamMax compared to the Adam optimization method. Comparing the test error for different numbers for an optimization method Adam and AdamMax, it can be noted that for the numbers from "2" to "9" the testing error is approximately the same. The largest testing error is observed for the number "0". 3x5 homogeneous array, 3x5 homogeneous array, 3x5 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 3x5 non-homogeneous array, 3x5 non-homogeneous array, 3x5 non-homogeneous array, β2=0.99. β2=0.999. β2=0.9999. Figure 12: Testing error for different numbers when changing the optimization parameter β2 and a constant value of β1=0.9, for a homogeneous and non-homogeneous input array, provided the method is usedAdamMax, N=100 iterations. 3x5 homogeneous array, 3x5 homogeneous array, 3x5 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 3x5 non-homogeneous array, 3x5 non-homogeneous array, 3x5 non-homogeneous array, β2=0.99. β2=0.999. β2=0.9999. Figure 13: The quality of training (1) and testing (2) from the number of iterations and the optimization parameter β2 for a homogeneous and non-homogeneous array, provided the method is usedAdamMax, N=100 iterations. 4.3. Homogeneous array, representation of numbers in a 4x7 array Under the influence of the optimization parameter β2 the following values were obtained for the learning error at the optimal value of the learning speed for a homogeneous array of the target digit of size 4x7: β2=0.99, optimal alpha = 0.451; learning error = 1.9111e-05÷1.9266e-05; β2=0.999, optimal alpha = 0.451; learning error = 1.9112e-05÷1.9261e-05; β2=0.9999, optimal alpha = 0.451; training error = 1.9109e-05÷1.9268e-05. Changing the optimization parameter β2 within the accuracy of the experiment for a homogeneous input array does not affect the value of the optimal learning speed. In comparison with the 3x5 digit presentation array, for the 4x7 array there is a slight increase in the optimal learning speed from 0.4501 to 0.451. As for the learning error, here, as noted in [11], within the accuracy of the experiment, with an increase in β2 first its decrease and then its increase can be traced. The best value of the value of the optimization parameter β2=0.999, at which the minimum learning error is observed for the given number of iterations. So, for the AdamMax method, the dependence of the learning error on the value of the parameter β2 can be traced for both the 3x5 and 4x7 number representation arrays, as described in [12]. 4.4. Non-homogeneous array, representation of numbers in a 4x7 array Consider addiction learning errors at the optimal value of the learning speed for a non- homogeneous array of size 4x7 when changing optimization parameter β2: β2=0.99, optimal alpha = 0.4501; learning error = 1.5144e-05÷1.5282e-05; β2=0.999, optimal alpha = 0.4501; training error = 1.5144e-05÷1.5284e-05; β2=0.9999, optimal alpha = 0.4501; learning error = 1.5154e-05÷1.5283e-05. Changing the optimization parameter β2 within the accuracy of the experiment, for a non- homogeneous input array, does not affect the value of the optimal learning speed. As for the learning error, here, within the accuracy of the experiment, there is a slight increase in the learning error with an increase in the value of β2. So, in the AdamMax method, the dependence of the learning error on the value of the parameter β2 can be traced for both a homogeneous and non-homogeneous arrays. Fig. 14 shows the testing error for different numbers when the optimization parameter β2 is changed and a constant value of β1=0.9, for a homogeneous and non-homogeneous input array, provided that the number is represented by a 4x7 array. Changing the value of the optimization parameter β2 affects the testing error, both for a homogeneous array and for a non- homogeneous array. Increasing the value of β2 causes a decrease in the testing error by almost two times for a homogeneous array, and three times for a non-homogeneous array, at 100 iterations. For the testing process, it does not depend on the value of β 2 there is a pattern that the testing error for numbers "1", "2", "3", "4", "5" shows a worse result than for other numbers. Comparing the testing error for a homogeneous and non-homogeneous array, according to Fig. 14, we observe the pattern that the testing error for a non-homogeneous array is almost a third larger and depends on the parameter β2. If we compare the testing error with the training error, the latter is about an order of magnitude smaller. The dependencies of training quality (1) and testing quality (2) shown in Fig. 15 reflect this. The interval of a sharp change in their dependence depends both on the optimization parameter β2 and on the homogeneity of the input array. For a non-homogeneous array, as mentioned above, the training error is smaller than the testing error. 4x7 homogeneous array, 4x7 homogeneous array, 4x7 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 4x7 non-homogeneous, 4x7 non-homogeneous, 4x7 non-homogeneous, β2=0.99. β2=0.999. β2=0.9999. Figure 14: Testing error for different numbers when changing the optimization parameter β2 and a constant value of β1=0.9, for a homogeneous and non-homogeneous input array, provided the method is usedAdamMax, N=100. Therefore, the learning quality curve is higher than the corresponding test curve. Although at small iteration values (in the interval of sharp changes in the quality of testing), the opposite trend is observed. By increasing the parameter β2 for a homogeneous array, the expansion of the iteration interval in which there is a sharp change in the steepness of this dependence can be traced. That is, the larger the value of the β2 parameter, the greater the number of iterations required to achieve a certain testing accuracy. For a non-homogeneous array, the quality of training and testing when the parameter β2 increases improves, the interval of sharp changes in each iteration decreases at the beginning, but then increases again. Such a feature of behavior, both learning and testing errors from the number of iterations, as noted above, attests to the important role of large gradients that improve the stability and learning speed of the neural network when using the optimization method AdamMax compared to the Adam optimization method. 4x7 homogeneous array, 4x7 homogeneous array, 4x7 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 4x7 non-homogeneous, 4x7 non-homogeneous, 4x7 non-homogeneous, β2=0.99. β2=0.999. β2=0.9999. Figure 15: The quality of training (1) and testing (2) from the number of iterations and the optimization parameter β2 for a homogeneous and non-homogeneous array, provided the method is usedAdamMax, N=100. 5. The AMSGrad method 5.1. Homogeneous array, representation of numbers in a 3x5 array AMSGrad (Adaptive Moment Estimation with Squared Gradient) is a variant of the optimization algorithm, which is a modification of Adam (Adaptive Moment Estimation) [12]. The main idea of AMSGrad is to fix the problem of increasing the values of vn second-order moment estimates in Adam. In regular Adam, vn increases with each iteration, which can lead to an increase in the learning rate and, as a result, to large changes in the model parameters, which is not always desirable. AMSGrad solves the problem of excessive vn growth in Adam by saving the maximum value of vn from all past steps. That is, AMSGrad allows you to ensure stability of training and more accurate adaptation of the speed of training for each parameter. The AMSGrad optimization algorithm initializes parameters such as learning rate, exponential averages of the first and second order moment, β1, β2, initial values of the first and second order moment m=0, v=0, vmax=0 [13]. This method is given by the following formulas: The calculation of the exponentially weighted mean gradient: mn=β1mn−1+(1−β1)gn The calculation of the exponentially weighted mean square of the gradient: vn=β2vn−1+(1−β2)gn2 vn=max(β2⋅vn-1,∣g∣), Offset correction: 𝑚 ̂𝑛 = 𝑚𝑛 /(1 − 𝛽1𝑛 ), : 𝑣 ̂𝑛 = 𝑣𝑛 /(1 − 𝛽2𝑛 ) Then the weights are updated according to the formula : 𝑤𝑛+1 = 𝑤𝑛 −𝜂𝑚 ̂𝑛 /√𝑣 ̂𝑛 + 𝜖 Model parameter update: 𝑣̂ 𝑚𝑎𝑥,𝑛 -maximum with vmax,n and 𝑣 ̂, 𝑛 ϵ is a small number for numerical stability. The obtained Fourier spectra (Fig. 16) are similar to the spectra obtained for the Adam and AdamMax optimization methods. These Fourier spectra prove the existence of harmonics, which indicate the presence of a neural network retraining process at a learning speed greater than the optimal speed. The resulting branch diagram for the method AMSGrad provided that the array size is 3x5, in comparison with the Adam method and AdamMax, under the same conditions certifies that the learning process in the interval alpha<0.4 it is more homogeneous. Based on the Fourier spectra and branching diagrams, the optimization method AMSGrad also has an inherent chaotic learning mode, which is caused by the appearance of local minima of the learning error function, which in turn are associated with the process of retraining neurons when approaching the global minimum. a) b) Figure 16: Fourier spectra a), branching diagram training and testing error depending on the number of iterations b), subject to the application of the AMSGrad optimization method, for a homogeneous array of dimensions 3x5, with β2=0.999 and 100 iterations. As for optimization methods Adam and AdamMax, and for the AMSGrad optimization method, we will consider the influence of the optimization parameter β2 on the learning error at the optimal value of the learning speed for a homogeneous array with a given size of 3x5 digits: β2=0.99, optimal alpha = 0.4501; learning error = 1.908e-05÷1.9233e-05; β2=0.999, optimal alpha = 0.4501; training error = 1.9068e-05÷1.9242e-05; β2=0.9999, optimal alpha = 0.4501; training error = 1.909e-05÷1.9244e-05. Changing the optimization parameter β2 within the accuracy of the experiment for a homogeneous input array does not affect the value of the optimal learning speed. As for the learning error, as in the AdamMax method, within the accuracy of the experiment, with an increase in β2 its increase can be traced. The best value of the value of the optimization parameter in which the minimum learning error is observed for the given number of iterations is β2=0.999. So, in neural networks using the optimization method AMSGrad to display numbers in a 3x5 array as in the method AdamMax the dependence of the learning error on the value of the parameter β2 is traced. 5.2. A non-homogeneous array, displaying numbers in a 3x5 array When considering the influence of the optimization parameter β2 the following parameters are obtained for the learning error at the optimal value of the learning speed for a non- homogeneous 3x5 array: β2=0.99, optimal alpha = 0.4501; learning error = 1.5006e-05÷1.519e-05; β2=0.999, optimal alpha = 0.4501; learning error = 1.496e-05÷1.5192e-05; β2=0.9999, optimal alpha = 0.4501; learning error = 1.5075e-05÷1.5196e-05. Changing the optimization parameter β2 for a non-homogeneous input array does not affect the value of the optimal learning speed. As for the learning error, here, as noted in [12], within the accuracy of the experiment, with an increase in β2 first its decrease and then its increase can be traced. The best value of the value of the optimization parameter in which the minimum learning error is observed for a given number of iterations, as well as for a homogeneous array, is β2=0.999. So, in neural networks using the optimization method AMSGrad the dependence of the learning error on the value of the parameter β2 is traced. Consider how the homogeneity and non-homogeneity of the input array affects the testing error for the 3x5 array when using the optimization method AMSGrad. Fig. 17 shows the testing error for different numbers when the optimization parameter β2 is changed and constant value β1=0.9, for homogeneous and heterogeneous input array. Changing the value of the optimization parameter β2 affects the testing error, both for a homogeneous array and for a non- homogeneous array. Increasing the value of β2 causes a decrease in the testing error by almost an order of magnitude for a homogeneous array, and almost by an order of magnitude for a non- homogeneous array at 100 iterations. For the testing process, regardless of the value of β 2 there is a pattern that the testing error for the digits "0" shows a worse result than for other digits. Comparing the testing error for a homogeneous and non-homogeneous array, according to Fig. 17, a pattern can be observed that the testing error for a non-homogeneous array is smaller. 3x5 homogeneous array, 3x5 homogeneous array, 3x5 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 3x5 non-homogeneous array, 3x5 non-homogeneous array, 3x5 non-homogeneous array, β2=0.99. β2=0.999. β2=0.9999. Figure 17: Testing error for different numbers when changing the optimization parameter β2 and a constant value of β1=0.9, for a homogeneous and non-homogeneous input array, provided the method is used AMSGrad, N=100. Comparing the testing error with the learning error, the latter is about an order of magnitude smaller. The dependencies of training quality (1) and testing quality (2) shown in Fig. 18 reflect this. The interval of a sharp change in their dependence on the number of iterations depends both on the optimization parameter β2 and on the homogeneity of the input array. For a non- homogeneous array, as mentioned above, the training error is smaller than the testing error. Therefore, the curve of the quality of learning is higher than the corresponding curve for testing in the interval outside of its sharp changes. In the interval of iterations, where a sharp change in the quality of learning is observed, this process is the opposite. By increasing the parameter β2 a narrowing of the iteration interval in which there is a sharp change in the steepness of this dependence can be traced. That is, the larger the value of the β2 parameter, the smaller the number of iterations required to achieve a certain testing accuracy. But compared to a homogeneous array of numbers, with a non-homogeneous array, the quality of training and testing is characterized by a steeper change in the number of iterations. Such a feature of the behavior of both learning and testing errors from the number of iterations testifies to the important role of large gradients, which improve the stability and learning speed of the neural network when using the optimization method AMSGrad compared to the Adam optimization method. 3x5 homogeneous array, 3x5 homogeneous array, 3x5 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 3x5 non-homogeneous array, 3x5 non-homogeneous array, 3x5 non-homogeneous array, β2=0.99. β2=0.999. β2=0.9999. Figure 18: The quality of training (1) and testing (2) from the number of iterations and the optimization parameter β2 for a homogeneous and non-homogeneous array, provided the method is usedAMSGrad, N=100. 5.3. Homogeneous array, representation of numbers in a 4x7 array Changing the optimization parameter β2 at the optimal value of the learning speed for a homogeneous array, the given size of digits 4x7 causes the following change in the learning error: β2=0.99, optimal alpha = 0.4501; learning error = 1.9154e-05÷1.9308e-05; β2=0.999, optimal alpha= 0.4501; training error = 1.915e-05÷.1.9292e-05; β2=0.9999, optimal alpha= 0.4501; training error = 1.915e-05÷1.9292e-05. As for optimization methods Adam, AdamMax and so on for AMSGrad changing the optimization parameter β2 within the accuracy of the experiment for a non-homogeneous input array does not affect the value of the optimal learning speed. The learning error decreases whit increases. The best value related to the minimum learning error of the value of the optimization parameter β2 is observed for the given number of iterations is β2=0.999. So, in the method AMSGrad for both the 3x5 and 4x7 digit display arrays, the dependence of the learning error on the value of the β2 parameter can be traced. 5.4. Non-homogeneous array, representation of numbers in a 4x7 array When changing the optimization parameter β2 under the condition of the optimal value of the learning speed for a non-homogeneous array of a given size of 4x7 digits, the following values of the learning error at 100 iterations were obtained: β2=0.99, optimal alpha = 0.4501; learning error = 1.5144e-05÷1.526e-05; β2=0.999, optimal alpha= 0.4501; training error = 1.5144e-05÷1.5284e-05; β2=0.9999, optimal alpha= 0.4501; learning error = 1.5154e-05÷1.5283e-05. Changing the optimization parameter β2 within the accuracy of the experiment for a non- homogeneous input array does not affect the value of the optimal learning speed. As for the learning error, with an increase in value β2 its increase can be traced. Fig. 19 shows the testing error for different numbers when the optimization parameter β2 is changed and a constant value of β1=0.9, for homogeneous and non-homogeneous input array at display numbers in a 4x7 array. Changing the value of the optimization parameter β2 affects the testing error, both for a homogeneous array and for a non-homogeneous array. Increasing the value of β2 causes a similar dependence of the testing error for a non-homogeneous array as for a homogeneous one. For the testing process, regardless of the value of β2 there is a pattern that the testing error for numbers "1", "2", "3", "4" and "5" shows a worse result than for other numbers. Comparing the testing error for a homogeneous and non-homogeneous array at submission of an array of numbers 4x7 in size, according to Fig. 19, there is a pattern that the testing error is smaller for a non-homogeneous array. 4x7 homogeneous array, 4x7 homogeneous array, 4x7 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 4x7 non-homogeneous array, 4x7 non-homogeneous array, 4x7 non-homogeneous array, β2=0.99. β2=0.999. β2=0.9999. Figure 19: Testing error for different numbers when changing the optimization parameter β2 and a constant value of β1=0.9, for a homogeneous and non-homogeneous input array, provided the method is used AMSGrad, N=100. Comparing the testing error with the learning error, the latter is approximately an order of magnitude smaller. The dependencies of training quality (1) and testing quality (2) shown in Fig. 20 reflect this. The interval of a sharp change in their dependence on the number of iterations depends on both the optimization parameter β2 and the homogeneity of the input array. For a non-homogeneous array, as mentioned above, the training error is smaller than the testing error. Therefore, the curve of the quality of learning is higher than the corresponding curve for testing in the interval outside of its sharp changes. In the interval of iterations, where a sharp change in the quality of learning is observed, this process is the opposite. By increasing the parameter β2 a narrowing of the iteration interval in which there is a sharp change in the steepness of this dependence can be traced. That is, the larger the value of the β 2 parameter, the smaller the number of iterations required to achieve a certain testing accuracy. But compared to a homogeneous array of numbers, with a non-homogeneous array, the quality of training and testing is characterized by a steeper change in the number of iterations. Such a feature of the behavior of both learning and testing errors from the number of iterations proves an important role of large gradients, which improve the stability and learning speed of the neural network when using the optimization method AMSGrad compared to the Adam optimization method. 4x7 homogeneous array, 4x7 homogeneous array, 4x7 homogeneous array, β2=0.99. β2=0.999. β2=0.9999. 4x7 non-homogeneous array, 4x7 non-homogeneous array, 4x7 non-homogeneous array, β2=0.99. β2=0.999. β2=0.9999. Figure 20: The quality of training (1) and testing (2) from the number of iterations and the optimization parameter β2 for a homogeneous and non-homogeneous array, provided the method is used AMSGrad, N=100. 6. Conclusions Therefore, the conducted studies of learning error and testing error prove that the learning process, depending on the learning speed (learning step), demonstrates the existence of a number of learning modes. Under the condition that the learning rate is less than optimal, a non-learning mode is observed, which can be characterized as a learning mode with an error > 10%. Provided that training takes place in the vicinity of the learning speed values equal to the optimum, a satisfactory learning process with minimal learning error can be observed. The process of relearning takes place when the learning speed is greater than its optimal value. It is accompanied by the appearance of local minima, and therefore, an increase in the learning error. A further increase in the learning speed leads to an increase in the number of neurons that are inherent in the relearning process, and therefore to an increase in the number of local minima. A sharp increase in the number of local minima (which in the first approximation can be described by the process of doubling their number) leads to the emergence of a chaotic state. In this paper, the optimal learning speed was determined from the Fourier spectra of the learning error function and corresponded to the value of the learning speed at which the Fourier spectra are characterized by the appearance of the first harmonic. The conducted studies of the learning error of neural networks when using the optimization methods of learning AMSGrad, Adam, AdamMax prove that these methods do not affect the value of the optimal learning speed, and it does not depend on the change of the optimization parameter β2. For all considered optimization methods, it is 0.450. It is approximately equal to the default value used in all known machine learning libraries that use neural networks. In our experimental studies on the influence of the sample and the size of the array of numbers on the value of the optimal learning speed, we demonstrate that it does not change. Although conducting similar studies for an array of handwritten digits, which were set with a display size of 28x28 pixels, showed an increase in its value to 0.5 when using the Adam optimization method. As for the learning error, its value depends on the sample, from the size of the number display array, from the homogeneity of the input array, from the application of optimization learning methods, and the value of the optimization parameters. The change in the slope of the dependence of the learning error on the number of iterations for the considered optimization methods is especially noteworthy. Moving from the Adam optimization method to AdamMax and AMSGrad can be traced to an increase in the steepness of the slope of the dependence of the learning error on the number of iterations. This is due to what is in the methods AdamMax and AMSGrad instead of using the average value of the squares of the gradients, as is done in the Adam method, the maximum value of the absolute values of the gradients for each parameter is used. This allows focusing on large gradients, which improves stability and learning speed. Along with this, it should be noted that the retraining process is associated with the appearance of local minima, which in the final case, when the learning speed changes, lead to a chaotic learning mode of the neural network. The mechanism of transition to the chaotic learning mode of the neural network is related to the process of doubling the number of local minima of the error function. The obtained Fourier spectra of the learning error function, branching diagrams, and the special behavior of the error function when using optimization learning methods prove that the main reason for the appearance of local minima of the error function is the relearning process. As for the smaller value of the learning error for a non-homogeneous input array compared to a homogeneous array, according to the authors, this confirms the important role of large gradients in improved stability and speed of learning. References [1] J. Schneider, Cross Validation, 1997. URL: https://www.cs.cmu.edu/~schneide/tut5/node42.html [2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens & Z. Wojna, Rethinking the conception architecture for computer vision, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818-2826. URL: https://www.cv- foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception _CVPR_2016_paper.pdf [3] S. O. Subbotin, Neural networks: theory and practice: teaching. (Ed. O. O. Evenok), Zhytomyr, 2020, 184p. [4] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research. Volume 15, 2014, pp. 1929-1958, URL: https://jmlr.org/papers/v15/srivastava14a.html [5] D. Berrar, Cross-validation. Encyclopedia of Bioinformatics and Computational Biology, Elsevier, Volume 1, 2018, pp. 542–545. URL: https://doi.org/10.1016/B978-0-12-809633- 8.20349-X [6] Y. Yuan, L. Rosasco, A. Caponnetto, On Early Stopping in Gradient Descent Learning. Constructive Approximation. Springer, Volume 26, 2007, pp 289–315. URL: https://doi.org/10.1007/s00365-006-0663-2. ISSN 0176-4276. S2CID 8323954 [7] G. Federico, M. Jones, T. Poggio, Regularization Theory and Neural Networks Architectures. Neural Computation. MIT Press Volume 7 Issue 2, 1995, pp. 219–269. URL: https://doi.org/10.1162/neco.1995.7.2.219 ISSN 0899-7667. S2CID 49743910. [8] K. Kawaguchi, Effect of Depth and Width on Local Minima in Deep Learning Neural Computation. MIT Press Volume 31 Issue 7, 2019, pp. 1462-1498. URL: https://doi.org/10.1162/neco_a_01195 [9] S. Sveleba, I. Katerynchuk, I. Kuno, N. Sveleba, O. Semotyjuk Investigation of the Transition Mechanism to Chaos in Multilayer Neural Networks 2021 IEEE 4th International Conference on Advanced Information and Communication Technologies (AICT), 2021, pp. 118-121. URL: https://ieeexplore.ieee.org/document/9628919 doi: 10.1109/AICT52120.2021.9628919. [10] D. P. Kingma, J. B. Adam, A Method for Stochastic Optimization 3rd International Conference for Learning Representations, San Diego, 2015, URL: https://doi.org/10.48550/arXiv.1412.6980 [11] X. Zeng, Z. Zhang and D. Wang, AdaMax Online Training for Speech Recognition, CSLT TECHNICAL REPORT-20150032, 2016, URL: http://www.cslt.org/mediawiki/images/d/df/Adamax_Online_Training_for_Speech_Recog nition.pdf [12] T. T. Phuong, L. T. Phong, On the Convergence Proof of AMSGrad and a New Version IEEE Access, Volume 7, 2019, pp. 61706-61716 URL: https://doi.org/10.48550/arXiv.1904.03590, doi: 10.1109/ACCESS.2019.2916341 [13] J.-K. Wang, X. Li, B. Karimi, P. Li, An Optimistic Acceleration of AMSGrad for Nonconvex Optimization. Proceedings of Machine Learning Research Volume 157, 2021, pp.422-437 URL: https://proceedings.mlr.press/v157/wang21c/wang21c.pdf