Training neural network method modification for forward
                                error propagation based on adaptive components
                                Victoria Vysotska1,†, Vasyl Lytvyn1,†, Mariia Nazarkevych1,†, Serhii Vladov2,∗,† ,
                                Ruslan Yakovliev2,† and Alexey Yurko3,†

                                1 Lviv Polytechnic National University, Stepan Bandera Street 12 79013 Lviv, Ukraine

                                2 Kremenchuk Flight College of Kharkiv National University of Internal Affairs, Peremohy Street 17/6 39605

                                Kremenchuk, Ukraine
                                3 Kremenchuk Mykhailo Ostrohradskyi National University, University Street 20 39600 Kremenchuk, Ukraine


                                                Abstract
                                                The work is devoted to the development of a training algorithm for forward propagation neural
                                                networks, based on the backpropagation algorithm, through the use of adaptive elements, such as
                                                adaptive training rate, adaptive initialization of neural network weights, adaptive regularization,
                                                adaptive neuron activation function, adaptive change in neural network architecture, adaptive
                                                mini-batch resizing. Using the example of solving the task of helicopter turboshaft engine
                                                parameters debugging, it is shown that the developed algorithm made it possible to achieve almost
                                                100 % accuracy of neural network training on both the training and validation data sets with a
                                                minimum number of iterations. The work experimentally substantiates the optimal value of the
                                                training rate coefficient, the number of neurons in the hidden layer of the neural network, and the
                                                optimal number of iterations when training a neural network by determining the smallest value of
                                                the final total standard deviation per epoch. It has been established that the use of L2-
                                                regularization in the developed method of training a feed-forward neural network with adaptive
                                                elements increases the regulation curve (or a similar dependence), increasing its values by the
                                                amount of regularization and bringing it closer to unity. This led to an improvement in the accuracy
                                                of setting the gas-generator rotor r.p.m. in the task of helicopter turboshaft engine parameters
                                                debugging by half compared to the use of the well-known Delta-Bar-Delta neural network training
                                                algorithm. Using the developed training algorithm for forward propagation neural networks with
                                                adaptive elements reduces the error coefficient by 1.89 times and slightly increases the accuracy
                                                of determining gas-generator rotor r.p.m. boundary values by 1.01 times, compared to the Delta-
                                                Bar-Delta algorithm, in helicopter turboshaft engines parameter debugging.

                                                Keywords
                                                Neural network, helicopter turboshaft engines, training algorithm, parameters debugging,
                                                adaptive elements, adaptive training rate, gas-generator rotor r.p.m., L2-regularization 1


                                MoMLeT-2024: 6th International Workshop on Modern Machine Learning Technologies, May, 31 - June, 1, 2024,
                                Lviv-Shatsk, Ukraine
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                   victoria.a.vysotska@lpnu.ua (V. Vysotska); vasyl.v.lytvyn@lpnu.ua (V. Lytvyn);
                                mariia.a.nazarkevych@lpnu.ua (M. Nazarkevych); serhii.vladov@univd.edu.ua (S. Vladov);
                                director.klk.hnuvs@gmail.com (R. Yakovliev); yurkoalexe@gmail.com (A. Yurko)
                                    0000-0001-6417-3689 (V. Vysotska); 0000-0002-9676-0180 (V. Lytvyn); 0000-0002-6528-9867
                                (M. Nazarkevych); 0000-0001-8009-5254 (S. Vladov); 0000-0002-3788-2583 (R. Yakovliev); 0000-0002-
                                8244-2376 (A. Yurko)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
Feedforward neural networks are one of the most widely used classes of artificial neural
networks. They comprise neurons organized into layers, with each neuron connected to
neurons in the next layer. Direct propagation means that signals are transmitted in only one
direction, from input nodes to output units [1, 2].
   In feedforward neural networks, adaptive elements play a key role. These elements allow
the network to train from the data provided and adapt its weights and parameters to
achieve the desired output. One of the most common methods for adapting elements in
neural networks is the backpropagation algorithm, which uses gradient descent to adjust
the weights [3, 4].
   Development of a neural network begins with defining its architecture, which includes
the number of layers, the number of neurons in each layer, and the choice of activation
functions. Then it is necessary to initialize the neuron weights with random values. The
training process involves passing data forward through the network (forward propagation),
estimating the error between the predicted and expected output, and then backpropagating
the error to adjust the weights using gradient descent. Once training is completed, the
network is tested on a separate dataset to evaluate its performance. This process is repeated
until a satisfactory level of neural network performance is achieved [5, 6].
   Important aspects of neural network development are the correct choice of network
architecture, optimization of training parameters, and accurate data processing.
Feedforward neural networks with adaptive elements provide a powerful tool for modeling
complex relations in data and solving a variety of tasks in the fields of machine learning and
artificial intelligence [7, 8].
   A critical drawback of the element adaptation method in feedforward neural networks,
namely the backpropagation algorithm, is its tendency to get stuck in local minima and
saddle points of the loss function, especially in the case of complex and non-smooth
functions. This can limit the network's ability to reach an optimal solution and slow down
the training process, requiring careful selection of hyperparameters and the use of
additional methods to avoid getting stuck [9, 10].
   The work aim is to research and develop new methods for optimizing the
backpropagation algorithm in feedforward neural networks to improve its resistance to
getting stuck in local minima and saddle points of the loss function. This includes analyzing
problem situations, developing new gradient optimization methods and algorithms, and
experimentally testing and comparing their effectiveness on different datasets and network
architectures. The result should be innovative approaches that can increase the speed and
accuracy of neural network training, reduce the likelihood of getting stuck in local minima,
and provide more stable convergence to the optimal solution.

2. Related works
It is known that a feed-forward neural network consists of interacting adaptive elements
called neurons, each of which carries out a certain functional transformation of input signals
[11, 12].
    In [13] the first proposed to represent the error backpropagation process using a
functional diagram known as a system backpropagation diagram. This diagram serves as a
visual tool to explain the operation of the backpropagation algorithm. The authors use it as
an aid to simplify the derivation of necessary expressions when analyzing dynamic neural
networks designed to process time-dependent signals. This method has also been used by
other authors, for example in [14, 15], as a visual way to represent backpropagation rules
when studying neural networks.
    In [16], the approach proposed in [13] was expanded and streamlined by constructing a
neural network based on adaptive components, which must remain independent of each
other during the construction of a mathematical model of the network. Bidirectional
connections are established between the components, forming two combined graphs to
describe the transmission of signals in both directions. Each component performs signal
processing in both forward and backward directions and also adjusts its adaptive
parameters during training using the Delta-Bar-Delta method [17]. Unlike gradient descent
and torque, the main difference in this method is that each adaptive parameter is assigned
its training rate coefficient. At the end of each training epoch, both the adaptable parameters
and the training rate coefficient are corrected.
    A critical disadvantage [16, 17] is the increased complexity of model control and tuning
due to the need to track and adjust individual training rate coefficients for each adaptive
parameter. This requires additional computational resources and time to conduct training
since each parameter must be separately configured according to the training dynamics,
which can slow down the process and complicate network configuration. In addition, there
is an increased likelihood of incorrectly selecting training rate coefficients, which can lead
to instability and poor model performance.
    Thus, the relevance of the research is emphasized by the need to overcome the
difficulties associated with managing and tuning neural networks due to the increased
complexity of adaptive parameters that require individual adjustment of training rate
coefficients. This limits the training efficiency and stability of models, increasing the
likelihood of instability and slower training. In the context of the desire to improve the
performance and accuracy of neural networks, the development of new optimization
methods is becoming an urgent task aimed at improving the stability of training, reducing
setup time, and increasing the stability of models when converging to the optimal solution.

3. Methods and materials
One possible optimal adaptive element to improve the backpropagation algorithm could be
the “Adaptive Training Rate” (ATR). This element will dynamically change the training rate
depending on the gradients obtained at each training step (Table 1). The paper proposes an
algorithm for training a forward propagation neural network using an adaptive element in
the form of an "Adaptive Training Rate" by combining the backpropagation algorithm with
ATR.
Table 1
“Adaptive Training Rate” description (author's research)
         Factor                                         Description
 Automatic regulation      ATR allows the training rate to be adapted at each step based on
   of training speed       gradient information. If the gradients are small, which could
                           indicate that the network is near a local minimum or saddle
                           point, ATR will automatically reduce the training rate to prevent
                           the weights from changing too much and possibly getting stuck
                           at local minima or saddle points.
  Quick adaptation to      ATR allows you to quickly adapt to changes in data structure or
  changing conditions      task complexity. For example, if some model parameters require
                           more intensive training, ATR can increase the training rate for
                           those parameters, providing more efficient training.
      Preventing           An adaptive training rate can help prevent the backpropagation
    divergence and         algorithm from diverging by controlling the rate at which the
  increasing training      weights change. This provides more stable training and
        stability          improves the overall convergence of the neural network.
  Improving training       ATR allows for more efficient use of training resources because
       efficiency          it allows the training rate to be tailored to the specific conditions
                           of each training step, reducing the likelihood of overfitting and
                           accelerating convergence to the optimal solution.
      Conclusion           The introduction of an adaptive element in the form of an
                           "Adaptive Training Rate" can significantly improve the training
                           process of neural networks, making it more stable, efficient, and
                           resistant to various conditions and problems associated with the
                           backpropagation algorithm.

    At the initial stage, adaptive initialization of the neural network weights is carried out by
calculating the average value of the input data and the dispersion of the input data according
to the expressions:
                                                   𝑁
                                         1
                                      𝜇 = ∙ ∑ 𝑥𝑖 ,                                          (1)
                                         𝑁
                                                  𝑖=1

                                              𝑁
                                        1
                                 𝜎2 =       ∙ ∑(𝑥𝑖 − 𝜇)2 ,                                  (2)
                                        𝑁
                                             𝑖=1

where N is the number of training examples, xi is the input data.
   Using weight initialization methods (for example, the He's method [18] or Xavier [19]),
the initial values of the weights are set, taking into account the obtained statistical
characteristics of the input data (Table 2).
Table 2
Initial weights initialization methods description (author's research)
                  He's method                                  Xavier method
                             2                                    √6            √6
                 𝑊~𝑁 (0,        ),                 𝑊~𝑈   (−               ,           ),
                            𝑛𝑖𝑛
                                                             √𝑛𝑖𝑛 + 𝑛𝑜𝑢𝑡 √𝑛𝑖𝑛 + 𝑛𝑜𝑢𝑡
 where N(μ, σ2) is a normal distribution with
                                                where U(a, b) is a uniform distribution on
 mean μ and variance σ2, nin is the number of
                                                the interval [a, b], nout is the number of
 input neurons.
                                                output neurons.

          (𝑙)
   Let 𝑊𝑖𝑗      be the weight connecting the i-th neuron in the l-th layer with the j-th neuron
in the next (l + 1)-th layer. For each training example x, the output 𝑦̂ of the neural network
is calculated according to the expressions:

                                   𝑧(𝑙) = 𝑊(𝑙) ∙ 𝑎(𝑙−1) + 𝑏(𝑙) ,                                 (3)

                                           𝑎(𝑙) = 𝜎(𝑧(𝑙) ),                                      (4)

where z(l) is the weighted sum of inputs for the i-th layer, a(l) is the activation of the l-th layer,
σ is the activation function of the l-th layer.
    Next, the error of the neural network is estimated using the loss function L and the
expected value of y according to the expression:
                                                  𝑁
                                          1
                                       𝐸 = ∙ ∑(𝑦𝑖 − 𝑦̂𝑖 )2 .                                     (5)
                                          2
                                                 𝑖=1

  Next, the gradient of the loss function is calculated according to the neural network
weights according to the expressions:

                                ( )       𝜕𝐸                               (𝑙) )
                              𝛿𝐿 =                      ̂ ) ∙ 𝜎′(𝑧
                                                 = (𝑦 − 𝑦                          ,             (6)
                                         𝜕𝑧(𝑙)

                              ()                  𝑇    (     )                 (𝑙) )
                             𝛿 𝑙 = (𝑊(𝑙+1) ) ∙ 𝛿 𝑙+1 ⊙ 𝜎′(𝑧                            ,         (7)

where δ(l) is the error on the lth layer, ⊙ denotes element-wise multiplication
   After calculating the gradient of the loss function from the neural network weights, the
weights are updated taking into account the gradient and the adaptive training rate
according to the expressions:
                                                                  𝜕𝐸
                                   𝑊(𝑙) = 𝑊(𝑙) − 𝛼(𝑙) ∙                    ,                     (8)
                                                                 𝜕𝑊(𝑙)

                                                                 𝜕𝐸
                                      𝑏(𝑙) = 𝑏(𝑙) − 𝛼(𝑙) ∙             ,                         (9)
                                                             𝜕𝑏(𝑙)
where α(l) is the adaptive training rate for the l-th layer.
  In this case, the training rate at each step is updated according to the expression:
                                              𝛼0
                               𝛼(𝑙) =                      ,                              (10)
                                        1 + 𝛽 ∙ ‖∇𝐿(𝜃)‖2
where α0 is the initial training rate, β is the adaptation coefficient, ‖∇𝐿(𝜃)‖2 is the squared
norm of the gradient, L(θ) is the loss function, θ is the model parameters vector.
   To control the retraining of the neural network, adaptive regularization is introduced
into the proposed training algorithm. Overfitting occurs when a model overfits the training
data and begins to lose its ability to generalize to new, previously unseen data. Adaptive
regularization allows you to dynamically adjust the level of regularization during training
depending on the current state of the network, which can improve its generalization ability
and prevent overfitting [20, 21]. For a given training algorithm that already includes
adaptive training rate and other gradient control techniques, L2 regularization may be
preferable to Dropout as it effectively controls overfitting by penalizing large weights while
keeping all neurons active during training. L2-regularization for a loss function L(θ) with
weights W and regularization coefficient λ is defined as:
                                                𝐿
                                       𝜆             2
                             𝐿2 = 𝐿 +     ∙ ∑‖𝑊 (𝑙) ‖ ,                                   (11)
                                      2∙𝑁
                                               𝑖=1

where L is the number of layers in the neural network, λ is the regularization coefficient, N
is the number of training examples.
    The regularization coefficient is determined according to the expression:
                               λ = const · Training rate,                                 (12)
where “const” is a coefficient that is set manually and is usually chosen based on experience
or by brute force, and determines the importance of regularization compared to training
(training rate).
   The choice of the optimal value for the regularization coefficient depends on the specific
task and data, as well as on the optimization method used. It should be chosen to provide
adequate control of overfitting without restricting model training too much. Typically, you
start with small values and gradually increase them while observing changes in model
performance on the validation dataset. The value can range from 10−6 to 10−2 depending on
the size of the data set and the complexity of the model. Thus, the initial value for the
constant const can be chosen, for example, equal to 10−4, and then adjusted during the
training process depending on the effectiveness of regularization and preventing
overfitting.
   To improve the resistance of the training algorithm to getting stuck in local minima and
saddle points of the loss function, it is advisable to use a loss function, which contributes to
smoother and more predictable optimization. One option would be to use a smooth loss
function such as cross-entropy [22, 23] for classification tasks, and mean squared error for
regression tasks [24, 25]. In addition, you can consider using a loss function that takes into
account the distribution of the data and penalizes large deviations of the predicted values
from the actual values, for example, the Huber loss function [26] or the K-quantile loss
function [27].
   A smooth loss function allows for smoother gradient changes and helps avoid sharp
jumps, which can lead to better convergence to a global minimum and prevent getting stuck
at local minima and saddle points. Choosing a smooth loss function allows the training
algorithm to adapt to different types of problems and data, allowing the neural network to
training more efficiently while minimizing the risk of getting stuck in local minima or saddle
points.
   Loss functions such as Huber or K-quantile take into account the data distribution and
impose a more balanced error penalty without allowing large variations in the value of the
loss function, resulting in more stable optimization. However, a key disadvantage of Huber
or K-quantile functions over a smooth loss function is their less smooth nature, which can
lead to more complex optimization and slower neural network training.
   One smooth loss function that is used here is a smooth version of the mean squared error
known as Smooth Mean Squared Error (SMSE) [28], which uses a smooth loss function
instead of the squared difference between the predicted and actual output. The SMSE
analytical expression looks like this:
                                                  𝑁
                                           1
                               𝐿(𝑦, 𝑦̂) =      ∙ ∑ smooth(𝑦𝑖 − 𝑦̂𝑖 ),                            (13)
                                          2∙ 𝑁
                                                 𝑖=1

where smooth(𝑦𝑖 − 𝑦̂𝑖 ) is a smooth function that replaces the absolute value in the squared
error.
   Application (13) allows us to improve the resistance of the training algorithm to getting
stuck in local minima and saddle points of the loss function, since the smooth function
smooth(𝑦𝑖 − 𝑦̂𝑖 ) ensures a smooth change in the gradient even in the vicinity of points
where the loss function has sharp changes. This avoids sudden jumps and allows gradient
descent to more efficiently find paths to the global minimum of the loss function, improving
the overall convergence of the training algorithm and preventing it from getting stuck at
local minima or saddle points.
   Thus, the squared norm of the gradient is defined as:
                                                  𝑁               2
                                                        𝜕𝐿(𝑦, 𝑦̂)
                                             2
                                   𝐿‖∇𝐿(𝜃)‖ = ∑ (                ) ,                             (14)
                                                          𝜕𝜃𝑖
                                                  𝑖=1
          𝜕𝐿(𝑦,𝑦
               ̂)
where               is the partial derivative of the loss function L with respect to the i-th parameter
            𝜕𝜃𝑖
θi .
       The adaptation coefficient for the proposed training algorithm is defined as:
                                                 𝛽0
                                       𝛽=                2
                                                           ,                                     (15)
                                            1+𝛾∙ ‖∇𝐿(𝜃)‖
where β0 is the initial value of the adaptation coefficient, γ is the adaptation coefficient for
the adaptation coefficient.
   The initial value of the adaptation coefficient β0 and the adaptation coefficient for the
adaptation coefficient γ are usually set at the initialization stage of the training algorithm.
They are hyperparameters that are selected experimentally or using optimization
techniques such as cross-validation.
   A small positive number, for example, 0.1 or 0.01, is usually selected as the initial value
of the adaptation coefficient β0. This initial value determines how quickly training rate
adaptation will begin. The lower the value, the faster adaptation will begin. The adaptation
factor for the adaptation factor is also chosen experimentally and depends on the specific
task and network architecture. Typically, it is selected in the range from 0.9 to 0.999. This
coefficient controls the adaptation speed of the adaptation coefficient itself: the closer to 1,
the slower the adaptation occurs.
   In the proposed training algorithm, it is important to select an adaptive activation
function for the l-th layer, which will ensure stable and efficient transfer of gradients during
backpropagation. Given this goal, it is advisable to choose an activation function that has a
smooth gradient and reduces the likelihood of gradients decaying or exploding in deep
networks. Activation functions such as Mish, Swish or ELiSH [29, 30] may be preferable as
they not only provide a smooth gradient but also show high efficiency in optimizing and
generalizing neural network models. This choice of activation function is important to
ensure the stability and speed of convergence of the training algorithm, which in turn helps
to achieve better results in practice.
   From these activation functions (Mish, Swish and ELiSH), it is advisable to select the Mish
function for the proposed training algorithm. The Mish function is a smooth and
continuously differentiable function that has good ability to adapt to different data and
reduce the likelihood of gradients decaying during backpropagation. Due to its shape and
unique properties, Mish demonstrates high efficiency in both optimization and
generalization of neural network models. Its use in this algorithm promotes more stable
and efficient training, which can ultimately lead to better results in practice. The adaptive
activation function Mish is described by the expression:
                            𝑀𝑖𝑠ℎ = 𝑥 ∙ tanh(softplus(𝑥)),                                 (16)
where x is the input signal, tanh is the hyperbolic tangent, softplus is the softplus activation
function, defined as softplus(𝑥) = ln(1 + 𝑒𝑥 ).
   Thus, the adaptive Mish function is a combination of a linear function x and a hyperbolic
tangent, which provides smoothness and continuous differentiability while maintaining
useful activation properties.
   Adding adaptive training rate variation over time helps improve the stability and
training rate of the model, which in turn can lead to higher quality and more efficient
training. To add an adaptive change in the training rate over time in this algorithm, you can
use methods such as Learning Rate Schedulers or Learning Rate Decay (Table 3) [31, 32].
   Learning Rate Schedulers allow you to dynamically change the training rate during
training depending on a specific schedule. For example, you can start with a higher training
rate and gradually decrease it as you progress in training or after a certain number of
epochs. This approach allows you to better adjust the training rate in accordance with the
training progress and the dynamics of changes in gradients.
   Learning Rate Decay involves reducing the training rate after each epoch or a certain
number of training steps. This can be implemented by multiplying the current training rate
by a factor that decreases over time or with each epoch.
   For example, after each epoch, you can reduce the training rate by a fixed percentage or
multiply it by a coefficient that depends on the quality indicator of the model on the
validation data set.

Table 3
Description of adaptive change in learning rate over time (author's research based on [31,
32])
          Learning Rate Schedulers                       Learning Rate Decay
 Step Decay:                                  Cosine Annealing Decay:
                              ⌊
                                  𝑒𝑝𝑜𝑐ℎ
                                          ⌋                                   𝑒𝑝𝑜𝑐ℎ
       𝛼𝑛𝑒𝑤 = 𝛼𝑜𝑙𝑑 ∙ 𝑓𝑎𝑐𝑡𝑜𝑟 𝑠𝑡𝑒𝑝 𝑠𝑖𝑧𝑒 ,                      1 + cos (𝜋 ∙             )
                                                                            max(𝑒𝑝𝑜𝑐ℎ)
                                            𝛼𝑛𝑒𝑤 = 𝛼0 ∙                                 ,
 where αnew represents the new value of                                     2
 parameter α, αold is the current value of where max(𝑒𝑝𝑜𝑐ℎ) is the total number of
 parameter α, "factor" is the constant training epochs.
 multiplier by which the parameter is
 adjusted, "epoch" refers to the current
 iteration or epoch in the process, "step
 size" is the number of epochs after which
 the parameter is updated.
 Exponential Decay:
      𝛼𝑛𝑒𝑤 = 𝛼𝑜𝑙𝑑 ∙ 𝑒 −𝑑𝑒𝑐𝑎𝑦 𝑟𝑎𝑡𝑒∙𝑒𝑝𝑜𝑐ℎ ,
 where “decay rate” is the decay coefficient
 that determines the rate at which the
 training rate decreases with each epoch.

   The use of adaptive modification of the neural network architecture in the proposed
training algorithm can help improve the efficiency of the model by optimizing its structure
during the training process. This allows the model to adapt more quickly and accurately to
changing task conditions and requirements, which can ultimately lead to higher
performance and generalization ability. To adaptively change the architecture of a neural
network, automatic architecture differentiation (AutoML) is proposed, which allows the
structure of the neural network to be optimized during the training process using
optimization algorithms such as gradient descent. A neural network can automatically
change its architecture by adding or removing layers, adjusting their parameters, etc. to
improve performance based on training data [33, 34].
   To optimize the neural network architecture, an optimization algorithm is used, for
example, gradient descent, according to which the task of optimizing the neural network
architecture is represented as:
                                𝜃∗ = arg min𝜃 𝐿(𝜃),                                   (17)
where θ∗ are the optimal parameters of the model.
    To calculate gradients based on the model parameters, the backpropagation algorithm
is used, which calculates the gradients of the loss function based on the network parameters
∇𝜃 𝐿(𝜃). In the case of AutoML, gradients can also be calculated from model
hyperparameters such as number of layers, number of neurons, etc. This allows us to
optimize the network architecture during the training process. Hyperparameter gradients
can be computed using hyperparameter differentiation methods or approximate methods
such as REINFORCE or gradient backpropagation time (TBPTT) algorithms. After
computing the gradients across the model's parameters and hyperparameters, we can use
an optimization algorithm such as stochastic gradient descent (SGD) to update the
parameters and hyperparameters according to the resulting gradients. These steps form the
basis of the automatic architecture differentiation algorithm (AutoML), which allows a
neural network to change its structure during training to optimize its performance and
generalization ability.
    The use of adaptive mini-batch resizing allows you to more flexibly manage the training
process and improve its efficiency. For example, if a model faces the problem of rapidly
changing gradients or computational inefficiency, increasing the mini-batch size can help
smooth out gradients and speed up training. Conversely, reducing the mini-batch size can
be useful to improve the generalization ability of the model or improve convergence in case
of overfitting [35]. Mathematically, the adaptive change in the mini-batch size is
implemented according to the expression:
                                  𝑁𝑛𝑒𝑤 = ⌊𝑁𝑜𝑙𝑑 ∙ 𝜂⌋,                                     (18)
where Nold is the current mini-batch size, Nnew is the new mini-batch size, η is the adaptation
coefficient, ⌊∙⌋ is the rounding down function.
   The adaptation coefficient η is selected based on certain criteria or conditions. For
example, you can choose η such that the new mini-batch size increases or decreases
depending on the rate of model convergence or the dynamics of the gradients.
   Once the new mini-batch size is calculated, it is applied to the next iteration of model
training. A new mini-batch is formed from training examples taking into account the new
size.
   The proposed algorithm for training feedforward neural networks allowed us to
formulate the following theorem: training algorithm for a feedforward neural network with
adaptive initialization of weights, adaptive training rate, adaptive regularization, smooth
loss function, adaptive activation function, adaptive change in training rate over time,
adaptive change in neural network architecture and adaptively changing the mini-batch size
converges to an optimal solution to the training task with probability 1 if the following
conditions are met:

       1.       Limited training set: the training data set X consists of N independent and
   identically distributed examples, where N → ∞.
       2.       Boundedness of the parameter space: the parameter space Θ of the model is
   limited by the compact set K ⊂ ℝd, where d is the dimension of the parameter space.
        3.      Smoothness of the loss function: the loss function L(θ) is twice continuously
   differentiable on K.
        4.      Convexity of the loss function: the loss function L(θ) is convex on K.
        5.      Strong convexity of the loss function: the loss function L(θ) is strongly
   convex on K with a strong convexity constant m > 0.
        6.      Training rate adaptability: the training rate α(t) adapts over time in such a
   way that it satisfies the following conditions: 𝛼(𝑡) > 0∀𝑡 > 0, ∑∞           𝑡=1 𝛼(𝑡) = ∞,
               2
   ∑∞𝑡=1(𝛼(𝑡)) < ∞.
        7.     Adaptability of regularization: the regularization coefficient λ adapts over
   time in such a way that it satisfies the following condition: 0 < 𝜆(𝑡) < 𝜆max ∀𝑡 > 0.
        8.     Adaptability of the activation function: the activation function σ(x) is
   continuously differentiable and monotonically increasing.
        9.     Adaptability of mini-batch size: The mini-batch size N(t) adapts over time in
   such a way that it satisfies the following condition: 𝑁min < 𝑁(𝑡) < 𝑁max ∀𝑡 > 0.

    Proof of theorem. To prove this theorem, the stochastic gradient descent (SGD) method
is used in combination with parameters that adaptively change over time by specified
conditions. Let the loss function L(θ) be given, where θ are the parameters of the neural
network model. The aim is to minimize the loss function L(θ). For this, SGD is used, which
updates the parameters as 𝜃𝑡+1 − 𝜃𝑡 − 𝛼(𝑡) ∙ ∇𝐿(𝜃𝑡 ), where α(t) is the training rate at step
t, ∇𝐿(𝜃𝑡 ) is the gradient of the function losses in terms of parameters θ at step t. This
approach is generalized taking into account adaptive parameters: adaptive initialization of
weights is the initialization of neural network weights randomly, but taking into account
the size of the input layer and the number of neurons in the next layer; adaptive training
rate α(t) – the sequence α(t) is used, which satisfies the adaptability conditions; adaptive
regularization λ(t) is a sequence λ(t) is used that satisfies the adaptivity conditions; adaptive
activation function is a continuously differentiable and monotonically increasing activation
function is used; adaptive change in the size of the mini-batch N(t) is the sequence N(t) is
used, which satisfies the adaptivity conditions. When N → ∞, the training set covers the
entire data space, which allows the algorithm to train from a variety of examples, which
determines the boundedness of the training set. The compact parameter space ensures that
changes in the model parameters are limited, which is important for the convergence of the
algorithm. A doubly continuously differentiable loss function ensures a smooth loss surface,
which simplifies optimization, while a convex loss function ensures that the global
minimum is unique and achievable, but strong convexity ensures that the algorithm quickly
converges to a global minimum.
    The convergence of the algorithm to the optimal solution is ensured by the convergence
of gradient descent and adaptive parameters. Provided that α(t) > 0 for all t > 0 and
                                       2
∑∞𝑡=1 𝛼(𝑡) = ∞, as well as ∑𝑡=1(𝛼(𝑡)) < ∞, gradient descent converges to a local minimum
                             ∞

of the loss function L(θ) with probability 1 under the conditions of smoothness and
convexity of L(θ). By adaptively changing the training rate α(t) and the regularization
coefficient λ(t) by the conditions of the algorithm, these parameters can adapt to the
characteristics of the loss function and ensure stable convergence of the algorithm.
  Thus, by applying the stochastic gradient descent method to the loss function L(θ) with
adaptive training and regularization parameters, taking into account constraints on the data
and model parameters, the algorithm converges to the optimal solution with probability 1.

4. Experiment
The proposed algorithm for training a feedforward neural network with many adaptive
components finds wide practical applications in various fields of machine learning and
artificial intelligence. For example, in image processing, it can be used to train a neural
network to recognize objects in images with high accuracy, thanks to a smooth loss function
and an adaptive activation function, allowing it to efficiently process different types of data
and situations. Adaptive initialization of weights and training rates ensures fast model
convergence, adaptive regularization helps avoid overfitting. In addition, adaptive changes
in the architecture and size of the mini-batch allow you to optimize the training process by
the requirements of a specific task and the available computing resources. This approach
can be successfully applied in the fields of computer vision, natural language processing,
medical data analysis, and others where precise adaptation of the model to a variety of
conditions and data is required [36–40].
    In [41], the use of direct propagation neural networks in the problem of debugging the
parameters of helicopter turboshaft engines (TE) is shown, which is based on the use of a
universal mathematical model for debugging the parameters of a helicopter TE and the
operating algorithm of the control device (Fig. 1), which leads to the elimination of
inconsistencies that calculated for each engine control element.
                       GTE model along the un     Model of an electronic governor        I
                        r.p.m control loop         along the r.p.m. control loop


                                                         R.p.m. limitation
                                         GT
                                                                               up
                           Fuel system                    Fuel controller
                                                              Control
                                                              element


                                                Simulation                  Simulation
                                                 model 1                     model m


                                                          Control device


                                                          Fuel controller
                                                         reference model


Figure 1: Helicopter turboshaft engines fuel dispenser debugging diagram. (author's
research, published in [41]).
   Using a universal approach, which is based on the use of Lyapunov functions, in [41]
universal tuning equations were obtained:
                  𝑀
                𝐴̇ = 𝜀1 ∙ |𝐾 ∙ 𝛹(𝛼)|𝑇 or 𝐴𝑀 = ∫ 𝜀1 ∙ |𝐾 ∙ 𝛹(𝛼)|𝑇 𝑑𝑡,                      (19)


                     𝑀
                 𝐵̇ = 𝜀1 ∙ |𝐿 ∙ 𝛷(𝐼)|𝑇 or 𝐵𝑀 = ∫ 𝜀1 ∙ |𝐿 ∙ 𝛷(𝐼)|𝑇 𝑑𝑡,                     (20)


                 𝑀
               𝐶̇ = 𝜀2 ∙ |𝑀 ∙ 𝑄(𝐺𝑇 )|𝑇 or 𝐶𝑀 = ∫ 𝜀2 ∙ |𝑀 ∙ 𝑄(𝐺𝑇 )|𝑇 𝑑𝑡,                   (21)


                     𝑀
                  𝐷̇ = 𝜀2 ∙ |𝑁 ∙ 𝑈(𝛼)|𝑇 or 𝐷𝑀 = ∫ 𝜀2 ∙ |∙ 𝑈(𝛼)|𝑇 𝑑𝑡,                      (22)

where AM, BM, CM, DM are the tunable coefficients are equal, after the end of the identification
process, to the coefficients of the equations describing the fuel dispenser, Ψ(α), Φ(I), Q(GT);
U(α) are the nonlinear functions, ε1, ε2 are the residual signals, K, L, M, N are the positive
definite diagonal matrices of given constant coefficients [41].
    The identified values of the coefficients AM, BM, CM, DM, which describe a real fuel
dispenser, are compared with the values AE, BE, CE, DE of the reference model of the
dispenser. Signals of differences between identified and reference coefficients 𝛿𝐴 = 𝐴𝑀 −
𝐴𝐸 , 𝛿𝐵 = 𝐵𝑀 − 𝐵𝐸 , 𝛿𝐶 = 𝐶 𝑀 − 𝐶 𝐸 , 𝛿𝐷 = 𝐷 𝑀 − 𝐷 𝐸 are used to debug the fuel dispenser. The
amount of movement of the actuators is determined by the sensitivity of the fuel dispenser
to the movement of the engine control element.
    To demonstrate the use of a feedforward neural network using adaptive elements to
solve the task of helicopter TE parameters debugging at flight modes, a two-dimensional
classification scenario was researched in [41], which consists in the fact that one of two
random narrow-band processes is observed using a quadrature demodulator. In this case,
the probability density function of each of these processes is described by the following
expression:
                                                                     2
                           1             (𝐼 − 𝑚𝐼 )2 (𝑄 − 𝑚𝑄 )
             𝑝(𝐼, 𝑄) =        ∙ exp (− (           +          )),                         (23)
                       √2 ∙ 𝜋              2 ∙ 𝜎𝐼2    2 ∙ 𝜎𝑄2

where σI, σQ are the dispersions, mI, mQ are the mathematical expectations of components I
and Q, I corresponds to the values of the gas-generator rotor r.p.m. nTC, Q corresponds to the
values of specific fuel consumption Ce.
   As a solution to this task, in [41] the data distribution area of two classes (I and Q) and
boundary lines at levels 0.1, 0.5, 0.9 were obtained, which shows the permissible and
unacceptable values of the gas-generator rotor r.p.m. nTC according to the specific fuel
consumption Ce.
   In this work, by conducting a corresponding computational experiment, it is proposed to
solve the same problem with a feed-forward neural network, while applying the proposed
training algorithm. To conduct the computational experiment, a personal computer was
used, AMD Ryzen 5 5600 processor, 32 KB third-level cache, Zen 3 architecture, 6 cores, 12
threads, 3.5 GHz, RAM – 32 GB DDR-4.
    To solve the task of helicopter TE parameters debugging (on the example of TV3-117
turboshaft engine), as a training sample. We will use the values of the gas generator rotor
r.p.m. nTC at the takeoff mode, reduced to absolute values [41, 42], given in Table 4, and the
parameters of the average engine fleet the next: 𝑛̅ 𝑇𝐶 = 0.994, 𝐶𝑒̅ = 0.977.
    In the input signal approximation task, according to [41], the dependence of the specific
fuel consumption Ce on the gas generator rotor r.p.m. nTC for the TV3-117 turboshaft engine
(which represents an element of the engine throttle characteristic) is presented. Fig. 2
shows the input data, indicated by points, which are approximated by broken lines for
clarity.

Table 4
Training set fragment (author's research, published in [41])
  Number         Gas generator rotor r.p.m. nTC          Specific fuel consumption Ce
     1                      0.998                                    0.972
     2                      0.998                                    0.978
     3                      0.992                                    0.964
     4                      0.992                                    0.984
     5                      0.991                                    0.998
     6                      0.995                                    0.979
     7                      0.991                                    0.970
     8                      0.996                                    0.990
     9                      0.998                                    0.965
    10                      0.989                                    0.990
     …                        …                                         …
    256                     0.993                                    0.964


Figure 2: Diagram of dependence Ce = f(nTC) and the result of approximation. (author's
research, published in [41]).
   At the stage of training sample pre-processing, its homogeneity is checked, divided into
control and test samples, as well as an assessment of their representativeness using cluster
analysis. To assess the homogeneity of the training set, the calculation of the Fisher-Pearson
criterion [43] is used based on the observed frequencies and comparison with the critical
values of χ2 with the number of degrees of freedom r – k –1 = 13 and the significance level
α = 0.01. This allows us to determine when statistical significance is accepted only if the
probability of obtaining these or more extreme results given the null hypothesis is less than
1 %.
   The resulting value χ2 = 18.388 does not exceed the critical value of 30.577, which
confirms the consistency of the samples and the hypothesis of normal distribution.
   To confirm homogeneity, the Fisher-Snedecor [44] criterion is adopted, which is the
ratio of the values of the larger and smaller dispersion with degrees of freedom r – k –1 =
13 and the significance level α = 0.01.
   The resulting value of F = 3.393 does not exceed the critical value of 3.61, which confirms
the consistency of the samples and the hypothesis of normal distribution.
   The representativeness of the training and test samples was assessed using cluster
analysis, the aim of which is to divide the set of input data X (Table 4) into k disjoint clusters,
where k is a predetermined number of clusters. Each cluster is a group of objects that are
considered more similar to each other than to objects from other clusters. The work uses
the k-means cluster analysis method, which is based on minimizing the sum of squared
distances between cluster objects and their centroids. Each object xi of set X is assigned to
                                                             2
the nearest centroid according to 𝐶𝑖 = arg min𝑗 ‖𝑥𝑖 − 𝜇𝑗 ‖ , where μj are the initial centroids,
             2
‖𝑥𝑖 − 𝜇𝑗 ‖ is the Euclidean distance between object xi and centroid μj. After this, the
centroids are recalculated as the average value of objects within each cluster according to
      1
𝜇𝑗 =     ∙ ∑𝑥𝑖∈𝐶𝑗 𝑥𝑖 , where |𝐶𝑗 | is the number of objects in the j-th cluster. The calculations
     |𝐶𝑗 |
of Ci and μj are repeated until changes in the cluster distribution are minimal. The algorithm
terminates when none of the centroids changes significantly or the specified number of
iterations is completed [45]. The results of the cluster analysis of the training sample data
(Table 4) identified 8 classes (classes I…VIII). After random selection, training and test
samples were compiled in a 2:1 ratio (67 and 33 %, respectively). The cluster analysis of
both samples revealed the presence of eight groups in them, which indicates the similarity
of the composition of both training and test samples. The distances between groups are
almost the same in both samples, which confirms the similarity of their composition (Fig. 3).
Thus, the optimal sample size was obtained: training – 256 elements (100 %), control – 172
elements (67 % of the training sample), test – 84 elements (33 % of the training sample).
                              a                                        b
Figure 3: Diagram of dependence Ce = f(nTC) and the result of approximation. (author's
research, published in [41]).

   As part of the computational experiment, a forward propagation neural network was
used (Fig. 4), the inputs of which are the parameters of the gas generator rotor r.p.m. nTC
and specific fuel consumption Ce, and the outputs are their optimal values nTCopt and Ce opt.
During its training with the proposed algorithm, the dependences of the accuracy (Fig. 5)
and losses (Fig. 6) of the neural network on the number of iterations (100 iterations were
used in the work) were obtained, in which the “blue curve” means training on the training
sample, the “orange curve” means validation on a control sample. From Fig. 5 it can be seen
that the limiting value of accuracy reaches 1, and from Fig. 6 shows that the maximum loss
value does not exceed 0.025. This indicates a high degree of efficiency in training the model
on the provided data and the ability of the model to generalize to new data with high
accuracy, which makes it potentially suitable for solving the task of helicopter TE parameter
debugging.

                                    w01
                                   (1)              (2)
                                  w11w             w11
             nTС                        02
                                                                           nTСоpt

                                    w03
              Ce                                                        Ce оpt
                                (1)                  (2)
                               w23                  w32
Figure 4: The proposed feedforward neural network for solving the task of helicopter TE
parameters debugging. (author's research, published in [41]).
Figure 5: Diagram of changes in the neural network accuracy function with 100 iterations.
(author's research).


Figure 6: Diagram of changes in the neural network loss function with 100 iterations.
(author's research).

   In this case, the loss function was determined according to (13), and the accuracy
function – according to the expression:
                                                 𝑁
                                        1
                              𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ∙ ∑ 𝐼(𝑦, 𝑦̂),                                     (24)
                                        𝑁
                                                𝑖=1

where N is the total number of examples, yi is the true value of the target variable for the i-th
example, 𝑦̂ is the predicted value of the target variable for the i-th example, 𝐼(𝑦, 𝑦̂) is an
indicator function that returns 1 if the predicted value matches with true 𝑦 = 𝑦̂, and 0
otherwise.
5. Results
The results of the computational experiment are both partial researches of the proposed
neural network training algorithm, and boundary lines at levels 0.1, 0.5, 0.9, which shows the
permissible and unacceptable values of the gas-generator rotor r.p.m. nTC according to the
specific fuel consumption Ce, which must be compared with the corresponding results
obtained in [41].
   The adequacy of the resulting diagram of the area of distribution of data of two classes
(I and Q), reconstructed by a neural network, directly depends on the training process.
According to [46], a number of parameters are identified that affect the quality of training:
training rate coefficient (assumed 10−4); number of neurons in the hidden layer
(assumed 10); number of training epochs completed (assuming 100 training epochs).
   As a criterion for assessing the quality of training, the final total standard deviation for
the epoch was used, which is determined according to the expression:
                                       𝑁       𝑛
                                1     1
                        𝐸𝑒𝑝𝑜𝑐ℎ = ∙ ∑ ( ∙ ∑(𝑦𝑘 − 𝑦̂𝑘 )2 ) .                               (25)
                                𝑁     2
                                      𝑖=1     𝑘=1

   The results of the researches are given in Table 5–7 and in Fig. 7–9, where: Fig. 7 –
diagram determining the influence of the training rate on the final standard deviation; Fig.
8 – diagram determining the influence of the number of hidden neurons on the final
standard deviation; Fig. 9 – diagram determining the influence of the number of epochs
passed on the final standard deviation.

Table 5
Influence of the training rate coefficient on the resulting error (author's research)
  Number             Training rate coefficient                Final standard deviation
      1                       0.0001                                    3.642
      2                       0.0005                                    4.018
      3                       0.001                                     6.024
      4                       0.002                                     6.547
      5                       0.003                                     7.112
      6                       0.004                                     7.937
      7                       0.005                                     8.645
      8                       0.006                                     9.202
      9                       0.008                                    10.383
     10                        0.01                                    12.002
Table 6
Influence of the number of neurons in the hidden layer on the resulting error (author's
research)
  Number      Number of neurons in the hidden layer      Final standard deviation
      1                       2                                    8.307
      2                       5                                    8.865
      3                       10                                   4.317
      4                       15                                   6.997
      5                       20                                   9.005
      6                       25                                  10.513
      7                       30                                  11.817
      8                       35                                   9.545
      9                       40                                   8.997
     10                       45                                  10.816

Table 7
Influence of the number of epochs passed on the resulting error (author's research)
  Number           Epoch of training passed               Final standard deviation
      1                        0                                   25.346
      2                       20                                   22.717
      3                       40                                   19.657
      4                       60                                   14.008
      5                       80                                    7.856
      6                      100                                    3.358
      7                      150                                    3.358
      8                      200                                    3.358
      9                      300                                    3.358
     10                      500                                    3.358


Figure 7: Diagram determining the influence of the training rate on the final standard
deviation. (author's research).
Figure 8: Diagram determining the influence of the number of hidden neurons on the final
standard deviation. (author's research).


Figure 9: Diagram determining the influence of the number of epochs passed on the final
standard deviation. (author's research).

   From the results obtained it follows that the minimum final total standard deviations per
epoch were obtained with the optimal value of the training rate coefficient being 10−4 and
10 neurons in the hidden layer. It is worth noting that in [41], the optimal number of
neurons in the hidden layer is 3. Increasing the number of neurons in the hidden layer from
3 to 10 leads to a noticeable improvement in the generalization ability of the model and a
reduction in the risk of overfitting. Increasing the number of neurons to 10 allows the model
to more flexibly adapt to complex relations in the data, which helps improve the accuracy
of predictions on new, previously unseen data. This is because more neurons allow the
model to training more complex features and data structures, which is especially important
in the case of high-dimensional and complex data. Thus, increasing the number of neurons
to 10 in the hidden layer is a promising step to improve the quality of the neural network.
    It is also worth noting that, starting from 100 training epochs, the minimum final total
standard deviation is minimal and constant – 3.358, which indicates that the model has
achieved optimal accuracy on this data set and further training does not lead to a significant
improvement in results. This may indicate that the model has trained to predict the target
variable with high accuracy and additional training epochs do not bring a significant increase
in the quality of predictions. Thus, a constant value of the minimum total standard deviation
after 100 epochs indicates the convergence of the model and its readiness to be used for
solving practical tasks. Thus, the proposed forward propagation neural network for solving
the task of helicopter TE parameters debugging (Fig. 4) is transformed into the form
presented in Fig. 10.

                                         w01


                                                     (2)
                                                    w11
                                   (1)
                                  w11


                       nTС                                         nTСоpt

                       Ce                                          Ce оpt


Figure 10: Refined proposed feedforward neural network for solving the task of helicopter
TE parameters debugging (author's research, published in [41]).

   At the next stage of the computational experiment, the control curve 𝐶𝑒 = 𝑓(𝑛̅ 𝑇𝐶 ) is
researched, which, according to [41], is presented in the form:
  𝐶𝑒 (𝑛̅ 𝑇𝐶 ) = 0.0016 ∙ 𝑛4𝑇𝐶 − 0.0195 ∙ 𝑛3𝑇𝐶 + 0.0864 ∙ 𝑛2𝑇𝐶 − 0.1774 ∙ 𝑛 𝑇𝐶 + 0.4083,   (26)
               𝑛
where 𝑛̅ 𝑇𝐶 = 𝑛 𝑇𝐶 is the relative value of the gas-generator rotor r.p.m. nTC.
               𝑇𝐶max
   Fig. 11 shows a diagram of dependence of the objective function 𝐶𝑒 (𝑛̅ 𝑇𝐶 ) → min from the
of the gas generator rotor r.p.m nTC value, where “blue curve” shows the original
dependence obtained in [41], “orange curve” shows the dependence obtained in this work
using L2-regularization (11). In this case, the objective function will have an updated form:
                                                          𝐿
                                                      𝜆             2
                  𝐶𝑒 (𝑛̅ 𝑇𝐶 )𝐿2 = 𝐶𝑒 (𝑛̅ 𝑇𝐶 ) + (𝐿 +     ∙ ∑‖𝑊 (𝑙) ‖ ),                    (27)
                                                     2∙𝑁
                                                         𝑖=1

or
 𝐶𝑒 (𝑛̅ 𝑇𝐶 )𝐿2 = 0.0016 ∙ 𝑛4𝑇𝐶 − 0.0195 ∙ 𝑛3𝑇𝐶 + 0.0864 ∙ 𝑛2𝑇𝐶 − 0.1774 ∙ 𝑛 𝑇𝐶 + 0.4083
                              𝜆                                                            (28)
                   + (𝐿 +        ∙ (𝑊 (1) + 𝑊 (2) + 𝑊 (3) + 𝑊 (4) + 𝑊 (5) )),
                            2∙𝑁

where W(1), W(2), W(3), W(4), W(5) are the model weights corresponding to each of the five
terms in the original function 𝐶𝑒 (𝑛̅ 𝑇𝐶 ) (26).


Figure 11: Diagram of the objective function dependence from the gas generator rotor
r.p.m. value. (author's research).

    As can be seen from Fig. 11, adding L2-regularization to the objective function made it
possible to raise the adjustment curve 𝐶𝑒 = 𝑓(𝑛̅ 𝑇𝐶 ) up by the regularization value, bringing
it closer to 1, by adding to the original one function that increases its values. This allows the
model to more effectively take into account the complexity of the data and reduce the risk
of overfitting, due to a penalty for large values of the weighting coefficients. A raised curve
provides a more stable and robust optimization of the model, which can lead to improved
generalization ability and predictive accuracy on new data. In this case, objective function
minimum 0.40 is reached at the value r.p.m. 0.992. Thus, the correction of the mean value
of nTС by 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
             𝑇𝐶      = 0.994 − 0.991 = 0.003, while the value 𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡
                                                                   𝑇𝐶       = 0.006 obtained in
[41]. Thus, the addition of L2-regularization made it possible to more accurately (2 times
compared to [41]) adjust the gas generator rotor r.p.m nTC value and bring it closer to the
average value for the engine fleet 𝑛̅ 𝑇𝐶 = 0.994.
   The results obtained made it possible to obtain a refined area of distribution of data of
two classes (I and Q) with boundary values of nTC, respectively, lines at levels 0.1, 0.5, 0.9
(Fig. 12).


Figure 12: Data of two classes (blue area – allowed values nTC and Ce; red area – invalid
values nTC and Ce) and boundary lines at levels 0.1, 0.5, 0.9. (author's research).

   Fig. 12 allows you to determine the areas in which each of the classes is most likely to be
found. Refined limit values of nTC on lines at levels 0.1, 0.5, 0.9 make it possible to more
accurately determine the optimal gas generator rotor r.p.m nTC values to achieve the
required levels of specific fuel consumption Ce. As can be seen from Fig. 12, the region of
unacceptable values of nTC and Ce (red region) includes their values located at the
boundaries of this region. This indicates that it is inadmissible to regulate the nTC parameter
to obtain the maximum permissible value of Ce. “Level 0.1” means the lower level of
permissible Ce values, “Level 0.5” – optimal Ce values, “Level 0.9” – maximum permissible Ce
values. The inadmissibility of adjusting the nTC parameter to obtain the maximum
permissible value of Ce in helicopter flight mode is explained by the fact that in this context
there is a certain connection between the gas-generator rotor r.p.m. nTC and the specific fuel
consumption Ce, which is determined by the optimal operating conditions of the engine.
When adjusting the gas-generator rotor r.p.m. nTC to achieve the maximum permissible
value of specific fuel consumption Ce located on the border of the red area in the figure, the
system may go beyond the permissible parameters of engine operation. This can lead to
undesirable consequences such as engine overheating, loss of flight stability, or even a
crash. To ensure the safety and normal operation of the helicopter at flight mode, it is
important to maintain optimal engine operating parameters, including those related to the
gas-generator rotor r.p.m., to avoid going beyond the permissible range of specific fuel
consumption values. Thus, Fig. 12 provides important information for regulating system
operation parameters, as it allows you to determine the optimal nTC values to achieve the
desired specific fuel consumption indicators.

6. Discussion
The work carried out a comparative analysis of the solution to the task of helicopter TE
parameters debugging based on a feed-forward neural network with adaptive elements
using both the training algorithm proposed in the work and the Delta-Bar-Delta algorithm
used in [41]. The I and II type errors were calculated in obtaining the gas-generator rotor
r.p.m. nTC boundary values to achieve the required levels of specific fuel consumption Ce
(Table 8).
    A type I error occurs when the null hypothesis H0 is rejected when it is in fact true, and
is defined as:
                       𝑇𝑦𝑝𝑒 𝐼 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝑃(reject 𝐻0|𝐻0 true).                              (29)
    A type II error occurs when accepting the null hypothesis H0 when it is in fact false, and
is defined as:
                      𝑇𝑦𝑝𝑒 𝐼𝐼 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝑃(accept 𝐻0 |𝐻0 false).                            (30)
    The null hypothesis H0 in the problem under consideration is that the use of a
feedforward neural network to determine the gas-generator rotor r.p.m. nTC boundary
values does not lead to statistically significant changes in achieving the required levels of
specific fuel consumption Ce.
    The significance level adopted in this work is 0.01, which means that when conducting a
statistical test with this level of significance, the probability of a type I error is 0.01. That is,
if the test results reject the null hypothesis at this significance level, then the probability of
making a type I error is 1 %, which is a low enough probability level to detect statistically
significant differences between groups or conditions.

Table 8
The results of determining the 1st and 2nd type errors (author's research)
              Neural network type                  The probability of error in determining
                                                        a gas-generator rotor r.p.m. nTC
                                                                boundary values
                                                     Type 1st error       Type 2nd error
  Feed-forward neural network with adaptive
   elements using both the training algorithm              0.57                 0.38
              proposed in the work
  Feed-forward neural network with adaptive
    elements using Delta-Bar-Delta algorithm               0.94                 0.65
                   used in [41]

   Table 8 shows that the use of a feedforward neural network with adaptive elements,
trained on the basis of the algorithm proposed in the work, made it possible to reduce the
types first and second errors by 1.65...1.71 times compared with the use of the Delta-Bar-
Delta algorithm for its training [41] at the significance level is 0.01.
   At the final stage of the comparative analysis, the efficiency coefficients and quality
coefficients of the feedforward neural network with adaptive elements were calculated for
both the proposed training algorithm and the Delta-Bar-Delta algorithm [41], according to
the expressions [47–50] (Table 9):
                                          𝑇𝑒𝑟𝑟𝑜𝑟
                               𝐾𝑒𝑟𝑟𝑜𝑟 =          ∙ 100%,                               (31)
                                            𝑇0

                                             𝑇𝑒𝑟𝑟𝑜𝑟
                           𝐾𝑞𝑢𝑎𝑙𝑖𝑡𝑦 = (1 −          ) ∙ 100%,                          (32)
                                               𝑇0

where Kerror and Kquality represent the errorneus and quality coefficients [51–53] for
determining the gas-generator rotor r.p.m. nTC boundary values by a feedforward neural
network with adaptive elements; Terror indicates the total time of segments associated with
misclassification [54], while T0 denotes the duration of the test sample [55] (in this work,
T0 = 5 s is assumed) [56–58].

Table 9
Results of calculating the quality and efficiency coefficients (author's research)
          Parameter                  Feed-forward neural             Feed-forward neural
                                    network with adaptive           network with adaptive
                                   elements using both the        elements using Delta-Bar-
                                      training algorithm           Delta algorithm used in
                                    proposed in the work                     [41]
                                     Kerror         Kquality         Kerror        Kquality
 Gas-generator rotor r.p.m.
                                     0.317          99.965           0.598         99.201
    nTC boundary values


   From Table 9 it can be seen that the use of a forward propagation neural network with
adaptive elements, trained based on the algorithm proposed in the work, made it possible
to reduce the erroneous coefficient by 1.89 times and slightly (1.01 times) increase the
quality coefficient for determining the gas-generator rotor r.p.m. nTC boundary values
compared with the use Delta-Bar-Delta algorithm [41].

7. Conclusions
1.      For the first time, a training algorithm for forward propagation neural networks has
been developed, based on the backpropagation algorithm, which, through the use of
adaptive elements, such as adaptive training rate, adaptive initialization of neural network
weights, adaptive regularization, adaptive neuron activation function, adaptive change in
neural network architecture, adaptive change in the size of the mini-batch made it possible
to achieve almost 100% accuracy of their training on both the training and validation data
sets with a minimum number of iterations.
    2. The training rate coefficient optimal value, the number of neurons in the hidden
layer of the neural network, and the iterations optimal number when training a neural
network were experimentally substantiated by determining the smallest value of the final
total standard deviation per epoch. By conducting a computational experiment to solve the
task of helicopter turboshaft engine parameters debugging with 2 input neurons and 2
output neurons, as well as 256 elements of the training set, the optimal training rate
coefficient value was obtained – 0.0001, the optimal number of neurons in the hidden layer
of the neural network – 10, the optimal number of iterations – 100, since they correspond
to the minimum values of the final total standard deviation for the epoch, which,
respectively, amounted to 3.642, 4.317, 3.358.
    3. It has been experimentally proven that the use of L2-regularization in the developed
feed-forward neural network training algorithm with adaptive elements raises the
adjustment curve (or a similar researched dependence) by the regularization value,
bringing it closer to 1, by adding term to the original function, which increases its meanings.
This made it possible, in the task of helicopter turboshaft engine parameters debugging, to
adjust the gas-generator rotor r.p.m. value 2 times more accurately, compared with the use
of the well-known Delta-Bar-Delta neural network training algorithm.
    4. An updated area of data distribution of two classes (gas-generator rotor r.p.m. and
specific fuel consumption) was obtained with gas-generator rotor r.p.m. boundary values,
respectively, lines at levels 0.1, 0.5, 0.9, which reduced errors of the first and second kind
by 1.65...1.71 times compared with the use of the Delta-Bar-Delta neural networks training
algorithm.
    5. It has been mathematically proven that the use of the developed training algorithm
for forward propagation neural networks with adaptive elements reduces the erroneous
coefficient by 1.89 times and slightly (1.01 times) increases the quality coefficient for
determining the gas-generator rotor r.p.m. boundary values in the task of helicopter
turboshaft engines parameters debugging compared with the use of Delta-Bar-Delta neural
network training algorithm.

Acknowledgements
This research was supported by the Ministry of Internal Affairs of Ukraine “Theoretical and
applied aspects of the development of the aviation sphere” under Project No. 0123U104884.

References
[1] M. Heidari, M. H. Moattar, H. Ghaffari, Forward propagation dropout in deep neural
    networks using Jensen–Shannon and random forest feature importance ranking,
    Neural Networks 165 (2023) 238–247. doi: 10.1016/j.neunet.2023.05.044.
[2] M. El-Sharkawy, M. Wael, M. Mashaly, E. Azab, Re-configurable parallel Feed-Forward
    Neural Network implementation using FPGA, Integration 97 (2024) 102176.
    doi: 10.1016/j.vlsi.2024.102176.
[3] W.-K. Hong, 4 - Forward and backpropagation for artificial neural networks, in: W.-
    K. Hong (Ed.), Artificial Intelligence-Based Design of Reinforced Concrete Structures,
     Woodhead Publishing, Sawston, England, 2023, pp. 67–116. doi: 10.1016/B978-0-443-
     15252-8.00006-6.
[4] X. Zhu, M. Li, X. Liu, Y. Zhang, A backpropagation neural network-based hybrid energy
     recognition and management system, Energy 297 (2024) 131264.
     doi: 10.1016/j.energy.2024.131264
[5] A. Sachenko, V. Kochan, V. Turchenko, V. Tymchyshyn, N. Vasylkiv, Intelligent nodes for
     distributed sensor network, in: Proceedings of the 16th IEEE Instrumentation and
     Measurement Technology Conference (IMTC/99), Venice, Italy, 1999, pp. 1479–1484.
     doi: 10.1109/IMTC.1999.776072
[6] A. Sachenko, V. Kochan, V. Turchenko, Intelligent distributed sensor network, in:
     IMTC/98 Conference Proceedings. IEEE Instrumentation and Measurement
     Technology Conference. Where Instrumentation is Going, St. Paul, MN, USA, 1998,
     pp. 60–66. doi: 10.1109/IMTC.1998.679663
[7] S. Babichev, M. Korobchynskyi, O. Lahodynskyi, O. Korchomnyi, V. Basanets, V.
     Borynskyi, Development of a technique for the reconstruction and validation of gene
     network models based on gene expression, Eastern-European Journal of Enterprise
     Technologies 1(4 (91)) (2018) 19–32. doi: 10.15587/1729-4061.2018.123634
[8] S. Babichev, V. Lytvynenko, J. Skvor, J. Fiser, Model of the objective clustering inductive
     technology of gene expression profiles based on SOTA and DBSCAN clustering
     algorithms, Advances in Intelligent Systems and Computing 689 (2018) 21–39.
     doi: 10.1007/978-3-319-70581-1\_2
[9] O. Ivanov, L. Koretska, V. Lytvynenko, Intelligent modeling of unified communications
     systems using artificial neural networks, CEUR Workshop Proceedings 2623 (2020)
     77–84.
[10] S. Vladov, R. Yakovliev, O. Hubachov, J. Rud, Neuro-Fuzzy System for Detection Fuel
     Consumption of Helicopters Turboshaft Engines, CEUR Workshop Proceedings 3628
     (2024) 55–72.
[11] L. Wang, W. Ye, Y. Zhu, F. Yang, Y. Zhou, Optimal parameters selection of back
     propagation algorithm in the feedforward neural network, Engineering Analysis with
     Boundary Elements 151 (2023) 575–596. doi: 10.1016/j.enganabound.2023.03.033
[12] H. Calvo-Pardo, T. Mancini, J. Olmo, Granger causality detection in high-dimensional
     systems using feedforward neural networks, International Journal of Forecasting 37:2
     (2021) 920–940. doi: 10.1016/j.ijforecast.2020.10.004
[13] K. S. Narendra, K. Parthasarathy, Identification and Control of Dynamical Systems Using
     Neural Networks, IEEE Transactions on Neural Networks 1:1 (1990) 4–27.
     doi: 10.1109/72.80202
[14] J. M. Maroli, Generating discrete dynamical system equations from input–output data
     using neural network identification models, Reliability Engineering & System Safety
     235 (2023) 109198. doi: 10.1016/j.ress.2023.109198
[15] R. G. Ramirez-Chavarria, M. Schoukens, Nonlinear Finite Impulse Response Estimation
     using Regularized Neural Networks, IFAC-PapersOnLine 54:7 (2021) 174–179.
     doi: 10.1016/j.ifacol.2021.08.354
[16] E. Efimov, T. Shevgunov, Development of feedforward neural networks using adaptive
     elements,           Journal       of         Radio       Electronics,       8        (2012).
     URL: http://jre.cplire.ru/win/aug12/4/text.html
[17] G. Dudek, A constructive approach to data-driven randomized learning for feedforward
     neural      networks,       Applied      Soft    Computing      112     (2021)      107797.
     doi: 10.1016/j.asoc.2021.107797
[18] P. Dumka, P. S. Pawar, A. Sauda, G. Shukla, D. R. Mishra, Application of He's homotopy
     and perturbation method to solve heat transfer equations: A python approach,
     Advances           in      Engineering        Software       170      (2022)        103160.
     doi: 10.1016/j.advengsoft.2022.103160
[19] J. Li, Y. Song, X. Song, D. Wipf, On the Initialization of Graph Neural Networks. in:
     Proceedings of the 40th International Conference on Machine Learning, Honolulu,
     Hawaii, USA, 2023, pp. 19911–19931. doi: 10.48550/arXiv.2312.02622
[20] M. Li, S. Bi, G. Cai, An adaptive fractional-order regularization primal-dual image
     denoising algorithm based on non-convex function, Applied Mathematical Modelling
     131 (2024) 67–83. doi: 10.1016/j.apm.2024.04.001
[21] C. Liu, R. Li, S. Chen, L. Zheng, D. Jiang, Adaptive dual graph regularization for clustered
     multi-task        learning,    Neurocomputing         574      (2024)      127259.       doi:
     10.1016/j.neucom.2024.127259
[22] G. Sun, B. Ji, L. Liang, M. Chen, CeCR: Cross-entropy contrastive replay for online class-
     incremental continual learning, Neural Networks 173 (2024) 106163.
     doi: 10.1016/j.neunet.2024.106163
[23] J. Chan, I. Papaioannou, D. Straub, Bayesian improved cross entropy method for
     network reliability assessment, Structural Safety 103 (2023) 102344.
     doi: 10.1016/j.strusafe.2023.102344
[24] C. Wang, J. Zhou, An adaptive index smoothing loss for face anti-spoofing, Pattern
     Recognition Letters 153 (2022) 168–175. doi: 10.1016/j.patrec.2021.12.006
[25] A. Bosman, A. Engelbrecht, M. Helbig, Visualising basins of attraction for the cross-
     entropy and the squared error neural network loss functions, Neurocomputing 400
     (2020) 113–136. doi: 10.1016/j.neucom.2020.02.113
[26] Y. Wang, Y. Zhu, Q. Sun, L. Qin, Adaptively robust high-dimensional matrix factor
     analysis under Huber loss function, Journal of Statistical Planning and Inference 231
     (2024) 106137. doi: 10.1016/j.jspi.2023.106137
[27] J. Zhang, H. Yang, Bounded quantile loss for robust support vector machines-based
     classification and regression, Expert Systems with Applications 242 (2024) 122759.
     doi: 10.1016/j.eswa.2023.122759
[28] A. Araveeporn, An estimating parameter of nonparametric regression model based on
     smoothing techniques, Statistical Journal of the IAOS 35:2 (2019) 269–276.
     doi: 10.3233/SJI-180477
[29] S. R. Dubey, S. K. Singh, B. B. Chaudhuri, Activation functions in deep learning: A
     comprehensive survey and benchmark, Neurocomputing 503 (2022) 92–108.
     doi: 10.1016/j.neucom.2022.06.111
[30] G. Bingham, R. Miikkulainen, Discovering Parametric Activation Functions, Neural
     Networks 148 92022) 48–65. doi: 10.1016/j.neunet.2022.01.001
[31] J. Wei, X. Zhang, Z. Zhuo, Z. Ji, Z. Wei, J. Li, Q. Li, Leader population learning rate schedule,
     Information Sciences 623 (2023) 455–468. doi: 10.1016/j.ins.2022.12.039
[32] J. Wei, X. Zhang, Z. Ji, Z. Wei, J. Li, DPLRS: Distributed Population Learning Rate Schedule,
     Future        Generation           Computer           Systems      132       (2022)       40–50.
     doi: 10.1016/j.future.2022.02.001
[33] I. Salehin, Md. Shamiul Islam, P. Saha, S. M. Noman, A. Tuni, Md. Mehedi Hasan, Md. Abu
     Baten, AutoML: A systematic review on automated machine learning with neural
     architecture search, Journal of Information and Intelligence 2:1 (2024) 52–81.
     doi: 10.1016/j.jiixd.2023.10.002
[34] X. He, K. Zhao, X. Chu, AutoML: A survey of the state-of-the-art, Knowledge-Based
     Systems 212 (2021) 106622. doi: 10.1016/j.knosys.2020.106622
[35] S. Vladov, Y. Shmelov, R. Yakovliev, Optimization of Helicopters Aircraft Engine
     Working Process Using Neural Networks Technologies, CEUR Workshop Proceedings
     3171 (2022) 1639–1656.
[36] S. Vladov, Y. Shmelov, R. Yakovliev, Modified Searchless Method for Identification of
     Helicopters Turboshaft Engines at Flight Modes Using Neural Networks, in:
     Proceedings of the 2022 IEEE 3rd KhPI Week on Advanced Technology, Kharkiv,
     Ukraine,               October                 03–07,             2022,             pp. 257–262.
     doi: 10.1109/KhPIWeek57572.2022.9916422
[37] D. Konar, A. D. Sarma, S. Bhandary, S. Bhattacharyya, A. Cangi, V. Aggarwal, A shallow
     hybrid classical–quantum spiking feedforward neural network for noise-robust image
     classification,      Applied           Soft      Computing        136       (2023)       110099.
     doi: 10.1016/j.asoc.2023.110099
[38] B. Yang, B. Liang, Y. Qian, R. Zheng, S. Su, Z. Guo, L. Jiang, Parameter identification of
     PEMFC via feedforward neural network-pelican optimization algorithm, Applied
     Energy 361 (2024) 122857. doi: 10.1016/j.apenergy.2024.122857
[39] X. Wang, P. Dai, X. Cheng, Y. Liu, J. Cui, L. Zhang, D. Feng, Aerospace Science and
     Technology 128 (2022) 107739. doi: 10.1016/j.ast.2022.107739
[40] J. Pousin, Least squares formulations for some elliptic second order problems,
     feedforward neural network solutions and convergence results, Journal of
     Computational         Mathematics          and      Data     Science    2     (2022)     100023.
     doi: 10.1016/j.jcmds.2022.100023
[41] S. Vladov, Y. Shmelov, R. Yakovliev, Parameter Debugging (Regulation) Method of
     Helicopters Aircraft Engines in Flight Modes Using Neural Networks, CEUR Workshop
     Proceedings 3179 (2022) 1–14.
[42] S. Vladov, R. Yakovliev, O. Hubachov, J. Rud, Y. Stushchanskyi, Neural Network Modeling
     of Helicopters Turboshaft Engines at Flight Modes Using an Approach Based on “Black
     Box” Models, CEUR Workshop Proceedings 3624 (2024) 116–135.
[43] F. S. Corotto, Appendix C - The method attributed to Neyman and Pearson, Wise Use of
     Null Hypothesis Tests (2023) 179–188. doi: 10.1016/B978-0-323-95284-2.00012-4
[44] F. V. Motsnyi, Analysis of Nonparametric and Parametric Criteria for Statistical
     Hypotheses Testing. Chapter 1. Agreement Criteria of Pearson and Kolmogorov,
     Statistics of Ukraine 4’2018 (83) (2018) 14–24. doi: 10.31767/su.4(83)2018.04.02
[45] D. Parnes, A. Gormus, Prescreening bank failures with K-means clustering: Pros and cons,
     International      Review        of     Financial     Analysis     93      (2024)      103222.
     doi: 10.1016/j.irfa.2024.103222
[46] S. Vladov, Y. Shmelov, R. Yakovliev, Y. Stushchankyi, Y. Havryliuk, Neural Network
     Method for Controlling the Helicopters Turboshaft Engines Free Turbine Speed at
     Flight Modes, CEUR Workshop Proceedings 3426 (2023) 89–108.
[47] M. Duhan, P. K. Bhatia, Hybrid Maintainability Prediction using Soft Computing
     Techniques, International Journal of Computing 20(3) (2021) 350–356.
     doi: 10.47839/ijc.20.3.2280
[48] M. Duhan, P. K. Bhatia, Software Reusability Estimation based on Dynamic Metrics
     using Soft Computing Techniques, International Journal of Computing 21(2) (2022)
     188–194. doi: 10.47839/ijc.21.2.2587
[49] S. Vladov, Y. Shmelov, R. Yakovliev, M. Petchenko, S. Drozdova, Helicopters Turboshaft
     Engines Parameters Identification at Flight Modes Using Neural Networks, in:
     Proceedings of the IEEE 17th International Conference on Computer Science and
     Information       Technologies        (CSIT),     Lviv,    Ukraine,     2022,      pp.     5–8.
     doi: 10.1109/CSIT56902.2022.10000444
[50] S. Vladov, Y. Shmelov, R. Yakovliev, M. Petchenko, S. Drozdova, Neural Network Method
     for Helicopters Turboshaft Engines Working Process Parameters Identification at
     Flight Modes, in: Proceedings of the 2022 IEEE 4th International Conference on Modern
     Electrical and Energy System (MEES), Kremenchuk, Ukraine, 2022, pp. 604–609.
     doi: 10.1109/MEES58014.2022.10005670
[51] V. V. Morozov, O. V. Kalnichenko, O. O. Mezentseva, The method of interaction modeling
     on basis of deep learning the neural networks in complex it-projects, International
     Journal of Computing 19(1) (2020) 88–96. doi: 10.47839/ijc.19.1.1697
[52] S. Bezobrazov, V. Golovko, A. Sachenko, M. Komar, R. Dolny, V. Kasyanik, P. Bykovyy,
     E. Mikhno, O. Osolinskyi, Deep multilayer neural network for predicting the winner of
     football matches, International Journal of Computing 19(1) (2020) 70–77.
     doi: 10.47839/ijc.19.1.1695
[53] E. M. Cherrat, R. Alaoui, H. Bouzahir, Score fusion of finger vein and face for human
     recognition based on convolutional neural network model, International Journal of
     Computing 19(1) (2020) 11–19. doi: 10.47839/ijc.19.1.1688
[54] K. Andriushchenko, V. Rudyk, O. Riabchenko, M. Kachynska, N. Marynenko, L. Shergina,
     V. Kovtun, M. Tepliuk, A. Zhemba, O. Kuchai. Processes of managing information
     infrastructure of a digital enterprise in the framework of the «Industry 4.0» concept, Eastern-
     European Journal of Enterprise Technologies 1(3–97) (2019) 60–72. doi: 10.15587/1729-
     4061.2019.157765
[55] T. E. Romanova, P. I. Stetsyuk, A. M. Chugay, S. B. Shekhovtsov. Parallel Computing
     Technologies for Solving Optimization Problems of Geometric Design, Cybernetics and
     System Analysis 55(6) (2019) 894–904. doi: 10.1007/s10559-019-00199-4
[56] S. Vladov, Y. Shmelov, R. Yakovliev, Modified Neural Network Method for Classifying
     the Helicopters Turboshaft Engines Ratings at Flight Modes, in: Proceedings of the
     2022 IEEE 41st International Conference on Electronics and Nanotechnology
     (ELNANO), Kyiv, Ukraine, 2022, pp. 535–540. doi: 10.1109/ELNANO54667.2022.9927108
[57] F. Munoz, J. M. Valdovinos, J. S. Cervantes-Rojas, S. S. Cruz, A. M. Santana, Leader–
     follower consensus control for a class of nonlinear multi-agent systems using
     dynamical     neural   networks,      Neurocomputing       561     (2023)     126888.
     doi: 10.1016/j.neucom.2023.126888
[58] V. Makarov, The neural network to identify an object by a sequential training mode,
     Procedia Computer Science 190 (2021) 532–539. doi: 10.1016/j.procs.2021.06.062