The Methods for Training Technological Multilayered Neural
Network Structures
Andrey Kupin1, Yuriy Osadchuk2, Rodion Ivchenko3 and Oleg Gradovoy4
1,2,3,4
          Kryvyi Rih National University, Ukraine, 50027, Kryvyi Rih, Vitaly Matusevich, 11


                    Abstract
                    The analysis of existing methods of training multilayered of neural networks structures is
                    made. The way computer modelling investigates the most effective methods of training.
                    Recommendations of application of the selected methods on an example of problems of
                    multilayered approximation for concentrating technology are given.

                    Keywords 1
                    Multilayered neural networks, training methods, Conjugate gradient, approximation,
                    classification.

1. Introduction
    Now even more often to the decision of applied problems of information and automation in the
conditions of difficult manufactures apply different technologies of intellectual control [1]. Thus one
of base approaches for construction of mathematical models in the course of approximation,
identification, classifications are applications of multilayered neural networks (NN) of different
architecture.
    The complex technological processes of the mining and metallurgical industry are a good basis for
demonstrating the benefits of using intelligent technologies. All this is confirmed by the presence of
multifactoriality, incomplete information, nonlinear characteristics, nonstationarity, etc. All this
provides quite good prerequisites for the successful application of computational intelligence
technologies in the automation of basic processes.
    For today in the theory of artificial neural networks there are no definite answers on concrete
questions of an unequivocal choice of this or that architecture and the most effective method of
training (parameterization). Therefore the majority of researchers operate in the empirical image,
selecting from certain set of potentially possible alternatives of the best a variant behind certain
criteria and in the conditions of concrete technology.

2. The analysis of last researches, publications and problem statement
   In numerous works of the authors [2-4], the outstanding capabilities of neural networks and fuzzy
logic have been proved for solving problems of automated control of similar technological processes.
This is the mining, metallurgical, coal, energy and other industries of Ukraine. However, there are still
many unsolved problems associated with various applied aspects of the use of intellectual control.
This is, first of all, the choice of the architecture of neural networks, parameter settings, the choice of
effective learning algorithms, etc. The metadology of scientific approaches in such conditions can


ICTERI-2021, Vol I: Main Conference, PhD Symposium, Posters and Demonstrations, September 28 – October 2, 2021, Kherson, Ukraine
EMAIL: kupin.andrew@gmail.com (A. Kupin); u.osadchuk@knu.edu.ua (Y. Osadchuk); ivchenko.ra@gmail.com (R. Ivchenko);
queke888@gmail.com (O. Gradovoy)
ORCID: 0000-0001-7569-1721 (A. Kupin); 0000-0001-6110-9534 (Y. Osadchuk); 0000-0003-4252-4825 (R. Ivchenko); 0000-0001-6984-
1690 (O. Gradovoy)
                 ©️ 2020 Copyright for this paper by its authors.
                 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                 CEUR Workshop Proceedings (CEUR-WS.org)
differ significantly. This work is intended to solve one of these problems. This is the problem of
training neural structures for technological purposes.
    For training (parameterization) multilayered of neural network the structures intended for the
further identification and control by difficult technological processes (TP) in a mode of real time, it is
necessary to apply methods which meet certain requirements. According to [2] these requirements,
first of all, concern: speed of convergence, computing robust, requirements concerning an operative
computer memory, etc. Among existing methods in the biggest measure respond for today these
requirements so-called methods of 2nd order, namely [2-6]:
    - Levenberg-Markuardt (LM);
    - Gauss-Newton (GN);
    - Conjugate Gradient (CG);
    - Modifications to these methods.
    Therefore the further analysis, research and a choice potentially the most effective methods of
training to neural structures of the technological appointment, offered in [1], will be limited to set of
these methods. Thus very important from the point of view of automation of the further calculations
and modelling is that the specified methods are realised as a part of the most powerful packages of
applied programs on emulation of neural network structures (MATLAB Neural Tools, Neuro
Solutions, Statistical Neural Network, etc.) [5, 6].

3. Statement of a material and results
   All aforementioned methods are based on decomposition functions abreast Taylor to 2nd order
inclusive. Such decomposition near to a point (a theoretical optimum of parameters of NN) will have
such appearance [4]:
                    VM , S ,  = VM * , S , + ( − * )T VM * , S , +

                     + ( − * )T VM * , S , ( − * ) = VM * , S , +
                       1
                                                                                                 (1)
                       2
                                            1
                     + ( − * )T G (* ) + ( − * )T H (* )( − * ),
                                            2
where VM  is a designation of criterion of criterion function;  is a vector of parameters which are
subject to adjustment (architecture of NN, weight factors, depth of regress); S is versions regressive
models which it is applied;  is statistical sample of the data for training; G(* ) , H (* ) are
accordingly a gradient and gessian in an optimum point.
   The gradient is defined as
                                                      dV * , S , 
                          G(* ) = VM * , S ,  = M                                          (2)
                                                            d         =   *


and a matrix of the second derivatives – gessian or a matrix Gess
                                                       d 2VM * , S , 
                         H (* ) = VM * , S ,  =                                                (3)
                                                             d2           =   *


   Null value of a gradient and positive definiteness gessian is sufficient conditions of a minimum of
function. That is
                                        G (* ) = 0
                                                        .
                                         H ( )  0
                                                *


   In most cases minimum search can be shown to iterative procedure of type:
                                   (i+1) = (i ) +  (i ) f (i ) ,
where  (i ) is value of parameters of current iteration (); f (i) is a search direction;  (i ) is a step of
algorithm of current iteration.
   Linear approximation of an error of forecasting  (t , ) according to an initial signal on the NN
exit in dyˆ (t | ) such kind is thus applied  (t , ) dyˆ (t | ) :
                          ~(t , ) =  (t , (i ) ) + ( (t , (i ) ))T ( − (i ) ) =
                               =  (t , (i ) ) − ( (t ,  (i ) ))T ( −  (i ) )T ,
                    dyˆ (t | )
where (t , ) =                , t is value of discrete time.
                        d
       The modified criterion (1) is:
                                                                          1 M ~
                                 VM , S ,   L( i ) () =                    [ (t , )]2 ,
                                                                        2 M i =1
where L( i ) () is the approached value of the modified criterion; M is quantity of templates of training
sample.
        Search direction in Newton - Gauss method it is based on definition of approximation of criterion
 L () around current iteration [2-5]. In turn a method of the interfaced gradients based on change of
  (i )


directions of search (restart) in a direction to a gradient (anti gradient) in the conditions of sharp delay
of convergence. Thus there are different lines of thought and algorithms of realisation of the specified
procedures for both methods (set of versions [7]).
       At the same time in one algorithm it is not considered that the global minimum L(i ) () can be out
of a zone of current iteration therefore search will be incorrect. Therefore more rational will estimate
at first expediency of search of a minimum L(i ) () in the field of current iteration. For this purpose
behind algorithm of a method of Levenberg-Markuardts (known in the literature under synonyms:
Levenberg-Marquardt methods, the scheme of Levenberg) is selected sphere to radius  (i ) . Then the
optimisation problem can be formulated in the form of such system
                                                     ˆ = arg min L(i )                                 (4)
                                                                               .
                                                       −   
                                                                (i )       (i )


       Interactive procedure of search of a minimum behind presence of restrictions contains such stages
in system
                                        ( i+1) = ( i ) + f ( i )                                       (5)
                                        
                                        R( ) +  I  f = −G ( )
                                                 (i )      (i )      (i )            (i )
                                                                                          ,

where (i ) is parameter which defines area  (i ) .
       The hypersphere to radius  (i ) is interpreted as area in which limits L(i ) () it can be considered as
adequate approximation of criterion VM , S ,  .
    Feature of a method is procedure of definition of interrelation between and  (i ) parameter (i ) . As
unequivocal dependence between them does not exist, in practice apply some heuristic procedures [2].
For example, the gradual increase (i ) until will take place criterion reduction L(i ) () then iteration
comes to the end. Value of parameter ( i +1) for the following operation decreases.
   Also the alternative approach based on comparison of real reduction of criterion and reduction
which it is predicted on the basis of approximation is applied L(i ) () . As a measure of accuracy of
approximation the factor is considered
                                  V ( i ) , S , − VM ( i ) + f ( i ) , S ,              (6)
                          r (i ) = M
                                    VM ( i ) , S , − L( i ) (( i ) + f ( i ) )
                                                                                    .

   In case of approach of value to factor r (i) to 1, L(i ) () is adequate approximation VM , S , 
and value λ decreases that responds increase  (i ) . On the other hand, small or negative values of factor
lead to necessity of increase λ. On the basis of it the general scheme of realisation of algorithm the
such:
   1. To select initial values of a vector of parameters which are subject to adjustment Θ (0), and
factor λ (0).
   2. To define a direction of search from system of the equations (5).
   3. If r (i )  0,75  (i ) = (i ) / 2 .
   4. If r (i )  0,25  (i ) = 2(i ) .
    5. If to accept for new iteration and (i+1) = (i ) + f (i ) , then to establish ( i+1) = ( i ) .
    6. If criterion of a stop will not reach, then to pass to a stage 2.
    Value of criterion which is minimised, can be presented in such kind
                     L( i ) (( i ) + f ) = VM ( i ) , S , + f T G (( i ) ) + f T R(( i ) ) f .
                                                                                      1                                  (7)
                                                                                      2
    Substituting to (2) value of expression for direction finding of search which is received from a
parity
                                           R((i ) ) f (i ) = −G((i ) ) − f (i ) ,
and we set
                 VM ( i ) , S , − L( i ) (( i ) + f ( i ) ) = (− ( f ( i ) )T G (( i ) ) + ( i ) | f ( i ) |2 ) .
                                                                  1                                                      (8)
                                                                  2
                                                                                                        (i )
    The parity (8) allows at stages 3, 4 algorithms to define factor on r expression (6).
    On the basis of the general technique intellectual neural multilayered identification [8] with
application of methods of computer modelling researches of modelling structures on a basis of neural
network autoregressive predictors for conditions ТП of concentration quartzites of magnetite have
been conducted. Research was included by such stages:
    - choice of a method of training, estimation of depth of regress (quantity of the detained signals
        on an input and an exit) models;
    - application of methods of training (speed of convergence, accuracy);
    - direct and return forecasting;
    - testing of the received systems for nonlinearity.
    The analysis and choice of a base set of methods of training for identification models was carried
out on the basis of a technique stated in [2]. The basic investigation phases are such:
    1. For imitating experiments the elementary model of type NNARX (Neural Network based
Autoregressive exogenous signal) has been selected. For the purpose of analysis simplification
identical depth of regress ( l1 = l2 = 2 ) on the basis of the previous results [1, 8] has been accepted
l1 = l2 = 2 .
    2. Templates of NN of modelling structures in bases of NN of direct distribution (NNDD), radial-
basic functions (RBF) that full the coherent (FCNN, recurrent) are prepared. For all models was
applied in the NN from one latent layer behind the formula: 16-8-8 (corresponding quantity neurons
on a structure input, in the latent layer and on an exit).
    3. Tenfold training and testing of all specified NNS of structures with application of four methods
of training has been carried out: return distribution of an error (back propagation or ВР – a method, as
the actual standard from NN training [2-6]), Gauss-Newton (GN - method), Levenberg-Markuardt
(LM) and Conjugate gradient (CG). Statistical sample of indicators has been applied to training
Northern Mining Complex (Kryviy Rih, Ukraine) behind the formula: 350-280-70 (total of templates,
quantity of templates for training, quantity of templates for verification). Base indicators of first and
last stage TP were thus analyzed.
    4. Average indicators of convergence (the quantity of epoch or iterations for training), robust (a
root-mean-square error – MSE, the generalised root-mean-square error -- NMSE [6]) and the applied
computing resources (operative memory) has been brought to Table 1.
    5. On the basis of the results received in the course of research there was their carried out
comparative analysis.
    The authors tested other neural network architectures using a similar methodology. Absolutely all
research data showed quite encouraging results. Thus, this proves that the approach is quite
promising.
    Further research will consider more complex neural network architectures based on deep learning.
In our opinion, deep neural networks (DNN [7]) may be of the greatest interest for this:
      -      convolution (CNN),
      -      recurrent (RNN),
      -      long short-term memory (LSTM),
     -     neural networks with an attention mechanism (NNAM).
   Convolutional neural networks that use convolutional layers, union layers, fully connected and lost
layers to simulate parallel computing. The convolutional layer basically counts the integrals of many
small overlapping areas. In a fully connected layer, neurons have connections to all activations in the
previous layer. Loss rate calculates how network learning corrects the variance between predicted and
true labels using the softmax function or the cross-entropy loss function for classification, or the
Euclidean loss function for regression. A network with long short-term memory is capable of
forgetting or remembering previous information. LSTM can handle sequences of hundreds of past
inputs. Attention modules are generalized elements that apply weights to a vector of inputs. A
hierarchical neural attention encoder uses multiple layers of attention modules to work with tens of
thousands of past inputs.

Table 1
Comparative estimation of accuracy, resources consumption and speed of convergence of potential
algorithms of training investigated neural structures
    Algorithm of          Convergence,                MSE                 NMSE            COMPUTER
       training         Epoch (iterations)                                              resources, Mb
                                  1. Basis NN (multilayered perceptron)
       1.1. BP                 568                 1,198596            1,76165223              30
       1.2. GN                 303                 1,161828            1,96306745              24
       1.3. LM                 177                 0,778172            1,45139743              35
       1.4. CG                 425                 0,888760            1,45448391              21
                                   2. Basis RBF (radial-basic functions)
       2.1. BP                 196                1,85732511          2,111487478              30
       2.2. GN                 65                 1,19651332          2,131730124              25
       2.3. LM                 31                 0,79076953          1,906790835              35
       2.4. CG                 87                 0,89815021          1,912728683              21
                             3. Basis FCNN (full coherent neural networks)
    3.1. BP                      837                1,0915434            1,60226771            33
    3.2. GN                      451                1,0807423            1,77265223            27
    3.3. LM                      265                0,7223413            1,21234453            37
    3.4. CG                      637                0,8684867            1,26644234            22
    As program environments for computer modelling there were applied three independent packages
of applied programs (neural simulator) type: Neuro Solution, Statistica Neural Networks and
MATLAB Neural Networks Tools (NNT). Corresponding results of modelling in these different
packages approximately coincide. Also all received results well enough coincide with resulted in [1,
2].
    In the course of computer modelling it has been applied such system hardware-software platform:
    - Personal computer with working parameters CPU Pentium IV 2.66 GHz/RAM 8 Gb;
    - Operating system MS Windows 10.
    On Figure 1 curves which show change of criterion of root-mean-square error MSE (Mean-Square
Error or Normalized NMSE) in the course of training of model of type NNARX for different bases of
neural network structures are resulted. Similar results have been received by the author for others
extended the autoregressive predictors models NNARXMAX (NNARX + Moving Average,
exogenous signal), NNOE (Neural Network Output Error).
    In addition to these standard criteria, the authors in alternative works tested other indicators (for
example, the ability to generalize, conditioning, statistical hypotheses, etc.). These results also give
quite good indicators. [8]
                                                          MSE versus Epoch
                                0,5
                               0,45
                                0,4


                   MSE(min)
                               0,35
                                0,3
                                                                (1)
                               0,25
                                0,2
                               0,15
                                0,1
                               0,05                (3)                                (2)
                                    0
                                        1    51     101    151        201       251     301   351   401   451

                                                                            Epoch


                                                          NMSE versus Epoch
                              0,8

                              0,7

                              0,6
               NMSE(min)


                              0,5                           (1)
                              0,4

                              0,3

                              0,2

                              0,1
                                                  (3)                                 (2)
                               0
                                    1       51     101    151         201       251     301   351   401   451

                                                                            Epoch


Figure 1: Change of criterions MSE and NMSE from quantity of iterations (epoch) at training neural
identification model NNARX:
1 – two-layer perceptron which was trained for CG-method;
2 – a network of radial-basic functions (RBF) for GN-method;
3 – full coherent and partially recurrent a network for LM-method.

4. Conclusions
     The analysis of results of computer modelling allows to make certain generalisations in the form
of such conclusions.
     Results of training intellectual neural models of type NNARX qualitatively almost identical if
them accordingly to group (clusterization) behind identical methods of training (GN, CG, LM).
     From the point of view of speed of convergence and robust the most perspective the method of
Levenberg-Markuardt (LM), but it utilization of resources the greatest looks computing. The standard
method of training of the NN, based on return distribution of an error (BP), has shown good enough
robust, but its speed of coincidence slow enough, and requirements concerning resources – to big.
Approximately identical and balanced enough results have shown methods of Gauss-Newton (GN)
and Conjugate gradient (CG).
     In view of the above-stated tests it is possible to recommend to apply to approximation difficult
TP and using recurrent dynamic neural structure under condition of possibility of their hardware
realisation (for example, neuro-graphic processors) or application of the parallel and distributed
computing [9-11]. The latest is immediate prospects for continuation of the further researches in this
direction.
5. References
[1] Kupin, A., Senko, A. Principles of intellectual control and classification optimization in
     conditions of technological processes of beneficiation complexes, CEUR Workshop Proceedings,
     2015, 1356, pp. 153–160. URL: http://ceur-ws.org/Vol-1356/paper_34.pdf
[2] Bublikov, A.V., Tkachov, V.V. Automation of the control process of the mining machines based
     on fuzzy logic, Naukovyi Visnyk Natsionalnoho Hirnychoho Universytetu, 2019, 2019(3), pp.
     112–118. DOI: 10.29202/nvngu/2019-3/19.
    URL: https://www.metaljournal.com.ua/assets/Journal/11.2014.pdf
[3] Charu C.A. Neural Networks and Deep Learning, IBM T. J. Watson Research Center
     International Business MachinesYorktown HeightsUSA, Springer International Publishing AG,
     part of Springer Nature (2018). DOI: https://doi.org/10.1007/978-3-319-94463-0.
[4] Morkun, V.S., Morkun, N.V., Tron, V.V., Dotsenko, I.A. Adaptive control system for the
     magnetic separation process, Sustainable Development of Mountain Territories, 2018, 10(4), pp.
     545–557. URL: http://naukagor.ru/Portals/4/%233%202018/№4,%202018.pdf?ver=2019-02-21-
     091240-697.
[5] Livshin I. Artificial Neural Networks with Java, Apress, Berkeley, CA (2019). DOI
     https://doi.org/10.1007/978-1-4842-4421-0
[6] Rudenko, O.G., Bezsonov, A.A. Neural network approximation of nonlinear noisy functions
     based on coevolutionary cooperative-competitive approach, Journal of Automation and
     Information Sciences, 2018, 50(5), pp. 11–21. DOI: 10.1615/JAutomatInfScien.v50.i5.20.
[7] Trunov, A., Malcheniuk, A. Recurrent Network As A Tool For Calibration In Automated
     Systems And Interactive Simulators, Eastern-European Journal of Enterprise Technologies,
     2018, 2(9-92), pp. 54–60. DOI: https://doi.org/10.15587/1729-4061.2018.126498.
[8] Kupin, A. Research of properties of conditionality of task to optimization of processes of
     concentrating technology is on the basis of application of neural networks. Metallurgical and
     Mining Industry, 2014, 6(4), pp. 51–55.
     URL: https://www.metaljournal.com.ua/assets/Journal/11.2014.pdf
[9] Hu, Z., Bodyanskiy, Y., Tyshchenko, O.K. Self-learning procedures for a kernel fuzzy clustering
     system, Advances in Intelligent Systems and Computing, 2019, 754, pp. 487–497. URL:
     https://link.springer.com/chapter/10.1007%2F978-3-319-91008-6_49.
[10] Derbentsev, V., Semerikov, S., Serdyuk, O., Solovieva, V., Soloviev, V. Recurrence based
     entropies for sustainability indices, E3S Web Conf. Volume 166, 2020. The International
     Conference on Sustainable Futures: Environmental, Technological, Social and Economic Matters
     (ICSF 2020), pp. 1-7. DOI: https://doi.org/10.1051/e3sconf/202016613031.
[11] Drozd, O., Kharchenko, V., Rucinski, A., Kochanski, T., Garbos, R., Maevsky, D. Development
     of Models in Resilient Computing, Proceedings of 10th IEEE International Conference on
     Dependable Systems, Services and Technologies (DESSERT’2019), Leeds, UK, June 5-7 2019,
     pp. 2 – 7. https://doi.org/10.1109/DESSERT.2019.8770035.