Forecast Method for Natural Language Constructions
        Based on a Modified Gated Recursive Block

         Eugene Fedorov1[0000-0003-3841-7373], Olga Nechyporenko1[0000-0002-3954-3796],
                           Tetyana Utkina1[0000-0002-6614-4133]
1
    Cherkasy State Technological University, Cherkasy, Shevchenko blvd., 460, 18006, Ukraine

                          {fedorovee75, olne}@ukr.net,
                                 t.utkina@chdtu.edu.ua


         Abstract. The paper proposes a method for predicting natural language con-
         structions based on a modified gated recursive block. For this, an artificial neu-
         ral network model was created, a criteria for evaluating the efficiency of the
         proposed model was selected, two methods for parametric identification of the
         artificial neural network model were developed is based on the backpropagation
         through time algorithm and based on simulated annealing particle swarm opti-
         mization algorithms. The proposed model and methods for its parametric identi-
         fication make it possible to more accurately control the share of information
         coming from the input layer and the hidden layer of the model, increase the par-
         ametric identification speed and the prediction probability. The proposed meth-
         od for predicting natural language constructions can be used in various intelli-
         gent natural language processing systems.

         Keywords: modified gated recursive block, prediction of natural language con-
         structions, particle swarm optimization, simulated annealing, parametric identi-
         fication.


1        Introduction

Currently, one of the most important problems in the field of natural language pro-
cessing is the insufficiently high accuracy of the analysis of alphabetic and/or pho-
neme sequences [1, 2]. This leads to the fact that natural language processing may be
ineffective. Therefore, the development of methods for predicting natural language
constructions is an important task.
   As a prediction method, a neural network forecast [3] was chosen, which, when
forecasting natural language constructions, has the following advantages:

─ correlations between factors are studied on existing models;
─ no assumptions regarding the distribution of factors are required ;
─ prior information about factors may be absent;
─ source data can be highly correlated, incomplete or noisy;
─ analysis of systems with a high degree of nonlinearity is possible;
    Copyright © 2020 for this paper by its authors.
    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
─ fast model development;
─ high adaptability;
─ analysis of systems with a large number of factors is possible;
─ full enumeration of all possible models is not required;
─ analysis of systems with heterogeneous factors is possible.

The following recurrent networks are most often used as forecast neural
networks [4-6]:

─ Jordon neural network (JNN) [7, 8];
─ Elman neural network (ENN) or simple recurrent network (SRN) [9, 10];
─ bidirectional recurrent neural network (BRNN) [11, 12];
─ long short-term memory(LSTM) [13, 14];
─ gated recurrent block (GRU) [15, 16];
─ echo state network (ESN) [17, 18];
─ liquid state machine (LSM) [19, 20].

Table 1 shows the comparative characteristics of neural networks for predicting natu-
ral language constructions.

                   Table 1. Comparative characteristics of neural networks
                      for the predicting natural language constructions

                          Network           ENN
                                    JNN          BRNN LSTM GRU               ESN   LSM
 Criterion                                 (SRN)
 Low probability of getting into
                                     -        -       -       -        -      +     +
 a local extremum
 High learning speed                 +       +        +       -       +       +     +
 Possibility of batch training       -       -        -       -       -       -     +
 Dynamic control of the share of
 information from the input and      -        -       -       +       +       -     -
 hidden layers
   According to table 1, none of the networks meets all the criteria. In this regard, the
creation of training methods that will eliminate these drawbacks is relevant.
   To increase the probability of falling into a global extremum and replacing batch
training with multi-agent training, metaheuristic search is often used instead of local
search [21-25]. Metaheuristics expands the capabilities of heuristics by combining
heuristic methods based on a high-level strategy [26-30].
   Modern metaheuristics may have one or more of the following disadvantages:
─ there is only a generalized method structure or the method structure is focused on
  solving only a specific problem [21];
─ iteration numbers are not present when searching for a solution [22];
─ the method may not converge [31];
─ material potential solutions are unacceptable [32];
─ there is no formalized parameter values search strategy [33];
─ the method is not intended for conditional optimization [34];
─ the method does not possess high accuracy [35].

Thereby, arises the problem of constructing an effective metaheuristic optimization
method.
   Thus, the task to create an effective forecast model for alphabetic and/or phoneme
sequences, which is trained based on effective metaheuristics, is relevant today.
   The purpose of the work is to develop a forecast method for natural language con-
structions based on a modified gated recurrent block. To achieve the goal, the follow-
ing tasks were set and solved:
1. Create a model of a modified gated recursive block.
2. Select a criteria for evaluating the efficiency of the proposed model.
3. Develop a method for the parametric identification of a model based on local
   search.
4. Develop a method for parametric identification of a model based on a multi-agent
   metaheuristic search.
5. Conduct a numerical study.


2      Creating a model of a modified gated recursive block

The paper proposes a modification of GRU by introducing 1  rj  n   factor for the
weighted sum of the neurons outputs in the input layer, which allows to more accu-
rately control the share of information coming from the input layer and the hidden
layer.
    The proposed modified gated recurrent unit (MGRU) is a recurrent two-layered ar-
tificial neural network (ANN) with an input layer in , a hidden layer h , an output
layer out . Just as for a regular GRU, each neuron in hidden layer is associated with
reset and update gates is (FIR filters). The structural representation of the MGRU
model is shown in Fig. 1.
                                               r


                                               …


                                  in       h           out


                                  …            …       …


                                               …


                                               z

        Fig. 1. Structural representation of a modified gated recurrent block (MGRU)
Gates determine how much information to pass. Thereby, the following special cases
are possible. If the share of information passed by the reset gate is close to 0.5, and
the share of information passed by the update gate is close to 0, then we get SRN. If
the share of information passed by the reset gate is close to 0 and the share of infor-
mation passed by the update gate is close to 0, then the ANN information is updated
only due to the input (short-term) information. If the share of information passed by
the update gate is close to 1, then the ANN information is not updated. If the share of
information passed by the reset gate is close to 0 and the share of information passed
by the update gate is close to 1, then the ANN information is updated only due to
internal (long-term) information.

   The modified gated recurrent block (MGRU) model is presented in the following
form:
─ calculating the share of information passed by the reset gate

                             N                       N                     
                                 0                      1


               rj  n   f    ijin  r yiin  n    ijh  r yih  n  1  , j 1, N   ,
                                                                                             1

                             i 1                      i 1                    

─ calculating the share of information passed by the update gate

                              N                      N                     
                                 0                      1


               z j  n   f   uijin  z yiin  n    uijh  z yih  n  1  , j 1, N   ,
                                                                                             1

                              i 1                     i 1                    

─ calculating the output signal of the candidate layer

                                                                                        
                                  (0)                                (1)
                               N                                  N
    y hj (n)  g  (1  rj (n))  wijin  h y in (n  i)  rj (n)  wijh  h yih (n  1)  , j 1, N   ,
                                                                                                      1

                               i 1                              i 1                   

─ calculating the output signal of the hidden layer

                   y hj (n)  z j (n) y hj (n  1)  (1  z j (n)) y hj (n), j 1, N   ,
                                                                                         1


─ calculating the output signal of the output layer

                                       N   h out h 
                                               1
                                                                   2
                            j  
                          y out n  f   wij yi  n   , j 1, N ,
                                       i 1           

                                                          1
                                             f s             ,
                                                       1  e s

                                            g  s   tanh  s  ,

where N   is the number of neurons in the input layer;
           0
     N   is the number of neurons in the output layer;
         2


   N   is the number of neurons in the hidden layer;
         1


   ijin  r , uijin  z are connection weight from the i th input neuron to the reset gates and
update of the j th hidden neuron;
     ijh  r , uijh  z are connection weight from the i th hidden neuron to the reset gates and
update of the j th hidden neuron;
     wijin  h is connection weight from the i th input neuron to the j th hidden neuron;
     wijh  h is connection weight from the i th hidden neuron to the j th hidden neuron;
     wijh  out is connection weight from the i th hidden neuron to the output neuron;
     rj  n  is share of information passed by the reset gate of the j th hidden neuron at a
time n , rj  n    0, 1 ;
     z j  n  is share of information passed by the update gate j th of the hidden neuron
at a time n , z j  n    0, 1 ;
     yiin  n  is output of the i th input neuron at time n ;
     yiout  n  is output of the i th output neuron at time n ;
     y hj  n  is output of the j th hidden neuron at time n ;
     f   , g   are activation function.
   To evaluate the effectiveness of the proposed model, it is necessary to select crite-
rion.


3            Selection of criterion for evaluating the effectiveness of the
             proposed model

In the paper, to evaluate the parametric identification of the MGRU model, a model
adequacy criterion is chosen, which means the choice of such parameter values
                                                                                                                    
as W   ijin r  n  , uijin z  n  , wijin h  n  ,  ijh r  n  , uijh z  n  , wijh h  n  , wijhout  n  , which
deliver a minimum of the mean squared error (the difference between the model out-
put and the desired output):
                                                         P N 
                                                               2


                                                           y out
                                                                pi  d pi   min .
                                             1                               2
                                  F              2                          W
                                         PN             p 1 i 1


In the paper, to evaluate the functioning of the MGRU model in test mode, a forecast
probability criterion is selected, which means the choice of such parameter values
                                                                                                                     
as W   ijin r  n  , uijin  z  n  , wijin h  n  ,  ijh r  n  , uijh  z  n  , wijh h  n  , wijh out  n  , which
deliver the maximum probability:
                                                                     1, a  b
               F
                     1 P
                       
                     P p 1
                                      
                             round y out
                                      p           W
                                                               
                                          , d p  min ,   a, b   
                                                                     0, a  b
                                                                               ,


where round  is function that rounds a number to the nearest integer.
 According to the first criterion, the methods of parametric identification of the
MGRU model are proposed in this paper.


4        Creation of method for parametric identification of the
         MGRU model based on local search

In this paper, we first propose a method for parametric identification of the MGRU
model based on the traditional for GRU backpropagation through time (BPTT).
   The proposed method allows you to find a quasi-optimal vector of parameters’ val-
ues of the MGRU model and consists of the following blocks.
   Block 1 – Initialization:
─ set the current iteration number n to one;
─ initialization by uniform distribution over the interval  0, 1 or  0.5, 0.5 weights

     ijin  r  n  , uijin  z  n  , wijin  h  n  , i 1, N  0 , j 1, N 1 ,  ijh  r  n  , uijh  z  n  ,

    wijh  h  n  , i 1, N   , j 1, N   , wijh out  n  , i 1, N   , i 1, N   , where N   is
                                  1                  1                                             1              2       0


    the number of neurons in the input layer, N   is the number of neurons in the out-
                                                                                2


    put layer, N   is number of neurons in the hidden layer.
                      1


Block 2 – Setting the training set                            x , d  x  R , d  R  ,  1, P , where
                                                                              
                                                                                        N 
                                                                                          0

                                                                                                       
                                                                                                           N 
                                                                                                             2


x is  th training input vector, d  is  th training output vector, P is the power of
the training set. Number of the current pair from the training set   1 .
    Block 3 is Initial calculation of the output signal for the hidden layer hi  n  1  0 ,

i 1, N   .
           1


   Block 4 is Calculation of the output signal for each layer (forward propagation)

                                                                                    
                                          yiin  n   xi , rj  n   f s rj  n  ,         
                           N                                 N 
                             0                                      1


               s rj  n     ijin  r  n  yiin  n    ijh  r  n  yih  n  1 , j 1, N   ,
                                                                                                                      1

                           i 1                                 i 1


                                                                        
                                                   z j  n   f s zj  n  ,       
                      N                                N 
                        0                                  1


          s zj  n    uijin  z  n  yiin  n    uijh  z  n  yih  n  1 , j 1, N   ,
                                                                                                     1

                      i 1                               i 1


                                                                                              
                                    (0)                                        (1)
                                 N                                  N
        s hj (n)   (1  rj (n))  wijin  h (n) yiin (n)  rj (n)  wijh  h (n) yih (n  1)  ,
                                 i 1                              i 1                       

                                                      j 1, N   ,
                                                                       1


                                                                                        
            y hj  n   z j  n  y hj  n  1  1  z j  n   g s hj  n  , j 1, N   ,
                                                                                                 1


                                                                N 
                                                                  1


         y out         out
                            
           j  n  f s j  n , s  
                                   out
                                        n    wijhout  n  yih  n  , j 1, N   ,
                                                                i 1
                                                                                                         2


                                                    1
                                f s                    , g  s   tanh  s  .
                                                 1  e s

Block 5 – Calculation of ANN error energy
                                           2
                                   1N 2
                       E n            e j  n  , e j  n   yjL   n   d j .
                                   2 j 1

Block 6 – Setting up synaptic weights based on a generalized delta rule (back propa-
gation)

        wijh out  n  1  wijh out  n   wijh out  n  , i 1, N   , j 1, N   ,
                                                                                             1           2


          wijin  h  n  1  wijin h  n   wijin h  n  , i 1, N   , j 1, N   ,
                                                                                         0           1


           wijh  h  n  1  wijh h  n   wijh h  n  , i 1, N   , j 1, N   ,
                                                                                     1           1


           uijin  z  n  1  uijin  z  n   uijin  z  n  , i 1, N   , j 1, N   ,
                                                                                     0           1


            uijh  z  n  1  uijh  z  n   uijh  z  n  , i 1, N   , j 1, N   ,
                                                                                     1           1


           ijin r  n  1   ijin r  n    ijin r  n  , i 1, N  0 , j 1, N 1 ,

            ijh h  n  1   ijh h  n    ijh h  n  , i 1, N 1 , j 1, N 1 ,

where  is a parameter that determines the learning speed (for large  , learning is
faster, but the risk of getting the wrong decision increases), 0    1 .
                                    wijh out  n   yih  n   out
                                                                   j n ,


                             wijin  h  n   1  rj  n   yiin  n   jh  n  ,

                               wijh  h  n   rj  n  yih  n  1  hj  n  ,

                                     uijin  z  n   yiin  n   jz  n  ,

                                   uijh  z  n   yih  n  1  jz  n  ,

                                      ijin r  n   yiin  n   rj  n  ,

                                    ijh r  n   yih  n  1  rj  n  ,

                              j  n  f  s j  n   y j  n   d j  ,
                             out        out            out


                                                               N                             
                                                                   2


                 hj  n   g   s hj  n   1  z j  n     whjlout  n   lout  n   ,
                                                               l 1                            

                                                               N                           
                                                                     
                                                                              2


                                                         
          n   f  s zj  n  y hj  n  1  g s hj  n    whjlout  n   lout  n   ,
          z
          j
                                                               l 1                          

                                  N                                                            
                                       (1)
                                                                        M
         rj (n)  f ( s rj (n))   wijh  h (n) yih (n  1)   wijin  h (n) yiin (n)   hj (n) .
                                   i 1                               i 1                     

Block 7 – Verification of the termination condition
   If n mod P  0 , then increase the number of the training pair  by one, increase
the iteration number n by one, and go to block 4.
                          1 P
   If n mod P  0 and  E  n  P  s    , then increase the iteration number n
                          P s 1
by one and go to block 2.
                          1 P
   If n mod P  0 and  E  n  P  s    , then end.
                          P s 1
   To increase the probability of falling into a global extremum and the possibility of
parallel training of a neural network, the second method of parametric identification is
further proposed.
5        Creation of a method for parametric identification of the
         MGRU model based on multi-agent metaheuristic search

In this paper, we propose a method for parametric identification of the MGRU model
based on simulated annealing particle swarm optimization (SAPSO).
   The SAPSO method allows you to find a quasi-optimal vector of parameter values
for the MGRU model and consists of the following blocks.
   Block 1 – Initialization:
─ set the current iteration number n to one;
─ set the maximum number of iterations N ;
─ set swarm size K ;
─ set the dimension of the particle position M (corresponds to the number of the
  MGRU model parameters);
─ position initialization xk (corresponds to the parameters vector of the MGRU
  model)

             xk   xk1 ,   , xkM  , xij   xmax
                                               j    x min
                                                       j   U  0, 1  xmin
                                                                         j   , k 1, K ,

where x min
        j   , x max
                j   are minimum and maximum values, U  0, 1 is function that pro-
vides the calculation of a uniformly distributed random variable on a segment  0, 1 ;

─ initialization of a personal (local) best position xkbest

                                        xkbest  xk , k 1, K ;

─ speed initialization  k

                             k   k1 ,   ,  kM  ,  ij  0 , k 1, K ;

─ create an initial swarm of particles

                                       Q    x , x ,   ;
                                                 k
                                                     best
                                                     k      k


─ determine the particle from the current population with the best position (corre-
  sponds to the best parameters vector of the MGRU model in target function)

                                 k *  arg min F  xk  , x*  xk * .
                                            k1, K


Block 2 – Modification of the velocity of each particle using simulated annealing

    r1k   r1k1 ,   , r1kM  , r1kj  U  0, 1 , C  0, 1 , N  0, 1 , k 1, K , j 1, M ,
 r 2k   r 2k1 ,   , r 2kM  , r 2kj  U  0, 1 , C  0, 1 , N  0, 1 , k 1, K , j 1, M ,

       k  w  n  k  1  n   xkbest  xk   r1    2  n   x*  xk   r2  , k 1, K ,
                                                    T                             T


           1  n    2  n     0 exp  1 T  n   ,   0  0  0.5  ln 2 ,


                    w  n   w  0  exp  1 T  n   , w  0   w0 
                                                                               1
                                                                                   ,
                                                                            2 ln 2

                                                                      1               N
                                                                  
                T  n    T  n  1 , T  0   T0 ,   N N 1 , T0  N N 1 ,

where N  0, 1 is function that provides the calculation of a random variable from
standard normal distribution,
   C  0, 1 is function that provides the calculation of a random variable from stand-
ard Cauchy distribution,
   1  n  is parameter controlling the contribution of the component  xkbest  xk   r1 
                                                                                                      T


to the particle’s velocity at the iteration n ,
    2  n  is parameter controlling the contribution of the component  x*  xk   r2 
                                                                                                      T


to the particle’s velocity at the iteration n ,
    w  n  is parameter controlling the contribution of the particle’s velocity at the iter-
ation n  1 to the particle’s velocity at the iteration n ,
     0 is initial value of 1  n  and  2  n  parameters,
   w0 is initial value of w  n  parameter,
   T  n  is annealing temperature at the iteration n ,
   T0 is initial annealing temperature,
    is parameter that controls the annealing temperature.
   The simulated annealing introduced in this work allow us to establish the inverse
correlation between the parameters 1  n  ,  2  n  , w  n  and the iteration number,
i.e. in the first iterations, the search is global, and in the last iterations, the search be-
comes local. In addition, in this work, a direct correlation between the parameters T0 ,
  and the iteration number is established, which allows for automated selection of
these parameters.
                                                                     1
    The choice of initial values of 0  0.5  ln 2 and w0               is standard and sat-
                                                                  2 ln 2
                                                               1
isfies the conditions of a swarm convergence w  1 and w0       1   2   1 .
                                                               2
   Block 3 – Modification of each particle’s position, considering the limitations
                               xk  xk  k , k 1, K ,

                       x min
                            j , xkj  x min
                                        j
                      
                                     
                xkj   xkj , xkj  x j , x max
                                       min
                                            j         
                                                , k 1, K , j 1, M ,
                        max
                        x j , xkj  x j
                                        max


                                    
                       kj ,  kj  x min  max
                                       j , xj         
                kj                           , k 1, K , j 1, M .
                                
                      0,  kj  x j , x j
                                     min   max
                                                 
Block 4 – Determining the personal (local) best position of each particle
                      
  If F  xk   F xkbest , then xkbest  xk , k 1, K .
   Block 5 – Determining the particle from the current population with the best posi-
tion

                                k *  arg min F  xk  .
                                             k1, K


Block 6 – Determining the global best position
                  
  If F  xk *   F x* , then x*  xk .  *


   Block 7 – Stop Condition
   If n  N , then increase the iteration number n by 1 and go to block 2.
   The proposed method is intended for implementation through a multi-agent sys-
tem.


6      Numerical study

Table 2 presents the used lexemes and their categories, moreover the case when there
is no lexeme is taken into account too.

                           Table 2. Lexemes and their categories

  Lexeme category             Lexeme                   Lexeme category          Lexeme
E                      empty lexeme                   V-INTR             think, sleep
N-H                    man, woman                     V-TR               see, chose
N-AN                   cat, mouse                     V-AGP              move, break
N-IN                   book, rack                     V-P                smell, see
N-AGR                  dragon, monster                V-D                break, smash
N-FR                   plate, glass                   V-EAT              eat
N-F                    break, cookie
   Before starting the parametric identification of the MGRU model, each letter of
each lexeme is encoded. Table 3 presents the codes for each letter of the English al-
phabet, as well as a space.
                                      Table 3. Letter codes

  Letter     Code       Letter          Code         Letter     Code     Letter     Code
Space      00000      g               00111      n            01110    u          10101
a          00001      h               01000      o            01111    v          10110
b          00010      i               01001      p            10000    w          10111
c          00011      j               01010      q            10001    x          11000
d          00100      k               01011      r            10010    y          11001
e          00101      l               01100      s            10011    z          11010
f          00110      m               01101      t            10100
  In this work, a neural network forecast of the third lexeme by the previous two lex-
emes was carried out. Lexemes combination patterns are presented in table 4.

                           Table 4. Lexemes combination templates

   First lexeme category              Second lexeme category        Third lexeme category
N-H                                 V-EAT                         N-F
N-H                                 V-P                           N-IN
N-H                                 V-D                           N-FR
N-H                                 V-INTR                        E
N-H                                 V-TR                          N-H
N-H                                 V-AGP                         N-IN
N-H                                 V-AGP                         E
N-AN                                V-EAT                         N-F
N-AN                                V-TR                          N-AN
N-AN                                V-AGP                         N-IN
N-AN                                V-AGP                         E
N-IN                                V-AGP                         E
N-AGR                               V-D                           N-FR
N-AGR                               V-EAT                         N-H
N-AGR                               V-EAT                         N-AN
N-AGR                               V-EAT                         N-F
   The number of neurons in the input layer is calculated as

                       N    2  MaxLenLexem  LenCode ,
                            0


where LenCode is code length of one letter (according to table 3, LenCode  5 ),
    MaxLenLexem is maximum lexeme length (according to table 2
 MaxLenLexem  7 ).
   If the lexeme is shorter than MaxLenLexem , it is completed with spaces on the
right. An empty lexeme consists of only spaces.
   The number of neurons in the output layer is calculated as

                           N    MaxLenLexem  LenCode .
                                2
The parametric identification of the MGRU model was carried out for 10,000 training
implementations based on the proposed multi-agent metaheuristic search.
    Table 5 presents the forecast probabilities obtained for 1000 test implementations
based on the proposed MGRU model and artificial neural networks’ traditional
models.
    Table 6 presents the number of parameters (link weights) for the proposed MGRU
model and artificial neural networks’ traditional models, which is directly proportion-
al to the computational complexity of parametric identification.

                                      Table 5. Forecast probability

              Network                       ENN                                           complete
                                JNN                     BRNN                                                              GRU        MGRU
Criterion                                  (SRN)                                           LSTM
Forecast probability            0.8         0.85                 0.9                        0.95                          0.93       0.95

                                  Table 6. Number of parameters

                Criterion
                                                    Number of parameters
Network
JNN
                                                             1
                                                                     
                                                       N   N   2 N 
                                                                                  0                    2
                                                                                                            
ENN (SRN)                                          N   1
                                                              N         0
                                                                                  N       1
                                                                                                 N         2
                                                                                                                  
BRNN                                                                 
                                                 2 N   N   N   N 
                                                             1                0              1                  2
                                                                                                                      
complete LSTM
                                                                 1
                                                                         
                                  N   N   M  N   N   N   N   N   M
                                       0     1                                        0            1                  2          1
                                                                                                                                     
GRU                                              3 N    1
                                                             N N 
                                                                  N          0            1              2


                                                 3 N     N    N    N   
                                                         1                    0              1                  2
MGRU
  For a complete LSTM, GRU, MGRU, the number of neurons in the hidden layer is
calculated as N    N   .
                   1        0

   For JNN, ENN (SRN), BRNN, the number of neurons in the hidden layer is calcu-
lated as N    2  N   .
            1          0

   For LSTM, the number of memory cells is set as M  1 .
   According to tables 5-6, MGRU and complete LSTM give the best forecast proba-
bility results, but MGRU, unlike LSTM, has fewer parameters, i.e. less computational
complexity.
   The increase in accuracy of MGRU prediction was made possible by introducing a
multiplier (1  rj (n)) for the weighted sum of the neurons outputs in the input layer.
This allows to more accurately control the share of information coming from the input
layer and the hidden layer, as well as by using the metaheuristic determination of the
MGRU models parameters.
   The limitations of this work include the MGRUs full connectivity, the requirement
for more parameters than in ENN (SRN), MGRU testing only on trigrams.
   Like the BERT system, the proposed MGRU can work with context-free and con-
text-sensitive grammars, but unlike BERT, it can be used not only for English.
   The practical contribution of this work consists in the fact that it allows to predict
alphabetic and / or phoneme sequences through an artificial neural network, the train-
ing of which is based on the proposed metaheuristics, which allows to increase the
accuracy of the forecast and can be used as an intermediate stage in the speech under-
standing system.


7      Conclusions

1. To solve the problem of insufficient quality of the natural language sequences
   analysis, the corresponding neural network forecast methods were studied. To in-
   crease the efficiency of training neural networks, metaheuristic methods were stud-
   ied.
2. The created model of the modified gated recurrent block allows for more precise
   control of the share of information coming from the input layer and the hidden lay-
   er, which increases the forecast accuracy.
3. The created method of parametric identification of the MGRU model based on
   simulated annealing particle swarm optimization reduces the probability of getting
   into local extremum and replaces batch training with multi-agent training, which
   increases the forecast probability and the training speed.
4. The proposed method for predicting natural language constructions based on a
   modified gated recurrent block can be used in various intelligent natural language
   processing systems.

References
 1. Dominey, P.F., Hoen, M., Inui, T.: A Neurolinguistic Model of Grammatical Construction
    Processing. Journal of Cognitive Neuroscience. 18 (12), 2088–2107 (2006). doi:
    10.1162/jocn.2006.18.12.2088
 2. Khairova, N., Sharonova, N.: Modeling a Logical Network of Relations of Semantic Items
    in Superphrasal Unities. In: Proc. of the EWDTS. pp. 360–365. Sevastopol (2011).
    doi:10.1109/EWDTS.2011.6116585
 3. Lyubchyk, L., Bodyansky, E., Rivtis, A.: Adaptive Harmonic Components Detection and
    Forecasting in Wave Non-Periodic Time Series using Neural Networks. In: Proc. of the
    ISCDMCI’2002. pp. 433–435. Evpatoria (2002).
 4. Du, K.-L., Swamy, K.M.S.: Neural Networks and Statistical Learning. Springer-Verlag,
    London (2014). doi: 10.1007/978-1-4471-5571-3
 5. Haykin, S.: Neural Networks. Pearson Education, NY (1999).
 6. Sivanandam, S.N., Sumathi, S., Deepa, S.N.: Introduction to Neural Networks using
    Matlab 6.0. The McGraw-Hill Comp., Inc., New Delhi (2006).
 7. Jordan, M.I.: Attractor Dynamics and Parallelism in a Connectionist Sequential Machine.
    In: Proc. of the Ninth Annual Conference of the Cognitive Science Society. pp. 531–546.
    Hillsdale, NJ (1986).
 8. Jordan, M., Rumelhart, D.: Forward Models: Supervised Learning with a Distal. Cognitive
    Science. 16, 307–354 (1992). doi: 10.1016/0364-0213(92)90036-T
 9. Zhang, Z., Tang, Z., Vairappan, C.: A Novel Learning Method for Elman Neural Network
    using Local Search. Neural Information Processing – Letters and Reviews. 11 (8), 181–
    188 (2007).
10. Wiles, J., Elman, J.: Learning to Count without a Counter: a Case Study of Dynamics and
    Activation Landscapes in Recurrent Networks. In: Proc. of the Seventeenth Annual Con-
    ference of the Cognitive Science Society. pp. 1200–1205. Cambridge, MA (1995).
11. Schuster, M., Paliwal, K.K.: Bidirectional Recurrent Neural Networks. IEEE Transactions
    on Signal Processing. 45 (11), 2673–2681 (1997). doi:10.1109/78.650093
12. Baldi, P., Brunak, S., Frasconi, P. Soda, G., Pollastri, G.: Exploiting the Past and the Fu-
    ture in Protein Secondary Structure Prediction. Bioinformatics. 15 (11), 937–946 (1999).
    doi: 10.1093/bioinformatics/15.11.937
13. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Technical Report FKI-207-95,
    Fakultat      fur      Informatik,     Technische        Universitat      Munchen.       doi:
    10.1162/neco.1997.9.8.1735
14. Gers, F.: Long Short-Term Memory in Recurrent Neural Networks. PhD thesis, Ecole Pol-
    ytechnique Federale de Lausanne.
15. Cho, K., Merrienboer, van B., Gulcehre, C., Bougares, F., Schwenk, H., Bengio, Y.:
    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine
    Translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
    Language Processing (EMNLP). pp. 1724–1734. Qatar, Doha (2014). doi:
    10.3115/v1/D14-1179
16. Dey, R., Salem, F.M.: Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks.
    (2017). arXiv: 1701.05923
17. Jaeger, H.: A Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL,
    EKF and the “Echo State Network” Approach. GMD Report 169, Fraunhofer Institute AIS
    (2002).
18. Jaeger, H., Lukosevicius, M., Popovici, D., Siewert, U.: Optimization and Applications of
    Echo State Networks with Leakyintegrator Neurons. Neural Networks. 20 (3), 335–352
    (2007). doi: 10.1016/j.neunet.2007.04.016
19. Maass, W., Natschläger, T., Markram, H.: Real-Time Computing without Stable States: a
    New Framework for Neural Computation Based on Perturbations. Neural Computation.
    14 (11), 2531–2560 (2002). doi: 10.1162/089976602760407955
20. Jaeger, H., Maass, W., Prıncipe, J.C.: Special Issue on Echo State Networks and Liquid
    State Machines. Editorial. Neural Networks. 20 (3), 287–289 (2007). doi:
    10.4249/scholarpedia.2330
21. Talbi, El-G.: Metaheuristics: from Design to Implementation. Wiley & Sons, Hoboken,
    New Jersey (2009).
22. Engelbrecht, A.P.: Computational Intelligence: an Introduction. Wiley & Sons, Chichester,
    West Sussex (2007).
23. Zbigniew, J.C.: Three Parallel Algorithms for Simulated Annealing. In: Proceedings of the
    4th International Conference on Parallel Processing and Applied Mathematics-Revised Pa-
    pers (PPAM’01). pp. 210–217. Springer-Verlag, Berlin, Heidelberg (2001).
24. Loshchilov, I.: CMA-ES with Restarts for Solving CEC 2013 Benchmark Problems. In:
    Proceedings of the IEEE Congress on Evolutionary Computation (CEC’2013). pp. 369–
    376. Cancun, Mexico (2013).
25. Byrne, J., Hemberg, E., Brabazon, A., O’Neill, M.: A Local Search Interface for Interac-
    tive Evolutionary Architectural Design. In: Proceedings of the International Conference on
    Evolutionary and Biologically Inspired Music and Art (Evo-MUSART’2012). pp. 23–34.
    Springer, Berlin, Heidelberg (2012).
26. Yuen, S.Y., Chow C. K.: A Genetic Algorithm that Adaptively Mutates and Never Revis-
    its. IEEE Transactions on Evolutionary Computation. 13 (2), 454-472 (2009). doi:
    10.1109/TEVC.2008.2003008
27. Ventresca, M., Tizhoosh, H.R.: Simulated Annealing with Opposite Neighbors. In: Pro-
    ceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence.
    pp. 186–192. Honolulu, HI, 2007. doi: 10.1109/FOCI.2007.372167
28. Wang, H., Li, H., Liu, Y., Li, Ch., Zeng, S.: Opposition-based Particle Swarm Algorithm
    with Cauchy Mutation. In: Proceedings of the 2007 IEEE Congress on Evolutionary Com-
    putation. pp. 4750-4756. Singapore (2007). doi: 10.1109/CEC.2007.4425095
29. Grygor, O.O., Fedorov, E.E., Utkina, T.Yu., Lukashenko, A.G., Rudakov, K.S., Hard-
    er, D.A., Lukashenko, V.M.: Optimization Method Based on the Synthesis of Clonal Se-
    lection and Annealing Simulation Algorithms. Radio Electronics, Computer Science, Con-
    trol. 2, 90–99 (2019). doi: 10.15588/1607-3274-2019-2-10
30. Rasdi, M.R.H.M., Musirin, I., Hamid, Z.A., Haris H.C.M.: Gravitational Search Algorithm
    Application in Optimal Allocation and Sizing of Multi Distributed Generation. In Proceed-
    ings of the 2014 IEEE 8th International Power Engineering and Optimization Conference
    (PEOCO’2014).          pp. 364–368.        Langkawi,      Malaysia       (2014).      doi:
    10.1109/PEOCO.2014.6814455
31. Radosavljević, J., Jevtić, M., Klimenta, D.: Energy and Operation Management of a Mi-
    crogrid using Particle Swarm Optimization. Engineering Optimization, 48 (5), 811–830
    (2016). doi: 10.1080/0305215X.2015.1057135
32. Alinejad-Beromi, Y., Sedighizadeh, M., Sadighi, M.: A Particle Swarm Optimization for
    Sitting and Sizing of Distributed Generation in Distribution Network to Improve Voltage
    Profile and Reduce THD and Losses. In: Proceedings of the 2008 43rd International Uni-
    versities Power Engineering Conference. pp. 1-5. Padova, Italy (2008). doi:
    10.1109/UPEC.2008.4651544
33. Petrović, M., Petronijević, J., Mitić, M., Vuković, N., Miljković, Z., Babić, B.: The Ant
    Lion Optimization Algorithm for Integrated Process Planning and Scheduling. Applied
    Mechanics         and         Materials.      834,      187–192         (2016).       doi:
    10.4028/www.scientific.net/amm.834.187
34. Gandomi, A.H., Alavi, A.H.: Krill Herd: a New Bio-inspired Optimization Algorithm.
    Communications in Nonlinear Science and Numerical Simulation. 17 (12), 4831–4845
    (2012). doi: 10.1016/j.cnsns.2012.05.010
35. Balochian, S., Ebrahimi, E.: Parameter Optimization via Cuckoo Optimization Algorithm
    of Fuzzy Controller for Liquid Level Control. Journal of Engineering. Hindawi Publishing
    Corporation (2013). doi: 10.1155/2013/982354