Development of the heuristic method of
    evolutionary strategies for reinforcement learning
                    problems solving
                      Maksim Naumov                                                                   Aleksandr Blagov
              Samara National Research University                                             Samara National Research University
                       Samara, Russia                                                                  Samara, Russia
                       blagov@ssau.ru                                                                  blagov@ssau.ru

    Abstract—There are many methods for machine learning                         5. They are invariant to transformations of the space
problems solving. Each of them is used depending on the solved               of solutions that preserve angles.
problems. This article describes several classical algorithms of
evolutionary strategies in the reinforcement learning problems.                 6. They are invariant to transformations of the
The authors propose the Heuristic method that has certain                    objective function.
advantages over existing methods. The article also provides a                   7. A very high degree of concurrency, due to the very
comparative analysis of the solutions to problems, including
                                                                             small amount of data sent.
the problems of error-correcting coding.
                                                                                 There are approaches that use evolutionary strategies not
   Keywords—reinforcement learning, machine learning, data                   for optimization, but for the numerical calculation of the
mining, heuristic method                                                     gradient of the optimized function. Thus, it is possible to use
                                                                             higher-level and more accurate methods, without the need to
                        I. INTRODUCTION
                                                                             calculate the gradient of the function.
    The article describes the research and development of
methods of heuristic evolutionary strategies for solving the                     II.        TRADITIONAL METHODS OF EVOLUTIONARY
problem of training recurrent neural networks. Artificial                                                STRATEGIES
neural networks are one of the most popular methods of                           The main cycle of evolutionary strategy consists of two
machine learning They represent a mathematical model, as                     stages: mutation and selection. Let us consider these stages
well as its software or hardware implementation, built on the                as an example of maximizing some vector function. 𝑓(𝑥):
principle of the organization and functioning of biological
neural networks [1]. The formation and training of neural                        1. Choose the initial solution. After it make the initial
networks can be reduced to an optimization problem. The                      solution for the current.
authors propose the improvement of the simplest
implementation of evolutionary strategies using heuristic                        2. Mutation step: from the current solution we create μ
methods for training recurrent neural networks in the                        new ones by adding random vectors distributed over the
problem of error-correcting coding                                           multidimensional normal distribution to the current solution
                                                                             μ with the expectation equal to the current solution and the
    The practical value of improving existing algorithms lies                correlation matrix σI.
in expanding the range of initial approximations from which
the method will converge to a solution. It can also affect the                  3. Calculate the value of the function f in each of the
change in the rate of convergence.                                           possible solutions.
    Evolutionary strategy is an optimization method based                        4. The selection step: as the current solution, we take
on the ideas of evolution [2]. The relevance of their use is                 the best solution possible (or leave the current one if it is
due to the high efficiency of solving optimization problems                  better), or calculate the weighted average of the possible
[3, 4]. Possible solutions are encoded as vectors of real                    solutions, given that more remote solutions have more
numbers. This method has been successfully used to solve                     weight.
reinforcement learning problems [4-6].
                                                                                 5.      Repeat steps 2-4 until we get an acceptable solution.
    Evolutionary strategies have the following features.
                                                                                 In step 4, the following formula is used
   1. They are able to optimize any models that can be                                              𝜇
expressed through the real numbers vectors (not dependent                                        1       𝑓(𝑥𝑘 ) − 𝑓^
                                                                                           𝑥^ =    ∑ [𝑥𝑘            ],                     (1)
on optimized model).                                                                            𝜇𝜎           𝑓~
                                                                                                   𝑘=1
    2. No calculation of the gradient of parameters of the                       where 𝑥^ is the average solution, 𝑥𝑘 is the possible
optimized model required.                                                    solution numbered 𝑘, 𝑓^ is the sample mean 𝑓(𝑥𝑘 ), 𝑓~ is the
    3. It is not required to perform the back propagation                    sample variance 𝑓(𝑥𝑘 ).
of gradients, which avoids many problems when multiplying                        This algorithm does not cover all possible evolutionary
them.                                                                        strategies. However, it shows their distinguishing features:
    4. They are invariant with respect to the delay before
                                                                                 the solution is represented as a vector of real
the response of the system.
                                                                             numbers;


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

    at each step, one (or several) key decision is selected,                 This algorithm performs steps 2-6 in 𝑂(𝑁 2 ) (due to the
which is used in the next step as the basis for new solutions;            fact that 𝑁𝑏𝑒𝑠𝑡 is a constant), where 𝑁 – task dimension.
                                                                          There are approaches that reduce complexity to 𝑂(𝑁)with a
    selection takes place in a deterministic way, which                  large constant. This article uses the most complete and
means that it can be performed on the cluster without                     correct implementation of the algorithm from the Pycma
sending data;                                                             library for Python.
     the relative simplicity of the algorithm, which is
                                                                             III.    DEVELOPMENT AND IMPLEMENTATION OF THE
reflected in the speed of the algorithm.
                                                                                METHOD OF HEURISTIC EVOLUTIONARY STRATEGIES
    High speed and large parallelism of the algorithm allows                  The above algorithms were either inaccurate (simple
for a greater number of iterations compared to other methods              evolutionary strategy) or slow (adaptive evolutionary
that solve reinforcement learning problems.                               strategies). Is it possible to build an algorithm with
    The above algorithm is also called a simple evolutionary              computational complexity 𝑂(𝑁) with a small constant and at
strategy. Due to the fact that does not change the parameter 𝜎            the same time more accuracy than a simple implementation
depending on the already known data. This impairs                         of evolutionary strategies?
convergence, or makes finding a solution impossible.                         Consider the following modification of a simple
   This algorithm was implemented in this paper. The                      evolutionary strategy:
computational complexity of one iteration is 𝑂(𝑁) with a                      1. Choose the initial solution. After it make the initial
small constant, where 𝑁– task dimension.                                  solution for the current.
    There are many algorithms to solve the problem of                         2. Mutation step: from the current solution we create 𝜇
constant parameter 𝜎. This article discusses a method called              new ones by adding random vectors to the current solution 𝜇
CMA-ES (covariance matrix adaptation evolution strategy)                  distributed over a multidimensional normal distribution with
[7] as the most popular. As an implementation, a library for              a mathematical expectation equal to the current solution and
Python called Pycma is used. Consider a simplified version                the correlation matrix 𝜎𝐼 .
of this algorithm (reflecting the main ideas of the algorithm),
with the dimension of the problem equal to N:                                3. Calculate the value of the function f in each of the
                                                                          possible solutions.
    1. Choose the number of possible solutions at each
step μ. Typically, a value greater than four is selected.                     4. Selection step: as the current solution, we take the
Choose an initial solution and make it current (𝑚). Choose                best solution possible (or leave the current one if it is better),
the initial parameter vector 𝜎 of length N, which is                      or calculate the weighted average of the possible solutions,
responsible for the step length in each direction. Set the                given that more remote solutions have more weight.
correlation matrix C equal to the identity matrix, with 𝑁 × 𝑁
                                                                             5. Calculate the new value of 𝜎 using one of the
dimension.
                                                                          heuristic functions, which will be considered later.
   2. Generate 𝜇 random vectors 𝑥𝑘 , corresponding to                          6.    Repeat steps 2-5 until we get an acceptable solution.
possible solutions. The generation comes from a
multidimensional normal distribution with a correlation                       At step 5, instead of calculating the parameter 𝜎, we can
matrix C and a mathematical expectation of m.                             calculate the entire correlation matrix. However, this will
                                                                          increase the computational complexity of the algorithm from
    3. Calculate the value of the function f in each possible             𝑂(𝑁) to 𝑂(𝑁 2 ).
solution.
                                                                             Under the heuristic understand the totality of techniques
     4.     Sort possible solutions by the value of the function          and methods that facilitate and simplify the solution of
f.                                                                        cognitive, constructive, practical problems.
    5. Update the value of the correlation matrix C, taking                   In this article, heuristic functions are understood to mean
as a basis the data on the distribution of possible solutions.            functions that improve the accuracy of calculations without
The calculations are made according to the formula:                       increasing the asymptotic complexity.
                           𝜇
                        1       𝑓(𝑥𝑘 ) − 𝑓^                                    Consider at some examples of such functions:
                  𝑥^ =    ∑ [𝑥𝑘            ],                     (2)
                       𝜇𝜎           𝑓~                                                                             1                    (4)
                          𝑘=1                                                                 ℎ1 (𝑓𝑥, 𝑖, 𝑁) =             ;
   where 𝑥𝑘 is the 𝑘 random vector, 𝑁𝑏𝑒𝑠𝑡 is the number of                                                      1 + 𝑒 −𝑓𝑥
best solutions to consider; 𝑥^ is the average 𝑁𝑏𝑒𝑠𝑡 vectors,
matching the best possible solutions                                                                                          𝜋         (5)
                                                                                          ℎ2 (𝑓𝑥, 𝑖, 𝑁) = 𝑎𝑟𝑐𝑡𝑎𝑛(𝑓𝑥) +          ;
                                                                                                                              2
   6. Update the value of the current solution m,
according to the formula:
                                                                                                                  𝑁−𝑖                   (6)
                        𝑁𝑏𝑒𝑠𝑡                                                                   ℎ3 (𝑓𝑥, 𝑖, 𝑁) =       ;
                   1                                                                                               𝑁
          𝐶𝑖𝑗 =         ∑ (𝑥𝑘𝑖 − 𝑥^𝑖 ) (𝑥𝑘𝑗 − 𝑥^𝑗 ),              (3)
                  𝑁𝑏𝑒𝑠𝑡                                                                                            𝑖
                        𝑘=1                                                                                                             (7)
     7.     Repeat steps 2-6 until you get the right solution.                           ℎ4 (𝑓𝑥, 𝑖, 𝑁) = 1 − 𝑠𝑖𝑛2 ( 10𝜋),
                                                                                                                   𝑁


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                   20
Data Science

where (4) - (7) are heuristic functions; fx is the value of the               To do this, suppose that when a fatal error occurs, all 4
optimized function at the point of the current solution; i is the         bits are decoded incorrectly. Using (8), we obtain the
number of the current iteration; N is the limit number of                 following value for the expected number of errors among n
iterations.                                                               bits:
    Function (6) implements an idea similar to temperature in                            𝑛 (1 + (1 − 𝑝)7 − 7𝑝(1 − 𝑝)6 ),                          (9)
the simulated annealing method [8]. At the beginning of
optimization, the algorithm considers more distant points,                    For experimental evaluation 𝑝 = 0,01, 𝑛 = 1000.
and then gradually “cools down” and proceeds to a more                       Substituting the values in (9), we obtain the expected
accurate search for a solution. This approach allows the                  number of error bits: 2. Which leads to an estimate of the
optimization process to go to more promising solutions at the             mean square error for the cyclic code (7, 4, 3): 0.002.
very beginning, and then refine them.
                                                                              The plan of the experiment:
    The heuristic function (7) is a rather interesting case. It
goes through several full periods during the optimization                     1. we will change the maximum number of iterations
function. This allows the optimization process to go out of               for the optimization process N from 100 to 300, in
local optima, speeding up the solution.                                   increments of 100;

    Not all heuristic functions are similar to those listed,                 2. we will carry out a significant number of iterations
however, this list gives some starting point for finding a                (more than 30), using each implementation to solve the
suitable kind of function.                                                problem;

   From the above functions, you can make a linear                            3. we write out the average values of the simulation
combination using positive coefficients and get a new                     results and the optimization time.
heuristic function.
                                                                                    V. COMPARATIVE ANALYSIS OF THE RESULTS
   IV.     EXPERIMENTAL COMPARISON OF IMPLEMENTED                             The data obtained as a result of the experiment are shown
                              METHODS                                     in table 1. In the table, the average values of the mean square
    As a problem, on the solution of which the                            errors obtained after the optimization are used as the results.
implementations of the above algorithms will be compared,                 In other words, the smaller the result, the better the algorithm
the problem of error-correcting coding was taken.                         works.

    One of the most famous cyclic codes is the code (7; 4; 3).               TABLE I.        OPTIMIZATION VALUES FOR DIFFERENT NUMBERS OF
This code converts messages from k = 4 bits into code words                                               ITERATIONS

of length n = 7, using for this a generating polynomial of                   Number       Simple ES         Heuristic ES          CMA-ES
degree r = 3. This code allows us to correct a single error, or                 of
                                                                            iterations, result    time                 time
detect double errors.                                                                              (s)
                                                                                                           result
                                                                                                                        (s)
                                                                                                                              result   time (s)
                                                                                 N
    If we assume that the cyclic code (7; 4; 3) is used to                    100       0.2582   101.5     0.2477   101.8     0.2334   373.6
correct single errors, then the probability of correctly                      200        0.2414   204.1    0.2303   210.7     0.2171   739.7
decoding the message is
                                                                              300        0.2485   293.7    0.2273   290.4     0.2103   1128.3
                    (1 − 𝑝)7 + 7𝑝(1 − 𝑝)6 .                       (8)         As can be seen from table 1, evolutionary strategies have
    A combination of two LSTM networks was taken as an                    solved the problem, but with a convergence rate not
optimized model. The first network encrypts 4-bit messages                applicable to practical problems. The problem posed
into 7-bit messages. The second produces the inverse                      belonged to the class of instruction with a teacher. It is
transformation.                                                           known that the speed of evolutionary strategies for such
                                                                          problems is not high [1].
    This choice of architecture has several features:
                                                                             There is an assumption that such a slow convergence
    1. The authors gave an example of a working cyclic                    could indicate that:
code (7, 4, 3). The possibility of solving this problem with
this architecture was confirmed. Theoretically, a system of                   1. network architecture and / or number of nodes were
two neural networks can, after the optimization process, use              not suitable for solving this problem;
a cyclic code.                                                                2. the task was quite complex and most of the
    2. Due to the use of LSTM nodes at the core of the                    solutions were be bad, which did not allow the algorithm to
architecture, the system has some internal memory. This can               find a relatively good solution in an acceptable time;
positively affect the quality of encoding and decoding (if                   3. the task was unstable in the sense that a small
there is some relationship between messages).                             change in a good solution dramatically worsens the result.
    Neural networks were implemented on Pytorch, which                         We considered each hypothesis separately.
allowed to significantly speed up the calculations, as well as
simplify the source code.                                                    The results of solving the problem using the Adam [9]
                                                                          method are shown in table 2.
   As an optimized function, we take the standard error
between the original and decoded message.                                     As a result, the mean value of the mean square error was
                                                                          used. Table 2 shows that this model can solve the problem,
    For further comparison, we calculate the upper estimate               though not as good as the cyclic code. It is worth paying
of the mean square error for the cyclic code (7, 4, 3).                   attention to the fact that the error obtained using Adam is


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                            21
Data Science

almost two orders of magnitude smaller than the error of                  methods and the environment for their experimental testing
evolutionary strategies.                                                  were implemented. Conclusions are drawn on the
                                                                          applicability of this approach to solving machine learning
   TABLE II.       OPTIMIZATION VALUES FOR DIFFERENT N OBTAINED           problems with a teacher.
                            USING ADAM
                N             Result           Time (s)                       Many experiments have been conducted. Based on the
               100            0.2415             174.7                    data obtained, an analysis of the effectiveness of the methods
               200            0.1383              334.5                   was carried out, and three hypotheses describing the results
               300            0.1027              515.1
                                                                          were constructed and tested.
               600            0.0094             1036.8                      An important result of this work is testing the
                                                                          applicability of evolutionary strategies for solving various
    Consider the second hypothesis. We randomly selected                  problems. Moreover, the study confirmed the limitations of
500 000 solutions. Each net weight was taken from the                     evolutionary strategies for the class of problems.
normal distribution with zero mean and standard deviation
equal to 1.5. This covered most of the possible solutions,                    It is concluded that evolutionary strategies solve well
since optimization started from the zero point and takes no               reinforcement learning problems and worse solve more
more than 300 steps of length of the order of 10−3. The                   traditional problems. However, their disadvantages can be
obtained values allowed us to calculate the mean and                      eliminated by connecting more computing resources.
standard deviation of 0.3086 and 0.0207, respectively. This                                            REFERENCES
suggested that most of the possible solutions are
unsatisfactory.                                                           [1]   S. Khaikin, “Neural networks: full course,” M.: Williams, 2008, 1104
                                                                                p.
    To consider the third hypothesis, we took the solution                [2]   H.G. Beyer and H.P. Schwefel, “Evolution strategies. A
obtained at step 600 of optimization by the Adam method.                        comprehensive introduction,” Natural computing, vol. 1, no. 1, pp.
                                                                                3-52, 2002.
Added to this solution a Gaussian noise with zero
mathematical expectation and a standard deviation of 10−3.                [3]   T. Salimans, J. Ho, X. Chen, S. Sidor and I. Sutskever, “Evolution
                                                                                Strategies as a Scalable Alternative to Reinforcement Learning,”
This allowed us to check how sustainable the solution was.                      arXiv preprint arXiv:1703.03864, 2017.
After generating 500 000 solutions from a given area, we got              [4]   J. Lehman, J. Chen, J. Clune and K. O. Stanley, “ES Is More Than
the following results: the expected value is 0.1622 and the                     Just a Traditional Finite-Difference,” Proceedings of the Genetic and
standard deviation is 0.01526. This allowed us to talk about                    Evolutionary Computation Conference, pp. 450-457, 2018.
the instability of the obtained solution.                                 [5]   E. Conti, V. Madhavan, F. Petroski Such, J. Lehman, K. O. Stanley
                                                                                and J. Clune, “Improving Exploration in Evolution Strategies for
    Thus, based on the data obtained, the following                             Deep Reinforcement Learning via a Population of Novelty-Seeking
conclusion was made. Evolutionary strategies have an                            Agents,” Advances in neural information processing systems, pp.
impressive list of advantages, however, if it is possible to                    5027-5038, 2018.
calculate the gradient of the optimized function and the task             [6]   A.N. Kovartsev, “A deterministic evolutionary algorithm for the
is unstable, then it is reasonable to think about using first or                global optimization of morse cluster,” Computer Optics, vol. 39, no.
                                                                                2, pp. 234-240, 2015. DOI: 10.18287/0134-2452-2015-39-2-234-240.
second order methods.
                                                                          [7]   N. Hansen, “The CMA Evolution Strategy: a Tutorial,” 2009.
    On the other hand, evolutionary strategies make it                    [8]   S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, “Optimization by
possible to use the computing resources of an entire cluster                    simulated annealing,” Science, vol. 220, no. 4598, pp. 671-680,
without significant overhead, this can even out the difference                  1983.
in speed, and sometimes even surpass classical methods.                   [9]   D. P. Kingma and J. Ba. Adam, “A method for stochastic
                                                                                optimization,” arXiv preprint arXiv:1412.6980, 2014.
                        VI. CONCLUSION
    In the course of this work, methods of evolutionary
strategies and their application to the solution of the problem
of error-correcting coding were considered. Several different


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                            22