<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>evolutionary strategies for reinforce m ent learning proble m s solving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>19</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>-There are many methods for machine learning problems solving. Each of them is used depending on the solved problems. This article describes several classical algorithms of evolutionary strategies in the reinforcement learning problems. The authors propose the Heuristic method that has certain advantages over existing methods. The article also provides a comparative analysis of the solutions to problems, including the problems of error-correcting coding.</p>
      </abstract>
      <kwd-group>
        <kwd>reinforcement learning</kwd>
        <kwd>machine learning</kwd>
        <kwd>data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>I. INTRODUCTION
authors
propose
the
improvement
of
the
simplest
implementation of evolutionary strategies using heuristic
methods for training recurrent neural networks in the
problem of error-correcting coding</p>
      <p>The practical value of improving existing algorithms lies
in expanding the range of initial approximations from which
the method will converge to a solution. It can also affect the
change in the rate of convergence.</p>
      <p>Evolutionary strategy is an optimization method based
on the ideas of evolution [2]. The relevance of their use is
due to the high efficiency of solving optimization problems
[3, 4]. Possible solutions are encoded as vectors of real
numbers. This method has been successfully used to solve
reinforcement learning problems [4-6].</p>
      <p>Evolutionary strategies have the following features.</p>
      <p>They are able to optimize any models that can be
expressed through the real numbers vectors (not dependent
on optimized model).</p>
      <p>No calculation of the gradient of parameters of the
optimized model required.</p>
      <p>It is not required to perform the back propagation
of gradients, which avoids many problems when multiplying
them.</p>
      <p>They are invariant with respect to the delay before
the response of the system.</p>
      <p>Copyright © 2020 for this paper by its authors.
5.
6.
7.
1.
2.
3.
4.
5.</p>
      <p>TRADITIONAL METHODS OF EVOLUTIONARY</p>
      <p>STRATEGIES</p>
      <p>The main cycle of evolutionary strategy consists of two
stages: mutation and selection. Let us consider these stages
as an example of maximizing some vector function.  ( ):</p>
      <p>Choose the initial solution. After it make the initial</p>
      <p>Mutation step: from the current solution we create μ
new ones by adding random
vectors distributed over the
multidimensional normal distribution to the current solution
μ with the expectation equal to the current solution and the
correlation matrix σI.</p>
      <p>The selection step: as the current solution, we take
the best solution possible (or leave the current one if it is
better), or calculate the weighted average of the possible
solutions, given that more remote solutions have more
weight.</p>
      <p>Repeat steps 2-4 until we get an acceptable solution.
In step 4, the following formula is used
^ =
 =1
 (  ) −  ^
~</p>
      <p>where ^ is the average solution,   is the possible
solution numbered  ,  ^ is the sample mean  (  ), ~ is the</p>
      <p>
        This algorithm does not cover all possible evolutionary
strategies. However, it shows their distinguishing features:
 the solution is represented as a vector of real
numbers;
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )


      </p>
      <p>at each step, one (or several) key decision is selected,
which is used in the next step as the basis for new solutions;
selection takes place in a deterministic way, which
means that it can be performed on the cluster without
sending data;</p>
      <p> the relative simplicity of the algorithm, which is
reflected in the speed of the algorithm.</p>
      <p>High speed and large parallelism of the algorithm allows
for a greater number of iterations compared to other methods
that solve reinforcement learning problems.</p>
      <p>The above algorithm is also called a simple evolutionary
strategy. Due to the fact that does not change the parameter 
depending
on
the
already
known
data.</p>
      <p>This impairs
convergence, or makes finding a solution impossible.</p>
    </sec>
    <sec id="sec-2">
      <title>This algorithm</title>
      <p>was implemented in this paper. The
computational complexity of one iteration is  ( ) with a
small constant, where  – task dimension.</p>
      <p>There are many algorithms to solve the problem of
constant parameter  . This article discusses a method called
CMA-ES (covariance matrix adaptation evolution strategy)
[7] as the most popular. As an implementation, a library for
Python called Pycma is used. Consider a simplified version
of this algorithm (reflecting the main ideas of the algorithm),
with the dimension of the problem equal to N:
1.</p>
      <p>Choose the number of possible solutions at each
step μ. Typically, a value greater than four is selected.
Choose an initial solution and make it current ( ). Choose
the initial parameter
vector 
of length</p>
      <p>N,
which is
responsible for the step length in each direction. Set the
correlation matrix C equal to the identity matrix, with 
× 
Generate  random vectors   , corresponding to
solutions.</p>
      <p>The
generation
comes
from
a
multidimensional normal distribution
with a correlation
matrix C and a mathematical expectation of m.</p>
      <p>Calculate the value of the function f in each possible
Sort possible solutions by the value of the function
Update the value of the correlation matrix C, taking
as a basis the data on the distribution of possible solutions.
The calculations are made according to the formula:
dimension.
possible
solution.</p>
      <p>∑ (   −  ^ ) (</p>
      <p>−  ^ ),
Repeat steps 2-6 until you get the right solution.</p>
      <p>^ =

1</p>
      <p>,  ,  ) =
ℎ2(
,  ,  ) = 
ℎ3(</p>
      <p>
        ,  ,  ) =
ℎ4(
,  ,  ) = 1 − 
2 (
10 ),
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
1.
2.
3.
4.
5.
6.
      </p>
      <p>This algorithm performs steps 2-6 in  ( 2) (due to the
fact that  
is a constant), where

– task dimension.</p>
      <p>There are approaches that reduce complexity to  ( )with a
large constant. This article uses the
most complete and
correct implementation of the algorithm from the Pycma
library for Python.</p>
      <p>III.</p>
      <p>DEVELOPMENT AND IMPLEMENTATION OF THE
METHOD OF HEURISTIC EVOLUTIONARY STRATEGIES
The above algorithms were either inaccurate (simple
computational complexity  ( ) with a small constant and at
the same time more accuracy than a simple implementation
of evolutionary strategies?</p>
      <p>Consider the
following
modification
of a simple
evolutionary strategy:
new ones by adding random vectors to the current solution 
distributed over a multidimensional normal distribution with
a mathematical expectation equal to the current solution and
the correlation matrix  .</p>
      <p>Calculate the value of the function f in each of the
best solution possible (or leave the current one if it is better),
or calculate the weighted average of the possible solutions,
given that more remote solutions have more weight.</p>
    </sec>
    <sec id="sec-3">
      <title>Calculate the new value of  using one of the heuristic functions, which will be considered later. Repeat steps 2-5 until we get an acceptable solution.</title>
      <p>( ) to  ( 2).</p>
      <p>At step 5, instead of calculating the parameter  , we can
calculate the entire correlation matrix. However, this will
increase the computational complexity of the algorithm from</p>
      <p>Under the heuristic understand the totality of techniques
and</p>
      <p>methods that facilitate and simplify the solution of
cognitive, constructive, practical problems.</p>
      <p>In this article, heuristic functions are understood to mean
functions that improve the accuracy of calculations without
increasing the asymptotic complexity.</p>
      <p>
        Consider at some examples of such functions:
where (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) - (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) are heuristic functions; fx is the value of the
optimized function at the point of the current solution; i is the
number of the current iteration; N is the limit number of
iterations.
      </p>
      <p>
        Function (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) implements an idea similar to temperature in
the simulated annealing method [8]. At the beginning of
optimization, the algorithm considers more distant points,
and then gradually “cools down” and proceeds to a more
accurate search for a solution. This approach allows the
optimization process to go to more promising solutions at the
very beginning, and then refine them.
      </p>
      <p>
        The heuristic function (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) is a rather interesting case. It
goes through several full periods during the optimization
function. This allows the optimization process to go out of
local optima, speeding up the solution.
      </p>
      <p>Not all heuristic functions are similar to those listed,
however, this list gives some starting point for finding a
suitable kind of function.</p>
      <p>From the above functions, you can make a linear
combination using positive coefficients and get a new
heuristic function.</p>
      <p>IV.</p>
      <p>EXPERIMENTAL COMPARISON OF IMPLEMENTED</p>
      <p>METHODS</p>
      <p>As a problem, on the solution of which the
implementations of the above algorithms will be compared,
the problem of error-correcting coding was taken.</p>
      <p>One of the most famous cyclic codes is the code (7; 4; 3).
This code converts messages from k = 4 bits into code words
of length n = 7, using for this a generating polynomial of
degree r = 3. This code allows us to correct a single error, or
detect double errors.</p>
      <p>
        If we assume that the cyclic code (7; 4; 3) is used to
correct single errors, then the probability of correctly
decoding the message is
(1 −  )7 + 7 (1 −  )6.
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
      </p>
      <p>A combination of two LSTM networks was taken as an
optimized model. The first network encrypts 4-bit messages
into 7-bit messages. The second produces the inverse
transformation.</p>
      <p>
        This choice of architecture has several features:
1. The authors gave an example of a working cyclic
code (
        <xref ref-type="bibr" rid="ref3 ref4 ref7">7, 4, 3</xref>
        ). The possibility of solving this problem with
this architecture was confirmed. Theoretically, a system of
two neural networks can, after the optimization process, use
a cyclic code.
      </p>
      <p>2. Due to the use of LSTM nodes at the core of the
architecture, the system has some internal memory. This can
positively affect the quality of encoding and decoding (if
there is some relationship between messages).</p>
      <p>Neural networks were implemented on Pytorch, which
allowed to significantly speed up the calculations, as well as
simplify the source code.</p>
      <p>As an optimized function, we take the standard error
between the original and decoded message.</p>
      <p>
        For further comparison, we calculate the upper estimate
of the mean square error for the cyclic code (
        <xref ref-type="bibr" rid="ref3 ref4 ref7">7, 4, 3</xref>
        ).
      </p>
      <p>
        To do this, suppose that when a fatal error occurs, all 4
bits are decoded incorrectly. Using (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ), we obtain the
following value for the expected number of errors among n
bits:
 (1 + (1 −  )7 − 7 (1 −  )6),
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
For experimental evaluation  = 0,01,  = 1000.
      </p>
      <p>
        Substituting the values in (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ), we obtain the expected
number of error bits: 2. Which leads to an estimate of the
mean square error for the cyclic code (
        <xref ref-type="bibr" rid="ref3 ref4 ref7">7, 4, 3</xref>
        ): 0.002.
      </p>
      <p>The plan of the experiment:
1. we will change the maximum number of iterations
for the optimization process N from 100 to 300, in
increments of 100;</p>
      <p>2. we will carry out a significant number of iterations
(more than 30), using each implementation to solve the
problem;</p>
      <p>3. we write out the average values of the simulation
results and the optimization time.</p>
      <p>V. COMPARATIVE ANALYSIS OF THE RESULTS</p>
      <p>The data obtained as a result of the experiment are shown
in table 1. In the table, the average values of the mean square
errors obtained after the optimization are used as the results.
In other words, the smaller the result, the better the algorithm
works.</p>
      <p>As can be seen from table 1, evolutionary strategies have
solved the problem, but with a convergence rate not
applicable to practical problems. The problem posed
belonged to the class of instruction with a teacher. It is
known that the speed of evolutionary strategies for such
problems is not high [1].</p>
      <p>There is an assumption that such a slow convergence
could indicate that:</p>
      <p>1. network architecture and / or number of nodes were
not suitable for solving this problem;</p>
      <p>2. the task was quite complex and most of the
solutions were be bad, which did not allow the algorithm to
find a relatively good solution in an acceptable time;
3. the task was unstable in the sense that a small
change in a good solution dramatically worsens the result.</p>
      <p>We considered each hypothesis separately.</p>
      <p>The results of solving the problem using the Adam [9]
method are shown in table 2.</p>
      <p>As a result, the mean value of the mean square error was
used. Table 2 shows that this model can solve the problem,
though not as good as the cyclic code. It is worth paying
attention to the fact that the error obtained using Adam is
almost two orders of magnitude smaller than the error of
evolutionary strategies.</p>
      <p>OPTIMIZATION VALUES FOR DIFFERENT N OBTAINED</p>
      <p>USING ADAM</p>
      <p>Result Time (s)
0.2415 174.7</p>
      <p>Consider the second hypothesis. We randomly selected
500 000 solutions. Each net weight was taken from the
normal distribution with zero mean and standard deviation
equal to 1.5. This covered most of the possible solutions,
since optimization started from the zero point and takes no
more than 300 steps of length of the order of 10−3. The
obtained values allowed us to calculate the mean and
standard deviation of 0.3086 and 0.0207, respectively. This
suggested that most of the possible solutions are
unsatisfactory.</p>
      <p>To consider the third hypothesis, we took the solution
obtained at step 600 of optimization by the Adam method.
Added to this solution a Gaussian noise with zero
mathematical expectation and a standard deviation of 10−3.
This allowed us to check how sustainable the solution was.
After generating 500 000 solutions from a given area, we got
the following results: the expected value is 0.1622 and the
standard deviation is 0.01526. This allowed us to talk about
the instability of the obtained solution.</p>
      <p>Thus, based on the data obtained, the following
conclusion was made. Evolutionary strategies have an
impressive list of advantages, however, if it is possible to
calculate the gradient of the optimized function and the task
is unstable, then it is reasonable to think about using first or
second order methods.</p>
      <p>On the other hand, evolutionary strategies make it
possible to use the computing resources of an entire cluster
without significant overhead, this can even out the difference
in speed, and sometimes even surpass classical methods.</p>
      <p>VI. CONCLUSION</p>
      <p>In the course of this work, methods of evolutionary
strategies and their application to the solution of the problem
of error-correcting coding were considered. Several different
methods and the environment for their experimental testing
were implemented. Conclusions are drawn on the
applicability of this approach to solving machine learning
problems with a teacher.</p>
      <p>Many experiments have been conducted. Based on the
data obtained, an analysis of the effectiveness of the methods
was carried out, and three hypotheses describing the results
were constructed and tested.</p>
      <p>An important result of this work is testing the
applicability of evolutionary strategies for solving various
problems. Moreover, the study confirmed the limitations of
evolutionary strategies for the class of problems.</p>
      <p>It is concluded that evolutionary strategies solve well
reinforcement learning problems and worse solve more
traditional problems. However, their disadvantages can be
eliminated by connecting more computing resources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khaikin</surname>
          </string-name>
          , “
          <article-title>Neural networks: full course,” M.:</article-title>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          ,
          <year>2008</year>
          , 1104 p.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.G.</given-names>
            <surname>Beyer</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.P.</given-names>
            <surname>Schwefel</surname>
          </string-name>
          , “
          <article-title>Evolution strategies. A comprehensive introduction,” Natural computing</article-title>
          , vol.
          <volume>1</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>52</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sidor</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , “
          <article-title>Evolution Strategies as a Scalable Alternative to Reinforcement Learning,”</article-title>
          <source>arXiv preprint arXiv:1703.03864</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , “
          <article-title>ES Is More Than Just a Traditional Finite-Difference,”</article-title>
          <source>Proceedings of the Genetic and Evolutionary Computation Conference</source>
          , pp.
          <fpage>450</fpage>
          -
          <lpage>457</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Conti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroski Such</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          , “
          <article-title>Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents</article-title>
          ,
          <source>” Advances in neural information processing systems</source>
          , pp.
          <fpage>5027</fpage>
          -
          <lpage>5038</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Kovartsev</surname>
          </string-name>
          , “
          <article-title>A deterministic evolutionary algorithm for the global optimization of morse cluster,” Computer Optics</article-title>
          , vol.
          <volume>39</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>240</lpage>
          ,
          <year>2015</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>0134</fpage>
          -2452-2015-39-2-
          <fpage>234</fpage>
          -240.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          , “
          <source>The CMA Evolution Strategy:</source>
          a Tutorial,”
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Gelatt</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Vecchi</surname>
          </string-name>
          , “
          <article-title>Optimization by simulated annealing,” Science</article-title>
          , vol.
          <volume>220</volume>
          , no.
          <issue>4598</issue>
          , pp.
          <fpage>671</fpage>
          -
          <lpage>680</lpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          . Adam, “
          <article-title>A method for stochastic optimization</article-title>
          ,
          <source>” arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>