Development of the heuristic method of evolutionary strategies for reinforcement learning problems solving Maksim Naumov Aleksandr Blagov Samara National Research University Samara National Research University Samara, Russia Samara, Russia blagov@ssau.ru blagov@ssau.ru Abstractβ€”There are many methods for machine learning 5. They are invariant to transformations of the space problems solving. Each of them is used depending on the solved of solutions that preserve angles. problems. This article describes several classical algorithms of evolutionary strategies in the reinforcement learning problems. 6. They are invariant to transformations of the The authors propose the Heuristic method that has certain objective function. advantages over existing methods. The article also provides a 7. A very high degree of concurrency, due to the very comparative analysis of the solutions to problems, including small amount of data sent. the problems of error-correcting coding. There are approaches that use evolutionary strategies not Keywordsβ€”reinforcement learning, machine learning, data for optimization, but for the numerical calculation of the mining, heuristic method gradient of the optimized function. Thus, it is possible to use higher-level and more accurate methods, without the need to I. INTRODUCTION calculate the gradient of the function. The article describes the research and development of methods of heuristic evolutionary strategies for solving the II. TRADITIONAL METHODS OF EVOLUTIONARY problem of training recurrent neural networks. Artificial STRATEGIES neural networks are one of the most popular methods of The main cycle of evolutionary strategy consists of two machine learning They represent a mathematical model, as stages: mutation and selection. Let us consider these stages well as its software or hardware implementation, built on the as an example of maximizing some vector function. 𝑓(π‘₯): principle of the organization and functioning of biological neural networks [1]. The formation and training of neural 1. Choose the initial solution. After it make the initial networks can be reduced to an optimization problem. The solution for the current. authors propose the improvement of the simplest implementation of evolutionary strategies using heuristic 2. Mutation step: from the current solution we create ΞΌ methods for training recurrent neural networks in the new ones by adding random vectors distributed over the problem of error-correcting coding multidimensional normal distribution to the current solution ΞΌ with the expectation equal to the current solution and the The practical value of improving existing algorithms lies correlation matrix ΟƒI. in expanding the range of initial approximations from which the method will converge to a solution. It can also affect the 3. Calculate the value of the function f in each of the change in the rate of convergence. possible solutions. Evolutionary strategy is an optimization method based 4. The selection step: as the current solution, we take on the ideas of evolution [2]. The relevance of their use is the best solution possible (or leave the current one if it is due to the high efficiency of solving optimization problems better), or calculate the weighted average of the possible [3, 4]. Possible solutions are encoded as vectors of real solutions, given that more remote solutions have more numbers. This method has been successfully used to solve weight. reinforcement learning problems [4-6]. 5. Repeat steps 2-4 until we get an acceptable solution. Evolutionary strategies have the following features. In step 4, the following formula is used 1. They are able to optimize any models that can be πœ‡ expressed through the real numbers vectors (not dependent 1 𝑓(π‘₯π‘˜ ) βˆ’ 𝑓^ π‘₯^ = βˆ‘ [π‘₯π‘˜ ], (1) on optimized model). πœ‡πœŽ 𝑓~ π‘˜=1 2. No calculation of the gradient of parameters of the where π‘₯^ is the average solution, π‘₯π‘˜ is the possible optimized model required. solution numbered π‘˜, 𝑓^ is the sample mean 𝑓(π‘₯π‘˜ ), 𝑓~ is the 3. It is not required to perform the back propagation sample variance 𝑓(π‘₯π‘˜ ). of gradients, which avoids many problems when multiplying This algorithm does not cover all possible evolutionary them. strategies. However, it shows their distinguishing features: 4. They are invariant with respect to the delay before ο‚· the solution is represented as a vector of real the response of the system. numbers; Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science ο‚· at each step, one (or several) key decision is selected, This algorithm performs steps 2-6 in 𝑂(𝑁 2 ) (due to the which is used in the next step as the basis for new solutions; fact that 𝑁𝑏𝑒𝑠𝑑 is a constant), where 𝑁 – task dimension. There are approaches that reduce complexity to 𝑂(𝑁)with a ο‚· selection takes place in a deterministic way, which large constant. This article uses the most complete and means that it can be performed on the cluster without correct implementation of the algorithm from the Pycma sending data; library for Python. ο‚· the relative simplicity of the algorithm, which is III. DEVELOPMENT AND IMPLEMENTATION OF THE reflected in the speed of the algorithm. METHOD OF HEURISTIC EVOLUTIONARY STRATEGIES High speed and large parallelism of the algorithm allows The above algorithms were either inaccurate (simple for a greater number of iterations compared to other methods evolutionary strategy) or slow (adaptive evolutionary that solve reinforcement learning problems. strategies). Is it possible to build an algorithm with The above algorithm is also called a simple evolutionary computational complexity 𝑂(𝑁) with a small constant and at strategy. Due to the fact that does not change the parameter 𝜎 the same time more accuracy than a simple implementation depending on the already known data. This impairs of evolutionary strategies? convergence, or makes finding a solution impossible. Consider the following modification of a simple This algorithm was implemented in this paper. The evolutionary strategy: computational complexity of one iteration is 𝑂(𝑁) with a 1. Choose the initial solution. After it make the initial small constant, where 𝑁– task dimension. solution for the current. There are many algorithms to solve the problem of 2. Mutation step: from the current solution we create πœ‡ constant parameter 𝜎. This article discusses a method called new ones by adding random vectors to the current solution πœ‡ CMA-ES (covariance matrix adaptation evolution strategy) distributed over a multidimensional normal distribution with [7] as the most popular. As an implementation, a library for a mathematical expectation equal to the current solution and Python called Pycma is used. Consider a simplified version the correlation matrix 𝜎𝐼 . of this algorithm (reflecting the main ideas of the algorithm), with the dimension of the problem equal to N: 3. Calculate the value of the function f in each of the possible solutions. 1. Choose the number of possible solutions at each step ΞΌ. Typically, a value greater than four is selected. 4. Selection step: as the current solution, we take the Choose an initial solution and make it current (π‘š). Choose best solution possible (or leave the current one if it is better), the initial parameter vector 𝜎 of length N, which is or calculate the weighted average of the possible solutions, responsible for the step length in each direction. Set the given that more remote solutions have more weight. correlation matrix C equal to the identity matrix, with 𝑁 Γ— 𝑁 5. Calculate the new value of 𝜎 using one of the dimension. heuristic functions, which will be considered later. 2. Generate πœ‡ random vectors π‘₯π‘˜ , corresponding to 6. Repeat steps 2-5 until we get an acceptable solution. possible solutions. The generation comes from a multidimensional normal distribution with a correlation At step 5, instead of calculating the parameter 𝜎, we can matrix C and a mathematical expectation of m. calculate the entire correlation matrix. However, this will increase the computational complexity of the algorithm from 3. Calculate the value of the function f in each possible 𝑂(𝑁) to 𝑂(𝑁 2 ). solution. Under the heuristic understand the totality of techniques 4. Sort possible solutions by the value of the function and methods that facilitate and simplify the solution of f. cognitive, constructive, practical problems. 5. Update the value of the correlation matrix C, taking In this article, heuristic functions are understood to mean as a basis the data on the distribution of possible solutions. functions that improve the accuracy of calculations without The calculations are made according to the formula: increasing the asymptotic complexity. πœ‡ 1 𝑓(π‘₯π‘˜ ) βˆ’ 𝑓^ Consider at some examples of such functions: π‘₯^ = βˆ‘ [π‘₯π‘˜ ], (2) πœ‡πœŽ 𝑓~ 1 (4) π‘˜=1 β„Ž1 (𝑓π‘₯, 𝑖, 𝑁) = ; where π‘₯π‘˜ is the π‘˜ random vector, 𝑁𝑏𝑒𝑠𝑑 is the number of 1 + 𝑒 βˆ’π‘“π‘₯ best solutions to consider; π‘₯^ is the average 𝑁𝑏𝑒𝑠𝑑 vectors, matching the best possible solutions πœ‹ (5) β„Ž2 (𝑓π‘₯, 𝑖, 𝑁) = π‘Žπ‘Ÿπ‘π‘‘π‘Žπ‘›(𝑓π‘₯) + ; 2 6. Update the value of the current solution m, according to the formula: π‘βˆ’π‘– (6) 𝑁𝑏𝑒𝑠𝑑 β„Ž3 (𝑓π‘₯, 𝑖, 𝑁) = ; 1 𝑁 𝐢𝑖𝑗 = βˆ‘ (π‘₯π‘˜π‘– βˆ’ π‘₯^𝑖 ) (π‘₯π‘˜π‘— βˆ’ π‘₯^𝑗 ), (3) 𝑁𝑏𝑒𝑠𝑑 𝑖 π‘˜=1 (7) 7. Repeat steps 2-6 until you get the right solution. β„Ž4 (𝑓π‘₯, 𝑖, 𝑁) = 1 βˆ’ 𝑠𝑖𝑛2 ( 10πœ‹), 𝑁 VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 20 Data Science where (4) - (7) are heuristic functions; fx is the value of the To do this, suppose that when a fatal error occurs, all 4 optimized function at the point of the current solution; i is the bits are decoded incorrectly. Using (8), we obtain the number of the current iteration; N is the limit number of following value for the expected number of errors among n iterations. bits: Function (6) implements an idea similar to temperature in 𝑛 (1 + (1 βˆ’ 𝑝)7 βˆ’ 7𝑝(1 βˆ’ 𝑝)6 ), (9) the simulated annealing method [8]. At the beginning of optimization, the algorithm considers more distant points, For experimental evaluation 𝑝 = 0,01, 𝑛 = 1000. and then gradually β€œcools down” and proceeds to a more Substituting the values in (9), we obtain the expected accurate search for a solution. This approach allows the number of error bits: 2. Which leads to an estimate of the optimization process to go to more promising solutions at the mean square error for the cyclic code (7, 4, 3): 0.002. very beginning, and then refine them. The plan of the experiment: The heuristic function (7) is a rather interesting case. It goes through several full periods during the optimization 1. we will change the maximum number of iterations function. This allows the optimization process to go out of for the optimization process N from 100 to 300, in local optima, speeding up the solution. increments of 100; Not all heuristic functions are similar to those listed, 2. we will carry out a significant number of iterations however, this list gives some starting point for finding a (more than 30), using each implementation to solve the suitable kind of function. problem; From the above functions, you can make a linear 3. we write out the average values of the simulation combination using positive coefficients and get a new results and the optimization time. heuristic function. V. COMPARATIVE ANALYSIS OF THE RESULTS IV. EXPERIMENTAL COMPARISON OF IMPLEMENTED The data obtained as a result of the experiment are shown METHODS in table 1. In the table, the average values of the mean square As a problem, on the solution of which the errors obtained after the optimization are used as the results. implementations of the above algorithms will be compared, In other words, the smaller the result, the better the algorithm the problem of error-correcting coding was taken. works. One of the most famous cyclic codes is the code (7; 4; 3). TABLE I. OPTIMIZATION VALUES FOR DIFFERENT NUMBERS OF This code converts messages from k = 4 bits into code words ITERATIONS of length n = 7, using for this a generating polynomial of Number Simple ES Heuristic ES CMA-ES degree r = 3. This code allows us to correct a single error, or of iterations, result time time detect double errors. (s) result (s) result time (s) N If we assume that the cyclic code (7; 4; 3) is used to 100 0.2582 101.5 0.2477 101.8 0.2334 373.6 correct single errors, then the probability of correctly 200 0.2414 204.1 0.2303 210.7 0.2171 739.7 decoding the message is 300 0.2485 293.7 0.2273 290.4 0.2103 1128.3 (1 βˆ’ 𝑝)7 + 7𝑝(1 βˆ’ 𝑝)6 . (8) As can be seen from table 1, evolutionary strategies have A combination of two LSTM networks was taken as an solved the problem, but with a convergence rate not optimized model. The first network encrypts 4-bit messages applicable to practical problems. The problem posed into 7-bit messages. The second produces the inverse belonged to the class of instruction with a teacher. It is transformation. known that the speed of evolutionary strategies for such problems is not high [1]. This choice of architecture has several features: There is an assumption that such a slow convergence 1. The authors gave an example of a working cyclic could indicate that: code (7, 4, 3). The possibility of solving this problem with this architecture was confirmed. Theoretically, a system of 1. network architecture and / or number of nodes were two neural networks can, after the optimization process, use not suitable for solving this problem; a cyclic code. 2. the task was quite complex and most of the 2. Due to the use of LSTM nodes at the core of the solutions were be bad, which did not allow the algorithm to architecture, the system has some internal memory. This can find a relatively good solution in an acceptable time; positively affect the quality of encoding and decoding (if 3. the task was unstable in the sense that a small there is some relationship between messages). change in a good solution dramatically worsens the result. Neural networks were implemented on Pytorch, which We considered each hypothesis separately. allowed to significantly speed up the calculations, as well as simplify the source code. The results of solving the problem using the Adam [9] method are shown in table 2. As an optimized function, we take the standard error between the original and decoded message. As a result, the mean value of the mean square error was used. Table 2 shows that this model can solve the problem, For further comparison, we calculate the upper estimate though not as good as the cyclic code. It is worth paying of the mean square error for the cyclic code (7, 4, 3). attention to the fact that the error obtained using Adam is VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 21 Data Science almost two orders of magnitude smaller than the error of methods and the environment for their experimental testing evolutionary strategies. were implemented. Conclusions are drawn on the applicability of this approach to solving machine learning TABLE II. OPTIMIZATION VALUES FOR DIFFERENT N OBTAINED problems with a teacher. USING ADAM N Result Time (s) Many experiments have been conducted. Based on the 100 0.2415 174.7 data obtained, an analysis of the effectiveness of the methods 200 0.1383 334.5 was carried out, and three hypotheses describing the results 300 0.1027 515.1 were constructed and tested. 600 0.0094 1036.8 An important result of this work is testing the applicability of evolutionary strategies for solving various Consider the second hypothesis. We randomly selected problems. Moreover, the study confirmed the limitations of 500 000 solutions. Each net weight was taken from the evolutionary strategies for the class of problems. normal distribution with zero mean and standard deviation equal to 1.5. This covered most of the possible solutions, It is concluded that evolutionary strategies solve well since optimization started from the zero point and takes no reinforcement learning problems and worse solve more more than 300 steps of length of the order of 10βˆ’3. The traditional problems. However, their disadvantages can be obtained values allowed us to calculate the mean and eliminated by connecting more computing resources. standard deviation of 0.3086 and 0.0207, respectively. This REFERENCES suggested that most of the possible solutions are unsatisfactory. [1] S. Khaikin, β€œNeural networks: full course,” M.: Williams, 2008, 1104 p. To consider the third hypothesis, we took the solution [2] H.G. Beyer and H.P. Schwefel, β€œEvolution strategies. A obtained at step 600 of optimization by the Adam method. comprehensive introduction,” Natural computing, vol. 1, no. 1, pp. 3-52, 2002. Added to this solution a Gaussian noise with zero mathematical expectation and a standard deviation of 10βˆ’3. [3] T. Salimans, J. Ho, X. Chen, S. Sidor and I. Sutskever, β€œEvolution Strategies as a Scalable Alternative to Reinforcement Learning,” This allowed us to check how sustainable the solution was. arXiv preprint arXiv:1703.03864, 2017. After generating 500 000 solutions from a given area, we got [4] J. Lehman, J. Chen, J. Clune and K. O. Stanley, β€œES Is More Than the following results: the expected value is 0.1622 and the Just a Traditional Finite-Difference,” Proceedings of the Genetic and standard deviation is 0.01526. This allowed us to talk about Evolutionary Computation Conference, pp. 450-457, 2018. the instability of the obtained solution. [5] E. Conti, V. Madhavan, F. Petroski Such, J. Lehman, K. O. Stanley and J. Clune, β€œImproving Exploration in Evolution Strategies for Thus, based on the data obtained, the following Deep Reinforcement Learning via a Population of Novelty-Seeking conclusion was made. Evolutionary strategies have an Agents,” Advances in neural information processing systems, pp. impressive list of advantages, however, if it is possible to 5027-5038, 2018. calculate the gradient of the optimized function and the task [6] A.N. Kovartsev, β€œA deterministic evolutionary algorithm for the is unstable, then it is reasonable to think about using first or global optimization of morse cluster,” Computer Optics, vol. 39, no. 2, pp. 234-240, 2015. DOI: 10.18287/0134-2452-2015-39-2-234-240. second order methods. [7] N. Hansen, β€œThe CMA Evolution Strategy: a Tutorial,” 2009. On the other hand, evolutionary strategies make it [8] S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, β€œOptimization by possible to use the computing resources of an entire cluster simulated annealing,” Science, vol. 220, no. 4598, pp. 671-680, without significant overhead, this can even out the difference 1983. in speed, and sometimes even surpass classical methods. [9] D. P. Kingma and J. Ba. Adam, β€œA method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. VI. CONCLUSION In the course of this work, methods of evolutionary strategies and their application to the solution of the problem of error-correcting coding were considered. Several different VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 22