=Paper= {{Paper |id=Vol-3309/paper8 |storemode=property |title=Methods for Solving the Traveling Salesman Problem Based on Reinforcement Learning and Metaheuristics |pdfUrl=https://ceur-ws.org/Vol-3309/paper8.pdf |volume=Vol-3309 |authors=Eugene Fedorov,Olga Nechyporenko |dblpUrl=https://dblp.org/rec/conf/ittap/FedorovN22 }} ==Methods for Solving the Traveling Salesman Problem Based on Reinforcement Learning and Metaheuristics== https://ceur-ws.org/Vol-3309/paper8.pdf
Methods for Solving the Traveling Salesman Problem Based on
Reinforcement Learning and Metaheuristics
Eugene Fedorov and Olga Nechyporenko
Cherkasy State Technological University, Shevchenko blvd., 460, Cherkasy, 18006, Ukraine


                Abstract
                To date, the search for a solution to the traveling salesman problem is relevant for general
                and special-purpose intelligent computer systems. Currently, there is a problem of
                insufficient efficiency of methods for finding a solution to the traveling salesman problem.
                The aim of the work is to increase the efficiency of finding a solution to the traveling
                salesman problem through reinforcement learning based on Q-learning and SARSA, and a
                metaheuristic method based on the ant colony algorithm. To achieve this goal, a method
                based on Q-learning for the traveling salesman problem, a method based on SARSA for the
                traveling salesman problem, and an Ant-Q based method for the traveling salesman problem
                were created in the work.
                The advantages of the proposed methods include the following. Firstly, the modification of
                the Q-learning and SARSA methods through dynamic parameters makes it possible to
                increase the learning rate while maintaining the mean squared error of the method. Secondly,
                in the Ant-Q method with dynamic parameters, the ε-greedy Q-learning approach is used to
                select a new vertex, which is close to random search at initial iterations, and close to directed
                search at final iterations. This is ensured by the use of dynamic Q-learning parameters and
                makes it possible to increase the learning rate while maintaining the mean squared error of
                the method. Thirdly, in the Ant-Q method with dynamic parameters for calculating the
                change in the global pheromone level at initial iterations, the pheromone increment (current
                reward table for Q-learning parameters) plays the main role, which ensures the breadth of the
                search. In the final iterations, the previous pheromone level (global reward table for Q-
                learning parameters) plays the main role, which ensures the convergence of the method. This
                is ensured by the use of dynamic Q-learning parameters and makes it possible to increase the
                learning rate while maintaining the mean squared error of the method.
                The numerical study made it possible to evaluate the proposed methods (for the first and
                second methods, the number of iterations was 300; for the third method, the number of
                iterations was 10; for all three methods, the mean squared error was 0.05). The proposed
                methods make it possible to expand the scope of reinforcement learning and metaheuristics,
                which is confirmed by their adaptation for the specified optimization problem, and
                contributes to an increase in the efficiency of general and special-purpose intelligent
                computer systems. The prospect of further research is the study of the proposed methods for a
                wide class of artificial intelligence problems.

                Keywords 1
                reinforcement learning, metaheuristics, ant colony algorithm, dynamic parameters, traveling
                salesman problem, pheromone level




ITTAP’2022: 2nd International Workshop on Information Technologies: Theoretical and Applied Problems, November 22–24, 2022,
Ternopil, Ukraine
EMAIL: fedorovee75@ukr.net (E. Fedorov); olne@ukr.net (O. Nechyporenko)
ORCID: 0000-0003-3841-7373 (E. Fedorov); 0000-0002-3954-3796 (O. Nechyporenko)
           © 2022 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
1. Introduction
   To date, the development of methods aimed at solving the traveling salesman problem, which are
used in general and special-purpose intelligent computer systems, is an urgent task.
   Optimization methods that find the exact solution have a high computational complexity.
Optimization methods that find an approximate solution through directed search have a high
probability of hitting a local extremum. Random search methods do not guarantee convergence. Due
to this, there is a problem of insufficient efficiency of optimization methods, which needs to be
addressed.
   Metaheuristics (or modern heuristics) are used to speed up finding a solution to the traveling
salesman problem and reduce the probability of hitting a local extremum [1-3]. Metaheuristics
expands the possibilities of heuristics by combining heuristic methods based on a high-level strategy
[4-6]. The most promising metaheuristics are agent-based metaheuristics, which show the best results
when searching for a solution to the traveling salesman problem [7-8].
   Another popular approach is reinforcement learning [9].
   Currently, there is a trend towards the joint use of reinforcement learning and metaheuristics [10].
   The aim of the work is to increase the efficiency of finding a solution to the traveling salesman
problem through reinforcement learning and the metaheuristic method.
   To achieve this goal, it is necessary to solve the following tasks:
   1. Create a Q-learning method for the traveling salesman problem.
   2. Create a SARSA-based method for the traveling salesman problem.
   3. Create an Ant-Q based method for the traveling salesman problem.
   4. Conduct a numerical study of the proposed optimization methods.

2. Formulation of the problem
   The problem of increasing the efficiency of solving the traveling salesman problem based on
reinforcement learning methods (Q-learning and SARSA) and the ant colony algorithm is presented as
the problem of finding such an ordered set of operators { A1 , A2 }, the iterative application of which
ensures finding a solution x* , such that F ( x* ) → min , where x – is a vertex vector (route), F (⋅) – is
a target function (route length), A1 – is an operator that selects a vertex based on the ε- greedy
approach, A2 – is an operator that updates the reward table.

3. Literature review
   Existing metaheuristics and reinforcement learning methods have one or more of the following
advantages:
   • computational complexity is lower than in traditional methods of exhaustive search (branch
   and bound, dynamic programming, etc.) [10];
   • the probability of hitting a local extremum is lower than in gradient methods [10].
   Existing metaheuristics and reinforcement learning methods have one or more of the following
disadvantages:
   • there is only an abstract description of the method or the description of the method is focused
   on solving only a specific problem [11];
   • method convergence is not guaranteed [12];
   • the influence of the iteration number on the process of finding a solution is not taken into
   account [13];
   • there is no possibility to solve problems of conditional optimization [14];
   • insufficient method accuracy [15].
   • the procedure for determining parameter values is not automated [16].
   This raises the problem of constructing efficient optimization methods.
   One of the most common and accurate metaheuristics is the ant colony algorithm, which was
proposed by Dorigo [18-20], further developed in [21-23] and implemented in software in [24].
   One of the most common and accurate reinforcement learning methods are Q-learning and SARSA
[9], which are based on the Bellman equation.
   Currently, hybrid metaheuristics are often used to control search [25-26].

4. Q-learning method with dynamic parameters for the traveling salesman
   problem
   The cost function (target function) is defined as
                                                              M −1
                                      F ( x) = d x M , x1 +   ∑d
                                                              i =1
                                                                     xi , xi +1 → min ,
                                                                                   x
                                                                                                                         (1)

where d x M , x1 – edge weight ( xi , xi +1 ) , xi , xi +1 ∈V ,
x – vertices vector.

   The method consists of the following steps:
   1. Initialization.
   1.1. The maximum number of iterations N , the length of the vertices vector M , the discrete set of
   states (vertices) S = {1,..., M } , the discrete set of actions (vertices) A = {1,..., M } , the matrix of
   edge      weights      [d ij ] ,   i, j ∈1, M ,     parameters            ρ min , ρ max   (control   the   learning   rate),
    0 < ρ min < ρ max < 1 , parameters ε min , ε max                 for the ε-greedy approach, 0 < ε min < ε max < 1 ,
   parameters θ min ,θ max (determine the importance of the future reward), 0 < θ min < θ max < 1 , are
   specified.
   1.2. The optimal vertices vector x * is defined by randomly ordering the set A .
   1.3. The reward table is initialized as
                                   Q = [Q(i, j )] , Q(i, j ) = 0 , i, j ∈1, M .
   2. Iteration number is n=1.
   3. Parameters are calculated:
                                                                        n −1
                                    ρ (n) = ρ max − ( ρ max − ρ min )          ,
                                                                        N −1
                                                                       n −1
                                     ε (n) = ε max − (ε max − ε min )        ,
                                                                      N −1
                                                                       n −1
                                     θ (n) = θ min + (θ max − θ min )        .
                                                                      N −1
   4. The initial state (vertex) s = 1 is observed, which becomes a new vertex of the vertices vector,
   i.e. x1 = 1 .
   5. The set of prohibited actions (vertices) Atabu = {1} is initialized.
   6. An action (vertex) a is selected, to which it is necessary to move from vertex s, using an ε-
   greedy approach (if U (0,1) < ε (n) , then choose an action (vertex) a randomly from the set of
   allowed actions (vertices) A / Atabu , otherwise choose an action (vertex) a as the nearest neighbor
   of vertex s from the set of allowed actions (vertices) A / Atabu , i.e. a = arg max Q( s, b) ,
                                                                                                                   b

    b ∈ A / A ). The selected action (vertex) a is included in the set of forbidden actions (vertices)
              tabu


    Atabu , i.e. Atabu = Atabu  {a} , and becomes the new vertex of the vertices vector, i. e. x| Atabu | = a .
   7. If there are no allowed actions (vertices) left, i.e. A / Atabu = ∅ , then go to step 12.
   8. The current reward table element R( s, a ) is calculated as the negative weight of the edge ( s, a ) ,
   i.e. R( s, a ) = −d sa .
   9. A new state (vertex) e = a is observed.
   10. An element of the reward table Q( s, a ) is calculated as
                                                                   (                               )
                  Q( s, a ) = (1 − ρ (n))Q( s, a ) + ρ (n) R ( s, a ) + θ (n) max Q(e, b) , b ∈ A / Atabu .
                                                                                          b
   11. Set the current state (vertex) as s = a . Go to step 6.
   12. If the best value of the target function at the current iteration is less than the best value of the
   target function for all previous iterations, i.e. F ( x) < F ( x* ) , then replace the best vertices vector,
   i.e. x* = x .
   13. If not the last iteration, i.e. n < N , then go to step 3.

5. SARSA-based method with dynamic parameters for the traveling salesman
   problem
   The cost function (target function) is defined as
                                                                 M −1
                                         F ( x) = d x M , x1 +   ∑d
                                                                  i =1
                                                                         xi , xi +1 → min ,
                                                                                       x
                                                                                                                                      (2)

where d x M , x1 – edge weight ( xi , xi +1 ) , xi , xi +1 ∈V ,
x – vertices vector.

   The method consists of the following steps:
   1. Initialization.
   1.1. The maximum number of iterations N , the length of the vertices vector M , the discrete set of
   states (vertices) S = {1,..., M } , the discrete set of actions (vertices) A = {1,..., M } , the matrix of
   edge      weights         [d ij ] ,   i, j ∈1, M ,       parameters           ρ min , ρ max   (control   the     learning          rate),
    0< ρ   min
                 <ρ   max
                            < 1 , parameters ε        min
                                                            ,ε   max
                                                                         for the ε-greedy approach, 0 < ε               min
                                                                                                                              <ε    max
                                                                                                                                          <1,
   parameters θ ,θ     min
                            (determine the importance of the future reward), 0 < θ < θ
                              max
                                                                                              < 1 , are           min         max

   specified.
   1.2. The optimal vertices vector x * is defined by randomly ordering the set A .
   1.3. The reward table is initialized as
                                   Q = [Q(i, j )] , Q(i, j ) = 0 , i, j ∈1, M .
   2. Iteration number is n=1.
   3. Parameters are calculated:
                                                                        n −1
                                    ρ (n) = ρ max − ( ρ max − ρ min )          ,
                                                                        N −1
                                                                       n −1
                                     ε (n) = ε max − (ε max − ε min )        ,
                                                                      N −1
                                                                       n −1
                                     θ (n) = θ min + (θ max − θ min )        .
                                                                      N −1
   4. The initial state (vertex) s = 1 is observed, which becomes a new vertex of the vertices vector,
   i.e. x1 = 1 .
   5. The set of prohibited actions (vertices) Atabu = {1} is initialized.
   6. An action (vertex) a is selected, to which it is necessary to move from vertex s, using an ε-
   greedy approach (if U (0,1) < ε (n) , then choose an action (vertex) a randomly from the set of
   allowed actions (vertices) A / Atabu , otherwise choose an action (vertex) a as the nearest neighbor
   of vertex s from the set of allowed actions (vertices) A / Atabu , i.e. a = arg max Q( s, b) ,
                                                                                                                              b
    b ∈ A / Atabu ). The selected action (vertex) a is included in the set of forbidden actions (vertices)
    Atabu , i.e. Atabu = Atabu  {a} , and becomes the new vertex of the vertices vector, i. e. x| Atabu | = a .
   7. If there are no allowed actions (vertices) left, i.e. A / Atabu = ∅ , then go to step 13.
   8. The current reward table element R( s, a ) is calculated as the negative weight of the edge ( s, a ) ,
   i.e. R( s, a ) = − d sa .
   9. A new state (vertex) e = a is observed.
   10. An action (vertex) c is selected, to which it is necessary to move from vertex e, using an ε-
   greedy approach (if U (0,1) < ε (n) , then choose an action (vertex) c randomly from the set of
   allowed actions (vertices) A / Atabu , otherwise choose an action (vertex) c as the nearest neighbor
   of vertex e from the set of allowed actions (vertices) A / Atabu , i.e. c = arg max Q(e, b) ,
                                                                                                      b

    b ∈ A / A ). The selected action (vertex) c is included in the set of forbidden actions (vertices)
              tabu


    Atabu , i.e. Atabu = Atabu  {c} , and becomes the new vertex of the vertices vector, i. e. x| Atabu | = c .
   11. An element of the reward table Q( s, a ) is calculated as
                         Q( s, a ) = (1 − ρ (n))Q( s, a ) + ρ (n)(R( s, a ) + θ (n)Q(e, c) ) .
   12. Set the current state (vertex) as s = a . Set the current state (vertex) as a = c . Go to step 7.
   13. If the best value of the target function at the current iteration is less than the best value of the
   target function for all previous iterations, i.e. F ( x) < F ( x* ) , then replace the best vertex vector, i.e.
   x* = x .
   14. If not the last iteration, i.e. n < N , then go to step 3.

6. Ant-Q method with dynamic parameters for the traveling salesman
   problem
   The cost function (target function) is defined as
                                                             M −1
                                     F ( x) = d x M , x1 +   ∑d
                                                             i =1
                                                                    xi , xi +1 → min ,
                                                                                  x
                                                                                                           (3)

where d x M , x1 – edge weight ( xi , xi +1 ) , xi , xi +1 ∈V ,
x – vertices vector.

   The method consists of the following steps:
   1. Initialization.
   1.1. The maximum number of iterations N , the length of the vertices vector M , population size
   K , the discrete set of states (vertices) S = {1,..., M } , the discrete set of actions (vertices)
    A = {1,..., M } , the matrix of edge weights [d ij ] , i, j ∈1, M , parameters ρ min , ρ max (control the
   learning rate), 0 < ρ min < ρ max < 1 , parameters ε min , ε max for the modified ε-greedy approach,
    0 < ε min < ε max < 1 , parameters θ min ,θ max (determine the importance of the future reward),
   0 < θ min < θ max < 1 , are specified.
   1.2. The optimal vertices vector x * is defined by randomly ordering the set A .
   1.3. The reward table is initialized (pheromone level) as
                                     Q = [Q(i, j )] , Q(i, j ) = 0 , i, j ∈1, M .
   2. Iteration number is n=1.
   3. Parameters are calculated:
                                                                          n −1
                                       ρ (n) = ρ max − ( ρ max − ρ min )       ,
                                                                          N −1
                                                                     n −1
                                                    ε (n) = ε max − (ε max − ε min )
                                                                          ,
                                                                     N −1
                                                                     n −1
                                    θ (n) = θ min + (θ max − θ min )      .
                                                                     N −1
   4. The initial state (vertex) sk = 1 is observed, which becomes a new vertex of the vertices vector,
   i.e. xk1 = 1 , k ∈1, K .
   5. The set of prohibited actions (vertices) Aktabu = {1} , k ∈1, K is initialized.
   6. Ant number is k = 1 .
   7. Calculate the transition probabilities of the k th ant from vertex sk to other vertices
                          (1 − ρ (n))Q( sk , b) + ρ (n)(1 / d s k b )
                                                                                        , b ∈ A \ Aktabu
                ps k b =               ∑
                                        (1 − ρ ( n ))Q ( s k , b ) + ρ ( n )(1 / d sk b )                  , b ∈ 1, M .
                            l ∈ A \ Aktabu
                          
                          0,                                                              b ∉ A \ Aktabu
   8. An action (vertex) ak is selected, to which it is necessary to move from vertex sk , using the
   modified ε-greedy approach (if U (0,1) < ε , then choose an action ak , satisfying the inequality
   a k −1                        ak

   ∑b =1
            ps k b < U (0,1) ≤   ∑p
                                 b =1
                                        s k b , from the set of allowed actions (vertices)               A / Atabu , otherwise choose

   an action (vertex) a as the nearest neighbor of vertex s from the set of allowed actions (vertices)
   A / Atabu , i.e. a л = arg max{(1 − ρ (n))Q( sk , b) + ρ (n)(1 / d s k b )} ,                    b ∈ A \ Aktabu ). The selected
                                             b

   action (vertex) is included in the set of forbidden actions (vertices), i.e. Aktabu = Aktabu  {ak } , and
   becomes the new vertex of the vertices vector, i. e. xk ,| Atabu | = ak .
                                                                                            k

   9. A new state (vertex) ek = ak is observed.
   10. An element of the reward table (pheromone level) Q( sk , ak ) is calculated as
                          Q( sk , ak ) = (1 − ρ (n))Q( sk , ak ) + ρ (n)θ (n) max Q (ek , b) , b ∈ A / Aktabu .
                                                                                                b
   11. Set the current state (vertex) as sk = ak .
   12. If the current ant is not the last one, i.e. k < K , then increase the ant number, i.e. k = k + 1 , go
   to step 7.
   13. If there are still allowed actions (vertices) left, i.e. A / AKtabu ≠ ∅ , then go to step 6.
   14. If the best value of the target function at the current iteration is less than the best value of the
   target function for all previous iterations, i.e. min F ( xk ) < F ( x* ) , then replace the best vertices
                                                                                  k ∈1, K

   vector, i.e. x = arg min F ( xk ) .
                      *
                                 k ∈1, K
   15. The table of current rewards R( s, a ) is calculated as
                                                          K      χ Z sa ( xk )
                                           R ( s, a ) =   ∑ F ( x ) , s ∈1, M − 1 , a ∈ i + 1, M ,
                                                          k =1           k

                                                                                 1, xk ∈ Z sa
                                                              χ Z sa ( xk ) =                 ,
                                                                                 0, xk ∉ Z sa
where Z sa – the set of vertices vectors that contain edges ( s, a ) or (a, s ) .
  16. The reward table (pheromone level) is calculated as
                    Q( s, a ) = (1 − ρ (n))Q( s, a ) + ρ (n) R ( s, a ) , s ∈1, M − 1 , a ∈ i + 1, M .
  17. If the current iteration is not the last one, i.e. n < N , then increase the iteration number, i.e.
   n = n + 1 , go to step 3, otherwise stop.
    Comment. U (0,1) – is a function that returns a uniformly distributed random number in the range
 [0,1] .

 7. Experiments and results
    The numerical study of the proposed optimization methods was carried out using the Python
 package.
    For Q-learning methods, SARSA, Ant-Q with dynamic parameters, parameters values
 ρ = 0.1, ρ max = 0.9 (control learning rate, evaporation rate coefficients), parameters values
   min


  ε min = 0.1, ε max = 0.9 for ε-greedy approach, parameters values θ min = 0.1,θ max = 0.9 (determine
 importance of the future reward).
     For the Ant-Q method with dynamic parameters, the ant population size is K = 20 .
     For the problem "about the shortest path in the world of tiles", the search for a solution was carried
 out on the standard database https://digitalcommons.du.edu/gridmaps2D, which is described in [25]
 (traditionally used to test methods for solving problems of finding routes in the world of tiles).
                                                                                      n −1
     The dependence of parameter θ (n) is defined as θ (n) = θ min + (θ max − θ min )      and is shown in
                                                                                      N −1
 Figure 1.

          1,1


           1


          0,9


          0,8


          0,7


          0,6
teta(n)




          0,5


          0,4


          0,3


          0,2


          0,1


           0
                 1
                 3
                 5
                 7
                 9
                11
                13
                15
                17
                19
                21
                23
                25
                27
                29
                31
                33
                35
                37
                39
                41
                43
                45
                47
                49
                51
                53
                55
                57
                59
                61
                63
                65
                67
                69
                71
                73
                75
                77
                79
                81
                83
                85
                87
                89
                91
                93
                95
                97
                99




                                                            n


 Figure 1: Dependence of parameter θ (n) on iteration number n

     The dependence (Figure 1) of parameter θ (n) on the iteration number n shows that its share
 increases with the iteration number.
     The       dependence           of    parameters       ρ (n)       and       ε (n)     is   determined     as
                                     n −1                                      n −1
  ρ (n) = ρ max − ( ρ max − ρ min )       and ε (n) = ε max − (ε max − ε min )       and is shown in Figure 2.
                                    N −1                                       N −1
     The dependence (Figure 2) of parameters ρ (n) and ε (n) on the iteration number n shows that
 their share decreases with an increase in the iteration number.
                1,1


                 1


                0,9


                0,8


                0,7
ro(n), eps(n)




                0,6


                0,5


                0,4


                0,3


                0,2


                0,1


                 0
                       1
                       3
                       5
                       7
                       9
                      11
                      13
                      15
                      17
                      19
                      21
                      23
                      25
                      27
                      29
                      31
                      33
                      35
                      37
                      39
                      41
                      43
                      45
                      47
                      49
                      51
                      53
                      55
                      57
                      59
                      61
                      63
                      65
                      67
                      69
                      71
                      73
                      75
                      77
                      79
                      81
                      83
                      85
                      87
                      89
                      91
                      93
                      95
                      97
                      99
                                                       n


  Figure 2: Dependence of parameters ρ (n) and ε (n) on iteration number n

      The results of comparison of the proposed methods with the methods based on the ant colony
  algorithm without simulated annealing and random pheromone level, described in [16–22], are
  presented in Table 1.
      The results of comparison of the proposed Q-learning method with dynamic parameters and with
  the traditional Q-learning method based on the mean squared error criterion and the number of
  iterations for solving the traveling salesman problem are presented in Table 1. Similar results were
  obtained for the proposed SARSA method with dynamic parameters and with the traditional SARSA
  method.

  Table 1
  Comparison of the proposed optimization method with the traditional Q-learning method based on
  the mean squared error criterion and the number of iterations for solving the «traveling salesman»
  problem
             Method mean squared error                             Number of iterations
          proposed                  existing               proposed                   existing
            0.05                     0.05                     300                      2000

     The results of comparing the proposed Ant-Q method with dynamic parameters and with the
  traditional Ant-Q method based on the mean squared error criterion and the number of iterations for
  solving the traveling salesman problem are presented in Table 2.

  Table 2
  Comparison of the proposed optimization method with the traditional Ant-Q method based on the
  mean squared error criterion and the number of iterations for solving the «traveling salesman»
  problem
             Method mean squared error                          Number of iterations
          proposed                existing               proposed                 existing
            0.05                    0.05                    10                      200
8. Discussion
   Advantages of the proposed methods:
   1. Modification of Q-learning and SARSA methods using dynamic parameters allows to increase
   the learning rate while maintaining the mean squared error of the method. The learning rate of the
   modified Q-learning and SARSA methods using dynamic parameters increased by about 7 times
   compared to the known methods.
   2. In the Ant-Q method with dynamic parameters, the ε-greedy Q-learning approach is used to
   select a new vertex, which is close to random search at initial iterations, and close to directed
   search at final iterations. This is achieved by using dynamic Q-learning parameters and makes it
   possible to increase the learning rate while maintaining the mean squared error of the method
   (Table 2). The learning rate of the modified Ant-Q methods using dynamic parameters increased
   by 10 times, compared to the known methods.
   3. In the Ant-Q method with dynamic parameters to calculate the change in the global pheromone
   level in the initial iterations, the pheromone increment (current reward table for Q-learning
   parameters) plays the main role, which ensures the breadth of the search, and in the final iterations
   the previous pheromone level plays the main role (global reward table for Q-learning parameters),
   which ensures the convergence of the method. This is achieved by using dynamic Q-learning
   parameters and makes it possible to increase the learning rate while maintaining the mean squared
   error of the method (Table 2).

9. Conclusions
   1. The paper proposes a modification of the Q-learning and SARSA methods by using dynamic
   parameters in the reward table update rule, which makes it possible to increase the learning rate.
   2. The paper proposes a modification of the Ant-Q method by using dynamic parameters in the
   reward table update rule and the ε-greedy vertex selection approach, which makes it possible to
   increase the learning rate.
   3. The proposed optimization methods, due to the study of the entire search space at the initial
   iterations and the search directionality at the final iterations, make it possible to ensure high
   accuracy in solving the "traveling salesman" problem.
   Practical value. The proposed methods allow expanding the scope of reinforcement learning and
   metaheuristics, which is confirmed by their adaptation for the specified optimization problem, and
   improves the efficiency of general and special-purpose intelligent computer systems.
   The prospect of further research is to investigate the proposed methods for a wide class of artificial
   intelligence problems.

10.References
[1] X. Yu, M. Gen, Introduction to evolutionary algorithms, London: Springer-Verlag, 2010. doi:
    10.1007/978-1-84996-129-5.
[2] A. P. Engelbrecht, Computational Intelligence: an introduction, Chichester, West Sussex, Wiley
    & Sons, 2007. doi: 10.1002/9780470512517.
[3] El-G. Talbi, Metaheuristics: from design to implementation, Hoboken, New Jersey: Wiley &
    Sons, 2009. doi: 10.1002/9780470496916.
[4] X.-S. Yang, Nature-inspired Algorithms and Applied Optimization, Charm: Springer, 2018. doi:
    10.1007/978-3-642-29694-9.
[5] A. Nakib, El-G. Talbi, Metaheuristics for Medicine and Biology, Berlin: Springer-Verlag, 2017.
    doi: 10.1007/978-3-662-54428-0.
[6] F. Glover, G. A. Kochenberger, Handbook of metaheuristics, Dordrecht: Kluwer Academic
    Publishers, 2003. doi: 10.1007/B101874.
[7] S. Subbotin, A. Oliinyk, V. Levashenko, E. Zaitseva, Diagnostic rule mining based on artificial
     immune system for a case of uneven distribution of classes in sample, Communications, volume
     3 (2016) 3-11.
[8] X.-S. Yang, Optimization Techniques and Applications with Examples, Hoboken, New Jersey:
     Wiley & Sons, 2018. doi: 10.1002/9781119490616.
[9] A. L. C. Ottoni, E. G. Nepomuceno, M. S. de Oliveira, D. C. R. de Oliveira, Reinforcement
     learning for the traveling salesman problem with refueling, in: Complex & Intelligent Systems,
     vol. 8, 2021, pp. 1-15. doi.org/10.1007/s40747-021-00444-4.
[10] M. Dorigo, L. Gambardella, Ant-Q: A reinforcement learning approach to the traveling salesman
     problem, in: Proceedings of ML-95, Twelfth Internet Conference on Machine Learning, 1995,
     pp. 252-260.
[11] C. Blum, G. R. Raidl, Hybrid Metaheuristics. Powerful Tools for Optimization, Charm: Springer,
     2016. doi: 10.1007/978-3-319-30883-8.
[12] R. Martí, P. M. Pardalos, M. G. C. Resende, Handbook of Heuristics, Charm: Springer, 2018.
     doi: 10.1007/978-3-319-07124-4.
[13] O. Bozorg Haddad, M. Solgi, H. Loaiciga, Meta-heuristic and Evolutionary Algorithms for
     Engineering Optimization, Hoboken, New Jersey: Wiley & Sons, 2017. doi:
     10.1002/9781119387053.
[14] M. Gendreau, J.-Y. Potvin, Gendreau M. Handbook of Metaheuristics, New York: Springer,
     2010. doi: 10.1007/978-1-4419-1665-5.
[15] B. Chopard, M. Tomassini, An Introduction to Metaheuristics for Optimization, New York:
     Springer, 2018. doi: 10.1007/978-3-319-93073-2.
[16] J. Radosavljević, Metaheuristic Optimization in Power Engineering, New York: Institution of
     Engineering and Technology, 2018. doi: 10.1049/PBPO131E.
[17] K. F. Doerner, M. Gendreau, P. Greistorfer, W. Gutjahr, R. F. Hartl, M. Reimann,
     Metaheuristics. Progress in Complex Systems Optimization, New York: Springer, 2007. doi:
     10.1007/978-0-387-71921-4.
[18] M. Dorigo, V. Maniezzo, A. Colorni, Ant system: optimization by a colony of cooperating
     agents, in: IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 26,
     issue 1, 1996, pp. 29–41. doi: 10.1109/3477.484436
[19] A. Colorni, M. Dorigo, V. Maniezzo, M. Trubian, Ant system for job-shop scheduling, in:
     Belgian Journal of Operations Research, Statistics and Computer Science, vol. 34, issue 1, 1994,
     pp. 39–53.
[20] L. M. Gambardella, É. D. Taillard, M. Dorigo, Ant colonies for the quadratic assignment
     problem, in: Journal of the Operational Research Society, vol. 50, issue 2, 1999, pp. 167–176.
     doi: 10.1057/palgrave.jors.2600676
[21] H. Min, P. Dazhi, Y. Song, An improved hybrid ant colony algorithm and its application in
     solving TSP, in: Proceedings of the 7th IEEE Joint International Information Technology and
     Artificial Intelligence Conference (ITAIC '14), 2014, pp. 423–427.
[22] K.-L. Du, M. N. S. Swamy, Search and Optimization by Metaheuristics. Techniques and
     Algorithms Inspired by Nature, Charm: Springer, 2016. doi: 10.1007/978-3-319-41192-7.
[23] E. Alba, A. Nakib, P. Siarry, Metaheuristics for Dynamic Optimization, Berlin: Springer-Verlag,
     2013. doi: 10.1007/978-3-642-30665-5
[24] J. Brownlee, Clever algorithms: nature-inspired programming recipes, Melbourne: Brownlee,
     2011.
[25] O. O. Grygor, E. E. Fedorov, T. Yu. Utkina, A. G. Lukashenko, K. S. Rudakov, D. A. Harder, V.
     M. Lukashenko, Optimization method based on the synthesis of clonal selection and annealing
     simulation algorithms, Radio Electronics, Computer Science, Control (2019) 90-99. doi:
     10.15588/1607-3274-2019-2-10.
[26] E. Fedorov, V. Lukashenko, T. Utkina, A. Lukashenko, K. Rudakov, Method for parametric
     identification of Gaussian mixture model based on clonal selection algorithm, in: CEUR
     Workshop Proceedings, vol. 2353, 2019. pp. 41-55.
[27] N. R. Sturtevant, Benchmarks for Grid-Based Pathfinding, in: IEEE Transactions on
     Computational Intelligence and AI in Games, vol. 4, issue 2, 2012, pp. 144–148.
     doi:10.1109/TCIAIG.2012.2197681.