<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Selection of Deep Reinforcement Learning Using a Genetic Algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yurii Kryvenchuk</string-name>
          <email>yurkokryvenchuk@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Petrenko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dariusz Cichoń</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuriy Malynovskyy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetiana Helzhynska</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AGH University of Science and Technology</institution>
          ,
          <addr-line>al. Mickiewicza 30, Krakow, 30059</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepana Bandery Street 12, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Neural network models contain many parameters. The selection of these parameters and the selection of the correct model takes a long time. Model developers rely on expert assessment to select models and hyperparameters for them. This paper examines the use of a genetic algorithm as an alternative to the current design process. The genetic algorithm is used to automatically select network hyperparameters and the model of the network itself. This improves the network model and reduces development time. The general model presents an algorithm with input parameters equal to those required to represent possible states and system output parameters sufficient to describe possible actions. The algorithm automatically selects different models for different parameters. It is determined that the algorithm can successfully start working with a low-efficiency model template and show good model performance and adjust the indicators of the number of layers, policy, entropy coefficient, and others. This shows the potential for further application of these algorithms for drone design.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Artificial Intelligence (AI)</kwd>
        <kwd>reinforcement learning (RL)</kwd>
        <kwd>deep reinforcement learning (DRL)</kwd>
        <kwd>genetic algorithm (GA)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The application of reinforced learning has grown in popularity during the last few years due to its
success in solving complex successive decision-making problems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While the most impressive results
have been achieved in classical single-task reinforcement training with a static environment and a fixed
reward function, reinforced contextual learning promises to stimulate the next wave of breakthroughs,
using similarities between environment and tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Reinforced learning (RL) allows agents to learn complex behaviours from interacting with the
environment. Combinations of RL paradigms with powerful function approximators, commonly
referred to as deep RLs (DRLs), have led to a superhuman performance in various simulated areas
[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. DRL algorithms, despite their outstanding results, suffer from a high sampling complexity. Thus,
many studies aim to reduce the complexity of the sample by improving the research behaviours of RL
agents in one task.
      </p>
      <p>Recently, a growing number of algorithms for curriculum development have been presented, which
empirically demonstrates that curriculum learning (CL) is an appropriate tool for improving the
efficiency of DRL algorithm sampling. However, these algorithms are based on heuristics and concepts
that are currently theoretically insufficiently understood, not allowing for significant improvements.
The basic idea of DRL is that an artificial agent can learn by interacting with the environment, much
like a biological agent. Using the experience gained, the artificial agent should be able to optimize some
of the goals set in the form of cumulative rewards.</p>
      <p>
        This approach, in principle, applies to any type of consistent decision-making problem based on
experience. The environment can be stochastic, and the agent can observe only partial information about
the current state, observations can be highly dimensional (e.g., frames and time series), the agent can
freely gain experience in the environment or, conversely, data may be limited (e.g., lack of access to
the accurate simulator or limited data) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>The basic idea of reinforced learning is to teach the agent to interact with the environment in such a
way as to achieve the greatest possible success in achieving the goals. Such training is quite similar to
the natural way of teaching people and other living beings. The environment itself and the result of
decisions made by the agent based on observation of this environment acts as a teacher in such training.</p>
      <p>
        From the beginning of its existence in this world, the brain does not know how to behave in it.
However, it sends signals to the organs of movement and receives data from the senses. And based on
the data obtained determines whether the previous action led to an approach to the desired result. For
example, in the process of learning, the child does not understand the purpose of objects and tries to
taste them because it is one of the most trained senses in early childhood. In the case where the found
object brings the desired result (pleasure), the pattern of behavior that led to this result is fixed in the
brain in the form of a more stable neural connection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Based on the observation of similar learning
processes in nature, reinforcement learning algorithms have been developed.
      </p>
      <p>
        There are two main goals for reinforcement learning algorithms. The first goal is to minimize the
number of errors, and therefore to minimize the number of steps and speed up the achievement of goals.
The agent learns to analyze the state of the environment before each subsequent action and predict the
expected outcome of following probable steps. The second important goal in the work of reinforcement
learning algorithms is to maximize the benefits of the actions taken. In this case, the definition of
benefits is programmed in advance. This can be, for example, minimizing execution time or finding as
much space for actions as possible, etc. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Returning to the example of nature, for humans it may be
the release of the hormone of happiness from achieving the goal.
      </p>
      <p>The primary purpose of the study conducted during the work on this article is to try to simulate a
combination of two algorithms that exist in wildlife. Namely, the reinforcement learning algorithm, as
an algorithm for achieving the goal of each agent, and the genetic algorithm, as an algorithm for
selecting those agents of the environment that have shown the best results in this environment.</p>
      <p>The task of the genetic algorithm is to optimize the basic parameters of the set of agents based on
natural selection. By analogy with living nature, the genetic algorithm in the presented experiment will
select the best representatives of its generation and combine the values of their basic model parameters
to create the next and more perfect generation, just as living beings are firstborn with certain parameters
inherited from their ancestors and depending on the success of adaptation to the environment have the
opportunity to further produce offspring. Thus, it is expected that the best individuals from each
generation will be selected, and their descendants will be able to become even better adapted to the
environment.</p>
    </sec>
    <sec id="sec-3">
      <title>3. State of art</title>
      <p>The general problem RL is formalized as a stochastic process of discrete time control, where the
agent interacts with its environment as follows: the agent starts in a given state in its enviroеment  0 ∈
 , collecting the initial observation ω0 ∈ Ω. At each time step t, the agent must perform the action
  ∈  . As shown in figure 1, this follows three consequences:
 The agent is rewarded rt ∈ R;

</p>
      <p>The state passes to st+1 ∈ S;</p>
      <p>The agent receives the observation ωt+1 ∈ Ω.</p>
      <p>
        This control parameter was first proposed by Bellman [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and later extended to Barto's teaching [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Comprehensive development of RL basics is provided by Sutton and Barto, 2017 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
environment, it means that the environment is completely observed. When an agent can only see partial
observation, it means that the environment is partially observed. Different environments allow for
different types of actions. The totality of all real actions in each environment is often called a space of
action. Some environments, such as Atari and Go, have discrete action spaces where only a finite
number of moves are available to the agent. Other environments, such as where an agent controls a
robot in the physical world, have spaces for continuous action. In continuous spaces, actions are vectors
with real values.
      </p>
      <p>A policy is a decision-making rule that an agent uses to choose what action to take. It can be
deterministic, and in this case, it is usually denoted by µ: at = µ(St). Because politics is essentially the
brain of the agent, the word "policy" can often be replaced by "agent," for example, by saying, "Policy
tries to maximize reward." DRL deals with parameterized policies: policies which outputs are
computational functions that depend on a set of parameters (such as neural network weights and offsets)
which can be configured to change behavior using a specific optimization algorithm [20].</p>
      <p>The reward function R is critical in reinforcement learning. It is contingent on the current condition
of the world, the recent action, and the future state of the world:  
=  (  ,   ,   +1). The agent's goal
is to maximize some idea of the total reward for actions, but, it can mean several things. Denoting all
these cases by R (τ), it will either be clear from the context which case is meant or it will not matter
(because the same equations will apply to all cases).</p>
      <p>
        Knowing the value of a state or state-action pair is frequently useful. The value means the expected
reward that begins in this state or state-action pair and then continues to operate according to a certain
policy. Value functions are used, one way or another in almost every RL algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
The
onpolicy value function   ( ), which gives the expected reward if it starts in the state s and always acts
in accordance with the policy π [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
      </p>
      <p>~
 π( ) =</p>
      <p>[ ( )| 0 =  ]</p>
      <p>The on-policy action-value function  π( ,  ), which gives the expected reward if it starts in the s
state, performs an arbitrary action a (which may not be from policy), and then forever act following the
policy π [8]:
  ( ,  ) =</p>
      <p>~</p>
      <p>[ ( )| 0 =  ,  0 =  ].</p>
      <p>The optimal value function  ∗( ), which gives the expected reward if it starts in the s state and
always acts following the optimal policy in the environment [8]:
 ∗( ) = max</p>
      <p>~</p>
      <p>[ ( )| 0 =  ]</p>
      <p>The optimal action-value function  ∗( ,  ), which gives the expected reward if it starts in the states,
performs arbitrary action a, and then acts forever according to the optimal policy in the environment:
 ∗( ,  ) = max</p>
      <p />
      <p>There is an essential connection between the optimal action-value function  ∗( ,  ) and the action
chosen by the optimal policy. By definition,  ∗( ,  ) gives the expected return to run in state S, perform
(arbitrary) action a, and then continuous action according to the optimal policy. The optimal policy in
s will choose the action that maximizes the expected reward from the beginning in s. As a result, with
 ∗ it can be directly obtained the optimal effect  ∗( ), through [8]:
 ∗( ) = 
max  ∗( ,  )</p>
      <p />
    </sec>
    <sec id="sec-4">
      <title>3.3. Types of RL algorithms</title>
      <p>In fact, it is pretty difficult to make an accurate, comprehensive taxonomy of algorithms in the
modern RL space [8], because the modularity of algorithms is poorly represented by a tree-like
structure (Figure 2).</p>
      <p>Whether the agent has access (or studies) to the environment model is one of the most critical
branching points in the RL algorithm. The environment model means a function that involves the
transition of states and rewards. The main disadvantage is that the environment model that corresponds
to reality is usually not available to the agent. If the agent wants to use the model in this case, it must
study the model solely on experience, which creates several problems. The most serious issue is that
the agent can introduce bias into the model. As a result, the agent works well on the studied model but
behaves sub-optimally in the natural environment. Learning a model is fundamentally difficult, so even
intense effort - the desire to spend a lot of time calculating - may not pay off. Algorithms that use a
model are called model-based methods, and those that do not use a model are called non-model. While
non-model methods reject the potential for sample efficiency from the model used, they are generally
easier to implement and configure [19].</p>
      <p>The trade-offs between policy optimization and Q-Learning are that they are fundamental because
they directly optimize what is needed. This makes them stable and reliable. On the contrary, Q-learning
methods only indirectly maximize the agent's work, teaching  θ to satisfy the self-consistency equation.
There are many failure modes for such training, so it is usually less stable [9]. But Q-learning methods
have an advantage because they are much more effective when they work because they can reuse data
more efficiently than policy optimization methods.</p>
      <p>Interpolation between policy optimization and Q-Learning. Ironically, policy optimization and
Qlearning are not incompatible (and in some circumstances appear to be equivalent), and there are a few
algorithms that live between the two extremes. Algorithms that live in this spectrum are able to find a
compromise between the strengths and weaknesses of either side. Examples include DDPG and SAC
[10, 11, 17].</p>
    </sec>
    <sec id="sec-5">
      <title>3.4. Metaheuristic algorithms</title>
      <p>Metaheuristic algorithms, in recent years, have been utilized to address real-world complicated
issues in a variety of sectors, such as economics, engineering, politics and management. Intensification
and diversification are key elements of the metaheuristic algorithm. A proper balance between these
elements is necessary to effectively solve a real problem. Most metaheuristic algorithms are based on
the process of biological evolution, swarm behavior and the laws of physics [12]. These algorithms are
broadly classified into two categories, namely: a single solution and a metaheuristic algorithm based on
the set (Figure 3). Metaheuristic algorithms based on a single solution use a single candidate solution
and improve this solution through local search. However, the solution obtained with the help of
metaheuristics based on one solution may remain in the local optimum [13].</p>
      <p>Evolutionary algorithms work on a set or population of solutions and use two mechanisms to find
good solutions: selecting mostly high-quality solutions from the set and combining the qualities of two
or more solutions with specialized operators to create new solutions. New solutions are reintroduced
into the population after recombination, which may require them to meet conditions such as feasibility
or minimum quality requirements to replace other (usually low-quality) solutions. Operators used in
evolutionary algorithms (selection, recombination, and re-insertion) almost without exception make
extensive use of randomness. A mutation operator is also often used, which randomly changes the
solution after its recombination. Most evolutionary algorithms repeat the selection, recombination,
mutation, and reintroduction phases several times and report the best solution in a population [16].</p>
      <p>Among the metaheuristic algorithms, a well-known algorithm is the genetic algorithm (GA), which
is inspired by the process of biological evolution. GA mimics Darwin's theory of the survival of the
fittest. GA was proposed by J.G. Holland in 1992. The main elements of GA are chromosome
representation, selection, and biological operators [15].</p>
      <p>GA dynamically changes the search process due to the probabilities of crossover and mutation and
achieves the optimal solution. GA can modify encoded genes. GA is capable of evaluating numerous
individuals and making several optimum decisions. Therefore, GA has better global search capabilities.
Offspring derived from parental chromosome crossover are likely to override the excellent genet ic
patterns of parental chromosomes, and the crossover formula is defined as:  =  +2√ , where g denotes
3
the number of generations and G denotes the population's total number of evolutionary generations. The
equation shows that R changes dynamically and increases with increasing number of evolutionary
generations. Individual similarity is quite low in the early stages of GA. To guarantee that the new
population does not disrupt the great genetic pattern of individuals, R should be set to a low number.
The individual similarity is relatively high at the end of evolution, hence the value of R should be high
too [13].</p>
      <p>The classical genetic algorithm is of the following shape [13]:</p>
      <sec id="sec-5-1">
        <title>Incoming data:</title>
      </sec>
      <sec id="sec-5-2">
        <title>Population size, n;</title>
      </sec>
      <sec id="sec-5-3">
        <title>Maximum number of iterations, MAX.</title>
      </sec>
      <sec id="sec-5-4">
        <title>Entrance:</title>
      </sec>
      <sec id="sec-5-5">
        <title>The best global solution, Yb.</title>
      </sec>
      <sec id="sec-5-6">
        <title>Beginning:</title>
        <p>Creating an initial population of n chromosomes Y, (i = 1,2, ...., n);
Set the iteration counter t = 0;</p>
      </sec>
      <sec id="sec-5-7">
        <title>Calculate the value of the fit of each chromosome;</title>
      </sec>
      <sec id="sec-5-8">
        <title>While (t &lt;MAX)</title>
      </sec>
      <sec id="sec-5-9">
        <title>Select a pair of chromosomes from the initial population based on suitability;</title>
      </sec>
      <sec id="sec-5-10">
        <title>Apply crossover operation to the selected pairs with the probability of crossing</title>
      </sec>
      <sec id="sec-5-11">
        <title>Apply the mutation to the offspring with the probability of mutation;</title>
      </sec>
      <sec id="sec-5-12">
        <title>Replace the old population with the newly created population;</title>
      </sec>
      <sec id="sec-5-13">
        <title>Increase the current iteration of t by 1.</title>
      </sec>
      <sec id="sec-5-14">
        <title>Return the best solution, Yb. End.</title>
        <p>According to the scheme theorem, the original scheme must be replaced by a modified scheme. To
preserve population diversity, the new scheme preserves the original population at an early stage of
evolution. At the end of evolution, an appropriate scheme will be created to prevent any distortion of
the excellent genetic scheme [14, 18].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. The results of the study</title>
      <p>As mentioned above, the main idea of the study is to create a combination of analogues of two types
of algorithms that occur in nature - namely, the reinforcement learning algorithm and the genetic
algorithm of selection. However, there is more than one reinforcement learning algorithm in machine
learning. And it is sometimes difficult to determine which one to choose to best solve the problem. In
addition to choosing the actual algorithm, you must also select parameters for this algorithm. In nature,
this function was taken over by evolution, which from generation to generation changed the parameters
and adapted organisms to survive in their environment. In this experiment, the role of evolution is
performed by a genetic algorithm. And the role of creatures that have to adapt to the environment is
performed by the agents of each of the reinforcement learning algorithms.</p>
      <p>Since it is not possible to cross different types of reinforcement learning algorithms, as this can lead
to uncertain results, each of these types of algorithms creates its own initial population of agents (Figure
4). Next, each of these populations will be handled almost separately in the genetic algorithm. Only
during selection based on results of the test in the environment, if all agents with a certain machine
learning algorithm with reinforcement showed significantly worse results, it will be a chance that none
of these agents will not have offspring, and therefore this algorithm will not be presented in the final
sample of results.</p>
      <p>The results were obtained using Python and such libraries as Tensorflow, Stable baselines, and
PyGAD. The flowchart of the developed algorithm is presented in Figure 5.</p>
      <p>At first, the initial population is created with different DRL algorithms. Chromosomes are presented
like hyperparameters of RL algorithms in each individual of the population. This determines the
diversity of individuals in the population. During the genetic algorithm cycles, crossings and mutations
occur between individuals with the same DRL algorithm in a population, which gradually determines
the selection of the best individuals with the most appropriate DRL algorithm. The genetic algorithm
completes its work when the maximum number of generations is reached. It can also complete work
when the agent's reward result's specified accuracy is achieved.</p>
      <p>The results were tested in a virtual gym environment. When testing the CartPole-v1 agent from 30
samples of the initial population for 500 iterations, the 5 most successful samples were selected. They
all turned out to be individuals with the DQN algorithm (Figure 6). The differences between them are
only in the internal construction of the models and hyperparameters to them.</p>
      <p>The dependence of the time the agent receives a positive reward on the sample generation is shown
in Figure 7.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Discusiion</title>
      <p>For this algorithm, improvements in the selection method of individuals from the population for
gene inheritance are possible, as individuals with poorly selected initial parameters in some RL
algorithms quickly lose the competition and the ability to produce offspring (Figure 8). Also, this
problem could probably be solved by more individuals in the population, but it will take more time and
hardware.</p>
      <p>Another possible improvement in the algorithm may be an improvement in the mutation method. It
would be a significant improvement in the final characteristics if some genes were given the possibility
of mutation.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusions</title>
      <p>The article considers machine learning algorithms with complement and metaheuristic algorithms.
The result of the study was the ability to combine different DRL algorithms with a genetic algorithm
and automatically select the best DRL models to solve the solution.</p>
      <p>During the experiment, a population of 30 CartPole agents was analyzed in a virtual gym
environment. The result of the experiment was the selection of one DRL algorithm from the sample
with some differences in the hyperparameters of the model.</p>
    </sec>
    <sec id="sec-9">
      <title>7. References</title>
      <p>[8] Joshua Achiam. Spinning Up Documentation. Release. 2020
[9] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,
Tim Harley, David Silver, Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement
Learning. 2016
[10] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,</p>
      <p>David Silver, Daan Wierstra. Continuous control with deep reinforcement learning. 2015.
[11] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy</p>
      <p>Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.
[12] Bonabeau E, Dorigo M, Theraulaz G Swarm intelligence: from natural to artificial systems. Oxford</p>
      <p>University Press, Inc. (1999)
[13] Sourabh Katoch, Sumit Singh Chauhan &amp; Vijay Kumar. A review on genetic algorithm: past,
present, and future. 2020
[14] Goldberg D (1989) Genetic algorithm in search. Optimization and Machine Learning, Addison</p>
      <p>Wesley, Reading, MA 1989
[15] Genetic Algorithm Implementation in Python.
https://towardsdatascience.com/genetic-algorithmimplementation-in-python-5ab67bb124a6
[16] Metaheuristics. Kenneth Sörensen University of Antwerp, Belgium Fred Glover University of</p>
      <p>Colorado and OptTek Systems, Inc., USA
[17] Held, D., Geng, X., Florensa, C., and Abbeel, P. (2017). Automatic goal generation for
reinforcement learning agents. arXiv preprint arXiv:1705.06366.
[18] Kryvenchuk Y., Shvorob I., and ect. Research by statistical methods of models of the function of
transformation of optical circuits of the means of measuring the temperature based on the effect of
Raman. CEUR Workshop Proceedings Vol. 2654. 2020.
[19] Xie, Q., Chen, Y., Wang, Z., and Yang, Z. Learning ZeroSum Simultaneous-Move Markov Games
Using Function Approximation and Correlated Equilibrium. arXiv preprint arXiv:2002.07066,
2020.
[20] Yuan, J. and Lamperski, A. Online convex optimization for cumulative constraints. In Advances
in Neural Information Processing Systems, pp. 6137–6146, 2018.
[21] Miryoosefi, S., Brantley, K., Daume III, H., Dudık, M., and Schapire, R. Reinforcement learning
with convex constraints. arXiv preprint arXiv:1906.09323, 2019.
[22] Liu, Q., Yu, T., Bai, Y., and Jin, C. A sharp analysis of model-based reinforcement learning with
self-play. arXiv preprint arXiv:2010.01604, 2020.
[23] Brantley, K., Dudik, M., Lykouris, T., Miryoosefi, S., Simchowitz, M., Slivkins, A., and Sun, W.</p>
      <p>Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv
preprint arXiv:2006.05051, 2020.
[24] Chen, X., Hu, J., Li, L., and Wang, L. Efficient reinforcement learning in factored mdps with
application to constrained rl. arXiv preprint arXiv:2008.13319, 2020.
[25] Shevcov, A. H., and O. V. Il’yina. "Nejropsyxolohichnyj pidxid u korekciyi rozvytku ditej z
psyxofizychnymy porushennyamy." Naukovyj chasopys Nacional"noho pedahohichnoho
universytetu imeni MP Drahomanova. Seriya 5: 347-360.
[26] François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., &amp; Pineau, J. (2018). An
introduction to deep reinforcement learning. arXiv preprint arXiv:1811.12560.
[27] Okwu, Modestus O., and Lagouge K. Tartibu. "Genetic Algorithm." Metaheuristic Optimization:
Nature-Inspired Algorithms Swarm and Computational Intelligence, Theory and Applications.
Springer, Cham, 2021. 125-132.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Vincent</given-names>
            <surname>François-Lavet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Riashat Islam, Marc G.
          <article-title>Bellemare and Joelle Pineau, “An Introduction to Deep Reinforcement Learning”, Foundations and Trends in Machine Learning</article-title>
          : Vol.
          <volume>11</volume>
          , No.
          <fpage>3</fpage>
          -
          <lpage>4</lpage>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Klink</surname>
          </string-name>
          , Hany Abdulsamad, Boris Belousov, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Peters</surname>
          </string-name>
          .
          <article-title>Self-paced contextual reinforcement learning</article-title>
          .
          <source>In CoRL</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Marlos</surname>
            <given-names>C Machado</given-names>
          </string-name>
          , Marc G Bellemare,
          <article-title>and</article-title>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bowling</surname>
          </string-name>
          .
          <article-title>Count-based exploration with the successor representation</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bellman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>1957b</year>
          . “Dynamic Programming”,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Anderson</surname>
          </string-name>
          . “
          <article-title>Neuronlike adaptive elements that can solve difficult learning control problems”</article-title>
          .
          <source>IEEE transactions on systems, man, and cybernetics</source>
          . (5):
          <fpage>834</fpage>
          -
          <lpage>846</lpage>
          .
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          .
          <article-title>Reinforcement Learning: An Introduction (2nd Edition, in progress)</article-title>
          . MIT Press.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] OpenAI documentation page</article-title>
          . https://spinningup.openai.com/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>