<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Asad Jeewa[</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Directed curiosity-driven exploration in hard exploration, sparse reward environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Reward</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of KwaZulu-Natal</institution>
          ,
          <addr-line>Westville 4000</addr-line>
          ,
          <country country="ZA">South Africa</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>0000</year>
      </pub-date>
      <volume>0003</volume>
      <abstract>
        <p>Training agents in hard exploration, sparse reward environments is a di cult task since the reward feedback is insu cient for meaningful learning. In this work, we propose a new technique, called Directed Curiosity, that is a hybrid of Curiosity-Driven Exploration and distancebased reward shaping. The technique is evaluated in a custom navigation task where an agent tries to learn the shortest path to a distant target, in environments of varying di culty. The technique is compared to agents trained with only a shaped reward signal, a curiosity signal as well as a sparse reward signal. It is shown that directed curiosity is the most successful in hard exploration environments, with the bene ts of the approach being highlighted in environments with numerous obstacles and decision points. The limitations of the shaped reward function are also discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>Sparse Rewards Shaping Navigation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A reinforcement learning agent learns how to behave based on rewards and
punishments it receives through interactions with an environment [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The reward
signal is the only learning signal that the agent receives [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Many
environments have extrinsic rewards that are sparsely distributed, meaning that most
timesteps do not return any positive or negative feedback. These environments,
known as sparse reward environments [
        <xref ref-type="bibr" rid="ref12 ref21">12,21</xref>
        ], do not provide su cient feedback
for meaningful learning to take place [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The most di cult sparse reward
environments are those where an agent only receives a reward for completing a task
or reaching a goal, meaning that all intermediate steps do not receive rewards.
These are referred to as terminal reward environments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Closely related to the sparse rewards problem is the issue of exploration.
Exploration algorithms aim to reduce the uncertainty of an agents understanding
of its environment [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It is not possible for an agent to act optimally until it has
su ciently explored the environment and identi ed all of the opportunities for
reward [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. An agent may never obtain positive rewards without an intuitive
exploration strategy when rewards are sparse. Hard exploration environments
are environments where local exploration strategies such as -greedy are insu
cient [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In these environments, the probability of reaching a goal state through
local exploration is negligible.
      </p>
      <p>
        These types of environments are prevalent in the real-world [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and training
reinforcement learning (RL) agents in them forms one of the biggest challenges
in the eld [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This research focuses on learning in hard exploration, terminal
reward environments.
      </p>
      <p>
        A popular approach to learning in these environments is reward shaping,
which guides the learning process by augmenting the reward signal with
supplemental rewards for intermediate actions that lead to success [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This ensures
that the agent receives su cient feedback for learning.
      </p>
      <p>
        Intrinsic rewards that replace or augment extrinsic rewards is another area
of research that has exhibited promising results [
        <xref ref-type="bibr" rid="ref17 ref4 ref7">4,7,17</xref>
        ]. Instead of relying on
feedback from the environment, an agent engineers its own rewards. Curiosity is
a type of intrinsic reward that encourages an agent to nd \novel" states [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        In this research, we present Directed Curiosity : a new technique that
hybridises reward shaping and Curiosity-Driven Exploration [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to allow agents
to explore intelligently. The algorithm is de ned in Section 3 and the custom
navigation environments used for evaluation are described in Section 4. The
performance of the algorithm is evaluated by comparing it to its constituent
algorithms i.e. agents trained with only the shaped reward and only the curiosity
reward. Directed Curiosity is shown to be the most robust technique in Section 5.
The environment characteristics that are suited to this technique are highlighted
and the limitations of the shaped reward function are also discussed.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Learning in hard exploration, sparse reward environments is a well-studied area
in reinforcement learning. Reward shaping is a popular approach that augments
the reward signal with additional rewards to enable learning in sparse reward
environments. It is a means of introducing prior knowledge to reduce the number
of suboptimal actions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and guide the learning process [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. A concern is that
when reward shaping is used incorrectly, it can have a detrimental e ect and
change the optimal policy or the de nition of the task [
        <xref ref-type="bibr" rid="ref15 ref9">9,15</xref>
        ].
      </p>
      <p>
        Potential-Based Reward Shaping has been proven to preserve the optimal
policy of a task [
        <xref ref-type="bibr" rid="ref15 ref9">9,15</xref>
        ]. It de nes , a reward function over states that introduces
\arti cial" shaped reward feedback [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The potential function F is de ned as
a di erence between of the next state s0 and the current state s with as a
discount factor on (s0).
      </p>
      <p>
        The restriction on the form of the reward shaping signal limits its
expressiveness [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Potential-Based Advice is a similar framework that introduces actions
in the potential function [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. A novel Bayesian approach that augments the
reward distribution with prior beliefs is presented in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        It is di cult to manually engineer reward functions for each new
environment [
        <xref ref-type="bibr" rid="ref11 ref9">9,11</xref>
        ]. Implicit reward shaping is an alternate approach that learns from
demonstrations of target behaviour. A potential-based reward function is
recovered from demonstrations using state similarity in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and through inverse
reinforcement learning methods in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The shaped reward function is learnt
directly from raw pixel data in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        An alternative to \shaping" an extrinsic reward is to supplement it with
intrinsic rewards [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] such as curiosity. Curiosity-Driven Exploration by
SelfSupervised Prediction [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is a fundamental paper that de ned a framework for
training curious agents. Curiosity empowers the agent by giving it the capability
of exploration, enabling it to reach far away states that contain extrinsic rewards.
Much research has built upon the ndings of this paper. Large scale analysis of
the approach is performed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where agents learned to play various Atari
Games using intrinsic rewards alone. A limitation of the approach is that it
struggles to learn in stochastic environments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Classic work in [
        <xref ref-type="bibr" rid="ref13 ref6">6,13</xref>
        ] investigated balancing exploration and exploitation
in polynomial time and has inspired much research in the area of intelligent
exploration. Count-based exploration methods generate an exploration-bonus
from state visitation counts [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. It has been shown to achieve good results on
the notoriously di cult \Montezuma's Revenge" Atari game in [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ]. Exploration
bonuses encourage an agent to explore, even when the environment's reward is
sparse [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], by optimising a reward function that is the sum of the extrinsic reward
and exploration bonus.
      </p>
      <p>
        Approximating these counts in large state spaces is a di cult task [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Hash
functions were used in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] to extend the method to high-dimensional,
continuous state spaces. Random Network Distillation (RND) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a novel technique
that consists of a xed randomly initialised target network and a prediction
network. The target network outputs a random function of the environment states
which the prediction network learns to predict. An intrinsic reward is de ned as
the loss of the prediction network. It achieved state of the art performance on
\Montezuma's Revenge" [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in 2018.
      </p>
      <p>
        Other methods of exploration include maximising empowerment [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], wherein
the long-term goal of the agent aims to maximise its control on the environment,
using the prediction error in the feature space of an auto-encoder as a measure
of interesting states to explore, and using demonstration data to learn an
exploration policy [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Directed Curiosity</title>
      <p>We propose a new reward function that is made up of two constituents: a
distance-based shaped extrinsic reward and a curiosity-based intrinsic reward.
3.1</p>
      <sec id="sec-3-1">
        <title>Distance-Based Reward Shaping</title>
        <p>
          Shaping rewards is a fragile process since small changes in the reward function
result in signi cant changes to the learned policy [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          Various functions were engineered and compared. It is essential that the
positive and negative rewards are balanced. In an episode, the agent should not
receive more positive rewards for moving closer to the target, or more negative
rewards for moving further away, so as not to introduce loopholes for the agent
to exploit. If the weighting of positive rewards is too high, the agent learns to
game the system by delaying reaching the target to gain more positive rewards
in an episode. If the weighting of the negative rewards is too high, the agent
does not receive su cient positive reinforcement to nd the target. This means
that the shaped rewards alter the optimal policy of the original task [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>The shaped reward should encourage the agent to keep advancing towards
the target by favouring consecutive positive moves and punishing consecutive
negative ones. It must not dominate the terminal reward such that the agent is
no longer incentivised to nd the target and its motivations become polluted.
To overcome these issues, a shaped reward function based on relative distance
between target and agent is used.</p>
        <sec id="sec-3-1-1">
          <title>Algorithm 1 Distance-based shaped reward function</title>
          <p>Input: Agent position Pagent, target position Ptarget, maximum distance Dmax,
previous distance Dprev, reward coe cient C
Calculate distance Dcurrent distance(Pagent; Ptarget)
Calculate reward signal: R Dcurrent=Dmax
if Dcurrent &lt; Dprev then</p>
          <p>return C (1 R)
else</p>
          <p>return C ( R)
end if</p>
          <p>There are various bene ts to Algorithm 1. The agent is penalised if it stays
still and the shaped reward signal can be controlled using the reward coe cient
C. This ensures that the episodic shaped rewards cannot exceed terminal positive
reward. There is a balance between positive and negative rewards since they are
both relative to the change in distance. The agent receives the highest reward
when it moves closest to the target and the highest penalty when it moves
furthest away. This means that the shaped reward function is policy invariant
i.e. it does not alter the goal of the agent to learn the optimal path to the target.</p>
          <p>Since the rewards are shaped exclusively based on distance metrics that do
not take into account the speci c dynamics of the environment, the same function
can be used across di erent environments, and in general, for navigation tasks.
A limitation of this approach is that the target location needs to be known.
We have investigated using ray casts to nd the location of the target if it is
unknown, however, the scope of this research is to teach an agent to navigate
past obstacles and nd an optimal path, given a starting point and a destination.
The de nition of the task changes drastically, from a navigation-based one to
a goal- nding or search task, when the location is unknown. This is a possible
area for future work.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Curiosity-Driven Exploration</title>
        <p>
          Pathak et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] formally de ned a framework for training curious agents that
involves training two separate neural-networks: a forward and an inverse model
that form an Intrinsic Curiosity Model (ICM). The inverse model encodes the
current and next observation into a feature space and learns to predict the
action a^t that was taken between the occurrence of the two encoded observations.
The forward model is trained to take the current encoded observation and action
and predict the next encoded observation.
        </p>
        <p>rit =
k ^(st+1)</p>
        <p>(st+1)k22
2</p>
        <p>In order to generate a curiosity reward signal, the inverse and forward
dynamics models' loss functions are jointly optimised i.e. curiosity is de ned as the
di erence between the predicted feature vector of the next state and the real
feature vector of the next state. is a scaling factor.</p>
        <p>As an agents explores, it learns more about its environment and becomes less
curious. A major bene t of this approach is that it is robust: by combining the
two models, the reward only captures surprising states that have come about
directly as a result of the agents actions.
(1)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Intelligent Exploration</title>
        <p>
          We propose hybridising curiosity [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and distance-based reward shaping. Using
reward shaping alone is awed since the agent cannot navigate past obstacles
to nd a target. Using curiosity alone may cause the agent to spend too much
time exploring, after the target has been found, and get trapped in a suboptimal
state. By combining the two approaches the agent is able to explore and learn
about the dynamics of the environment, while always keeping in mind its goal
of nding an optimal path to the target. The agent learns in a more directed
and intuitive manner. Curiosity enables the agent to nd the target, while the
shaped rewards provide feedback to the agent that enables it to learn a path to
the goal.
        </p>
        <p>Directed Curiosity simultaneously maximises two reward signals. The reward
function components are somewhat con icting so it is essential to nd a balance
between them. The agent needs su cient time to explore the environment, while
also ensuring that it does not converge to a suboptimal policy too quickly. This
is similar to the exploration vs exploitation Problem in RL. We balance the
reward by manually tuning weights attached to both the constituent reward
signals. In future work, we wish to nd a means of dynamically weighting the
reward signals during training. We also wish to investigate alternative means of
combining them.</p>
        <sec id="sec-3-3-1">
          <title>Algorithm 2 Directed Curiosity-Driven Exploration</title>
          <p>Input: Initial policy 0, extrinsic reward weighting we, intrinsic reward weighting wi,
max steps T , decision frequency D
for i 0 to T do</p>
          <p>Run policy i for D timesteps
Calculate distance-based shaped reward ret (Algorithm 1)
Calculate intrinsic reward rit (Equation 1)
Compute total rewards rt = wi rit + we ret</p>
          <p>
            Take policy step from i to i+1, using PPO [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] with reward function rt
end for
          </p>
          <p>
            PPO [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] is a popular policy gradient method that is robust and simpler than
alternative approaches. Our algorithm is trained using PPO though an arbitrary
policy gradient method can be used.
4
4.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <sec id="sec-4-1">
        <title>Learning Environment</title>
        <p>A custom testing environment was created to analyse the performance of our
technique, based on the principal of path nding. It consists of a ball and a
target. The ball is an agent that must learn to navigate to the target, in the
shortest possible time (see Fig. 1). The agent is penalised every time it falls o
the platform, since there are no walls along the boundaries and it receives a
positive reward upon reaching the target. An episode terminates upon falling o
the platform, reaching the target, or after a maximum number of steps.</p>
        <p>The bene t of this environment is that it de nes a simple base task of nding
an optimal path to a target. This allows us to perform thorough analysis of the
algorithm by continuously increasing the di culty of the task. In this way, we
are able to identify its limitations and strengths. The environment represents
a generalisation for navigation tasks wherein an agent only receives positive
feedback upon reaching its destination.</p>
        <p>The agent is equipped with a set of discrete actions. Action 1 de nes
forward and backward movement while action 2 de nes left and right movement.
Simultaneously choosing the actions allows the agent to move diagonally. The
agent's observations are vectors representing its current position and the target
position. It is not given any information about the dynamics of the environment.
The agent must learn an optimal policy that nds the shortest path to the target.</p>
        <p>The baseline reward function was carefully tuned: a +100 reward is received
for nding the target, -100 penalty for falling o the platform and -0.01 penalty
every timestep. The reasoning behind the selected values is to remove bias from
the experiments. An agent cannot fall into a local optimum by favouring a single
suboptimal policy. This is because a policy that immediately falls o the platform
and a policy that learns to remain on the platform for the entire episode, without
nding the goal, will both return roughly the same episodic reward. This function
was used as a baseline that was tuned for each new environment.
(a) BasicNav
(b) HardNav (c) ObstacleNav (d) MazeNav1</p>
        <p>For the simplest version of the task, the agent and target are placed at xed
locations, on the opposite sides of the platform, without any obstacles between
them. We term this an easy exploration task since it is possible for an agent
trained with only the sparse reward function to nd the target. This is achieved
by tuning the oor to agent ratio and agent speed. The shaped reward coe cient
C in Algorithm 1 was ampli ed to 0:1 due to the simplicity of the environment.
This is referred to as BasicNav (see Fig. 1a).</p>
        <p>
          The next environment, termed HardNav (see Fig. 1b), is signi cantly larger.
It is a hard exploration environment [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], since an agent trained with a sparse
reward function is never able to nd the target. Due to the increased number of
episode steps, the shaped reward coe cient C in Algorithm 1 was dampened to
0:001.
        </p>
        <p>We also perform testing in environments with walls that block the direct
path to the goal and make nding the target more di cult. ObstacleNav (see
Fig. 1c) has a single obstacle that is deliberately placed perpendicular to the
optimal path to the target, forcing the agent to have to learn to move around the
obstacle. The agent is never explicitly given any information about the obstacle.
This environment was designed to test the limitations of Directed Curiosity since
shaping the reward to minimise the distance to goal is counter-intuitive because
it leads the agent directly into the obstacle. The coe cient C in Algorithm 1
was dampened to 0:001.</p>
        <p>The remaining set of environments contain multiple walls and obstacles in
a maze-like structure. These environments were designed to investigate if the
agent can learn to move further away from the target at the current timestep,
in order to pass obstacles and reach the target at a later timestep i.e. it needs
foresight to succeed. We term the rst maze as MazeNav1 (see Fig. 1d).</p>
        <p>The last environment is the most di cult version of the task since it has
deadends and multiple possible paths to the goal. This allows us to investigate the
robustness of Directed Curiosity. Even after nding the target, it is di cult to
generalise a path from the starting point to the destination since it is easy for the
agent to get stuck in dead-ends or behind obstacles. We term this environment as
MazeNav2 (see Fig. 1e). Due to the increased complexities, the terminal reward
was increased to +1000 and the shaped reward coe cient C in Algorithm 1 was
dampened to 0:000001.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Hyperparameter Optimisation</title>
        <p>
          It is important to carefully tune the hyperparameters for each environment. The
success of the algorithms hinge on these values. Although literature guided this
process, the hyperparameters were manually optimised, since the experiments
were performed in custom environments. The base hyperparameters were found
in BasicNav and then ne-tuned for all other environments, in order to cater
for the increased complexities. PPO is a robust learning algorithm that did not
require signi cant tuning [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], once the base hyperparameters were identi ed and
this is a major reason for its selection.
        </p>
        <p>Hyperparameter tuning was essential in ensuring that the algorithms were
able to perform meaningful learning. By attempting to tune the parameters to
the best possible values, we were able to perform a fair comparison. The notable
parameters are a batch size of 32, experience bu er size of 256 and a learning
rate of 1:0e 5. The strength of the entropy regularization is 5:0e 3 and
the discount factor for both the curiosity and extrinsic reward is 0:99. The
extrinsic reward weighting is 1:0 and the curiosity weighting is 0:1. The network
has 2 hidden layers with 128 units. The baseline parameters were adjusted for
each environment: the maximum training steps is 50000 in BasicNav, 250000 in
HardNav, 750000 in ObstacleNav and 1000000 in MazeNav1 and MazeNav2.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Each algorithm was run ve times in every environment. 30 parallel instances of
the same environment are used for data collection during training.
(b) HardNav
(c) ObstacleNav
(d) MazeNav1</p>
      <p>The sparse rewards agent does not perform consistently in BasicNav. The
agent is able to nd the target and learn an optimal policy on some runs only.
This is the reason for the high variance in Fig. 2a. In the hard exploration
environments, the agent learns to avoid falling o the platform but is unable to
nd the target on all runs and therefore receives no positive rewards in training.
This highlights the need for an exploration strategy.</p>
      <p>The reward shaping agent performs well in BasicNav. This is because the
shaped rewards act as a de nition of the task since there are no obstacles blocking
the direct path to the goal. Continuously moving closer to the target on every
timestep leads the agent to the goal in the shortest time. Even in HardNav, the
agent is able to learn an optimal policy very quickly, for the same reasons.</p>
      <p>The de ciencies of using the shaped reward only are exposed when obstacles
are introduced (see Fig. 2c, Fig. 2d). The agent fails to nd the target on all runs
in ObstacleNav and MazeNav1 and gets stuck behind obstacles. This is because
the shaped reward function is a greedy approach and the agent is not equipped
with the foresight to learn to move around the obstacles. It cannot learn to move
further away from the target at the current time, in order to reach the target at
a later stage.</p>
      <p>In MazeNav2 (see Fig. 2e), the agent was able to nd the target on two
runs. Even though there are multiple obstacles, the optimal path to the goal in
MazeNav2 is similar to that in HardNav. The agent \ignores" the obstacles and
avoids dead-ends by acting simplistically. By the end of training, however, the
agent was unable to converge to an optimal policy on any of the runs.</p>
      <p>These results show that distance-based reward shaping provides the agent
with some valuable feedback, but without an intuitive exploration strategy, the
agent lacks the foresight needed to moves past obstacles that block it's path to
the target.</p>
      <p>The curiosity agent was able to consistently learn an optimal policy in the
environments without obstacles. However, Fig. 2a shows that the curiosity agent
takes longer to converge to an optimal policy in BasicNav. This highlights that
curiosity is not necessary in environments that are not hard exploration. In
HardNav (see Fig. 2b), the curiosity agent is still able to nd an optimal policy
on all runs, but it is signi cantly slower than the shaped reward function.</p>
      <p>The necessity of the curiosity signal is highlighted when obstacles are
introduced. Not only does it enable the agent to nd the distant target, it also
implicitly learns about the dynamics of the environment, allowing the agent to
learn how to move past multiple obstacles.</p>
      <p>In ObstacleNav (see Fig. 2c), the agent is still able to learn an optimal policy
on most runs. The performance of the agent is not as successful in MazeNav1
(see Fig. 2d) and MazeNav2 (see Fig. 2e).The agent successfully learns an
optimal policy on two of the runs. In these environments, it is di cult to converge
to an optimal policy, once the target has been found. One reason for this is that
the agent keeps exploring after initially nding the target and gets stuck behind
obstacles and in dead-ends, eventually converging to an unsuccessful policy,
without being able to reach the target again. The curiosity signal is insu cient to
direct the agent back to the target and learn a path to the destination. This is
the reason for the increase of the average reward in the early stages of training
and the subsequent drop thereafter in Fig. 2e.</p>
      <p>These results indicate that curiosity equips an agent with the ability to nd
a target in hard exploration environments with obstacles, but the agent requires
additional feedback to consistently learn a path from the start point to the
destination.</p>
      <p>The Directed Curiosity agent is shown to be the most robust technique.
Fig. 2a and Fig. 2b show that Directed Curiosity always nds an optimal policy
in BasicNav and HardNav. It converges to a solution faster than the curiosity
agent in HardNav, due to the additional shaped reward feedback.</p>
      <p>The hard exploration environments highlight the bene ts of the technique.
It is the only technique that converges to an optimal solution on all runs in
ObstacleNav (see Fig. 2c). Curiosity enables the agent to nd the target and
move past the obstacle, while the shaped rewards provide additional feedback
that allows the agent to learn an optimal path to the target, once it has been
found.</p>
      <p>MazeNav1 (see Fig. 2d) and MazeNav2 (see Fig. 2e) exhibit promising results
since the Directed Curiosity agent learns an optimal policy on more runs than
any other technique i.e. on three of the ve runs. Training is more stable than
the Curiosity agent. The agent always nds the target during training, however,
it is unable to consistently nd an optimal policy on all runs. A major reason
is due to the limitations we have highlighted with the shaped reward function.
In future work, we wish to investigate a more intuitive reward function that
has foresight. Another reason is due to the complexities we have introduced in
these environments. The reward feedback is not su cient to guide the agent out
of dead-ends back to the target. However, these results indicate that the two
components of Directed Curiosity, when balanced correctly, allow the agent to
learn in a more directed and intuitive manner.</p>
      <p>For all algorithms, the variance of the results increase with the di culty of
the task since the agents do not always converge to an optimal policy i.e. when
the agent does not learn a path to the target, it does not receive the terminal
reward and hence its episodic rewards are signi cantly lower. PPO learns a
stochastic policy, hence, even on the successful runs, the algorithms converge
at di erent times. Due to the inherent randomness in the algorithm, the agent
explores di erently on every run and thus visits states in a di erent order.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>
        A new approach to learning in hard exploration, sparse reward environments,
that maximises a reward signal made up of a hybrid of Curiosity-Driven
Exploration [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and distance-based reward-shaping, is presented. This algorithm is
compared to baseline algorithms in a custom path nding environment and it is
shown that the technique enables agents to learn in a more directed and intuitive
manner.
      </p>
      <p>The Directed Curiosity agent was the most robust technique. It was able
to consistently learn an optimal policy in hard exploration environments with a
single obstacle, and learned optimal polices more often then the other techniques,
in hard exploration environments with multiple obstacles and dead-ends.</p>
      <p>In future work, we wish to investigate alternative reward functions that are
more exible than the current greedy approach. We would like to perform
further testing in existing benchmarked environments and in domains other than
navigation. This requires further research into \intelligent exploration", through
hybridising di erent shaped reward signals and exploration strategies. Another
interesting direction is to create environments with multiple targets and agents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andrychowicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welinder</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGrew</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tobin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>O.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Hindsight experience replay</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>5048</volume>
          {
          <issue>5058</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Arulkumaran</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deisenroth</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brundage</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bharath</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Deep Reinforcement Learning: A Brief Survey</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          <volume>34</volume>
          (
          <issue>6</issue>
          ),
          <volume>26</volume>
          { 38 (Nov
          <year>2017</year>
          ). https://doi.org/10.1109/MSP.
          <year>2017</year>
          .
          <volume>2743240</volume>
          , http://ieeexplore.ieee. org/document/8103164/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Badnava</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mozayani</surname>
            ,
            <given-names>N.:</given-names>
          </string-name>
          <article-title>A new Potential-Based Reward Shaping for Reinforcement Learning Agent</article-title>
          . arXiv:
          <year>1902</year>
          .06239 [cs] (May
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1902</year>
          .06239, arXiv:
          <year>1902</year>
          .06239
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivasan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrovski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saxton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munos</surname>
          </string-name>
          , R.:
          <article-title>Unifying count-based exploration and intrinsic motivation</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>1471</volume>
          {
          <issue>1479</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naddaf</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowling</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The arcade learning environment: An evaluation platform for general agents</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>47</volume>
          ,
          <volume>253</volume>
          {
          <fpage>279</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Brafman</surname>
            ,
            <given-names>R.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennenholtz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>R-MAX - A General Polynomial</surname>
          </string-name>
          <article-title>Time Algorithm for Near-Optimal Reinforcement Learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          (Oct),
          <volume>213</volume>
          {
          <fpage>231</fpage>
          (
          <year>2002</year>
          ), http://www.jmlr.org/papers/v3/brafman02a.html
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Burda</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edwards</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storkey</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efros</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>LargeScale Study of Curiosity-Driven Learning</article-title>
          .
          <source>In: International Conference on Learning Representations</source>
          (
          <year>2019</year>
          ), https://openreview.net/forum?id=rJNwDjAqYX
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Burda</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edwards</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storkey</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Exploration by random network distillation</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>12894</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kudenko</surname>
            ,
            <given-names>D.: Dynamic</given-names>
          </string-name>
          <string-name>
            <surname>Potential-Based Reward Shaping</surname>
          </string-name>
          (
          <year>Jun 2012</year>
          ), http://eprints.whiterose.ac.uk/75121/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gregor</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rezende</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Variational intrinsic control</article-title>
          .
          <source>arXiv preprint arXiv:1611.07507</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hussein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elyan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaber</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jayne</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Deep reward shaping from demonstrations</article-title>
          .
          <source>In: 2017 International Joint Conference on Neural Networks (IJCNN)</source>
          . pp.
          <volume>510</volume>
          {
          <fpage>517</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
          </string-name>
          , J.: Policy Optimization with Demonstrations p.
          <volume>10</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kearns</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Near-optimal reinforcement learning in polynomial time</article-title>
          .
          <source>Machine learning 49(2-3)</source>
          ,
          <volume>209</volume>
          {
          <fpage>232</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Marom</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Belief reward shaping in reinforcement learning</article-title>
          .
          <source>In: Thirty-Second AAAI Conference on Arti cial Intelligence</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harada</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Policy invariance under reward transformations: Theory and application to reward shaping</article-title>
          .
          <source>In: ICML</source>
          . vol.
          <volume>99</volume>
          , pp.
          <volume>278</volume>
          {
          <issue>287</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Oudeyer</surname>
          </string-name>
          , P.Y.,
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>What is intrinsic motivation? A typology of computational approaches</article-title>
          .
          <source>Frontiers in neurorobotics 1</source>
          ,
          <issue>6</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efros</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Curiosity-Driven Exploration by Self-Supervised Prediction</article-title>
          .
          <source>In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          . pp.
          <volume>488</volume>
          {
          <fpage>489</fpage>
          . IEEE, Honolulu,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA (Jul
          <year>2017</year>
          ). https://doi.org/10.1109/CVPRW.
          <year>2017</year>
          .
          <volume>70</volume>
          , http://ieeexplore.ieee.org/ document/8014804/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ravishankar</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vijayakumar</surname>
            ,
            <given-names>M.V.</given-names>
          </string-name>
          :
          <article-title>Reinforcement Learning Algorithms: Survey and Classi cation</article-title>
          .
          <source>Indian Journal of Science and Technology</source>
          <volume>10</volume>
          (
          <issue>1</issue>
          ) (
          <year>Jan 2017</year>
          ). https://doi.org/10.17485/ijst/2017/v10i1/109385, http://www.indjst.org/ index.php/indjst/article/view/109385
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Prioritized experience replay</article-title>
          .
          <source>arXiv preprint arXiv:1511.05952</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Proximal Policy Optimization Algorithms</article-title>
          . arXiv:
          <volume>1707</volume>
          .06347 [cs] (
          <year>Jul 2017</year>
          ), http://arxiv.org/abs/ 1707.06347, arXiv:
          <fpage>1707</fpage>
          .
          <fpage>06347</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Suay</surname>
            ,
            <given-names>H.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brys</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Learning from Demonstration for Shaping through Inverse Reinforcement Learning</article-title>
          p.
          <volume>9</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Suay</surname>
            ,
            <given-names>H.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brys</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.E.,
          <string-name>
            <surname>Chernova</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Reward Shaping by Demonstration</article-title>
          .
          <source>In: Proceedings of the Multi-Disciplinary Conference on Reinforcement Learning and Decision Making (RLDM)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isbell Jr</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomaz</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Exploration from demonstration for interactive reinforcement learning</article-title>
          .
          <source>In: Proceedings of the 2016 International Conference on Autonomous Agents &amp; Multiagent Systems</source>
          . pp.
          <volume>447</volume>
          {
          <fpage>456</fpage>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Houthooft</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foote</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stooke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xi Chen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DeTurck</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
          </string-name>
          , P.: #Exploration:
          <article-title>A Study of Count-Based Exploration for Deep Reinforcement Learning</article-title>
          . In: Guyon,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.V.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Vishwanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Garnett</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          , pp.
          <volume>2753</volume>
          {
          <fpage>2762</fpage>
          . Curran Associates, Inc. (
          <year>2017</year>
          ), http://papers.nips.cc/paper/ 6868-exploration
          <article-title>-a-study-of-count-based-exploration-for-deep-reinforcement-learning. pdf</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vecerik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hester</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scholz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietquin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heess</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , Rothorl, T.,
          <string-name>
            <surname>Lampe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards</article-title>
          .
          <source>arXiv:1707.08817 [cs] (Jul</source>
          <year>2017</year>
          ), http://arxiv.org/abs/1707.08817, arXiv:
          <fpage>1707</fpage>
          .
          <fpage>08817</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wiewiora</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cottrell</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elkan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Principled methods for advising reinforcement learning agents</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Machine Learning (ICML-03)</source>
          . pp.
          <volume>792</volume>
          {
          <issue>799</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>