<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Auto-Perceptive Reinforcement Learning (APRiL)</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Rebecca Allday, Simon Hadfield, and Richard Bowden Center for Vision, Speech and Signal Processing (CVSSP) University of Surrey</institution>
          ,
          <addr-line>Guildford</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <fpage>103</fpage>
      <lpage>112</lpage>
      <abstract>
        <p>The relationship between the feedback given in Reinforcement Learning (RL) and visual data input is often extremely complex. Given this, expecting a single system trained end-to-end to learn both how to perceive and interact with its environment is unrealistic for complex domains. In this paper we propose Auto-Perceptive Reinforcement Learning (APRiL), separating the perception and the control elements of the task. This method uses an auto-perceptive network to encode a feature space. The feature space may explicitly encode available knowledge from the semantically understood state space but the network is also free to encode unanticipated auxiliary data. By decoupling visual perception from the RL process, APRiL can make use of techniques shown to improve performance and efficiency of RL training, which are often difficult to apply directly with a visual input. We present results showing that APRiL is effective in tasks where the semantically understood state space is known. We also demonstrate that allowing the feature space to learn auxiliary information, allows it to use the visual perception system to improve performance by approximately 30%. We also show that maintaining some level of semantics in the encoded state, which can then make use of state-of-the art RL techniques, saves around 75% of the time that would be used to collect simulation examples.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Deep RL has seen advances recently with work like Deep Q-Networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which uses a deep convolutional neural
network (CNN) to approximate the action-value function in a Q-learning method to learn to play Atari games.
There have since been many variations on DQNs such as using recurrent neural networks in place of a standard
feed-forward CNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and adaptations for use with continuous action spaces [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Whilst these value based methods
for RL have proved popular, policy based and actor-critic methods have also been successfully adapted for deep
learning. In this work we use a synchronous version of Mnih et al.’s Asynchronous Advantage Actor-Critic (A3C)
method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        RL algorithms are often tested using simple software simulators such as video games or simple physics problems
(e.g. cart-pole). This makes it easy to accumulate the number of episodes required to train the networks, which is
not practical for more realistic robotics applications. Many techniques for approaching the issue of data collection
have been suggested. For example, Hindsight Experience Replay (HER) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] allows RL to learn from unsuccessful
episodes by changing the goal and hence the reward feedback. However, in order to apply this to the image domain,
a method for synthesising images is required to change the goal. There have also been model based techniques
aimed at reducing the number of experiences needed for training. Black-DROPS (Black-box Data-efficient RObot
Policy Search) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], for example, uses Gaussian Processes (GPs) to learn the dynamics of a system with a small
number of experiences and then produces experiences for training the RL directly from the GPs. This accelerates
the RL process but is focused on systems where the state is fully observed and has a small number of dimensions.
The large dimensionality of observations only available as images are not suitable for GP dynamics modeling.
      </p>
      <p>
        Advances in deep learning has meant that feature spaces can be created which represent the important aspects
of a visual observation. Deep auto-encoders [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have been used extensively to reduce dimensionality of data
and have been used with CNNs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to help retain the spatial relationships in images. As well as providing a
low-dimensional feature space they are also used to create generative models, for example in image restoration
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Considering the problems high dimensional spaces cause in RL, it is not surprising that attempts to use
auto-encoder networks with RL have been made. Table 1 compares the uses of auto-encoders in RL systems. Finn
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] use an auto-encoder to create a set of feature points representing positions in the image that describe
the environment, for example where objects are. Stadie et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] encode a state for training a dynamics model in
order to improve exploration by increasing curiosity, but still use the raw observation as the input to the learning
system. Lange and Riedmiller [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] use a deep auto-encoder to compress a visual input to a low dimensional
feature space, which is not semantically understood. This improves the reinforcement learning data-efficiency.
Kimura [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] uses auto-encoders as pre-training for a DQN system. However, none of these approaches can exploit
valuable RL techniques, such as HER. Lange and Reidmiller’s work does not have the semantic understanding
required in the features in order to adapt the episode with a new goal. Kimura’s requires images, for fine-tuning
of the network, which cannot be adapted for a new goal.
      </p>
      <p>
        Nair et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] propose a solution to goal-conditioned RL, using an encoder-decoder system to learn a latent
space which can be used to sample goals, provide a lower dimensional, structured input for RL, and to compute
a reward signal. Although this allows HER to be used for visual problems, it introduces its own limitations. In
using an image as an explicit goal, the agent’s flexibility is limited. For example, in a pick and place problem it
restrains the final position of the robot when the final position of the object is more important. They also assume
that only the image is available to the RL system at train time, they do not consider cases where we may want
to make use of the state that is available - meaning information is wasted.
      </p>
      <p>In contrast APRiL makes use of whatever semantically understood state information is available at train time,
whilst still allowing additional auxiliary information to be encoded from the visual input. This gives a system
which makes full use of the information and RL techniques available at train time but can still be deployed using
vision as the input.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        A formalisation of episodic reinforcement learning is used where an agent interacts with an environment at discrete
time steps, t, with a maximum number of steps T . There is a set of states st ∈ S and a set of actions the agent
can perform at ∈ A. The goal is to maximise the discounted sum of reward signal rt over time,
where γ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the discount factor for future rewards. In order to maximise Rt we learn a policy π(a | st), which
estimates a distribution over the possible actions, a ∈ A, conditioned on the current state st. We sample at from
this distribution π(a | st). The value is defined as V π(st) = E (Rt | st, π), the expected return Rt given a particular
policy starting in a particular state st. Finally, the action-value function is defined as Qπ(st, at) = E (Rt | st, at, π),
Rt =
∞
X γkrt+k
k=0
      </p>
      <sec id="sec-3-1">
        <title>Goal conditioned</title>
        <sec id="sec-3-1-1">
          <title>Implicit in</title>
          <p>image
encoded into
latent space</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Implicit in image</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Implicit in image</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Conditioned</title>
          <p>with a point in
the state space</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Implicit in input (state or latent space)</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Semantics in RL input</title>
        <sec id="sec-3-2-1">
          <title>None</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Partially</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>None</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Variable (1)</title>
          <p>Perception
Observation
( ∈  )</p>
          <p>Observation
Available semantic state
( ҧ ∈  ҧ ⊂ ℝ ,  ≤  )</p>
          <p>Encoded state
(s ∈  ⊂ ℝ )
Agent and
Environment</p>
          <p>Action (a ∈  )
Reward ( ∈ ℝ)</p>
          <p>Reinforcement Learning</p>
          <p>Encoder</p>
          <p>Decoder
the expected return Rt given a particular policy, starting with a particular action at from a specified stated st.
For the visual aspect we define ot ∈ O as an image of the system.</p>
          <p>
            In this work we use Advantage-Actor-Critic style reinforcement learning [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. This system produces two outputs
- a stochastic policy (the actor) and an estimate of the value function (the critic). The ground truth value Rt is
used to calculate the value loss
          </p>
        </sec>
        <sec id="sec-3-2-5">
          <title>The policy loss is calculated using the advantage, given by</title>
          <p>Lv = Rt − V (st).</p>
          <p>A(st, at) = Q(st, at) − V (st).</p>
          <p>Lp = log π(a = at|st)A(st, at).</p>
          <p>Lrl = αLv + βLp + H(π(st))
The advantage gives the difference between the expected return given the action taken and the expected return
of the state itself given the current policy - showing how much better or worse the action performed than
expected. This can be approximated as the discounted rewards minus the predicted value for the current policy,
taking the form A(st, at) ≈ Rt − V (st). The policy used is in the form of a Gaussian distribution, such that
π(a | st) = N (μa, σa2). Given that an action at is then sampled and executed, the policy loss is then calculated as
This means that an action which is better than expected will be made more likely, with a weighting of how likely
it was in the first place. In contrast an action which performed worse will be made less likely for that state. The
full loss for the RL network then takes the form
where H is the entropy - which is included to encourage exploration - and α, β, are hyper-parameters which
control the strength of each loss term. To ensure that the initial random value estimate is sensible and does not
skew the policy loss, we train with α = 1, β = 0, = 0 for a small number of iterations.</p>
          <p>We use a batch-style off-policy approach by storing up experience in a replay buffer and sampling from this
to train the RL algorithm. We set a limit to our experience replay buffer to some value M so that as learning
progresses, the oldest experiences are forgotten and replaced with more recent ones. The replay buffer is of the
form Ω = {e : |Ω| &lt; M }, where each episode of experiences is of the form e = {(st, at, Rt) : t = 1, .., j and j ≤ T }
where j is the terminating step for that episode. The probability of at being selected from the current policy and
the value of the st for the current policy is found at training time.
3.1.1</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Hindsight Experience Replay</title>
        <p>
          Hindsight Experience Replay (HER) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a powerful technique which allows us to learn from unsuccessful episodes
in learning, especially where rewards are sparse and success from random exploration may be limited. Using
HER we can adjust the goal for our system to a state it achieved in the current episode - meaning we artificially
        </p>
        <p>Observation ( ∈  )
Reconstructed</p>
        <p>Observation
Control</p>
        <p>Forward data
flow
Optional loss
Loss
(2)
(3)
(4)
(5)
create successful episodes. For a given episode s1, ..., sT where a goal g 6= s1, ..., st, we may “replay” this episode
with g = si for some 1 &lt; i &lt; T knowing that it will achieve the goal. Adding these adapted episodes to the
experience replay, Ω, means the episode buffer then has more episodes to learn from and has a more balanced
ratio of successful episodes without needing excessive exploration.
3.1.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Gaussian Process Model</title>
        <p>In order to reduce the number of costly agent-environment interactions we use Gaussian Processes (GPs) to
approximate the dynamics of our system and give uncertainty information. We use a small number of interactions
with the agent and environment to train the GP - this takes in the current state, st, and the action to be taken,
at. It is then optimized to output a Gaussian distribution which estimates the next state st+1 with uncertainty.</p>
        <p>We represent the dynamics of our system as</p>
        <p>st+1 = st + D(st, at) + w,</p>
        <p>Dˆ (xt) ∼ GP(μdˆ (xt), kdˆ(xt, x0t)),
with w (Gaussian system noise) and D (unknown transition dynamics). Given that xt = (st, at), the GP is
computed as
where μdˆ is the mean function and kdˆ is the kernel function. With a set of observed transitions Y1:t =
D(x1), ..., D(xt), we can query our GP at a new data point x∗ to obtain a distribution over expected state
updates:</p>
        <p>p(Dˆ (x∗) | Y1:t, x∗) = N (μdˆ (x∗), σ2dˆ (x∗)).</p>
        <p>Sampling from this Gaussian allows the rapid creation of more episodes to train the RL system. The same reward
calculations as the normal environment are used so these episodes can be added directly to Ω as before.
3.2</p>
      </sec>
      <sec id="sec-3-5">
        <title>Auto-Perceptive Network</title>
        <p>The perception part of our system is an auto-encoder. This allows us to encode a feature space to use as the state
space, S, which is the input to the RL system. The encoder uses the observations of the agent and environment
in the form of an image, transforming it to the feature space as the function φenc : O → S, whilst the decoder
arm transforms from the feature space to a reconstructed image φdec : S → O.</p>
        <p>The auto-encoder takes the image observation of the system as an input and compresses it down to the feature
space st = φenc(ot) and the output is a reconstruction of that image oˆt = φdec ◦ φenc(ot). The reconstruction loss
is a pixel-wise loss against the input
(6)
(7)
(8)
(9)
(10)
(11)
where ω is a weighting which determines how strong the conditioning is. The learnt feature space can be:
1. entirely conditioned to be semantically understood as the observable state (m = n, ω 6= 0),
2. partially conditioned with some learnt features relating to the observable state and some auxiliary features
with no predetermined semantic meaning (m &lt; n, ω 6= 0),
3. or not conditioned with learnt features having no predetermined semantic meaning (ω = 0).
This network can be trained using data from initial random exploration and fine-tuned during reinforcement
learning.
We denote the space of available information from the environment, which has a predefined semantic meaning, as
s¯t ∈ S¯. The optional conditioning loss is the absolute difference between a section of the encoded state space and
the semantically understood state. In the case where S ⊂ Rn and S¯ ⊂ Rm, with m ≤ n, then the conditioning
loss is</p>
        <sec id="sec-3-5-1">
          <title>The full loss for the visual perception network is</title>
          <p>Lr = |ot − oˆt|.
Lc = |st1:m − s¯t|.
Lvp = Lr + ωLc,
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>Auto-Perceptive Reinforcement Learning (APRiL)</title>
        <p>The RL system and the auto-perception network are independent networks, which can be trained concurrently
with much of the same data but do not need to be trained end-to-end as they exploit different types of supervision.</p>
        <p>In the case of the encoded feature space being entirely semantically understood the auto-encoder is trained
with data collected for the initial random exploration - the same data can be used to train the GP to learn the
dynamics. These may be co-trained in parallel and tested individually before being integrated. The perception
network infers an approximated state from an observation and then passes this approximate state to the RL
network without the RL needing to see any images during training. This still provides a system which does not
need access to the robot state at run time and can predict actions with only visual input, but does not require it
to be trained in an end-to-end manner, allowing RL to benefit from HER and GP modelled transition dynamics.</p>
        <p>When the encoded feature space is partially semantically understood then the auto-encoder will still be
pretrained on random data but the encoder arm will be used to get the encoded state for input to the RL system.
Therefore the RL system only has to interpret the low dimensional feature space coming from the auto-encoder
and does not need to process the images. This means that the training is more focused on solving the control
problem. Techniques such as HER are still feasible since we have a predetermined understanding of some of the
feature space being used by the RL.</p>
        <p>
          The final case is where there is no semantically understood state available. This is similar to Lange and
Riedmiller’s work [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] where the encoder feature space had no predetermined semantic meaning. This case still
allows a lower dimensional state space to be learnt from the visual input even when there is no semantic state
available during training.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>
        To evaluate APRiL we use the OpenAI [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] framework with the Mujoco physics simulator [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. We use a variation
of the Fetch robot reach environment because it has a continuous action space and has a visually interesting
environment to test the auto-perceptive system. The aim is to direct the end-effector of the Fetch arm to a goal
gx, gy, gz - represented visually by a red sphere. The action space is defined with actions (Δx, Δy, Δz) where
(x, y, z) is the position of the end-effector and the maximum episode length is set as T = 50. We train the
networks using Adam optimizers [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and a Tensorflow [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] implementation of our system will be available at
https://github.com/rebecca-allday/APRiL.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Fully Semantic Features</title>
        <p>
          The first experiment uses a fully observed, semantically understood state s¯t = (xt, yt, zt, gx, gy, gz). Firstly, we
use a random policy to collect an initial experience replay buffer. This data can be used to train multiple
aspects of the system. Initially we train a GP on the transitions taking in (xt, yt, zt, Δx, Δy, Δz) and outputting
(xt+1, yt+1, zt+1). This allows us to create extra episodes to train our RL system as described in Section 3.1.2. The
advantage actor-critic RL system is trained with data created from both the GP and from the agent, including
the HER additions to the replay buffer. The data from the random policy and any episodes collected using the
simulator are used to train the perception network. In this case the perception network is co-trained such that
s¯t = st = φenc(ot), which is the first case from Section 3.2, when m = n and ω 6= 0. Finally, at test time the
networks can be used together to go directly from vision to actions, following the data flow shown by the black
arrows in Fig. 2. We compare this to a latent space with no conditioning loss, where ω = 0, which is similar to
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>The training of the RL system, using the semantically understood state space directly, converges with only 15
episodes of random policy interactions with the simulator, the rest of the data used is collected from our trained
GP. It takes approximately 0.01 seconds per rendered simulation step, but only 0.0025 seconds to sample a single
step from the GP. This equates to saving 75% of the time that would have been spent on collecting simulation
examples. This is a saving that would not be possible using a traditional end-to-end visual RL algorithm.</p>
        <p>Table 2 shows the policy achieves an average episode length of 3.1 actions when using the ground truth
state space as input. The perception network is trained alongside this. Examples of the reconstructions from
the auto-encoder can be seen in Fig. 3, along with reconstructions from the auto-encoder without the semantic
conditioning (ω = 0). Even though we fully constrain the encoded feature space, and do not enable the system
to encode many visual properties, the decoder arm is still able to learn how to produce realistic images of the
scene from a non-visual intermediate state, including how to correctly place a fully textured robotic arm. They
are certainly comparable to the reconstructions without the conditioning loss. However, reconstruction accuracy
is unimportant, the key is the reconstruction loss aids encoding meaningful information into the latent space for
the RL.</p>
        <p>
          At test time we can see the performance of the system using the visual encoder network to produce the feature
space, which is an approximation of the semantically understood state space, given to the RL network. The policy
achieves an average episode length of 12.4 actions. This is largely due to the goal or end point being occluded
or out of the field of view, in which case the arm must move to attempt to gather more information about its
current state. In these situations, the ground truth algorithm is an unrealistic comparison for a vision based
system which will never have full access to the state. However, this is still much more effective than the case
when the perceived state, S, is not conditioned on the semantically understood state, S¯, which is similar to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Partially Semantic Features</title>
        <p>
          The next set of experiments introduces an element such that the state is not be fully observed via a semantically
understood state space. A randomly placed obstacle (box) is added which can affect exploration and potential
solutions for getting to the goal (red sphere), see Fig. 4. Again we compare the results in this section to a network
trained with no access to the available semantic state which is similar to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. We also train a system which takes
the ground truth semantic state and a separate latent space (in a similar way to [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]) to show that if both are
available the system has all the information it needs.
        </p>
        <p>We first train APRiL on the same state that was available in the previous set-up. This means that the RL
system is not receiving any information about the obstacle. As expected, we see a reduction in performance
compared to the environment with no obstacle. From 3.10 average actions per episode with no obstacle to
8.45 with obstacles - this equates to approximately a 2.5 times increase in the number of actions. Examples
of the reconstructions from the perception network are seen in Fig. 5b. These reconstructions are comparable
to those in Fig. 3b, with some slight degradation because the scene is more complex yet we have not given it
any additional degrees of freedom in the latent space. It is interesting to note that the decoder arm attempts
to reconstruct the obstacle even though it is theoretically not present in the intermediate state.</p>
        <p>When testing with the perception to action system we see that this gives much worse performance with an
average episode length of 28.55 actions (See Table 3). It is good to note that this is in comparison to 12.42 actions
with no obstacles, equating to approximately a 2.5 times increase in the number of actions which is similar in
scale to the decrease in performance seen without perception. This is likely because it has no way of knowing
about the obstacle in the encoded state and often mistakes it for the goal, especially if the goal is occluded by
the arm.</p>
        <p>Next we allowed the encoded feature space to be only partially semantically understood. We used a feature space
of size n = 8, with the semantically understood state s¯t conditioning only the first 6 elements (i.e. m = 6). The
rest were driven purely by the reconstruction loss, allowing it to learn whatever was relevant to the understanding
of the environment. Example reconstructions from the perception network can be seen in Fig. 5d. This trained
perception network does a better job of modelling the obstacle and goal as independent objects, however the
robot arm has lost a significant amount of visual fidelity. This may be because all systems have been trained
for the same number of iterations, despite this one having more network parameters. Regardless, a high fidelity
image of the robotic arm is not important for RL, as long as the position is known.</p>
        <p>The proposed RL system using our partially semantically understood feature space as input performs better
than the system using just the semantic state, with an average of 20.83 actions (See Table 3). In comparison
to the 12.42 actions in the environment with no obstacles, this is only a 1.68 times increase for what is a more
difficult problem. This is approximately a 30% improvement compared to 28.55 average actions taken when using
the semantic feature space. This shows that when we do not have access to the full semantically understood state
our feature space can encode the additional auxiliary information necessary to solve the task better than just
with the semantic state based perception.</p>
        <p>
          Finally we give the perception network complete freedom to encode a state space based purely on the
reconstruction loss in a similar manner to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Fig. 5f shows that this improves the reconstruction as expected
since that is the only feedback given to the encoder-decoder network. However, as we can see from Table 3 the
performance of the system with no use of the semantically understood data available to it at train time performs
much worse than those which do.
(a) Original
        </p>
        <p>(b) Reconstructed (n = m = 6)
(c) Original</p>
        <p>(d) Reconstructed (n = 8, m = 6)
(e) Original
(f) Reconstructed (n = 16, m = 0)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we have shown that the bio-inspired separation of percepts and control at training time allows
reinforcement learning to be trained effectively and still gives a system that can predict actions purely from visual
data. We showed that allowing the perception system to encode additional properties into the feature space
improved the performance over a system using only the approximate state.</p>
      <p>This demonstrates the value in allowing the visual system to encode additional features into the input of our
RL algorithms. In addition, the splitting of perception and control allows other techniques to be used, which are
typically challenging to implement in the high dimensional image domain, such as HER and modelling transition
dynamics with GPs. Whilst we still have a system which allows us to go from visual observation to action - the
training does not need to be end-to-end.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Land</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Tatler</surname>
          </string-name>
          ,
          <article-title>Looking and acting: vision and eye movements in natural behaviour</article-title>
          . Oxford University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          , G. Ostrovski,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sadik</surname>
          </string-name>
          , I. Antonoglou,
          <string-name>
            <given-names>H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Legg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          , “
          <article-title>Human-level control through deep reinforcement learning</article-title>
          ,
          <source>” Nature</source>
          , vol.
          <volume>518</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausknecht</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          , “
          <article-title>Deep Recurrent Q-Learning for Partially Observable MDPs</article-title>
          ,” AAAI, pp.
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Hunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          , “
          <article-title>Continuous control with deep reinforcement learning</article-title>
          ,
          <source>” arXiv preprint arXiv:1509.02971</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Badia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , “
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          ,” in ICML,
          <year>2016</year>
          , pp.
          <fpage>1928</fpage>
          -
          <lpage>1937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Andrychowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McGrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , and W. Zaremba, “Hindsight experience replay,
          <source>” CoRR</source>
          , vol.
          <source>abs/1707.01495</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>K. I. Chatzilygeroudis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Goepp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vassiliades</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mouret</surname>
          </string-name>
          , “
          <article-title>Black-box data-efficient policy search for robotics,” CoRR</article-title>
          , vol.
          <source>abs/1703.07261</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , “
          <article-title>Reducing the Dimensionality of Data with Neural Networks,” Science</article-title>
          , vol.
          <volume>313</volume>
          , pp.
          <fpage>504</fpage>
          -
          <lpage>507</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Masci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Meier</surname>
          </string-name>
          , D. Cire¸san, and J.
          <string-name>
            <surname>Schmidhuber</surname>
          </string-name>
          , “
          <article-title>Stacked convolutional auto-encoders for hierarchical feature extraction,”</article-title>
          <source>in International Conference on Artificial Neural Networks</source>
          . Springer,
          <year>2011</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and Y.
          <string-name>
            <surname>-B. Yang</surname>
          </string-name>
          , “
          <article-title>Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,”</article-title>
          <source>in Advances in Neural Information Processing Systems</source>
          <volume>29</volume>
          . Curran Associates, Inc.,
          <year>2016</year>
          , pp.
          <fpage>2802</fpage>
          -
          <lpage>2810</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lange</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          , “
          <article-title>Deep auto-encoder neural networks in reinforcement learning</article-title>
          ,
          <source>” in The 2010 International Joint Conference on Neural Networks (IJCNN)</source>
          ,
          <year>July 2010</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Geurts</surname>
          </string-name>
          , and L. Wehenkel, “
          <article-title>Tree-based batch mode reinforcement learning</article-title>
          ,
          <source>” Journal of Machine Learning Research</source>
          , vol.
          <volume>6</volume>
          , no.
          <source>Apr</source>
          , pp.
          <fpage>503</fpage>
          -
          <lpage>556</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , “
          <article-title>Deep spatial autoencoders for visuomotor learning</article-title>
          ,” in ICRA,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , “
          <article-title>Learning neural network policies with guided policy search under unknown dynamics,”</article-title>
          <source>in Advances in Neural Information Processing Systems</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1071</fpage>
          -
          <lpage>1079</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Stadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , “
          <article-title>Incentivizing exploration in reinforcement learning with deep predictive models,” CoRR</article-title>
          , vol.
          <source>abs/1507.00814</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kimura</surname>
          </string-name>
          , “DAQN:
          <article-title>Deep Auto-encoder and</article-title>
          <string-name>
            <surname>Q-Network</surname>
          </string-name>
          ,” arXiv, vol.
          <source>abs/1710.06542</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Nair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          , “
          <article-title>Visual reinforcement learning with imagined goals,”</article-title>
          <source>in Advances in Neural Information Processing Systems</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>9209</fpage>
          -
          <lpage>9220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. van Hoof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>Meger</surname>
          </string-name>
          , “
          <article-title>Addressing function approximation error in actor-critic methods</article-title>
          ,” arXiv:
          <year>1802</year>
          .09477,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pettersson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , and W. Zaremba, “Openai gym,”
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          , “
          <article-title>Mujoco: A physics engine for model-based control</article-title>
          ,
          <source>” in Intelligent Robots and Systems (IROS)</source>
          ,
          <year>2012</year>
          IEEE/RSJ International Conference on. IEEE,
          <year>2012</year>
          , pp.
          <fpage>5026</fpage>
          -
          <lpage>5033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          , “
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <source>” in International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brevdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Citro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Devin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harp</surname>
          </string-name>
          , G. Irving,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jozefowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kudlur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Levenberg</surname>
          </string-name>
          , D. Man´e,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Talwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          , F. Vi´egas,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Warden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , “TensorFlow:
          <article-title>Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow</article-title>
          .
          <source>org.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>