<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Advantage Functions for Policy Transfer to Noisy Environments with Safety Constraints</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre Haritz</string-name>
          <email>pierre.haritz@tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Liebig</string-name>
          <email>thomas.liebig@cs.tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Artificial Intelligence, Faculty of Computer Science, TU Dortmund University</institution>
          ,
          <addr-line>Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LWDA'23: Lernen</institution>
          ,
          <addr-line>Wissen, Daten, Analysen</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lamarr Institute for Machine Learning and Artificial Intelligence</institution>
          ,
          <addr-line>Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Training acting agents for the goal of controlling complex live systems on the system itself is often an unfeasible task, either due to the high cost or the potential dangers that might arise. In this paper, we take a step towards identifying ways to evaluate the transferability of models for the class of constrained Reinforcement Learning problems. Furthermore, we present an approach based on free-energy advantage functions to improve adaptability and in turn transferability for constrained Reinforcement Learning problems and subsequently manage to increase the performance of a baseline algorithm, CPO, with regard to safety constraints in noisy environments.</p>
      </abstract>
      <kwd-group>
        <kwd>Constraints</kwd>
        <kwd>reinforcement learning</kwd>
        <kwd>transfer learning</kwd>
        <kwd>safety</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        AI systems can have significant real-world impact, and if not designed and deployed with safety
in mind, they can cause harm to individuals, organizations, or society as a whole. Ensuring
safety is crucial to prevent accidents, unintended consequences, or malicious uses of AI. When
deploying trained models to large-scale industrial applications, unstable live systems can cause
damage of economic or other nature. Because of the high complexity, cost, and potential danger
of training live systems from scratch, usually, these models are trained on historical or simulation
data, which may or may not accurately reflect the actual use case environment. Specifically,
in some instances, knowledge of the actual environment dynamics is only partially available,
and algorithms need to be able to handle situations where there is a degree of uncertainty.
Classically, in control environments, robustness can be achieved with Model Predictive Control
approaches ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) when plant dynamics are known.
      </p>
      <p>Reinforcement Learning (RL) is a machine learning paradigm that includes a variety of
algorithmic approaches, foremost in sequential decision-making environments. Recently, RL
has become a promising way to solve sequential decision-making tasks such as in marketing,
gaming, and control tasks, such as robotics and autonomous cars, where the aspect of safety
and trustworthiness in the agent is an important factor.</p>
      <p>We argue that in real-world applications that require safety guarantees, RL methods that
transfer well could improve upon satisfying certain thresholds.
CEUR
Workshop
Proceedings</p>
      <p>
        Transfer learning is an established concept in areas such as image classification and natural
language processing ([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), with the goal of reducing training time for Machine Learning models
and improving their performance. In this paper, we first give an overview of how Transfer
Learning is interpreted in Reinforcement Learning and discuss the benefit of transferability in
constrained Reinforcement Learning. Our contribution in this paper can be stated as such:
• We propose criteria to evaluate policy transfer in constrained RL.
• We present a method for improving performance regarding safety after transferring
pre-trained policies to a noisy environment through the use of free-energy advantage
functions.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Background and Related Work</title>
      <p>In this section, we will introduce the mathematical framework for the problem setting.</p>
      <sec id="sec-3-1">
        <title>2.1. Reinforcement Learning</title>
        <p>
          Reinforcement Learning problems can typically be modeled with the help of a Markov Decision
Process (MDP)  = (, ,  ,  , )
function  ∶  ×  ×  → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
        </p>
        <p>
          with a state space  , an action space  , a transition probability
, a discount factor  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and a reward function  ∶  ×  → ℝ
.
        </p>
        <p>To extend this to safety critical problems, one possibility is to introduce a constraint cost
function  ∶  ×  → ℝ</p>
        <p>analogue to the reward function and a safety threshold  ∈ ℝ . We
define a Constrained Markov Decision Process (from now on referred to as CMDP)   =
. We can calculate a weighted return value for constrained problems with</p>
        <p>with  ∈ Π for the set of all policies Π and a
be the set of policies that satisfy the constraint  . Then we can
(, ,  ,  , , , )
  ( ) = 
∼</p>
        <p>∞
[∑=0   (  ,   )] of a policy  ∶  → 
trajectory  = ( 0,  0,  1,  1, … ).</p>
        <p>Let Π = { ∈ Π ∶   ( ) ≤ }
calculate the optimal policy  ∗ = arg max ( ) .</p>
        <p>∈Π</p>
        <p>In real-life applications of Reinforcement Learning, environment dynamics, especially state
transitions, can be unknown. Therefore, we introduce a generalization of the MDP model by
assuming transition probabilities  ,⋆ ∈ Δ for finite states and actions and probability simplex
Δ ⊂ ℝ+. A common way to learn the objective under the assumption of unknown transition
probabilities is to maximize a lower bound on the return.
be written as  ∗ = arg</p>
        <p>∼ 
,∼</p>
        <p>[</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Transfer Learning in the Reinforcement Learning Context</title>
        <p>In a mathematical sense, given a source domain  S and a target domain  T, Transfer Learning
(TL) is used to learn an optimal policy  ∗ for  T by incorporating both external information
from the source ℐS and internal information ℐT gathered from  T. The optimal policy can
(, )]</p>
        <p>
          for initial set of states  . Taylor and Stone [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
highlight the benefits of using transfer methods in RL tasks and categorize measurements as
such:
• Performance improvement of the initial policy by transferring an agent from a source
task to a target task.
• Performance improvement of the final learned policy of an agent on a target task by
transferring.
• The gained total cumulative reward from a transfer strategy compared to a non-transfer
strategy.
• The ratio of the total reward accumulated by the transfer learner and the total reward
accumulated by the non-transfer learner.
• The reduction of learning time needed by the agent to achieve a pre-specified performance
level via knowledge transfer.
        </p>
        <p>
          In literature ([
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) a variety of TL approaches that fall under this category, are mentioned:
In Imitation learning, the agent is trained to mimic a policy of a source policy, called the
expert. This is a way of training without having access to feedback from the environment. A
framework for Imitation Learning in partially-observable settings based on the Free-Energy
Principle has been proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In cases where the reward signal is available, Learning from
Demonstrations (LfD) is a possible way of training an agent. The way agents combine their
knowledge (inter-agent or intra-agent) in Cooperative Multi-Agent RL can also be described as
a form of TL.
        </p>
        <p>In TL, domains can be described by MDPs, and any parts of it can have diferences between
the source and target domain. Consider state spaces  S and  T. Any of these relations might be
true, depending on the problem:  S ⊂  T,  S ≡  T or  S ⊂  T. Diferences for the action spaces
 S and  T are analogs. Since both state and action spaces can difer, reward functions can also
be defined diferently for both domains. Ultimately, trajectories can difer for problems where
reaching a goal can be achieved diferently (e.g., path-finding tasks).</p>
        <p>This can be further extended to safety critical applications. Difering state spaces can be the
result of failed sensors, difering action spaces are the result of hard constraints implemented by
the system. Additionally, reward functions might yield diferent values in cases where sensors
supply noisy data. In the case of CMDPs, for similar reasons, diferences can be found in both
constrained cost function and safety threshold.</p>
        <p>
          On the topic of which kind of knowledge is transferable, we can define multiple forms. The
transfer of trajectories is the main subject of LfD. Furthermore, the transfer of model dynamics
is possible when an approximation by ofline learning algorithms trained on historical data,
and before getting transferred to an online system, is feasible. Ofline RL algorithms usually
mitigate the impact of the gap between real and estimated values by adding a pessimism factor
to these learned values ([
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) or learned dynamic models ([
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]).
        </p>
        <p>
          The transfer of policies has been discussed by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. They propose to extend the
explorationexploitation choice with the option to reuse an older policy and consequently test the transfer
performance. Reward Shaping (as presented in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) speeds up the RL training process by guiding
the exploration process by transforming the reward function into a potential-based reward
function.
        </p>
        <p>
          Transfer by starting from prior distributions has been explored by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Instead of finding
trajectories that maximize expected rewards, inference formulations start from a prior
distribution over trajectories, condition the desired outcome, such as achieving a goal state, and
then estimate the posterior distribution over trajectories consistent with this outcome. Since
imitation learning provides a teacher policy to learn from, it interprets the teacher policy as a
prior policy distribution.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Using Free-Energy Priors to improve Robustness after Policy</title>
    </sec>
    <sec id="sec-5">
      <title>Transfer</title>
      <p>In real-world applications, such as robotics, it can be hard to separate signals from noise,
especially at the early stages after deploying a learned strategy. We consider a scenario where
there is a cost to receiving state data from an actor, e.g., sensor data from a robot’s joints. Since
we are considering the case of a SimToReal-transfer, we assume the existence of priors learned
from simulation interactions. In this section, we propose the use of an advantage function over
the simulation priors based on the free-energy principle to improve the agent’s robustness.</p>
      <sec id="sec-5-1">
        <title>3.1. Free-Energy Functions</title>
        <p>Free-energy functions are fundamental concepts in thermodynamics and statistical mechanics
that describe the energy available to do work in a system while accounting for both its internal
energy and its entropy.</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.2. Quanitifying the cost of Control</title>
        <p>
          Rubin et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] borrow the term to define free-energy functions in the RL context to derive
optimal policies and explore the tradeof between value and control information. The idea
is that optimal policies reflect a balance between maximizing expected rewards (value) and
minimizing the information cost that comes with control.
        </p>
        <p>With the help of information theory, we can quantify the expected cost of executing a policy
 in state  ∈  as Δ () =</p>
        <p>∑   T() log(  T() ) with Δ (  ) = 0 for a terminal state   . With this,
we are able to measure the relative entropy between the source policy  S and target policy
 T. The source policy is used by the agent in the absence of information from its new noisy
environment. For any state  , Δ () describes the minimal number of bits required to describe
the outcome, or action sampled, of the random variable  ∼ 
measure for the cost of control. Similar to the value function   ( 0), we can define the total
control information involved in executing policy  starting from the initial state  0:
T. In our case, it serves as a
 S()
  ( 0) = lim [Δ (  )]
 →∞
 →∞
 →∞
= lim  [ ∑ log
= lim  [ log</p>
        <p>T(  )

  S(  )</p>
        <p>]
  (</p>
        <p>0,  1, … ,   −1 | 0,  T)
  ( 0,  1, … ,   −1 | 0,  S) ]
Here, the optimal target policy  ∗,T should minimize the control information cost and at the
same time maximize the reward while respecting environmental constraints.</p>
      </sec>
      <sec id="sec-5-3">
        <title>3.3. Optimization with constrained Policies</title>
        <p>arg max ( )
makes sure that the new policy is within a so-called trust region of the previous one:  +1 =</p>
        <p>
          ) ≤  . Here, Π ⊂ Π denotes a  -parameterized policy
subset that filters for relevant parameters. Trust region algorithms for reinforcement learning
([
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ], such as CPO, have policy updates of the form
        </p>
        <p>)[]), and  &gt; 0 is the step size.</p>
        <p>The advantage functions calculates the expected reward gain along a trajectory and is given
by:
  (, ) = 
 (, ) −</p>
        <p>()
=  ∼ [( )| 0 = ,  0 = ] −  ∼ [( )| 0 = ]
The trust region is then defined by the set {  ∈ Π ∶  
( ‖
 )}.</p>
        <p>CPO solves the CMDP problem approximately by calculating the update
 
  ( ‖  ) ≤ 
(2)
(3)
(4)
(5)
(6)</p>
        <p>We define</p>
      </sec>
      <sec id="sec-5-4">
        <title>3.4. Using Free-Energy Functions to improve Transferability</title>
        <p>We aim to use a free-energy function to derive optimal policies while balancing the tradeof
between value and information during exploring.</p>
        <p>
          Early works ([
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]) propose using advantage functions in noisy environments to mitigate
undesired approximation efects by reducing the action gap ([
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]). We assume a stochastic prior
policy  S(|) from the source task. Fox et al ([16]) propose that we can measure the information
cost of a policy  T(|)
with   T
        </p>
        <p>(, ) = log  T(|) . The expected information cost of the target
policy  T can be written as [  T(  ,   |)] =  
(  T
‖  S). Considering the dynamics induced
by the transition probabilities  ( +1 |  ,   ) of the underlying MDP, we can now consider the
total discounted expected information cost for the target policy:</p>
        <p>S(|)


 T() =
∞
∑</p>
        <p>=0
 T(, ) = 
 T() +</p>
        <p>T
(   ,</p>
        <p>S
‖  ,  ).</p>
        <p>1  T()

as a  -weighted free-energy function with  controlling the tradeof between value and
information. From this we get a state-action free-energy function
  T(, ) =   [|, ] +  
 [  T
( ′)|, ].</p>
        <p>Now, we define the free-energy advantage function as:

 T(, ) = 
 T(, ) − 
 T()

  T</p>
        <p>=  ∼ T[( ) +</p>
        <p>( +1 ,  +1 )| 0 = ,  0 = ] −  ∼ T[( )| 0 = ]
Here, ( )</p>
        <p>represents the cumulative sum of constraint costs along the trajectory  .</p>
        <p>Finally, we can calculate the free-energy advantage transfer policy update:
 
 
  (, )</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Results</title>
      <sec id="sec-6-1">
        <title>4.1. Experiments</title>
        <p>
          In this section, we will present the evaluation framework, metrics and results.
In this section we will first evaluate the performance of the Constrained Policy Optimization
algorithm [17] for constrained RL problems. CPO yields better performance on constrained
tasks than methods such as Trust Region Policy Optimization or Primal-Dual Optimization
([
          <xref ref-type="bibr" rid="ref12">12, 18</xref>
          ]). We conduct the experiments on an exemplary robotics learning task, specifically the
HalfCheetah environment within the MuJoCo1 physics engine embedded in OpenAI Gym2. The
HalfCheetah is a two-dimensional simulated robot with six controllable joints, as depicted in
ifgure
        </p>
        <p>
          1. We use a continuous action space with  = [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] 6, where each entry of the action
vector represents the torque [Nm] applied to the respective motorized joint. The constraint is
placed on an angle, in which the HalfCheetah is considered to be fallen over and would not be
able to recover to a standing position without external help.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>4.2. Evaluating Transferability for Safety-Critical Applications</title>
        <p>For safety-critical applications at any scale, the best direct improvement of TL would generally
be starting from accurate prior distributions, because we can expect a reduced exploratory
period. While this is expected to reduce training time, prevention of constraint violations is
not necessarily guaranteed. Having reliable algorithms should also make it possible to train
an agent in a simulation and then transfer the model to safety-critical applications in the real
world without violating constraints imposed by the task. We, therefore, extend the list by the
following measurements:
• The ratio of total constraint cost accumulated by the transfer learner and total constraint
cost accumulated by the non-transfer learner or between diferent transfer learners.
• The sum of constraint violations committed by the transfer learner compared to the
non-transfer learner (or between multiple transfer learners) above a specified threshold.
Note that we hypothesize that measuring the robustness gained by simultaneously learning
system dynamics ([19]) could a valid metric, which we intend to examine in the future.</p>
      </sec>
      <sec id="sec-6-3">
        <title>4.3. Evaluation</title>
        <p>We hence compare the CPO algorithm with and without free-energy advantage policy transfer
(FEAT) on noisy environments with a noise factor   ∼  (1,  ) for every state variable index
 ∈ {1, … , ||} by evaluation the post-transfer performance according to the formerly proposed
criteria. In all experiments, we first pre-train an agent with an implementation of the CPO
algorithm in a simulated environment without noise for 2500 iterations. After the final iteration,
the agent is able to control the HalfCheetah at a satisfactory level.</p>
        <sec id="sec-6-3-1">
          <title>4.3.1. Comparison of ratios of total constraint costs</title>
        </sec>
        <sec id="sec-6-3-2">
          <title>4.3.2. Comparison of the sum of constraint violations</title>
          <p>For the criterion of constraint violations, we define a constraint threshold  . Like above, we
train the agents for a total of  = 1000 iterations. In a noisy environment with  = 0.1 , we
evaluate both agents with a strict safety threshold of  = 0.02 . Here, the value for  means that
the HalfCheetah is not allowed to show signs of falling over. While CPO without FEAT violates
the threshold 7.2% of the time, CPO with added FEAT evaluates at only 3.5%.
For  = 0.2 , we chose a higher threshold of  = 0.15 (the agent is allowed to appear unstable,
but is not allowed to fall over). CPO without FEAT violates the threshold in 86.7% of iterations,
while CPO with FEAT is significantly lower, with only 32.3% violations. Unfortunately, both
algorithms still lack the necessary robustness to guarantee safety for environments with higher
noise levels.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we highlighted how Transfer Learning can be interpreted in the context of
constrained Reinforcement Learning and proposed a way that transferability can be evaluated. The
experiments indicate that our approach improves the transferability of policies for constrained
problems in the specific case of the Constrained Policy Optimization algorithm.</p>
      <p>In the future, we aim to research further how this approach is applicable for similar policy
based RL algorithms and extended this to a more general case. Furthermore, to reflect real-world
problems more accurately, we plan to add further restrictions to the actor’s perception of the
environment, such as partial observability.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research has been funded by the Federal Ministry of Education and Research of Germany
and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning
and Artificial Intelligence.
New operators for reinforcement learning, in: Proceedings of the AAAI Conference on
Artificial Intelligence, volume 30, 2016.
[16] R. Fox, A. Pakman, N. Tishby, Taming the noise in reinforcement learning via soft updates,
arXiv preprint arXiv:1512.08562 (2015).
[17] J. Achiam, D. Held, A. Tamar, P. Abbeel, Constrained policy optimization, in: International
conference on machine learning, PMLR, 2017, pp. 22–31.
[18] Y. Chow, M. Ghavamzadeh, L. Janson, M. Pavone, Risk-constrained reinforcement learning
with percentile risk criteria, The Journal of Machine Learning Research 18 (2017) 6070–6120.
[19] P. G. Sessa, I. Bogunovic, M. Kamgarpour, A. Krause, Mixed strategies for robust
optimization of unknown objectives, in: International Conference on Artificial Intelligence and
Statistics, PMLR, 2020, pp. 2970–2980.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Kothare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morari</surname>
          </string-name>
          ,
          <article-title>Robust constrained model predictive control using linear matrix inequalities</article-title>
          ,
          <source>Automatica</source>
          <volume>32</volume>
          (
          <year>1996</year>
          )
          <fpage>1361</fpage>
          -
          <lpage>1379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on transfer learning</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>109</volume>
          (
          <year>2020</year>
          )
          <fpage>43</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , P. Stone,
          <article-title>Transfer learning for reinforcement learning domains: A survey.</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>10</volume>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Transfer learning in deep reinforcement learning: A survey</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>07888</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ogishima</surname>
          </string-name>
          , I. Karino,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kuniyoshi</surname>
          </string-name>
          ,
          <article-title>Reinforced imitation learning by free energy principle</article-title>
          ,
          <source>arXiv preprint arXiv:2107.11811</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <article-title>Of-policy deep reinforcement learning without exploration</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2052</fpage>
          -
          <lpage>2062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kidambi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajeswaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Netrapalli</surname>
          </string-name>
          , T. Joachims, Morel:
          <article-title>Model-based ofline reinforcement learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>21810</fpage>
          -
          <lpage>21823</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Veloso</surname>
          </string-name>
          ,
          <article-title>Probabilistic policy reuse for inter-task transfer learning</article-title>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>58</volume>
          (
          <year>2010</year>
          )
          <fpage>866</fpage>
          -
          <lpage>871</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harutyunyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , A. Nowé,
          <article-title>Policy transfer using reward shaping</article-title>
          .,
          <source>in: AAMAS</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdolmaleki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Springenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Maximum a posteriori policy optimisation</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>06920</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          ,
          <article-title>Trading value and information in mdps, in: Decision Making with Imperfect Decision Makers</article-title>
          , Springer,
          <year>2012</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <article-title>Trust region policy optimization</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1889</fpage>
          -
          <lpage>1897</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>High-dimensional continuous control using generalized advantage estimation</article-title>
          ,
          <source>arXiv preprint arXiv:1506.02438</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Baird</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning in continuous time: Advantage updating</article-title>
          ,
          <source>in: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94)</source>
          , volume
          <volume>4</volume>
          , IEEE,
          <year>1994</year>
          , pp.
          <fpage>2448</fpage>
          -
          <lpage>2453</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          , P. Thomas,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munos</surname>
          </string-name>
          , Increasing the action gap:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>