<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Free-Energy Advantage Functions for Policy Transfer to Noisy Environments with Safety Constraints</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Pierre</forename><surname>Haritz</surname></persName>
							<email>pierre.haritz@tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Chair of Artificial Intelligence</orgName>
								<orgName type="department" key="dep2">Faculty of Computer Science</orgName>
								<orgName type="institution">TU Dortmund University</orgName>
								<address>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Lamarr Institute for Machine Learning and Artificial Intelligence</orgName>
								<address>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Liebig</surname></persName>
							<email>thomas.liebig@cs.tu-dortmund.de</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Chair of Artificial Intelligence</orgName>
								<orgName type="department" key="dep2">Faculty of Computer Science</orgName>
								<orgName type="institution">TU Dortmund University</orgName>
								<address>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Lamarr Institute for Machine Learning and Artificial Intelligence</orgName>
								<address>
									<settlement>Dortmund</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Free-Energy Advantage Functions for Policy Transfer to Noisy Environments with Safety Constraints</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">7556ADA5300CD260BF6C02F4A885E4D5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>reinforcement learning</term>
					<term>transfer learning</term>
					<term>safety</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Training acting agents for the goal of controlling complex live systems on the system itself is often an unfeasible task, either due to the high cost or the potential dangers that might arise. In this paper, we take a step towards identifying ways to evaluate the transferability of models for the class of constrained Reinforcement Learning problems. Furthermore, we present an approach based on free-energy advantage functions to improve adaptability and in turn transferability for constrained Reinforcement Learning problems and subsequently manage to increase the performance of a baseline algorithm, CPO, with regard to safety constraints in noisy environments.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>AI systems can have significant real-world impact, and if not designed and deployed with safety in mind, they can cause harm to individuals, organizations, or society as a whole. Ensuring safety is crucial to prevent accidents, unintended consequences, or malicious uses of AI. When deploying trained models to large-scale industrial applications, unstable live systems can cause damage of economic or other nature. Because of the high complexity, cost, and potential danger of training live systems from scratch, usually, these models are trained on historical or simulation data, which may or may not accurately reflect the actual use case environment. Specifically, in some instances, knowledge of the actual environment dynamics is only partially available, and algorithms need to be able to handle situations where there is a degree of uncertainty. Classically, in control environments, robustness can be achieved with Model Predictive Control approaches ( <ref type="bibr" target="#b0">[1]</ref>) when plant dynamics are known.</p><p>Reinforcement Learning (RL) is a machine learning paradigm that includes a variety of algorithmic approaches, foremost in sequential decision-making environments. Recently, RL has become a promising way to solve sequential decision-making tasks such as in marketing, gaming, and control tasks, such as robotics and autonomous cars, where the aspect of safety and trustworthiness in the agent is an important factor.</p><p>We argue that in real-world applications that require safety guarantees, RL methods that transfer well could improve upon satisfying certain thresholds.</p><p>Transfer learning is an established concept in areas such as image classification and natural language processing ([2]), with the goal of reducing training time for Machine Learning models and improving their performance. In this paper, we first give an overview of how Transfer Learning is interpreted in Reinforcement Learning and discuss the benefit of transferability in constrained Reinforcement Learning. Our contribution in this paper can be stated as such:</p><p>• We propose criteria to evaluate policy transfer in constrained RL.</p><p>• We present a method for improving performance regarding safety after transferring pre-trained policies to a noisy environment through the use of free-energy advantage functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background and Related Work</head><p>In this section, we will introduce the mathematical framework for the problem setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Reinforcement Learning</head><p>Reinforcement Learning problems can typically be modeled with the help of a Markov Decision Process (MDP) 𝑀 = (𝑆, 𝐴, 𝑇 , 𝛾 , 𝑅) with a state space 𝑆, an action space 𝐴, a transition probability function 𝑇 ∶ 𝑆 × 𝐴 × 𝑆 → [0, 1], a discount factor 𝛾 ∈ [0, 1] and a reward function 𝑅 ∶ 𝑆 × 𝐴 → ℝ.</p><p>To extend this to safety critical problems, one possibility is to introduce a constraint cost function 𝐶 ∶ 𝑆 × 𝐴 → ℝ analogue to the reward function and a safety threshold 𝑐 ∈ ℝ. We define a Constrained Markov Decision Process (from now on referred to as CMDP) 𝑀 𝐶 = (𝑆, 𝐴, 𝑇 , 𝛾 , 𝑅, 𝐶, 𝑐). We can calculate a weighted return value for constrained problems with</p><formula xml:id="formula_0">𝐽 𝐶 (𝜋) = 𝔼 𝜏 ∼𝜋 [∑ ∞ 𝑡=0 𝛾 𝑡 𝐶(𝑠 𝑡 , 𝑎 𝑡 )]</formula><p>of a policy 𝜋 ∶ 𝑆 → 𝐴 with 𝜋 ∈ Π for the set of all policies Π and a trajectory 𝜏 = (𝑠 0 , 𝑎 0 , 𝑠 1 , 𝑎 1 , … ). Let Π 𝐶 = {𝜋 ∈ Π ∶ 𝐽 𝐶 (𝜋) ≤ 𝑐} be the set of policies that satisfy the constraint 𝑐. Then we can calculate the optimal policy 𝜋 * = arg max 𝜋∈Π 𝐶</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝐽 (𝜋).</head><p>In real-life applications of Reinforcement Learning, environment dynamics, especially state transitions, can be unknown. Therefore, we introduce a generalization of the MDP model by assuming transition probabilities 𝑇 ⋆ 𝑠,𝑎 ∈ Δ 𝑆 for finite states and actions and probability simplex Δ 𝑆 ⊂ ℝ 𝑆 + . A common way to learn the objective under the assumption of unknown transition probabilities is to maximize a lower bound on the return.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Transfer Learning in the Reinforcement Learning Context</head><p>In a mathematical sense, given a source domain 𝑀 S and a target domain 𝑀 T , Transfer Learning (TL) is used to learn an optimal policy 𝜋 * for 𝑀 T by incorporating both external information from the source ℐ S and internal information ℐ T gathered from 𝑀 T . The optimal policy can be written as</p><formula xml:id="formula_1">𝜋 * = arg 𝑚𝑎𝑥 𝜋 𝔼 𝑥∼𝜇 𝑡 ,𝑎∼𝜋 [𝑄 𝜋</formula><p>𝑀 (𝑥, 𝑎)] for initial set of states 𝜇. Taylor and Stone <ref type="bibr" target="#b2">[3]</ref> highlight the benefits of using transfer methods in RL tasks and categorize measurements as such:</p><p>• Performance improvement of the initial policy by transferring an agent from a source task to a target task. • Performance improvement of the final learned policy of an agent on a target task by transferring. • The gained total cumulative reward from a transfer strategy compared to a non-transfer strategy. • The ratio of the total reward accumulated by the transfer learner and the total reward accumulated by the non-transfer learner. • The reduction of learning time needed by the agent to achieve a pre-specified performance level via knowledge transfer.</p><p>In literature ( <ref type="bibr" target="#b3">[4]</ref>) a variety of TL approaches that fall under this category, are mentioned: In Imitation learning, the agent is trained to mimic a policy of a source policy, called the expert. This is a way of training without having access to feedback from the environment. A framework for Imitation Learning in partially-observable settings based on the Free-Energy Principle has been proposed in <ref type="bibr" target="#b4">[5]</ref>. In cases where the reward signal is available, Learning from Demonstrations (LfD) is a possible way of training an agent. The way agents combine their knowledge (inter-agent or intra-agent) in Cooperative Multi-Agent RL can also be described as a form of TL.</p><p>In TL, domains can be described by MDPs, and any parts of it can have differences between the source and target domain. Consider state spaces 𝑆 S and 𝑆 T . Any of these relations might be true, depending on the problem: 𝑆 S ⊂ 𝑆 T , 𝑆 S ≡ 𝑆 T or 𝑆 S ⊂ 𝑆 T . Differences for the action spaces 𝐴 S and 𝐴 T are analogs. Since both state and action spaces can differ, reward functions can also be defined differently for both domains. Ultimately, trajectories can differ for problems where reaching a goal can be achieved differently (e.g., path-finding tasks).</p><p>This can be further extended to safety critical applications. Differing state spaces can be the result of failed sensors, differing action spaces are the result of hard constraints implemented by the system. Additionally, reward functions might yield different values in cases where sensors supply noisy data. In the case of CMDPs, for similar reasons, differences can be found in both constrained cost function and safety threshold.</p><p>On the topic of which kind of knowledge is transferable, we can define multiple forms. The transfer of trajectories is the main subject of LfD. Furthermore, the transfer of model dynamics is possible when an approximation by offline learning algorithms trained on historical data, and before getting transferred to an online system, is feasible. Offline RL algorithms usually mitigate the impact of the gap between real and estimated values by adding a pessimism factor to these learned values ( <ref type="bibr" target="#b5">[6]</ref>) or learned dynamic models ( <ref type="bibr" target="#b6">[7]</ref>).</p><p>The transfer of policies has been discussed by <ref type="bibr" target="#b7">[8]</ref>. They propose to extend the explorationexploitation choice with the option to reuse an older policy and consequently test the transfer performance. Reward Shaping (as presented in <ref type="bibr" target="#b8">[9]</ref>) speeds up the RL training process by guiding the exploration process by transforming the reward function into a potential-based reward function.</p><p>Transfer by starting from prior distributions has been explored by <ref type="bibr" target="#b9">[10]</ref>. Instead of finding trajectories that maximize expected rewards, inference formulations start from a prior distribution over trajectories, condition the desired outcome, such as achieving a goal state, and then estimate the posterior distribution over trajectories consistent with this outcome. Since imitation learning provides a teacher policy to learn from, it interprets the teacher policy as a prior policy distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Using Free-Energy Priors to improve Robustness after Policy Transfer</head><p>In real-world applications, such as robotics, it can be hard to separate signals from noise, especially at the early stages after deploying a learned strategy. We consider a scenario where there is a cost to receiving state data from an actor, e.g., sensor data from a robot's joints. Since we are considering the case of a SimToReal-transfer, we assume the existence of priors learned from simulation interactions. In this section, we propose the use of an advantage function over the simulation priors based on the free-energy principle to improve the agent's robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Free-Energy Functions</head><p>Free-energy functions are fundamental concepts in thermodynamics and statistical mechanics that describe the energy available to do work in a system while accounting for both its internal energy and its entropy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Quanitifying the cost of Control</head><p>Rubin et al. <ref type="bibr" target="#b10">[11]</ref> borrow the term to define free-energy functions in the RL context to derive optimal policies and explore the tradeoff between value and control information. The idea is that optimal policies reflect a balance between maximizing expected rewards (value) and minimizing the information cost that comes with control.</p><p>With the help of information theory, we can quantify the expected cost of executing a policy 𝜋 in state 𝑠 ∈ 𝑆 as Δ𝐼 (𝑠) = ∑ 𝑎 𝜋 T 𝑠 (𝑎) log( 𝜋 T 𝑠 (𝑎) 𝜋 S 𝑠 (𝑎) ) with Δ𝐼 (𝑠 𝑇 ) = 0 for a terminal state 𝑠 𝑇 . With this, we are able to measure the relative entropy between the source policy 𝜋 S and target policy 𝜋 T . The source policy is used by the agent in the absence of information from its new noisy environment. For any state 𝑠, Δ𝐼 (𝑠) describes the minimal number of bits required to describe the outcome, or action sampled, of the random variable 𝑎 ∼ 𝜋 T . In our case, it serves as a measure for the cost of control. Similar to the value function 𝑉 𝜋 (𝑠 0 ), we can define the total control information involved in executing policy 𝜋 starting from the initial state 𝑠 0 :</p><formula xml:id="formula_2">𝐼 𝜋 (𝑠 0 ) = lim 𝑇 →∞ 𝔼[Δ𝐼 (𝑠 𝑡 )] = lim 𝑇 →∞ 𝔼[ 𝑇 −1 ∑ 𝑡=0 log 𝜋 T 𝑠 𝑡 (𝑎 𝑡 ) 𝜋 S 𝑠 𝑡 (𝑎 𝑡 ) ] = lim 𝑇 →∞ 𝔼[ log 𝑃𝑟(𝑎 0 , 𝑎 1 , … , 𝑎 𝑇 −1 |𝑠 0 , 𝜋 T ) 𝑃𝑟(𝑎 0 , 𝑎 1 , … , 𝑎 𝑇 −1 |𝑠 0 , 𝜋 S ) ]<label>(1)</label></formula><p>Here, the optimal target policy 𝜋 * ,T should minimize the control information cost and at the same time maximize the reward while respecting environmental constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Optimization with constrained Policies</head><p>In safety-critical domains, RL optimization problems are typically subject to constraints. For a distance measure 𝑑 ∶ Π × Π → ℝ and step size 𝛿, trust-region policy optimization algorithms makes sure that the new policy is within a so-called trust region of the previous one: 𝜋 𝑡+1 = arg max </p><p>where</p><formula xml:id="formula_4">𝐷 𝐾 𝐿 (𝜋||𝜋 𝑘 ) = 𝔼 𝑠∼𝑑 𝜋 𝑘 [𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 )[𝑠]</formula><p>), and 𝛿 &gt; 0 is the step size.</p><p>The advantage functions calculates the expected reward gain along a trajectory and is given by:</p><formula xml:id="formula_5">𝐴 𝜋 (𝑠, 𝑎) = 𝑄 𝜋 (𝑠, 𝑎) − 𝑉 𝜋 (𝑠) = 𝔼 𝜏 ∼𝜋 [𝑅(𝜏 )|𝑠 0 = 𝑠, 𝑎 0 = 𝑎] − 𝔼 𝜏 ∼𝜋 [𝑅(𝜏 )|𝑠 0 = 𝑠]<label>(3)</label></formula><p>The trust region is then defined by the set {𝜋 𝜃 ∈ Π 𝜃 ∶ 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 )}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CPO solves the CMDP problem approximately by calculating the update</head><formula xml:id="formula_6">𝜋 𝑘+1 = arg max 𝜋∈Π 𝜃 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [𝐴 𝜋 𝑘 (𝑠, 𝑎)] s.t. 𝐽 𝐶 𝑖 (𝜋 𝑘 ) + 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [ 𝐴 𝜋 𝑘 𝐶 𝑖 (𝑠, 𝑎) 1 − 𝛾 ] ≤ 𝑐 𝑖 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 ) ≤ 𝛿<label>(4)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Using Free-Energy Functions to improve Transferability</head><p>We aim to use a free-energy function to derive optimal policies while balancing the tradeoff between value and information during exploring. Early works ( <ref type="bibr" target="#b13">[14]</ref>) propose using advantage functions in noisy environments to mitigate undesired approximation effects by reducing the action gap ( <ref type="bibr" target="#b14">[15]</ref>). We assume a stochastic prior policy 𝜋 S (𝑎|𝑠) from the source task. Fox et al ( <ref type="bibr" target="#b15">[16]</ref>) propose that we can measure the information cost of a policy 𝜋 T (𝑎|𝑠) with 𝑔 𝜋 T (𝑠, 𝑎) = log</p><formula xml:id="formula_7">𝜋 T (𝑎|𝑠)</formula><p>𝜋 S (𝑎|𝑠) . The expected information cost of the target policy 𝜋 T can be written as 𝔼[𝑔 𝜋 T (𝑠 𝑡 , 𝑎 𝑡 |𝑠)] = 𝐷 𝐾 𝐿 (𝜋 T 𝑠 ‖𝜋 S 𝑠 ). Considering the dynamics induced by the transition probabilities 𝑇 (𝑠 𝑡+1 |𝑠 𝑡 , 𝑎 𝑡 ) of the underlying MDP, we can now consider the total discounted expected information cost for the target policy:</p><formula xml:id="formula_8">𝐼 𝜋 T (𝑠) = ∞ ∑ 𝑡=0 𝛾 𝑡 𝐷 𝐾 𝐿 (𝜋 T 𝑠 𝑡 ,𝑎 𝑡 ‖𝜋 S 𝑠 𝑡 ,𝑎 𝑡 ).<label>(5)</label></formula><p>We define</p><formula xml:id="formula_9">𝐹 𝜋 T (𝑠, 𝑎) = 𝑉 𝜋 T (𝑠) + 1 𝛽 𝐼 𝜋 T (𝑠)<label>(6)</label></formula><p>as a 𝛽-weighted free-energy function with 𝛽 controlling the tradeoff between value and information. From this we get a state-action free-energy function</p><formula xml:id="formula_10">𝐺 𝜋 T (𝑠, 𝑎) = 𝔼 𝜃 [𝑅|𝑠, 𝑎] + 𝛾 𝔼 𝑇 [𝐹 𝜋 T (𝑠 ′ )|𝑠, 𝑎].<label>(7)</label></formula><p>Now, we define the free-energy advantage function as:</p><formula xml:id="formula_11">𝐵 𝜋 T (𝑠, 𝑎) = 𝐺 𝜋 T (𝑠, 𝑎) − 𝑉 𝜋 T (𝑠) = 𝔼 𝜏 ∼𝜋 T [𝐶(𝜏 ) + 𝛾 𝛽 𝑔 𝜋 T (𝑠 𝑡+1 , 𝑎 𝑡+1 )|𝑠 0 = 𝑠, 𝑎 0 = 𝑎] − 𝔼 𝜏 ∼𝜋 T [𝐶(𝜏 )|𝑠 0 = 𝑠]<label>(8)</label></formula><p>Here, 𝐶(𝜏 ) represents the cumulative sum of constraint costs along the trajectory 𝜏.</p><p>Finally, we can calculate the free-energy advantage transfer policy update:</p><formula xml:id="formula_12">𝜋 𝑘+1 = arg max 𝜋∈Π 𝜃 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [𝐵 𝜋 𝑘 (𝑠, 𝑎)] s.t. 𝐽 𝐶 𝑖 (𝜋 𝑘 ) + 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [ 𝐵 𝜋 𝑘 𝐶 𝑖 (𝑠, 𝑎) 1 − 𝛾 ] ≤ 𝑐 𝑖 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 ) ≤ 𝛿<label>(9)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>In this section, we will present the evaluation framework, metrics and results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Experiments</head><p>In this section we will first evaluate the performance of the Constrained Policy Optimization algorithm <ref type="bibr" target="#b16">[17]</ref> for constrained RL problems. CPO yields better performance on constrained tasks than methods such as Trust Region Policy Optimization or Primal-Dual Optimization ( <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b17">18]</ref>). We conduct the experiments on an exemplary robotics learning task, specifically the HalfCheetah environment within the MuJoCo<ref type="foot" target="#foot_0">1</ref> physics engine embedded in OpenAI Gym<ref type="foot" target="#foot_1">2</ref> . The HalfCheetah is a two-dimensional simulated robot with six controllable joints, as depicted in figure <ref type="figure" target="#fig_0">1</ref>. We use a continuous action space with 𝐴 = [−1, 1] 6 , where each entry of the action vector represents the torque [Nm] applied to the respective motorized joint. The constraint is placed on an angle, in which the HalfCheetah is considered to be fallen over and would not be able to recover to a standing position without external help.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Evaluating Transferability for Safety-Critical Applications</head><p>For safety-critical applications at any scale, the best direct improvement of TL would generally be starting from accurate prior distributions, because we can expect a reduced exploratory period. While this is expected to reduce training time, prevention of constraint violations is not necessarily guaranteed. Having reliable algorithms should also make it possible to train an agent in a simulation and then transfer the model to safety-critical applications in the real world without violating constraints imposed by the task. We, therefore, extend the list by the following measurements:</p><p>• The ratio of total constraint cost accumulated by the transfer learner and total constraint cost accumulated by the non-transfer learner or between different transfer learners. • The sum of constraint violations committed by the transfer learner compared to the non-transfer learner (or between multiple transfer learners) above a specified threshold.</p><p>Note that we hypothesize that measuring the robustness gained by simultaneously learning system dynamics ( <ref type="bibr" target="#b18">[19]</ref>) could a valid metric, which we intend to examine in the future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Evaluation</head><p>We hence compare the CPO algorithm with and without free-energy advantage policy transfer (FEAT) on noisy environments with a noise factor 𝑈 𝑗 ∼ 𝒩 (1, 𝜎 ) for every state variable index 𝑗 ∈ {1, … , |𝑠|} by evaluation the post-transfer performance according to the formerly proposed criteria. In all experiments, we first pre-train an agent with an implementation of the CPO algorithm in a simulated environment without noise for 2500 iterations. After the final iteration, the agent is able to control the HalfCheetah at a satisfactory level.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1.">Comparison of ratios of total constraint costs</head><p>Figure <ref type="figure" target="#fig_1">2</ref> shows the mean constraint costs over a post-transfer training process of 𝑇 = 1000 iterations. Our approach, CPO+FEAT (orange), manages to stay below the curve of the baseline approach, CPO (green). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2.">Comparison of the sum of constraint violations</head><p>For the criterion of constraint violations, we define a constraint threshold 𝑐. Like above, we train the agents for a total of 𝑇 = 1000 iterations. In a noisy environment with 𝜎 = 0.1, we evaluate both agents with a strict safety threshold of 𝑐 = 0.02. Here, the value for 𝑐 means that the HalfCheetah is not allowed to show signs of falling over. While CPO without FEAT violates the threshold 7.2% of the time, CPO with added FEAT evaluates at only 3.5%.</p><p>For 𝜎 = 0.2, we chose a higher threshold of 𝑐 = 0.15 (the agent is allowed to appear unstable, but is not allowed to fall over). CPO without FEAT violates the threshold in 86.7% of iterations, while CPO with FEAT is significantly lower, with only 32.3% violations. Unfortunately, both algorithms still lack the necessary robustness to guarantee safety for environments with higher noise levels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and Future Work</head><p>In this paper, we highlighted how Transfer Learning can be interpreted in the context of constrained Reinforcement Learning and proposed a way that transferability can be evaluated. The experiments indicate that our approach improves the transferability of policies for constrained problems in the specific case of the Constrained Policy Optimization algorithm.</p><p>In the future, we aim to research further how this approach is applicable for similar policy based RL algorithms and extended this to a more general case. Furthermore, to reflect real-world problems more accurately, we plan to add further restrictions to the actor's perception of the environment, such as partial observability.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A rendering of the MuJoCo HalfCheetah environment in its initial state. Its controllable joints are highlighted in red.</figDesc><graphic coords="7,121.46,84.19,349.88,292.13" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: A comparison of mean constraint costs over 𝑇 = 1000 iterations between CPO (green) and CPO with FEAT (orange) in a noisy environment with 𝜎 = 0.1.</figDesc><graphic coords="8,117.00,199.67,358.80,152.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>𝐽 𝐶 (𝜋) ≤ 𝑐 and 𝑑(𝜋, 𝜋 𝑡 ) ≤ 𝛿. Here, Π 𝜃 ⊂ Π denotes a 𝜃-parameterized policy subset that filters for relevant parameters. Trust region algorithms for reinforcement learning (<ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>, such as CPO, have policy updates of the form𝜋 𝑘+1 = arg max 𝜋∈Π 𝜃 𝔼 𝑠∼𝑑 𝜋 𝑘 ,𝑎∼𝜋 [𝐴 𝜋 𝑘 (𝑠, 𝑎)],s.t. 𝐷 𝐾 𝐿 (𝜋‖𝜋 𝑘 ) ≤ 𝛿</figDesc><table><row><cell>𝜋∈Π 𝜃</cell><cell>𝐽 (𝜋) s.t.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/openai/mujoco-py</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://github.com/openai/gym</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Robust constrained model predictive control using linear matrix inequalities</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">V</forename><surname>Kothare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Balakrishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Morari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Automatica</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="1361" to="1379" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A comprehensive survey on transfer learning</title>
		<author>
			<persName><forename type="first">F</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the IEEE</title>
		<imprint>
			<biblScope unit="volume">109</biblScope>
			<biblScope unit="page" from="43" to="76" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Transfer learning for reinforcement learning domains: A survey</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Stone</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2009.07888</idno>
		<title level="m">Transfer learning in deep reinforcement learning: A survey</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Ogishima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Karino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kuniyoshi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2107.11811</idno>
		<title level="m">Reinforced imitation learning by free energy principle</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Off-policy deep reinforcement learning without exploration</title>
		<author>
			<persName><forename type="first">S</forename><surname>Fujimoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Meger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Precup</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="2052" to="2062" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Morel: Model-based offline reinforcement learning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Kidambi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rajeswaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Netrapalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Joachims</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="21810" to="21823" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Probabilistic policy reuse for inter-task transfer learning</title>
		<author>
			<persName><forename type="first">F</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>García</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Veloso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Robotics and Autonomous Systems</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="866" to="871" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Policy transfer using reward shaping</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brys</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Harutyunyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nowé</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AAMAS</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="181" to="188" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Abdolmaleki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Springenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tassa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Munos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Heess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Riedmiller</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1806.06920</idno>
		<title level="m">Maximum a posteriori policy optimisation</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Trading value and information in mdps</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rubin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Shamir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tishby</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Decision Making with Imperfect Decision Makers</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="57" to="74" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Trust region policy optimization</title>
		<author>
			<persName><forename type="first">J</forename><surname>Schulman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jordan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Moritz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1889" to="1897" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Schulman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Moritz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jordan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1506.02438</idno>
		<title level="m">High-dimensional continuous control using generalized advantage estimation</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Reinforcement learning in continuous time: Advantage updating</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Baird</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN&apos;94)</title>
				<meeting>1994 IEEE International Conference on Neural Networks (ICNN&apos;94)</meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="1994">1994</date>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="2448" to="2453" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Increasing the action gap: New operators for reinforcement learning</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Bellemare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ostrovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Munos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">30</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pakman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tishby</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1512.08562</idno>
		<title level="m">Taming the noise in reinforcement learning via soft updates</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Constrained policy optimization</title>
		<author>
			<persName><forename type="first">J</forename><surname>Achiam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Held</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tamar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="22" to="31" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Risk-constrained reinforcement learning with percentile risk criteria</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghavamzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Janson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pavone</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="6070" to="6120" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Mixed strategies for robust optimization of unknown objectives</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Sessa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bogunovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kamgarpour</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krause</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Artificial Intelligence and Statistics</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2970" to="2980" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
