<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Online, Worldwide,
September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Recommending safe actions by learning from sub-optimal demonstrations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lars Boecking</string-name>
          <email>boecking@fzi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Philipp</string-name>
          <email>philipp@fzi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FZI Research Center for Information Technology</institution>
          ,
          <addr-line>Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>26</volume>
      <issue>2020</issue>
      <abstract>
        <p>Clinical pathways describe the treatment procedure for a patient from a medical point of view. Based on the patient's condition, a decision is made about the next actions to be carried out. Such recurring sequential process decisions could well be outsourced to a reinforcement learning agent, but the patient's safety should always be the main consideration when suggesting activities. The development of individual pathways is also cost and time intensive, therefore a smart agent could support and relieve physicians. In addition, not every patient reacts in the same way to a clinical intervention, so the personalization of a clinical pathway should be given attention. In this paper we address with the fundamental problem that the use of reinforcement learning agents in the specification of clinical pathways should provide an individual optimal proposal within the limits of safety constraints. Imitating the decisions of physicians can guarantee safety but not optimality. Therefore, we present an approach that ensures compliance with health critical rules without limiting the exploration of the optimum. We evaluate our approach on open source gym environment where we are able to show that our adaptation of behavior cloning not only adheres better to safety regulations, but also manages to better explore the space of the optimum in the collective rewards.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Applied computing → Health care information systems.</p>
      <p>HealthRecSys’20, September 26th, 2020, Online, Worldwide
© 2020 Copyright for the individual papers remains with the authors. Use permitted
under Creative Commons License Attribution 4.0 International (CC BY 4.0). This
volume is published and copyrighted by its editors.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Our work focuses on the use of reinforcement learning to optimize
and personalize clinical pathways, illustrated in Figure 1.
Rehabilitation procedure, called „Clinical Pathway“, describes in detail
which activities are to be carried out for a patient within a course
of treatment[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The process of creating a clinical pathway tailored to an
individual patient spans several stages. To adapt a clinical pathway to
a patient’s needs, one starts from a disease specific blueprint and
later incorporates the patients clinical picture as well as his or her
individual preferences.</p>
      <p>On an abstract level, the adaptation of a pathway can be modelled
as a decision process. A number of activities must be decided upon,
which in turn have interdependent efects among one another.
Feedback on the efectiveness of the decisions made is often only given
with a delay or in aggregated form - for example during a control
visit to the doctor after a certain time. Reinforcement learning is
about optimizing processes that can be described as a feedback
control loop. The application of RL to the individualization of clinical
pathways is therefore particularly well suited and promising.</p>
      <p>The personalization of a clinical pathway is about
identifying the optimal combination of activities and treatments in
rehabilitation for an individual patient. In this context optimality can
be considered from diferent viewpoints. On the one hand we see
the fundamental objective of proposing rehabilitation measures
that are safe from a medical perspective. On the other hand we
aim to support the recovery process in the best-possible way by
exploring alternative rehabilitation activities. While there are generic
templates for diferent medical diagnosis - that are safe, there is
the need to go beyond and adapt the clinical pathway - to provide
tailored care plans to the individual patients.</p>
      <p>In order to address the objectives described above for clinical
path recommender systems, we present a safety-aware
reinforcement learning approach. On a conceptual level this means that
we have a state  of our patient and our agent proposes an action
 - a rehabilitation measure - for our patient at time  (Figure 1).</p>
      <p>
        The agent receives a reward +1 based on the change in the
condition of our patient +1. While classical RL is based on try&amp;error,
the healthcare application must guarantee the safety of the
patient during the proposed activity. Imitation Learning is one of
the ways in which this is pursued. Here the agent is trained to
to „imitate“ an expert’s actions, i.e., to suggest a similar treatment
activity to the one a doctor would choose faced with the same
patient profile. Current work in imitation learning [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] focus on
eficiently learning from demonstrations while not paying special
attention on safety or exploration. Research identified the
objective safety in imitation learning[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] based their concept on being
as close as possible to the examples shown. However, it is by no
means guaranteed that the doctor’s suggestion is optimal for the
rehabilitation of the individual patient.
      </p>
      <sec id="sec-2-1">
        <title>Challenges:</title>
        <p>• How can we emphasize the importance of safety in
suggesting rehabilitation treatments to a reinforcement learning
agent?
• How can an agent explore the individual optimum and still
remain within a safe and medically acceptable action space?
In answering the questions within this study we contribute to
the following: Contributions:
• a conceptual approach to extract safety relevant behavior
from expert demonstrations
• an adapted conceptual method for imitation learning that
emphasizes safety-critical thinking
• implementation, application and preliminary evaluation of
the concepts</p>
        <p>Paper outline: After we position our work in the scientific related
work in section 2, we introduce the conceptual background of our
approach in section 3.1. While in section 3.2 we present the novel
concepts of our approach, in section 3.3 we focus on the explicit
application to optimize clinical pathways. In chapter 4 we outline
our evaluation method and discuss the results achieved in chapter
5. Our work is then completed by a conclusion (section 6) and an
outlook on future work in section 7.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>Our work covers various areas of health care and machine learning,
which we would like to examine in greater detail.</p>
      <p>
        Research has shown an increasing interest in applying machine
learning techniques to health care related tasks. From
modelling disease progression [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to automated clinical prognostics
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] methods of artificial intelligence have shown to be promising
approaches. In further applications algorithms are used to annotate
medical images and support doctors decision-making in a
humanML collaborative way [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Overall, decisive questions are emerging
for the use of machine learning in the health sector. The decision
of a system must be validated and made comprehensible. Only if
the physician can be sure that the outcome of a machine learning
algorithm is understandable and, above all, guarantees the safety
of the patient, can systems prevail in the long term[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Pathway
- treatment Bica et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] introduced Counterfactual Recurrent
Networks to estimate treatment efects by modelling treatment
time-dependent impact on covariates based on the patient clinical
history. Besides the topic-related relevant areas, various conceptual
ifelds from machine learning are of relevance to our approach.
Imitation Learning is about training an agent to mimic the
behaviour of an expert. With approaches such as inverse RL, e.g. GAIL
- Generative Adversarial Imitation Learning - have recently achieved
remarkable success [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Beyond this, we have seen approaches
that attempt to reconstruct the expert’s objective by evaluating
hypotetic behaviour of an agent [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Further adaptations of imitation
learning approaches are concerned with incorporating examples
of an expert during the active learning process [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However these
approaches neither have been adapted to learn from sub-optimal
examples nor do they emphasise safety-relevant aspects.
constraint RL: First considerations about setting boundaries to
the exploration of a reinforcement learning agent go back to the
year 2000[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Recent work applied constraints in form of predefined
threshold-values in continuous action spaces by adding a safety
layer that in case of constraint violation corrects the suggestion of
policy network [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Other then our approach, these concepts are
based on pre-defined limits that are not deduced from examples
and do not learn from experts.
safety RL: We have seen approaches that measure the similarity
between the novice and the expert choice of action to prevent the
agent from suggesting unsafe actions by considering the state
distribution [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] or disagreement between multiple agents [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Follow
up research did consider the quantification of policy uncertainty
to model risk of exploration [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Lee et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] proposed
end-toend imitation learning, where safety is addressed by evaluating
the uncertainty of Bayesian convolutional network. Yet again, no
approach has been adopted to diferentiate existing demonstrations
and adapt safety-relevant behaviour in a targeted manner.
multi criteria Laroche et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced a Multi-Advisor RL
where  advisors are specialized on a sub task of the problem and
an aggregator is used to derive a global policy based on the
individual recommendations. While the safety of an RL problem can
be described as a multi-criteria problem, the question remains to
be answered how the approach described here can guarantee
compliance with safety constraints and foster exploration within these
limits.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>OUR APPROACH</title>
      <p>
        Contrary to previous imitation learning techniques, our approach
focus on avoiding unsafe states while still exploring safe states
to find the optimum. We teach the agent to handle safety critical
states by imitating expert actions in similar situation. In safe states
however the agent does not need to stick exactly with the behavior
observed in expert demonstration. In fact we encourage it to search
for the best personalized clinical path possible by exploration. While
current safety RL algorithms [
        <xref ref-type="bibr" rid="ref14 ref19">14, 19</xref>
        ] focus on choosing actions
that converge to the median of expert demonstrations that often is
not the optimum, our approach aims at encouraging the agent to
explore the state space while staying inside safe boundaries.
3.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>Formal Description</title>
      <p>At each step  the agent selects an action  ∈ () based on the
received representation of the environment state  ∈ . Applied
in a health recommender system speaking about action and states
relates to recommendations for therapy activity and patients
clinical state respectively. The agent receives a reward, +1, which
quantifies the development of the clinical condition and personal
well-being of the patient and a new state +1 of the patient as a
consequence of its action.  (|) is the agents policy which is
assigning a probability to each action at a given state and choices the
most promising. This part is to be trained during the exploration
or in the case of imitation learning during the expert observation.
Since the new state serves as the input for the next iteration the
agent keeps on interacting with the environment and creates a
trajectory  = ( ,  ) | ∈ [0, ℎ ] where  is the state at a given time
and  the action where  includes all elements from the start time
0 to the time of termination ℎ . The trajectory for an individual
patient directly relates to the configured pathway (actions  map
to the parameterised treatment activities foreseen in the clinical
pathway) and the observed reaction of the patient (states  ). The
objective function denoted as  ( ) relates to
 ( ) = E [  ( )] =  ( ) =
∼
∞
Õ</p>
      <p>
        =0
where  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is a discount factor. As we are dealing with the
complex task of adapting clinical pathways, the modelling of several
objectives and constraints gains in importance. constraints
Constrained Markov Decision Processes (CMDPs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] limit the number
of policies to a subset Π ⊂ Π that fulfill a set of constraints  such
that:
Π = { :  ( ) ≤  ∀  = 1, . . . ,  }
 ( )) = E [  ( )] (3)
∼
 is the estimation of the expected value for a cost function 
over the space of the trajectories achieved by the policy  . The
resulting space of allowed policies is defined by the limitation that
it only includes policies that do not exceed a defined limit  ∈ R
for all the defined cost functions.
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Safety Imitation</title>
      <p>Focusing on modelling the safety of an reinforcement learning agent
we define    for brevity  to approximate the safety of a given
state  . The flexibility of the approach provides the possibility
to diferentiate safety in several dimensions or to describe it as a
holistic unit. In the case of Imitation Learning from sub-optimal
but safe demonstrations we calculate the threshold value  over
the distribution of expert trajectories, such that  =   ( )
from the observed in the expert demonstrations  . Evaluating
the received expert trajectories we can now quantify how critical
the diferent states were in terms of safety by defining:
 = {( ,  ) :  ∈  ∧  ( ) ≥  −  }
(4)</p>
      <p>
        By focusing on the subset  to train our agent we can assure
that it knows how to handle critical situations while preserving the
freedom of exploring safe states. The collected demonstration data
set is then weighted in such a way that the training data set for
(1)
(2)
imitation learning consists of safety-relevant trajectories ( ) to
a defined extent mixed with randomly sampled trajectories from
 . Trough out this paper we will refer to this weighing as safety
focus  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
 = { ∗ ( ,  ) ⊂  ∪ (1 −  ) ∗ ( ,  ) ⊂  \ } (5)
 
It is essential to highlight that the data set used for the training
of the agent is not extended by additional information such as the
security factor, but rather a subset of the demonstrations is
deliberately chosen for the training. The agent during imitation learning
is not told at any time whether the state action pair currently
presented to him in the context of supervised learning is a security
relevant example. The approach changes solely the composition of
the training data set.
3.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Implication Health Recommender System</title>
      <p>Our approach allows to learn efectively from demonstrations that
guarantee a safe state of the environment respectively of the patient,
but beyond that the actions not always show the optimal reaction
to the state. Furthermore we aim to train a reinforcement learning
agent with weighted expert demonstrations and thereby putting
safety or other evaluation criteria in the foreground. Applying the
approach on training a reinforcement learning agent to suggest
and parameterize treatment activities in a clinical pathway we train
the agent to explore the optimal recommendation while imitating
expert recommendations when facing critical states described by
constraint cost functions.</p>
      <p>If the formal description is applied to the healthcare application,
the cost function evaluates the clinical condition of the patient. In
concrete terms, one could, for example, evaluate the deviation of
the measured pulse from rest or optimal pulse. One now look at
the expert’s demonstration, i.e. any number of pairs of the patient’s
condition and the proposed therapy measure, you can evaluate
for each demonstration what the cost function is, i.e. the safety
assessment of the patient’s clinical condition. It is crucial that the
costs are not per se included in the objective function but are used
as restrictions. As a result, an increased heart rate is not interpreted
as negative by our recommender, but we take care in the
decisionmaking process that the safety of this attribute is within certain
limits.</p>
      <p>We are therefore aware that the heart rate drops out during a
therapeutic measure, and that this is one of the undesirable efects.
But we want to make sure that the proposals of our intelligent
system are within the limits of the experts’ opinions. So if we see
in the trajectories that the safety costs are below a certain level, we
want to make sure that our proposals do not exceed this limit. To
learn this, the demonstrations where the patient’s condition was
particularly close to the observed limit are particularly relevant.
In our approach we define a sub set of trajectories  that have
a defined distance from the critical limit . From this sub set we
know that it is particularly relevant to learn how to avert critical
states. During the training process, our intelligent system should
accordingly pay special attention to adapting the expert suggestions
close to the critical states.
4</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION</title>
      <p>Although the concept presented here was developed out of the
motivation to individualize clinical pathways for patients, it can
be applied to various applications of reinforcement learning. For
this reason, and because clinical data was not available to the
extent necessary for an analysis, the evaluation is based on common
and comparable safety problems in the directive. We use the gym
environment provided by OpenAI.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Gym Environment</title>
      <p>The gym environment ofers the possibility to run diferent task
scenarios for reinforcement agent and to extend the provided
framework. Especially atari games and two dimensional games such as car
racing are very popular and provide an excellent baseline to
compare results. Due to the parallel use of the car racing environment
as a recommender in the health care sector, the car racing
environment is particularly suitable to demonstrate the functionality of
our approach. The car on the race track describes the condition of
the patient, who changes depending on the action - steering and
accelerating, or parameterization of the next treatment measure.
The more critically the condition of the patient - the position of the
vehicle on the track - is evaluated, the more relevant it is to behave
similar to the expert demonstrations. While we have described the
relevance of heart rate in the clinical environment above, safety
in this environment can be quantified with a cost function based
on the distance to the edge of the track. So while in the medical
case we can observe how a doctor behaves when the heart rate is
particularly high or exceptionally low, in this environment we can
quantify how far the vehicle is from the edge of the track.</p>
      <p>To learn how to deal with critical conditions, we then look at
the demonstrations where the assessment of the condition was
particularly critical, as described in the formal description. The
subset used for imitation learning is selected upon those based on
equation 5.
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Experiment Set Up</title>
      <p>In the following we will describe the dimensions and parameters
used for our evaluation in more detail: Demonstrations:All
experiments were carried out on the same demonstration data set of size
| | = 4692 ∗ ( ,  ), for further detail see Appendix A.
Imitation learning was performed as supervised learning of a
tensorflow model with same architecture for every experiment. The
agent was trained for 2000 ℎ, pairs of ( ,  ) respectively.
Cost function: In our evaluation we consider three cost function
that quantify the cars position in the environment. Since the cars
state represents a patients clinical state this can be seen as three
diferent clinical parameters that are monitored during expert
training. The cost functions considered quantify the cars position by
evaluating the game frame received as a state representation. In the
three directions    ,   and ℎ we calculate the distance to
the unsafe state - the green besides the road. Evaluating these three
cost functions for each state observed during the demonstration
we develop a representation of the states safety. The parameter ,
which indicates how early a state should be classified as
safetyrelevant is set to  = 5. To calculate we move from the edge of the
distribution of expert examples in the dimension of a constraint - in
this case the safety - to the centre of the distribution. Visually this
parameter defines how wide the edge of the distribution is, which
is classified as safety critical as shown in Figure 3.</p>
      <p>Agent testing After the weights of the reinforcement learning
agent were trained via imitation learning the agent is evaluated in a
newly generated gym environment. Here we observe the agent for
two whole episodes to collect information about its performance
and its safety. Depending on the individual performance of the
agent this relates to ≈ 2000 state action pairs.
5</p>
    </sec>
    <sec id="sec-11">
      <title>PRELIMINARY RESULTS</title>
      <p>In the following we want to present the preliminary results of
applying our approach to the safety critical decision process described
in 4.2. Diferent values for  in equation 5 has shown significant
influence on the performance of the agent with respect to the safety
as well as the reward as shown in table 1.</p>
      <p>The results show that the safety focus has a significant impact
on the agent’s performance. The agent trained with the unweighted
expert demonstrations achieves an average safety rating of 13.05
for its proposals, and the variation in safety over the ≈ 2000 state
action pairs of 16.41 should be noted. The approach of pre-selecting
and weighting the demonstrations based on the distribution of the
cost function shows a positive impact. The security evaluation of
the conditions caused by the agent can be raised to a level of 17.3 by
a weighting of  = 0.1, and by a weighting of  = 0.5 it can achieve
a value of 21.37. In addition, the weighting of the expert trajectories
in these cases also leads to a more robust reinforcement learning
agent, which is reflected in the standard deviation of safety.</p>
      <p>To make the results presented more comprehensible, Figure 4
provides a visualization.Training the agent with diferent safety
focuses  results in safety and reward shown on the y-axis and
the standard deviation represented by the dots size.</p>
      <p>Taking a closer look at comparing the cost function values for
two agents - one trained without safety focus (5) and one trained
with a safety focus  = 0.8 (6) - emphasising the safety critical
trajectories  in the expert demonstrations can significantly
raise the safety of the actions recommended by the agent. While
the performance of the agents is already reflected in the values
listed in table 1, the reasons for this can be identified in Figures 5
and 6.</p>
      <p>The non safety focus runs where not able to obtain a critical
distance to the critical states„ while the safety focus runs successfully
learned to avert critical states in an expert reaction manner.</p>
      <p>While the agent without safety focus was not able to learn the
correct handling of safety critical conditions during imitation
learning, our approach was successful in adapting the expert’s handling
of critical states. By pre-selecting the expert examples without
providing any further information during the training process, the
agent with safety focus was able to avert safety critical conditions
similar to the expert’s behaviour.
The motivation for this work is derived from the medical context,
in which the objective is to adapt clinical pathways to a patient’s
needs in the best possible way. while this scenario can be aptly
described as a reinforcement learning problem, as discussed in the
introduction, it is important to limit the exploration and thus the
parameterisation of therapies and activities to a safe range of action
from a medical point of view. the imitation learning approach ofers
a suitable approach to imitate the behaviour of experts. However,
two central questions have arisen in reinforcement learning. Firstly,
the question arose as to how an agent imitating an expert can
concentrate on learning safety relevant actions. Furthermore, we asked
ourselves whether an agent can be given the opportunity to explore
the optimum within the action space while still maintaining a focus
on safety.</p>
      <p>To answer these questions, we have developed an approach that
learns from expert demonstrations and concentrates on adapting
the safety-relevant behaviour of the expert by appropriately
weighting the examples provided. Our approach defines two parameters
that determine how to deal with the state action pairs observed
among experts. On the one hand, we have parameter , which
indicates how early a state should be classified as safety-relevant.
On the other hand we have safety focus  forcing the agent to
train on a subset of expert trajectories, where  of the examples
are classified as safety relevant under a given value .
Our approach for imitation learning was able to outperform
equivalent agents trained on balanced demonstrations with regard to
the safety as well as the reward. The generic conceptual approach
underlying the work can be applied to a wide range of RM tasks.
It is especially relevant for domains where expert knowledge is
available, which defines how one should behave to be safe, but
where it is not sure exactly what the optimal behaviour may look
like. This is the case in the personalization of clinical pathways.
while physicians can precisely advise which activities to suggest
as rehabilitation under certain clinical conditions of the patient,
it is not certain whether these suggestions are the optimal choice.
with our approach we provide an important basis for exploring
the optimum when proposing individually parameterized activities
without violating the limits of the safety-relevant parameters.
7</p>
    </sec>
    <sec id="sec-12">
      <title>FUTURE WORK</title>
      <p>
        Besides the further exploration of the parameter combinations of 
and  , the transfer to additional RL problems is pending. Evaluating
the approach on further 2D games in the gym environment is a
logical next step. Additionally, teaching robotics to safely interact
with their environment is relevant application [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Moreover, the
approach is to be evaluated in more complex RL tasks that focus
on the safety aspect, for which the recently published safety gym
is available [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Future research should also consider how to completely avoid
safety-critical examples that are dealt with by experts. One possible
approach to this could be the simulation of responsibilities and the
evaluation of possible reactions by an expert, using human in loop
approaches as feedback for the system, see [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the project vCare: Virtual
Coaching Activities for Rehabilitation in Elderly (funded by Horizon
2020 research and innovation programme under Grant Agreement
Number: 769807). Special acknowledgements are directed to the
partners of the project, who have contributed valuable feedback in
the specification of the research problem and by providing their
expertise to this study.
A</p>
    </sec>
    <sec id="sec-14">
      <title>INSIGHT ON EXPERT DEMONSTRATIONS</title>
      <p>Following we show the cost function calculated for the expert
demonstrations. In 7 we see the to cost functions calculating the
safety for    and ℎ .
In addition we evaluated the safety cost function in the
dimension ℎ , as shown in 8.</p>
    </sec>
    <sec id="sec-15">
      <title>ABLATION STUDY</title>
      <p>In the following we provide further insights on the agents
performance trained on diferent levels of      .</p>
      <p>Safety Focus 0.0 To complete the report on reinforcement agent
performance with no safety focus besides 5 we provide the cost
function referring to the safety evaluation   . Training the agent
with  = 0.0 results in the cost function to the front shown in Figure
9.</p>
      <sec id="sec-15-1">
        <title>Safety Focus 0.1</title>
        <p>Training the agent with a safety focus of 0.1 results in the cost
function shown below. Safety estimation to cost function sides is
shown in Figure 10 and function front in Figure 11 respectively.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ahmed</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alaa</surname>
          </string-name>
          and Mihaela van der Schaar.
          <year>2018</year>
          .
          <article-title>AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning</article-title>
          . arXiv e-prints, Article arXiv:
          <year>1802</year>
          .
          <volume>07207</volume>
          (
          <issue>Feb</issue>
          .
          <year>2018</year>
          ), arXiv:
          <year>1802</year>
          .07207 pages. arXiv:
          <year>1802</year>
          .
          <article-title>07207 [cs</article-title>
          .LG]
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ahmed</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alaa</surname>
          </string-name>
          and Mihaela van der Schaar.
          <year>2019</year>
          .
          <article-title>Attentive State-Space Modeling of Disease Progression</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          32,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beygelzimer</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>d'Alché-</article-title>
          <string-name>
            <surname>Buc</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , and R. Garnett (Eds.). Curran Associates, Inc.,
          <fpage>11338</fpage>
          -
          <lpage>11348</lpage>
          . http://papers.nips.cc/paper/9311- attentive
          <article-title>-state-space-modeling-of-disease-progression</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Altman</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Constrained Markov Decision Processes. Chapman and Hall</article-title>
          . https://doi.org/10.1016/
          <fpage>0167</fpage>
          -
          <lpage>6377</lpage>
          (
          <issue>96</issue>
          )
          <fpage>00003</fpage>
          -X
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Marcin</given-names>
            <surname>Andrychowicz</surname>
          </string-name>
          , Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <surname>Bob</surname>
            <given-names>McGrew</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Josh</given-names>
            <surname>Tobin</surname>
          </string-name>
          , Pieter Abbeel, and
          <string-name>
            <given-names>Wojciech</given-names>
            <surname>Zaremba</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hindsight Experience Replay</article-title>
          . arXiv e-prints,
          <source>Article arXiv:1707.01495 (July</source>
          <year>2017</year>
          ), arXiv:
          <fpage>1707</fpage>
          .01495 pages.
          <source>arXiv:1707</source>
          .01495 [cs.LG]
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dilip</given-names>
            <surname>Arumugam</surname>
          </string-name>
          , Jun Ki Lee,
          <string-name>
            <given-names>Sophie</given-names>
            <surname>Saskin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Michael L.</given-names>
            <surname>Littman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Deep Reinforcement Learning from Policy-Dependent Human Feedback</article-title>
          . arXiv e-prints, Article arXiv:
          <year>1902</year>
          .
          <volume>04257</volume>
          (
          <issue>Feb</issue>
          .
          <year>2019</year>
          ), arXiv:
          <year>1902</year>
          .04257 pages. arXiv:
          <year>1902</year>
          .
          <article-title>04257 [cs</article-title>
          .LG]
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ioana</given-names>
            <surname>Bica</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed M. Alaa</surname>
            ,
            <given-names>J. Brian</given-names>
          </string-name>
          <string-name>
            <surname>Jordon</surname>
          </string-name>
          , and Mihaela van der Schaar.
          <year>2020</year>
          .
          <article-title>Estimating Counterfactual Treatment Outcomes over Time Through Adversarially Balanced Representations</article-title>
          .
          <source>In Proc. 8th International Conference on Learning Representations (ICLR</source>
          <year>2020</year>
          ) abs/
          <year>2002</year>
          .04083 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kiante</given-names>
            <surname>Brantley</surname>
          </string-name>
          , Wen Sun, and
          <string-name>
            <given-names>Mikael</given-names>
            <surname>Henaf</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Disagreement-Regularized Imitation Learning</article-title>
          . In International Conference on Learning Representations. https: //openreview.net/forum?id=rkgbYyHtwB
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Greg</given-names>
            <surname>Brockman</surname>
          </string-name>
          , Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
          <string-name>
            <given-names>Wojciech</given-names>
            <surname>Zaremba</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>OpenAI Gym</article-title>
          . arXiv:arXiv:
          <fpage>1606</fpage>
          .
          <fpage>01540</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Carrie</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Cai</surname>
          </string-name>
          , Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S. Corrado,
          <string-name>
            <surname>Martin C. Stumpe</surname>
            , and
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Terry</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Human-Centered Tools for Coping with Imperfect Algorithms during Medical Decision-Making</article-title>
          . arXiv e-prints, Article arXiv:
          <year>1902</year>
          .
          <volume>02960</volume>
          (
          <issue>Feb</issue>
          .
          <year>2019</year>
          ), arXiv:
          <year>1902</year>
          .02960 pages. arXiv:
          <year>1902</year>
          .
          <article-title>02960 [cs</article-title>
          .HC]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Gal</surname>
            <given-names>Dalal</given-names>
          </string-name>
          , Krishnamurthy Dvijotham, Matej Vecerík, Todd Hester, Cosmin Paduraru, and
          <string-name>
            <given-names>Yuval</given-names>
            <surname>Tassa</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Safe Exploration in Continuous Action Spaces</article-title>
          . CoRR abs/
          <year>1801</year>
          .08757 (
          <year>2018</year>
          ). arXiv:
          <year>1801</year>
          .08757 http://arxiv.org/abs/
          <year>1801</year>
          .08757
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chelsea</surname>
            <given-names>Finn</given-names>
          </string-name>
          , Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>One-Shot Visual Imitation Learning via Meta-Learning</article-title>
          .
          <source>CoRR abs/1709</source>
          .04905 (
          <year>2017</year>
          ). arXiv:
          <volume>1709</volume>
          .04905 http://arxiv.org/abs/1709.04905
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Ho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Ermon</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Generative Adversarial Imitation Learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          29,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Guyon</surname>
          </string-name>
          , and R. Garnett (Eds.). Curran Associates, Inc.,
          <fpage>4565</fpage>
          -
          <lpage>4573</lpage>
          . http://papers.nips.cc/paper/6391-generative
          <article-title>-adversarialimitation-learning</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>James</given-names>
            <surname>Erica Snow Pamela Willis Jon Kinsman Leigh</surname>
          </string-name>
          , Rotter Thomas.
          <year>2010</year>
          .
          <article-title>What is a clinical pathway? Development of a definition to inform the debate</article-title>
          .
          <source>BMC Medicine</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Romain</surname>
            <given-names>Laroche</given-names>
          </string-name>
          , Mehdi Fatemi, Joshua Romof, and Harm van Seijen.
          <year>2017</year>
          .
          <article-title>MultiAdvisor Reinforcement Learning</article-title>
          .
          <source>CoRR abs/1704</source>
          .00756 (
          <year>2017</year>
          ). arXiv:
          <volume>1704</volume>
          .00756 http://arxiv.org/abs/1704.00756
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Keuntaek</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kamil</given-names>
            <surname>Saigol</surname>
          </string-name>
          , and
          <string-name>
            <surname>Evangelos</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Theodorou</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Safe end-to-end imitation learning for model predictive control</article-title>
          .
          <source>CoRR abs/1803</source>
          .10231 (
          <year>2018</year>
          ). arXiv:
          <year>1803</year>
          .10231 http://arxiv.org/abs/
          <year>1803</year>
          .10231
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Zachary</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lipton</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The Doctor Just Won't Accept That! arXiv e-prints</article-title>
          ,
          <source>Article arXiv:1711.08037 (Nov</source>
          .
          <year>2017</year>
          ), arXiv:
          <fpage>1711</fpage>
          .08037 pages.
          <source>arXiv:1711</source>
          .
          <article-title>08037 [stat</article-title>
          .ML]
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>James</surname>
            <given-names>MacGlashan</given-names>
          </string-name>
          , Mark K. Ho, Robert Loftin, Bei Peng,
          <string-name>
            <given-names>Guan</given-names>
            <surname>Wang</surname>
          </string-name>
          , David L. Roberts, Matthew E. Taylor, and
          <string-name>
            <surname>Michael</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Littman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Interactive Learning from Policy-Dependent Human Feedback</article-title>
          .
          <source>In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research</source>
          , Vol.
          <volume>70</volume>
          ),
          <source>Doina Precup and Yee Whye Teh (Eds.)</source>
          . PMLR, International Convention Centre, Sydney, Australia,
          <fpage>2285</fpage>
          -
          <lpage>2294</lpage>
          . http://proceedings.mlr.press/v70/ macglashan17a.html
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Kunal</surname>
            <given-names>Menda</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katherine Rose</surname>
            Driggs-Campbell, and
            <given-names>Mykel J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochenderfer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DropoutDAgger: A Bayesian Approach to Safe Imitation Learning</article-title>
          .
          <source>ArXiv abs/1709</source>
          .06166 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Kunal</surname>
            <given-names>Menda</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katherine Rose</surname>
            Driggs-Campbell, and
            <given-names>Mykel J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochenderfer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DropoutDAgger: A Bayesian Approach to Safe Imitation Learning</article-title>
          .
          <source>CoRR abs/1709</source>
          .06166 (
          <year>2017</year>
          ). arXiv:
          <volume>1709</volume>
          .06166 http://arxiv.org/abs/1709.06166
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Kunal</surname>
            <given-names>Menda</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katherine Rose</surname>
            Driggs-Campbell, and
            <given-names>Mykel J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochenderfer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning</article-title>
          . CoRR abs/
          <year>1807</year>
          .08364 (
          <year>2018</year>
          ). arXiv:
          <year>1807</year>
          .08364 http://arxiv.org/abs/
          <year>1807</year>
          .08364
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Alex</surname>
            <given-names>Ray</given-names>
          </string-name>
          , Joshua Achiam, and
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Benchmarking Safe Exploration in Deep Reinforcement Learning</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Siddharth</surname>
            <given-names>Reddy</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Anca D.</given-names>
            <surname>Dragan</surname>
          </string-name>
          , Sergey Levine, Shane Legg, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Leike</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Learning Human Objectives by Evaluating Hypothetical Behavior</article-title>
          . arXiv e-prints, Article arXiv:
          <year>1912</year>
          .
          <volume>05652</volume>
          (
          <issue>Dec</issue>
          .
          <year>2019</year>
          ), arXiv:
          <year>1912</year>
          .05652 pages. arXiv:
          <year>1912</year>
          .
          <article-title>05652 [cs</article-title>
          .CY]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>