Recommending safe actions by learning from sub-optimal
                          demonstrations
                                Lars Boecking                                                             Patrick Philipp
                         boecking@fzi.de                                                                 philipp@fzi.de
          FZI Research Center for Information Technology                                 FZI Research Center for Information Technology
                       Karlsruhe, Germany                                                             Karlsruhe, Germany

ABSTRACT                                                                             1   INTRODUCTION
Clinical pathways describe the treatment procedure for a patient                     Our work focuses on the use of reinforcement learning to optimize
from a medical point of view. Based on the patient’s condition, a                    and personalize clinical pathways, illustrated in Figure 1. Rehabil-
decision is made about the next actions to be carried out. Such                      itation procedure, called „Clinical Pathway“, describes in detail
recurring sequential process decisions could well be outsourced                      which activities are to be carried out for a patient within a course
to a reinforcement learning agent, but the patient’s safety should                   of treatment[13].
always be the main consideration when suggesting activities. The
development of individual pathways is also cost and time intensive,
therefore a smart agent could support and relieve physicians. In
addition, not every patient reacts in the same way to a clinical
intervention, so the personalization of a clinical pathway should
be given attention.
In this paper we address with the fundamental problem that the
use of reinforcement learning agents in the specification of clinical
pathways should provide an individual optimal proposal within the
limits of safety constraints.
Imitating the decisions of physicians can guarantee safety but not
optimality. Therefore, we present an approach that ensures com-
pliance with health critical rules without limiting the exploration
of the optimum. We evaluate our approach on open source gym                                     Figure 1: clinical pathway recommender
environment where we are able to show that our adaptation of
behavior cloning not only adheres better to safety regulations, but
also manages to better explore the space of the optimum in the
collective rewards.                                                                     The process of creating a clinical pathway tailored to an indi-
                                                                                     vidual patient spans several stages. To adapt a clinical pathway to
                                                                                     a patient’s needs, one starts from a disease specific blueprint and
                                                                                     later incorporates the patients clinical picture as well as his or her
CCS CONCEPTS                                                                         individual preferences.
• Applied computing → Health care information systems.                               On an abstract level, the adaptation of a pathway can be modelled
                                                                                     as a decision process. A number of activities must be decided upon,
KEYWORDS                                                                             which in turn have interdependent effects among one another. Feed-
                                                                                     back on the effectiveness of the decisions made is often only given
health care; clinical pathway; reinforcement learning; personaliza-
                                                                                     with a delay or in aggregated form - for example during a control
tion; imitation learning ; safety; constraints
                                                                                     visit to the doctor after a certain time. Reinforcement learning is
                                                                                     about optimizing processes that can be described as a feedback con-
ACM Reference Format:
Lars Boecking and Patrick Philipp. 2020. Recommending safe actions by
                                                                                     trol loop. The application of RL to the individualization of clinical
learning from sub-optimal demonstrations. In Proceedings of the 5th Interna-         pathways is therefore particularly well suited and promising.
tional Workshop on Health Recommender Systems co-located with 14th ACM                  The personalization of a clinical pathway is about identify-
Conference on Recommender Systems (HealthRecSys’20), Online, Worldwide,              ing the optimal combination of activities and treatments in reha-
September 26, 2020 , 8 pages.                                                        bilitation for an individual patient. In this context optimality can
                                                                                     be considered from different viewpoints. On the one hand we see
                                                                                     the fundamental objective of proposing rehabilitation measures
                                                                                     that are safe from a medical perspective. On the other hand we
HealthRecSys’20, September 26th, 2020, Online, Worldwide                             aim to support the recovery process in the best-possible way by ex-
© 2020 Copyright for the individual papers remains with the authors. Use permitted   ploring alternative rehabilitation activities. While there are generic
under Creative Commons License Attribution 4.0 International (CC BY 4.0). This       templates for different medical diagnosis - that are safe, there is
volume is published and copyrighted by its editors.
                                                                                     the need to go beyond and adapt the clinical pathway - to provide
                                                                                     tailored care plans to the individual patients.
HealthRecSys’20, September 26th, 2020, Online, Worldwide                                                            Lars Boecking and Patrick Philipp


   In order to address the objectives described above for clinical         of the patient, can systems prevail in the long term[16]. Pathway
path recommender systems, we present a safety-aware reinforce-            - treatment Bica et al. [6] introduced Counterfactual Recurrent
ment learning approach. On a conceptual level this means that              Networks to estimate treatment effects by modelling treatment
we have a state 𝑠𝑡 of our patient and our agent proposes an action         time-dependent impact on covariates based on the patient clinical
𝑎𝑡 - a rehabilitation measure - for our patient at time 𝑡 (Figure 1).      history. Besides the topic-related relevant areas, various conceptual
   The agent receives a reward 𝑟𝑡 +1 based on the change in the con-       fields from machine learning are of relevance to our approach.
dition of our patient 𝑠𝑡 +1 . While classical RL is based on try&error,    Imitation Learning is about training an agent to mimic the be-
the healthcare application must guarantee the safety of the pa-            haviour of an expert. With approaches such as inverse RL, e.g. GAIL
tient during the proposed activity. Imitation Learning is one of          - Generative Adversarial Imitation Learning - have recently achieved
the ways in which this is pursued. Here the agent is trained to            remarkable success [12]. Beyond this, we have seen approaches
to „imitate“ an expert’s actions, i.e., to suggest a similar treatment     that attempt to reconstruct the expert’s objective by evaluating hy-
activity to the one a doctor would choose faced with the same pa-          potetic behaviour of an agent [22]. Further adaptations of imitation
tient profile. Current work in imitation learning [11, 12] focus on        learning approaches are concerned with incorporating examples
efficiently learning from demonstrations while not paying special          of an expert during the active learning process [4]. However these
attention on safety or exploration. Research identified the objec-         approaches neither have been adapted to learn from sub-optimal
tive safety in imitation learning[18] based their concept on being         examples nor do they emphasise safety-relevant aspects.
as close as possible to the examples shown. However, it is by no           constraint RL: First considerations about setting boundaries to
means guaranteed that the doctor’s suggestion is optimal for the           the exploration of a reinforcement learning agent go back to the
rehabilitation of the individual patient.                                  year 2000[3]. Recent work applied constraints in form of predefined
   Challenges:                                                             threshold-values in continuous action spaces by adding a safety
     • How can we emphasize the importance of safety in suggest-           layer that in case of constraint violation corrects the suggestion of
        ing rehabilitation treatments to a reinforcement learning          policy network [10]. Other then our approach, these concepts are
        agent?                                                             based on pre-defined limits that are not deduced from examples
     • How can an agent explore the individual optimum and still           and do not learn from experts.
        remain within a safe and medically acceptable action space?        safety RL: We have seen approaches that measure the similarity
   In answering the questions within this study we contribute to           between the novice and the expert choice of action to prevent the
the following: Contributions:                                              agent from suggesting unsafe actions by considering the state dis-
                                                                           tribution [19] or disagreement between multiple agents [7]. Follow
     • a conceptual approach to extract safety relevant behavior           up research did consider the quantification of policy uncertainty
        from expert demonstrations                                         to model risk of exploration [20]. Lee et al. [15] proposed end-to-
     • an adapted conceptual method for imitation learning that            end imitation learning, where safety is addressed by evaluating
        emphasizes safety-critical thinking                                the uncertainty of Bayesian convolutional network. Yet again, no
     • implementation, application and preliminary evaluation of           approach has been adopted to differentiate existing demonstrations
        the concepts                                                       and adapt safety-relevant behaviour in a targeted manner.
   Paper outline: After we position our work in the scientific related     multi criteria Laroche et al. [14] introduced a Multi-Advisor RL
work in section 2, we introduce the conceptual background of our          where 𝑛 advisors are specialized on a sub task of the problem and
approach in section 3.1. While in section 3.2 we present the novel         an aggregator is used to derive a global policy based on the indi-
concepts of our approach, in section 3.3 we focus on the explicit         vidual recommendations. While the safety of an RL problem can
application to optimize clinical pathways. In chapter 4 we outline         be described as a multi-criteria problem, the question remains to
our evaluation method and discuss the results achieved in chapter          be answered how the approach described here can guarantee com-
5. Our work is then completed by a conclusion (section 6) and an           pliance with safety constraints and foster exploration within these
outlook on future work in section 7.                                       limits.

2    RELATED WORK
Our work covers various areas of health care and machine learning,
which we would like to examine in greater detail.                         3    OUR APPROACH
Research has shown an increasing interest in applying machine             Contrary to previous imitation learning techniques, our approach
learning techniques to health care related tasks. From mod-               focus on avoiding unsafe states while still exploring safe states
elling disease progression [2] to automated clinical prognostics          to find the optimum. We teach the agent to handle safety critical
[1] methods of artificial intelligence have shown to be promising         states by imitating expert actions in similar situation. In safe states
approaches. In further applications algorithms are used to annotate       however the agent does not need to stick exactly with the behavior
medical images and support doctors decision-making in a human-            observed in expert demonstration. In fact we encourage it to search
ML collaborative way [9]. Overall, decisive questions are emerging        for the best personalized clinical path possible by exploration. While
for the use of machine learning in the health sector. The decision        current safety RL algorithms [14, 19] focus on choosing actions
of a system must be validated and made comprehensible. Only if            that converge to the median of expert demonstrations that often is
the physician can be sure that the outcome of a machine learning          not the optimum, our approach aims at encouraging the agent to
algorithm is understandable and, above all, guarantees the safety         explore the state space while staying inside safe boundaries.
Recommending safe actions by learning from sub-optimal demonstrations                              HealthRecSys’20, September 26th, 2020, Online, Worldwide


3.1     Formal Description                                                                                                                   𝜖 ) to
                                                                              imitation learning consists of safety-relevant trajectories (𝑇𝑒𝑥𝑝
At each step 𝑡 the agent selects an action 𝑎𝑡 ∈ 𝐴(𝑠) based on the             a defined extent mixed with randomly sampled trajectories from
received representation of the environment state 𝑠𝑡 ∈ 𝑆. Applied              𝑇𝑒𝑥𝑝 . Trough out this paper we will refer to this weighing as safety
in a health recommender system speaking about action and states               focus 𝛼 ∈ [0, 1].
relates to recommendations for therapy activity and patients clin-
ical state respectively. The agent receives a reward, 𝑟𝑡 +1 , which             𝑡𝑟𝑎𝑖𝑛
                                                                               𝑇𝑒𝑥𝑝                        𝜖
                                                                                      = {𝛼 ∗ (𝑠𝑡 , 𝑎𝑡 ) ⊂ 𝑇𝑒𝑥𝑝                                 𝜖
                                                                                                               ∪ (1 − 𝛼) ∗ (𝑠𝑡 , 𝑎𝑡 ) ⊂ 𝑇𝑒𝑥𝑝 \𝑇𝑒𝑥𝑝 } (5)
quantifies the development of the clinical condition and personal
well-being of the patient and a new state 𝑠𝑡 +1 of the patient as a
consequence of its action. 𝜋𝑡 (𝑎|𝑠) is the agents policy which is as-         It is essential to highlight that the data set used for the training
signing a probability to each action at a given state and choices the         of the agent is not extended by additional information such as the
most promising. This part is to be trained during the exploration             security factor, but rather a subset of the demonstrations is deliber-
or in the case of imitation learning during the expert observation.           ately chosen for the training. The agent during imitation learning
Since the new state serves as the input for the next iteration the            is not told at any time whether the state action pair currently pre-
agent keeps on interacting with the environment and creates a tra-            sented to him in the context of supervised learning is a security
jectory 𝜏 = (𝑠𝑡 , 𝑎𝑡 )|𝑡 ∈ [𝑡 0, 𝑡ℎ ] where 𝑠𝑡 is the state at a given time   relevant example. The approach changes solely the composition of
and 𝑎𝑡 the action where 𝑡 includes all elements from the start time           the training data set.
𝑡 0 to the time of termination 𝑡ℎ . The trajectory for an individual
patient directly relates to the configured pathway (actions 𝑎𝑡 map
                                                                              3.3    Implication Health Recommender System
to the parameterised treatment activities foreseen in the clinical
pathway) and the observed reaction of the patient (states 𝑠𝑡 ). The           Our approach allows to learn effectively from demonstrations that
objective function denoted as 𝐽 (𝜋) relates to                                guarantee a safe state of the environment respectively of the patient,
                                                   ∞
                                                                              but beyond that the actions not always show the optimal reaction
                                                                              to the state. Furthermore we aim to train a reinforcement learning
                                                   Õ
                 𝐽 (𝜋) = E [ 𝑅(𝜏)] = 𝑅(𝜏) =               𝛾 𝑡 𝑟𝑡        (1)
                            𝜏∼𝜋                    𝑡 =0
                                                                              agent with weighted expert demonstrations and thereby putting
                                                                              safety or other evaluation criteria in the foreground. Applying the
where 𝛾 ∈ [0, 1] is a discount factor. As we are dealing with the
                                                                              approach on training a reinforcement learning agent to suggest
complex task of adapting clinical pathways, the modelling of several
                                                                              and parameterize treatment activities in a clinical pathway we train
objectives and constraints gains in importance. constraints Con-
                                                                              the agent to explore the optimal recommendation while imitating
strained Markov Decision Processes (CMDPs) [3] limit the number
                                                                              expert recommendations when facing critical states described by
of policies to a subset Π𝐶 ⊂ Π that fulfill a set of constraints 𝐶 such
                                                                              constraint cost functions.
that:
                                                                                 If the formal description is applied to the healthcare application,
                 Π𝐶 = {𝜋 : 𝐽𝑐𝑖 (𝜋) ≤ 𝑑𝑖 ∀ 𝑖 = 1, . . . , 𝑘 }         (2)      the cost function evaluates the clinical condition of the patient. In
                          𝐽𝑐𝑖 (𝜋)) = E [ 𝑐𝑖 (𝜏)]                        (3)   concrete terms, one could, for example, evaluate the deviation of
                                        𝜏∼𝜋                                   the measured pulse from rest or optimal pulse. One now look at
𝐽𝑐𝑖 is the estimation of the expected value for a cost function 𝑐𝑖            the expert’s demonstration, i.e. any number of pairs of the patient’s
over the space of the trajectories achieved by the policy 𝑝𝑖 . The            condition and the proposed therapy measure, you can evaluate
resulting space of allowed policies is defined by the limitation that         for each demonstration what the cost function is, i.e. the safety
it only includes policies that do not exceed a defined limit 𝑑𝑖 ∈ R           assessment of the patient’s clinical condition. It is crucial that the
for all the defined cost functions.                                           costs are not per se included in the objective function but are used
                                                                              as restrictions. As a result, an increased heart rate is not interpreted
3.2     Safety Imitation                                                      as negative by our recommender, but we take care in the decision-
Focusing on modelling the safety of an reinforcement learning agent           making process that the safety of this attribute is within certain
we define 𝑐𝑠𝑎𝑓 𝑒𝑡 𝑦 for brevity 𝑐𝑠 to approximate the safety of a given       limits.
state 𝑠𝑡 . The flexibility of the approach provides the possibility              We are therefore aware that the heart rate drops out during a
to differentiate safety in several dimensions or to describe it as a          therapeutic measure, and that this is one of the undesirable effects.
holistic unit. In the case of Imitation Learning from sub-optimal             But we want to make sure that the proposals of our intelligent
but safe demonstrations we calculate the threshold value 𝑑𝑠 over              system are within the limits of the experts’ opinions. So if we see
the distribution of expert trajectories, such that 𝑑𝑠 = 𝑚𝑎𝑥 𝐽𝑐𝑠 (𝜋𝑒𝑥𝑝 )       in the trajectories that the safety costs are below a certain level, we
from the observed in the expert demonstrations 𝑇𝑒𝑥𝑝 . Evaluating              want to make sure that our proposals do not exceed this limit. To
the received expert trajectories we can now quantify how critical             learn this, the demonstrations where the patient’s condition was
the different states were in terms of safety by defining:                     particularly close to the observed limit are particularly relevant.
                                                                              In our approach we define a sub set of trajectories 𝑇𝑒𝑥𝑝  𝜖 that have
            𝜖
           𝑇𝑒𝑥𝑝 = {(𝑠𝑡 , 𝑎𝑡 ) : 𝑠𝑡 ∈ 𝑇𝑒𝑥𝑝 ∧ 𝐽𝑐𝑠 (𝑠𝑡 ) ≥ 𝑑𝑠 − 𝜖}         (4)
                                                                              a defined distance from the critical limit 𝜖. From this sub set we
                              𝜖 to train our agent we can assure
   By focusing on the subset 𝑇𝑒𝑥𝑝                                             know that it is particularly relevant to learn how to avert critical
that it knows how to handle critical situations while preserving the          states. During the training process, our intelligent system should
freedom of exploring safe states. The collected demonstration data            accordingly pay special attention to adapting the expert suggestions
set is then weighted in such a way that the training data set for             close to the critical states.
HealthRecSys’20, September 26th, 2020, Online, Worldwide                                                          Lars Boecking and Patrick Philipp


4     EVALUATION                                                         different clinical parameters that are monitored during expert train-
Although the concept presented here was developed out of the             ing. The cost functions considered quantify the cars position by
motivation to individualize clinical pathways for patients, it can       evaluating the game frame received as a state representation. In the
be applied to various applications of reinforcement learning. For        three directions 𝑙𝑒 𝑓 𝑡, 𝑓 𝑟𝑜𝑛𝑡 and 𝑟𝑖𝑔ℎ𝑡 we calculate the distance to
this reason, and because clinical data was not available to the ex-      the unsafe state - the green besides the road. Evaluating these three
tent necessary for an analysis, the evaluation is based on common        cost functions for each state observed during the demonstration
and comparable safety problems in the directive. We use the gym          we develop a representation of the states safety. The parameter 𝜖,
environment provided by OpenAI.                                          which indicates how early a state should be classified as safety-
                                                                         relevant is set to 𝜖 = 5. To calculate we move from the edge of the
4.1     Gym Environment                                                  distribution of expert examples in the dimension of a constraint - in
                                                                         this case the safety - to the centre of the distribution. Visually this
The gym environment offers the possibility to run different task
                                                                         parameter defines how wide the edge of the distribution is, which
scenarios for reinforcement agent and to extend the provided frame-
                                                                         is classified as safety critical as shown in Figure 3.
work. Especially atari games and two dimensional games such as car
racing are very popular and provide an excellent baseline to com-
pare results. Due to the parallel use of the car racing environment
as a recommender in the health care sector, the car racing environ-
ment is particularly suitable to demonstrate the functionality of
our approach. The car on the race track describes the condition of
the patient, who changes depending on the action - steering and
accelerating, or parameterization of the next treatment measure.
The more critically the condition of the patient - the position of the
vehicle on the track - is evaluated, the more relevant it is to behave
similar to the expert demonstrations. While we have described the
relevance of heart rate in the clinical environment above, safety                 Figure 3: distribution safety demonstration
in this environment can be quantified with a cost function based
on the distance to the edge of the track. So while in the medical          Agent testing After the weights of the reinforcement learning
case we can observe how a doctor behaves when the heart rate is          agent were trained via imitation learning the agent is evaluated in a
particularly high or exceptionally low, in this environment we can       newly generated gym environment. Here we observe the agent for
quantify how far the vehicle is from the edge of the track.              two whole episodes to collect information about its performance
                                                                         and its safety. Depending on the individual performance of the
                                                                         agent this relates to ≈ 2000 state action pairs.

                                                                         5   PRELIMINARY RESULTS
                                                                         In the following we want to present the preliminary results of ap-
                                                                         plying our approach to the safety critical decision process described
                                                                         in 4.2. Different values for 𝛼 in equation 5 has shown significant
Figure 2: safety critical and uncritical states in the evalua-           influence on the performance of the agent with respect to the safety
tion environment                                                         as well as the reward as shown in table 1.

   To learn how to deal with critical conditions, we then look at                   Table 1: preliminary results safety focus
the demonstrations where the assessment of the condition was
particularly critical, as described in the formal description. The           safety focus 𝛼    safety mean     safety std    reward mean
subset used for imitation learning is selected upon those based on
equation 5.                                                                        0.0             13.05         16.41           139.50
                                                                                   0.1             17.30         12.87           228.88
4.2     Experiment Set Up                                                          0.5             21.37         11.12           697.41
                                                                                   0.8             20.94         14.08           549.51
In the following we will describe the dimensions and parameters
used for our evaluation in more detail: Demonstrations:All exper-
iments were carried out on the same demonstration data set of size          The results show that the safety focus has a significant impact
|𝑇𝑒𝑥𝑝 | = 4692 ∗ (𝑠𝑡 , 𝑎𝑡 ), for further detail see Appendix A.          on the agent’s performance. The agent trained with the unweighted
Imitation learning was performed as supervised learning of a ten-        expert demonstrations achieves an average safety rating of 13.05
sorflow model with same architecture for every experiment. The           for its proposals, and the variation in safety over the ≈ 2000 state
agent was trained for 2000 𝑏𝑎𝑡𝑐ℎ𝑒𝑠, pairs of (𝑠𝑡 , 𝑎𝑡 ) respectively.    action pairs of 16.41 should be noted. The approach of pre-selecting
Cost function: In our evaluation we consider three cost function         and weighting the demonstrations based on the distribution of the
that quantify the cars position in the environment. Since the cars       cost function shows a positive impact. The security evaluation of
state represents a patients clinical state this can be seen as three     the conditions caused by the agent can be raised to a level of 17.3 by
Recommending safe actions by learning from sub-optimal demonstrations                         HealthRecSys’20, September 26th, 2020, Online, Worldwide


a weighting of 𝛼 = 0.1, and by a weighting of 𝛼 = 0.5 it can achieve
a value of 21.37. In addition, the weighting of the expert trajectories
in these cases also leads to a more robust reinforcement learning
agent, which is reflected in the standard deviation of safety.
   To make the results presented more comprehensible, Figure 4
provides a visualization.Training the agent with different safety
focuses 𝛼 results in safety and reward shown on the y-axis and
the standard deviation represented by the dots size.

                                                                                     Figure 6: safety function 0.8 safety focus


                                                                          6   CONCLUSION
                                                                          The motivation for this work is derived from the medical context,
                                                                          in which the objective is to adapt clinical pathways to a patient’s
                                                                          needs in the best possible way. while this scenario can be aptly
                                                                          described as a reinforcement learning problem, as discussed in the
                                                                          introduction, it is important to limit the exploration and thus the
                                                                          parameterisation of therapies and activities to a safe range of action
                                                                          from a medical point of view. the imitation learning approach offers
                                                                          a suitable approach to imitate the behaviour of experts. However,
                                                                          two central questions have arisen in reinforcement learning. Firstly,
                                                                          the question arose as to how an agent imitating an expert can con-
                                                                          centrate on learning safety relevant actions. Furthermore, we asked
                                                                          ourselves whether an agent can be given the opportunity to explore
                                                                          the optimum within the action space while still maintaining a focus
Figure 4: impact safety focus on episode reward and safety
                                                                          on safety.
                                                                          To answer these questions, we have developed an approach that
    Taking a closer look at comparing the cost function values for
                                                                          learns from expert demonstrations and concentrates on adapting
two agents - one trained without safety focus (5) and one trained
                                                                          the safety-relevant behaviour of the expert by appropriately weight-
with a safety focus 𝛼 = 0.8 (6) - emphasising the safety critical
               𝜖                                                          ing the examples provided. Our approach defines two parameters
trajectories 𝑇𝑒𝑥𝑝   in the expert demonstrations can significantly
                                                                          that determine how to deal with the state action pairs observed
raise the safety of the actions recommended by the agent. While
                                                                          among experts. On the one hand, we have parameter 𝜖, which in-
the performance of the agents is already reflected in the values
                                                                          dicates how early a state should be classified as safety-relevant.
listed in table 1, the reasons for this can be identified in Figures 5
                                                                          On the other hand we have safety focus 𝛼 forcing the agent to
and 6.
                                                                          train on a subset of expert trajectories, where 𝛼 of the examples
                                                                          are classified as safety relevant under a given value 𝜖.
                                                                          Our approach for imitation learning was able to outperform equiv-
                                                                          alent agents trained on balanced demonstrations with regard to
                                                                          the safety as well as the reward. The generic conceptual approach
                                                                          underlying the work can be applied to a wide range of RM tasks.
                                                                          It is especially relevant for domains where expert knowledge is
                                                                          available, which defines how one should behave to be safe, but
                                                                          where it is not sure exactly what the optimal behaviour may look
                                                                          like. This is the case in the personalization of clinical pathways.
                      Figure 5: no safety focus                           while physicians can precisely advise which activities to suggest
                                                                          as rehabilitation under certain clinical conditions of the patient,
                                                                          it is not certain whether these suggestions are the optimal choice.
   The non safety focus runs where not able to obtain a critical dis-
                                                                          with our approach we provide an important basis for exploring
tance to the critical states„ while the safety focus runs successfully
                                                                          the optimum when proposing individually parameterized activities
learned to avert critical states in an expert reaction manner.
                                                                          without violating the limits of the safety-relevant parameters.
   While the agent without safety focus was not able to learn the
correct handling of safety critical conditions during imitation learn-
ing, our approach was successful in adapting the expert’s handling
of critical states. By pre-selecting the expert examples without pro-
viding any further information during the training process, the
agent with safety focus was able to avert safety critical conditions
similar to the expert’s behaviour.
HealthRecSys’20, September 26th, 2020, Online, Worldwide                                                                                     Lars Boecking and Patrick Philipp


7    FUTURE WORK                                                                              Representations (ICLR 2020) abs/2002.04083 (2020).
                                                                                          [7] Kiante Brantley, Wen Sun, and Mikael Henaff. 2020. Disagreement-Regularized
Besides the further exploration of the parameter combinations of 𝜖                            Imitation Learning. In International Conference on Learning Representations. https:
and 𝛼, the transfer to additional RL problems is pending. Evaluating                          //openreview.net/forum?id=rkgbYyHtwB
                                                                                          [8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John
the approach on further 2D games in the gym environment is a                                  Schulman, Jie Tang, and Wojciech Zaremba. 2016.                   OpenAI Gym.
logical next step. Additionally, teaching robotics to safely interact                         arXiv:arXiv:1606.01540
with their environment is relevant application [8]. Moreover, the                         [9] Carrie J. Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov,
                                                                                              Martin Wattenberg, Fernanda Viegas, Greg S. Corrado, Martin C. Stumpe, and
approach is to be evaluated in more complex RL tasks that focus                               Michael Terry. 2019. Human-Centered Tools for Coping with Imperfect Algo-
on the safety aspect, for which the recently published safety gym                             rithms during Medical Decision-Making. arXiv e-prints, Article arXiv:1902.02960
is available [21].                                                                            (Feb. 2019), arXiv:1902.02960 pages. arXiv:1902.02960 [cs.HC]
                                                                                         [10] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerík, Todd Hester, Cosmin Padu-
Future research should also consider how to completely avoid                                  raru, and Yuval Tassa. 2018. Safe Exploration in Continuous Action Spaces. CoRR
safety-critical examples that are dealt with by experts. One possible                         abs/1801.08757 (2018). arXiv:1801.08757 http://arxiv.org/abs/1801.08757
                                                                                         [11] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. 2017.
approach to this could be the simulation of responsibilities and the                          One-Shot Visual Imitation Learning via Meta-Learning. CoRR abs/1709.04905
evaluation of possible reactions by an expert, using human in loop                            (2017). arXiv:1709.04905 http://arxiv.org/abs/1709.04905
approaches as feedback for the system, see [17] and [5].                                 [12] Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation
                                                                                              Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee,
                                                                                              M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Asso-
ACKNOWLEDGMENTS                                                                               ciates, Inc., 4565–4573. http://papers.nips.cc/paper/6391-generative-adversarial-
                                                                                              imitation-learning.pdf
This work was partially supported by the project vCare: Virtual                          [13] James Erica Snow Pamela Willis Jon Kinsman Leigh, Rotter Thomas. 2010. What
Coaching Activities for Rehabilitation in Elderly (funded by Horizon                          is a clinical pathway? Development of a definition to inform the debate. BMC
                                                                                              Medicine (2010).
2020 research and innovation programme under Grant Agreement                             [14] Romain Laroche, Mehdi Fatemi, Joshua Romoff, and Harm van Seijen. 2017. Multi-
Number: 769807). Special acknowledgements are directed to the                                 Advisor Reinforcement Learning. CoRR abs/1704.00756 (2017). arXiv:1704.00756
partners of the project, who have contributed valuable feedback in                            http://arxiv.org/abs/1704.00756
                                                                                         [15] Keuntaek Lee, Kamil Saigol, and Evangelos A. Theodorou. 2018. Safe end-to-end
the specification of the research problem and by providing their                              imitation learning for model predictive control. CoRR abs/1803.10231 (2018).
expertise to this study.                                                                      arXiv:1803.10231 http://arxiv.org/abs/1803.10231
                                                                                         [16] Zachary C. Lipton. 2017. The Doctor Just Won’t Accept That! arXiv
                                                                                              e-prints, Article arXiv:1711.08037 (Nov. 2017), arXiv:1711.08037 pages.
REFERENCES                                                                                    arXiv:1711.08037 [stat.ML]
 [1] Ahmed M. Alaa and Mihaela van der Schaar. 2018. AutoPrognosis: Auto-                [17] James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L.
     mated Clinical Prognostic Modeling via Bayesian Optimization with Struc-                 Roberts, Matthew E. Taylor, and Michael L. Littman. 2017. Interactive Learning
     tured Kernel Learning. arXiv e-prints, Article arXiv:1802.07207 (Feb. 2018),             from Policy-Dependent Human Feedback. In Proceedings of the 34th Interna-
     arXiv:1802.07207 pages. arXiv:1802.07207 [cs.LG]                                         tional Conference on Machine Learning (Proceedings of Machine Learning Research,
 [2] Ahmed M. Alaa and Mihaela van der Schaar. 2019. Attentive State-Space Modeling           Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, International Conven-
     of Disease Progression. In Advances in Neural Information Processing Systems 32,         tion Centre, Sydney, Australia, 2285–2294. http://proceedings.mlr.press/v70/
     H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett        macglashan17a.html
     (Eds.). Curran Associates, Inc., 11338–11348. http://papers.nips.cc/paper/9311-     [18] Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer.
     attentive-state-space-modeling-of-disease-progression.pdf                                2017. DropoutDAgger: A Bayesian Approach to Safe Imitation Learning. ArXiv
 [3] E. Altman. 1999. Constrained Markov Decision Processes. Chapman and Hall.                abs/1709.06166 (2017).
     https://doi.org/10.1016/0167-6377(96)00003-X                                        [19] Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer.
 [4] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong,               2017. DropoutDAgger: A Bayesian Approach to Safe Imitation Learning. CoRR
     Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba.             abs/1709.06166 (2017). arXiv:1709.06166 http://arxiv.org/abs/1709.06166
     2017. Hindsight Experience Replay. arXiv e-prints, Article arXiv:1707.01495 (July   [20] Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer.
     2017), arXiv:1707.01495 pages. arXiv:1707.01495 [cs.LG]                                  2018. EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. CoRR
 [5] Dilip Arumugam, Jun Ki Lee, Sophie Saskin, and Michael L. Littman. 2019.                 abs/1807.08364 (2018). arXiv:1807.08364 http://arxiv.org/abs/1807.08364
     Deep Reinforcement Learning from Policy-Dependent Human Feedback.                   [21] Alex Ray, Joshua Achiam, and Dario Amodei. 2019. Benchmarking Safe Explo-
     arXiv e-prints, Article arXiv:1902.04257 (Feb. 2019), arXiv:1902.04257 pages.            ration in Deep Reinforcement Learning. (2019).
     arXiv:1902.04257 [cs.LG]                                                            [22] Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, and Jan Leike.
 [6] Ioana Bica, Ahmed M. Alaa, J. Brian Jordon, and Mihaela van der Schaar. 2020.            2019. Learning Human Objectives by Evaluating Hypothetical Behavior.
     Estimating Counterfactual Treatment Outcomes over Time Through Adversari-                arXiv e-prints, Article arXiv:1912.05652 (Dec. 2019), arXiv:1912.05652 pages.
     ally Balanced Representations. In Proc. 8th International Conference on Learning         arXiv:1912.05652 [cs.CY]
Recommending safe actions by learning from sub-optimal demonstrations                        HealthRecSys’20, September 26th, 2020, Online, Worldwide


A     INSIGHT ON EXPERT DEMONSTRATIONS
Following we show the cost function calculated for the expert
demonstrations. In 7 we see the to cost functions calculating the
safety for 𝑙𝑒 𝑓 𝑡 and 𝑟𝑖𝑔ℎ𝑡.


                                                                                Figure 10: safety function side 0.1 safety focus


    Figure 7: demonstration safety function left and right

   In addition we evaluated the safety cost function in the dimen-
sion 𝑠𝑡𝑟𝑎𝑖𝑔ℎ𝑡, as shown in 8.


                                                                                Figure 11: safety function front 0.1 safety focus

                                                                              Safety Focus 0.5
                                                                         Training the agent with a safety focus of 0.5 results in the safety
                                                                         function shown in Figure 12 for side safety estimation and 13 for
                                                                         𝑐 𝑓 𝑟𝑜𝑛𝑡 safety.

       Figure 8: demonstration safety function straight


B    ABLATION STUDY
In the following we provide further insights on the agents perfor-
mance trained on different levels of 𝑠𝑎𝑓 𝑒𝑡𝑦 𝑓 𝑜𝑐𝑢𝑠𝛼.

   Safety Focus 0.0 To complete the report on reinforcement agent
performance with no safety focus besides 5 we provide the cost
function referring to the safety evaluation 𝑓 𝑟𝑜𝑛𝑡. Training the agent
with 𝛼 = 0.0 results in the cost function to the front shown in Figure
                                                                                Figure 12: safety function side 0.5 safety focus
9.

                                                                            A Safety focus of 0.5 not only emphasises behavior to return
                                                                         from safety critical states with respect to the 𝑙𝑒 𝑓 𝑡 and 𝑟𝑖𝑔ℎ𝑡 safety
                                                                         constraint but also the 𝑓 𝑟𝑜𝑛𝑡 safety.


        Figure 9: safety function front no safety focus

  Safety Focus 0.1
Training the agent with a safety focus of 0.1 results in the cost
function shown below. Safety estimation to cost function sides is               Figure 13: safety function front 0.5 safety focus
shown in Figure 10 and function front in Figure 11 respectively.
HealthRecSys’20, September 26th, 2020, Online, Worldwide                                             Lars Boecking and Patrick Philipp


   Safety Focus 0.8
In addition to the safety function values for 𝑙𝑒 𝑓 𝑡 and 𝑟𝑖𝑔ℎ𝑡 shown
in 6 we provide the cost function for 𝑓 𝑟𝑜𝑛𝑡. Training the agent with
a safety focus of 𝛼 = 0.8 results in the cost function to the front
shown in Figure 14.


                                                                        Figure 14: cost function front 0.8 safety focus