Recommending safe actions by learning from sub-optimal demonstrations Lars Boecking Patrick Philipp boecking@fzi.de philipp@fzi.de FZI Research Center for Information Technology FZI Research Center for Information Technology Karlsruhe, Germany Karlsruhe, Germany ABSTRACT 1 INTRODUCTION Clinical pathways describe the treatment procedure for a patient Our work focuses on the use of reinforcement learning to optimize from a medical point of view. Based on the patient’s condition, a and personalize clinical pathways, illustrated in Figure 1. Rehabil- decision is made about the next actions to be carried out. Such itation procedure, called „Clinical Pathway“, describes in detail recurring sequential process decisions could well be outsourced which activities are to be carried out for a patient within a course to a reinforcement learning agent, but the patient’s safety should of treatment[13]. always be the main consideration when suggesting activities. The development of individual pathways is also cost and time intensive, therefore a smart agent could support and relieve physicians. In addition, not every patient reacts in the same way to a clinical intervention, so the personalization of a clinical pathway should be given attention. In this paper we address with the fundamental problem that the use of reinforcement learning agents in the specification of clinical pathways should provide an individual optimal proposal within the limits of safety constraints. Imitating the decisions of physicians can guarantee safety but not optimality. Therefore, we present an approach that ensures com- pliance with health critical rules without limiting the exploration of the optimum. We evaluate our approach on open source gym Figure 1: clinical pathway recommender environment where we are able to show that our adaptation of behavior cloning not only adheres better to safety regulations, but also manages to better explore the space of the optimum in the collective rewards. The process of creating a clinical pathway tailored to an indi- vidual patient spans several stages. To adapt a clinical pathway to a patient’s needs, one starts from a disease specific blueprint and later incorporates the patients clinical picture as well as his or her CCS CONCEPTS individual preferences. • Applied computing → Health care information systems. On an abstract level, the adaptation of a pathway can be modelled as a decision process. A number of activities must be decided upon, KEYWORDS which in turn have interdependent effects among one another. Feed- back on the effectiveness of the decisions made is often only given health care; clinical pathway; reinforcement learning; personaliza- with a delay or in aggregated form - for example during a control tion; imitation learning ; safety; constraints visit to the doctor after a certain time. Reinforcement learning is about optimizing processes that can be described as a feedback con- ACM Reference Format: Lars Boecking and Patrick Philipp. 2020. Recommending safe actions by trol loop. The application of RL to the individualization of clinical learning from sub-optimal demonstrations. In Proceedings of the 5th Interna- pathways is therefore particularly well suited and promising. tional Workshop on Health Recommender Systems co-located with 14th ACM The personalization of a clinical pathway is about identify- Conference on Recommender Systems (HealthRecSys’20), Online, Worldwide, ing the optimal combination of activities and treatments in reha- September 26, 2020 , 8 pages. bilitation for an individual patient. In this context optimality can be considered from different viewpoints. On the one hand we see the fundamental objective of proposing rehabilitation measures that are safe from a medical perspective. On the other hand we HealthRecSys’20, September 26th, 2020, Online, Worldwide aim to support the recovery process in the best-possible way by ex- © 2020 Copyright for the individual papers remains with the authors. Use permitted ploring alternative rehabilitation activities. While there are generic under Creative Commons License Attribution 4.0 International (CC BY 4.0). This templates for different medical diagnosis - that are safe, there is volume is published and copyrighted by its editors. the need to go beyond and adapt the clinical pathway - to provide tailored care plans to the individual patients. HealthRecSys’20, September 26th, 2020, Online, Worldwide Lars Boecking and Patrick Philipp In order to address the objectives described above for clinical of the patient, can systems prevail in the long term[16]. Pathway path recommender systems, we present a safety-aware reinforce- - treatment Bica et al. [6] introduced Counterfactual Recurrent ment learning approach. On a conceptual level this means that Networks to estimate treatment effects by modelling treatment we have a state 𝑠𝑡 of our patient and our agent proposes an action time-dependent impact on covariates based on the patient clinical 𝑎𝑡 - a rehabilitation measure - for our patient at time 𝑡 (Figure 1). history. Besides the topic-related relevant areas, various conceptual The agent receives a reward 𝑟𝑡 +1 based on the change in the con- fields from machine learning are of relevance to our approach. dition of our patient 𝑠𝑡 +1 . While classical RL is based on try&error, Imitation Learning is about training an agent to mimic the be- the healthcare application must guarantee the safety of the pa- haviour of an expert. With approaches such as inverse RL, e.g. GAIL tient during the proposed activity. Imitation Learning is one of - Generative Adversarial Imitation Learning - have recently achieved the ways in which this is pursued. Here the agent is trained to remarkable success [12]. Beyond this, we have seen approaches to „imitate“ an expert’s actions, i.e., to suggest a similar treatment that attempt to reconstruct the expert’s objective by evaluating hy- activity to the one a doctor would choose faced with the same pa- potetic behaviour of an agent [22]. Further adaptations of imitation tient profile. Current work in imitation learning [11, 12] focus on learning approaches are concerned with incorporating examples efficiently learning from demonstrations while not paying special of an expert during the active learning process [4]. However these attention on safety or exploration. Research identified the objec- approaches neither have been adapted to learn from sub-optimal tive safety in imitation learning[18] based their concept on being examples nor do they emphasise safety-relevant aspects. as close as possible to the examples shown. However, it is by no constraint RL: First considerations about setting boundaries to means guaranteed that the doctor’s suggestion is optimal for the the exploration of a reinforcement learning agent go back to the rehabilitation of the individual patient. year 2000[3]. Recent work applied constraints in form of predefined Challenges: threshold-values in continuous action spaces by adding a safety • How can we emphasize the importance of safety in suggest- layer that in case of constraint violation corrects the suggestion of ing rehabilitation treatments to a reinforcement learning policy network [10]. Other then our approach, these concepts are agent? based on pre-defined limits that are not deduced from examples • How can an agent explore the individual optimum and still and do not learn from experts. remain within a safe and medically acceptable action space? safety RL: We have seen approaches that measure the similarity In answering the questions within this study we contribute to between the novice and the expert choice of action to prevent the the following: Contributions: agent from suggesting unsafe actions by considering the state dis- tribution [19] or disagreement between multiple agents [7]. Follow • a conceptual approach to extract safety relevant behavior up research did consider the quantification of policy uncertainty from expert demonstrations to model risk of exploration [20]. Lee et al. [15] proposed end-to- • an adapted conceptual method for imitation learning that end imitation learning, where safety is addressed by evaluating emphasizes safety-critical thinking the uncertainty of Bayesian convolutional network. Yet again, no • implementation, application and preliminary evaluation of approach has been adopted to differentiate existing demonstrations the concepts and adapt safety-relevant behaviour in a targeted manner. Paper outline: After we position our work in the scientific related multi criteria Laroche et al. [14] introduced a Multi-Advisor RL work in section 2, we introduce the conceptual background of our where 𝑛 advisors are specialized on a sub task of the problem and approach in section 3.1. While in section 3.2 we present the novel an aggregator is used to derive a global policy based on the indi- concepts of our approach, in section 3.3 we focus on the explicit vidual recommendations. While the safety of an RL problem can application to optimize clinical pathways. In chapter 4 we outline be described as a multi-criteria problem, the question remains to our evaluation method and discuss the results achieved in chapter be answered how the approach described here can guarantee com- 5. Our work is then completed by a conclusion (section 6) and an pliance with safety constraints and foster exploration within these outlook on future work in section 7. limits. 2 RELATED WORK Our work covers various areas of health care and machine learning, which we would like to examine in greater detail. 3 OUR APPROACH Research has shown an increasing interest in applying machine Contrary to previous imitation learning techniques, our approach learning techniques to health care related tasks. From mod- focus on avoiding unsafe states while still exploring safe states elling disease progression [2] to automated clinical prognostics to find the optimum. We teach the agent to handle safety critical [1] methods of artificial intelligence have shown to be promising states by imitating expert actions in similar situation. In safe states approaches. In further applications algorithms are used to annotate however the agent does not need to stick exactly with the behavior medical images and support doctors decision-making in a human- observed in expert demonstration. In fact we encourage it to search ML collaborative way [9]. Overall, decisive questions are emerging for the best personalized clinical path possible by exploration. While for the use of machine learning in the health sector. The decision current safety RL algorithms [14, 19] focus on choosing actions of a system must be validated and made comprehensible. Only if that converge to the median of expert demonstrations that often is the physician can be sure that the outcome of a machine learning not the optimum, our approach aims at encouraging the agent to algorithm is understandable and, above all, guarantees the safety explore the state space while staying inside safe boundaries. Recommending safe actions by learning from sub-optimal demonstrations HealthRecSys’20, September 26th, 2020, Online, Worldwide 3.1 Formal Description 𝜖 ) to imitation learning consists of safety-relevant trajectories (𝑇𝑒𝑥𝑝 At each step 𝑡 the agent selects an action 𝑎𝑡 ∈ 𝐴(𝑠) based on the a defined extent mixed with randomly sampled trajectories from received representation of the environment state 𝑠𝑡 ∈ 𝑆. Applied 𝑇𝑒𝑥𝑝 . Trough out this paper we will refer to this weighing as safety in a health recommender system speaking about action and states focus 𝛼 ∈ [0, 1]. relates to recommendations for therapy activity and patients clin- ical state respectively. The agent receives a reward, 𝑟𝑡 +1 , which 𝑡𝑟𝑎𝑖𝑛 𝑇𝑒𝑥𝑝 𝜖 = {𝛼 ∗ (𝑠𝑡 , 𝑎𝑡 ) ⊂ 𝑇𝑒𝑥𝑝 𝜖 ∪ (1 − 𝛼) ∗ (𝑠𝑡 , 𝑎𝑡 ) ⊂ 𝑇𝑒𝑥𝑝 \𝑇𝑒𝑥𝑝 } (5) quantifies the development of the clinical condition and personal well-being of the patient and a new state 𝑠𝑡 +1 of the patient as a consequence of its action. 𝜋𝑡 (𝑎|𝑠) is the agents policy which is as- It is essential to highlight that the data set used for the training signing a probability to each action at a given state and choices the of the agent is not extended by additional information such as the most promising. This part is to be trained during the exploration security factor, but rather a subset of the demonstrations is deliber- or in the case of imitation learning during the expert observation. ately chosen for the training. The agent during imitation learning Since the new state serves as the input for the next iteration the is not told at any time whether the state action pair currently pre- agent keeps on interacting with the environment and creates a tra- sented to him in the context of supervised learning is a security jectory 𝜏 = (𝑠𝑡 , 𝑎𝑡 )|𝑡 ∈ [𝑡 0, 𝑡ℎ ] where 𝑠𝑡 is the state at a given time relevant example. The approach changes solely the composition of and 𝑎𝑡 the action where 𝑡 includes all elements from the start time the training data set. 𝑡 0 to the time of termination 𝑡ℎ . The trajectory for an individual patient directly relates to the configured pathway (actions 𝑎𝑡 map 3.3 Implication Health Recommender System to the parameterised treatment activities foreseen in the clinical pathway) and the observed reaction of the patient (states 𝑠𝑡 ). The Our approach allows to learn effectively from demonstrations that objective function denoted as 𝐽 (𝜋) relates to guarantee a safe state of the environment respectively of the patient, ∞ but beyond that the actions not always show the optimal reaction to the state. Furthermore we aim to train a reinforcement learning Õ 𝐽 (𝜋) = E [ 𝑅(𝜏)] = 𝑅(𝜏) = 𝛾 𝑡 𝑟𝑡 (1) 𝜏∼𝜋 𝑡 =0 agent with weighted expert demonstrations and thereby putting safety or other evaluation criteria in the foreground. Applying the where 𝛾 ∈ [0, 1] is a discount factor. As we are dealing with the approach on training a reinforcement learning agent to suggest complex task of adapting clinical pathways, the modelling of several and parameterize treatment activities in a clinical pathway we train objectives and constraints gains in importance. constraints Con- the agent to explore the optimal recommendation while imitating strained Markov Decision Processes (CMDPs) [3] limit the number expert recommendations when facing critical states described by of policies to a subset Π𝐶 ⊂ Π that fulfill a set of constraints 𝐶 such constraint cost functions. that: If the formal description is applied to the healthcare application, Π𝐶 = {𝜋 : 𝐽𝑐𝑖 (𝜋) ≤ 𝑑𝑖 ∀ 𝑖 = 1, . . . , 𝑘 } (2) the cost function evaluates the clinical condition of the patient. In 𝐽𝑐𝑖 (𝜋)) = E [ 𝑐𝑖 (𝜏)] (3) concrete terms, one could, for example, evaluate the deviation of 𝜏∼𝜋 the measured pulse from rest or optimal pulse. One now look at 𝐽𝑐𝑖 is the estimation of the expected value for a cost function 𝑐𝑖 the expert’s demonstration, i.e. any number of pairs of the patient’s over the space of the trajectories achieved by the policy 𝑝𝑖 . The condition and the proposed therapy measure, you can evaluate resulting space of allowed policies is defined by the limitation that for each demonstration what the cost function is, i.e. the safety it only includes policies that do not exceed a defined limit 𝑑𝑖 ∈ R assessment of the patient’s clinical condition. It is crucial that the for all the defined cost functions. costs are not per se included in the objective function but are used as restrictions. As a result, an increased heart rate is not interpreted 3.2 Safety Imitation as negative by our recommender, but we take care in the decision- Focusing on modelling the safety of an reinforcement learning agent making process that the safety of this attribute is within certain we define 𝑐𝑠𝑎𝑓 𝑒𝑡 𝑦 for brevity 𝑐𝑠 to approximate the safety of a given limits. state 𝑠𝑡 . The flexibility of the approach provides the possibility We are therefore aware that the heart rate drops out during a to differentiate safety in several dimensions or to describe it as a therapeutic measure, and that this is one of the undesirable effects. holistic unit. In the case of Imitation Learning from sub-optimal But we want to make sure that the proposals of our intelligent but safe demonstrations we calculate the threshold value 𝑑𝑠 over system are within the limits of the experts’ opinions. So if we see the distribution of expert trajectories, such that 𝑑𝑠 = 𝑚𝑎𝑥 𝐽𝑐𝑠 (𝜋𝑒𝑥𝑝 ) in the trajectories that the safety costs are below a certain level, we from the observed in the expert demonstrations 𝑇𝑒𝑥𝑝 . Evaluating want to make sure that our proposals do not exceed this limit. To the received expert trajectories we can now quantify how critical learn this, the demonstrations where the patient’s condition was the different states were in terms of safety by defining: particularly close to the observed limit are particularly relevant. In our approach we define a sub set of trajectories 𝑇𝑒𝑥𝑝 𝜖 that have 𝜖 𝑇𝑒𝑥𝑝 = {(𝑠𝑡 , 𝑎𝑡 ) : 𝑠𝑡 ∈ 𝑇𝑒𝑥𝑝 ∧ 𝐽𝑐𝑠 (𝑠𝑡 ) ≥ 𝑑𝑠 − 𝜖} (4) a defined distance from the critical limit 𝜖. From this sub set we 𝜖 to train our agent we can assure By focusing on the subset 𝑇𝑒𝑥𝑝 know that it is particularly relevant to learn how to avert critical that it knows how to handle critical situations while preserving the states. During the training process, our intelligent system should freedom of exploring safe states. The collected demonstration data accordingly pay special attention to adapting the expert suggestions set is then weighted in such a way that the training data set for close to the critical states. HealthRecSys’20, September 26th, 2020, Online, Worldwide Lars Boecking and Patrick Philipp 4 EVALUATION different clinical parameters that are monitored during expert train- Although the concept presented here was developed out of the ing. The cost functions considered quantify the cars position by motivation to individualize clinical pathways for patients, it can evaluating the game frame received as a state representation. In the be applied to various applications of reinforcement learning. For three directions 𝑙𝑒 𝑓 𝑡, 𝑓 𝑟𝑜𝑛𝑡 and 𝑟𝑖𝑔ℎ𝑡 we calculate the distance to this reason, and because clinical data was not available to the ex- the unsafe state - the green besides the road. Evaluating these three tent necessary for an analysis, the evaluation is based on common cost functions for each state observed during the demonstration and comparable safety problems in the directive. We use the gym we develop a representation of the states safety. The parameter 𝜖, environment provided by OpenAI. which indicates how early a state should be classified as safety- relevant is set to 𝜖 = 5. To calculate we move from the edge of the 4.1 Gym Environment distribution of expert examples in the dimension of a constraint - in this case the safety - to the centre of the distribution. Visually this The gym environment offers the possibility to run different task parameter defines how wide the edge of the distribution is, which scenarios for reinforcement agent and to extend the provided frame- is classified as safety critical as shown in Figure 3. work. Especially atari games and two dimensional games such as car racing are very popular and provide an excellent baseline to com- pare results. Due to the parallel use of the car racing environment as a recommender in the health care sector, the car racing environ- ment is particularly suitable to demonstrate the functionality of our approach. The car on the race track describes the condition of the patient, who changes depending on the action - steering and accelerating, or parameterization of the next treatment measure. The more critically the condition of the patient - the position of the vehicle on the track - is evaluated, the more relevant it is to behave similar to the expert demonstrations. While we have described the relevance of heart rate in the clinical environment above, safety Figure 3: distribution safety demonstration in this environment can be quantified with a cost function based on the distance to the edge of the track. So while in the medical Agent testing After the weights of the reinforcement learning case we can observe how a doctor behaves when the heart rate is agent were trained via imitation learning the agent is evaluated in a particularly high or exceptionally low, in this environment we can newly generated gym environment. Here we observe the agent for quantify how far the vehicle is from the edge of the track. two whole episodes to collect information about its performance and its safety. Depending on the individual performance of the agent this relates to ≈ 2000 state action pairs. 5 PRELIMINARY RESULTS In the following we want to present the preliminary results of ap- plying our approach to the safety critical decision process described in 4.2. Different values for 𝛼 in equation 5 has shown significant Figure 2: safety critical and uncritical states in the evalua- influence on the performance of the agent with respect to the safety tion environment as well as the reward as shown in table 1. To learn how to deal with critical conditions, we then look at Table 1: preliminary results safety focus the demonstrations where the assessment of the condition was particularly critical, as described in the formal description. The safety focus 𝛼 safety mean safety std reward mean subset used for imitation learning is selected upon those based on equation 5. 0.0 13.05 16.41 139.50 0.1 17.30 12.87 228.88 4.2 Experiment Set Up 0.5 21.37 11.12 697.41 0.8 20.94 14.08 549.51 In the following we will describe the dimensions and parameters used for our evaluation in more detail: Demonstrations:All exper- iments were carried out on the same demonstration data set of size The results show that the safety focus has a significant impact |𝑇𝑒𝑥𝑝 | = 4692 ∗ (𝑠𝑡 , 𝑎𝑡 ), for further detail see Appendix A. on the agent’s performance. The agent trained with the unweighted Imitation learning was performed as supervised learning of a ten- expert demonstrations achieves an average safety rating of 13.05 sorflow model with same architecture for every experiment. The for its proposals, and the variation in safety over the ≈ 2000 state agent was trained for 2000 𝑏𝑎𝑡𝑐ℎ𝑒𝑠, pairs of (𝑠𝑡 , 𝑎𝑡 ) respectively. action pairs of 16.41 should be noted. The approach of pre-selecting Cost function: In our evaluation we consider three cost function and weighting the demonstrations based on the distribution of the that quantify the cars position in the environment. Since the cars cost function shows a positive impact. The security evaluation of state represents a patients clinical state this can be seen as three the conditions caused by the agent can be raised to a level of 17.3 by Recommending safe actions by learning from sub-optimal demonstrations HealthRecSys’20, September 26th, 2020, Online, Worldwide a weighting of 𝛼 = 0.1, and by a weighting of 𝛼 = 0.5 it can achieve a value of 21.37. In addition, the weighting of the expert trajectories in these cases also leads to a more robust reinforcement learning agent, which is reflected in the standard deviation of safety. To make the results presented more comprehensible, Figure 4 provides a visualization.Training the agent with different safety focuses 𝛼 results in safety and reward shown on the y-axis and the standard deviation represented by the dots size. Figure 6: safety function 0.8 safety focus 6 CONCLUSION The motivation for this work is derived from the medical context, in which the objective is to adapt clinical pathways to a patient’s needs in the best possible way. while this scenario can be aptly described as a reinforcement learning problem, as discussed in the introduction, it is important to limit the exploration and thus the parameterisation of therapies and activities to a safe range of action from a medical point of view. the imitation learning approach offers a suitable approach to imitate the behaviour of experts. However, two central questions have arisen in reinforcement learning. Firstly, the question arose as to how an agent imitating an expert can con- centrate on learning safety relevant actions. Furthermore, we asked ourselves whether an agent can be given the opportunity to explore the optimum within the action space while still maintaining a focus Figure 4: impact safety focus on episode reward and safety on safety. To answer these questions, we have developed an approach that Taking a closer look at comparing the cost function values for learns from expert demonstrations and concentrates on adapting two agents - one trained without safety focus (5) and one trained the safety-relevant behaviour of the expert by appropriately weight- with a safety focus 𝛼 = 0.8 (6) - emphasising the safety critical 𝜖 ing the examples provided. Our approach defines two parameters trajectories 𝑇𝑒𝑥𝑝 in the expert demonstrations can significantly that determine how to deal with the state action pairs observed raise the safety of the actions recommended by the agent. While among experts. On the one hand, we have parameter 𝜖, which in- the performance of the agents is already reflected in the values dicates how early a state should be classified as safety-relevant. listed in table 1, the reasons for this can be identified in Figures 5 On the other hand we have safety focus 𝛼 forcing the agent to and 6. train on a subset of expert trajectories, where 𝛼 of the examples are classified as safety relevant under a given value 𝜖. Our approach for imitation learning was able to outperform equiv- alent agents trained on balanced demonstrations with regard to the safety as well as the reward. The generic conceptual approach underlying the work can be applied to a wide range of RM tasks. It is especially relevant for domains where expert knowledge is available, which defines how one should behave to be safe, but where it is not sure exactly what the optimal behaviour may look like. This is the case in the personalization of clinical pathways. Figure 5: no safety focus while physicians can precisely advise which activities to suggest as rehabilitation under certain clinical conditions of the patient, it is not certain whether these suggestions are the optimal choice. The non safety focus runs where not able to obtain a critical dis- with our approach we provide an important basis for exploring tance to the critical states„ while the safety focus runs successfully the optimum when proposing individually parameterized activities learned to avert critical states in an expert reaction manner. without violating the limits of the safety-relevant parameters. While the agent without safety focus was not able to learn the correct handling of safety critical conditions during imitation learn- ing, our approach was successful in adapting the expert’s handling of critical states. By pre-selecting the expert examples without pro- viding any further information during the training process, the agent with safety focus was able to avert safety critical conditions similar to the expert’s behaviour. HealthRecSys’20, September 26th, 2020, Online, Worldwide Lars Boecking and Patrick Philipp 7 FUTURE WORK Representations (ICLR 2020) abs/2002.04083 (2020). [7] Kiante Brantley, Wen Sun, and Mikael Henaff. 2020. Disagreement-Regularized Besides the further exploration of the parameter combinations of 𝜖 Imitation Learning. In International Conference on Learning Representations. https: and 𝛼, the transfer to additional RL problems is pending. Evaluating //openreview.net/forum?id=rkgbYyHtwB [8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John the approach on further 2D games in the gym environment is a Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. logical next step. Additionally, teaching robotics to safely interact arXiv:arXiv:1606.01540 with their environment is relevant application [8]. Moreover, the [9] Carrie J. Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S. Corrado, Martin C. Stumpe, and approach is to be evaluated in more complex RL tasks that focus Michael Terry. 2019. Human-Centered Tools for Coping with Imperfect Algo- on the safety aspect, for which the recently published safety gym rithms during Medical Decision-Making. arXiv e-prints, Article arXiv:1902.02960 is available [21]. (Feb. 2019), arXiv:1902.02960 pages. arXiv:1902.02960 [cs.HC] [10] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerík, Todd Hester, Cosmin Padu- Future research should also consider how to completely avoid raru, and Yuval Tassa. 2018. Safe Exploration in Continuous Action Spaces. CoRR safety-critical examples that are dealt with by experts. One possible abs/1801.08757 (2018). arXiv:1801.08757 http://arxiv.org/abs/1801.08757 [11] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. 2017. approach to this could be the simulation of responsibilities and the One-Shot Visual Imitation Learning via Meta-Learning. CoRR abs/1709.04905 evaluation of possible reactions by an expert, using human in loop (2017). arXiv:1709.04905 http://arxiv.org/abs/1709.04905 approaches as feedback for the system, see [17] and [5]. [12] Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Asso- ACKNOWLEDGMENTS ciates, Inc., 4565–4573. http://papers.nips.cc/paper/6391-generative-adversarial- imitation-learning.pdf This work was partially supported by the project vCare: Virtual [13] James Erica Snow Pamela Willis Jon Kinsman Leigh, Rotter Thomas. 2010. What Coaching Activities for Rehabilitation in Elderly (funded by Horizon is a clinical pathway? Development of a definition to inform the debate. BMC Medicine (2010). 2020 research and innovation programme under Grant Agreement [14] Romain Laroche, Mehdi Fatemi, Joshua Romoff, and Harm van Seijen. 2017. Multi- Number: 769807). Special acknowledgements are directed to the Advisor Reinforcement Learning. CoRR abs/1704.00756 (2017). arXiv:1704.00756 partners of the project, who have contributed valuable feedback in http://arxiv.org/abs/1704.00756 [15] Keuntaek Lee, Kamil Saigol, and Evangelos A. Theodorou. 2018. Safe end-to-end the specification of the research problem and by providing their imitation learning for model predictive control. CoRR abs/1803.10231 (2018). expertise to this study. arXiv:1803.10231 http://arxiv.org/abs/1803.10231 [16] Zachary C. Lipton. 2017. The Doctor Just Won’t Accept That! arXiv e-prints, Article arXiv:1711.08037 (Nov. 2017), arXiv:1711.08037 pages. REFERENCES arXiv:1711.08037 [stat.ML] [1] Ahmed M. Alaa and Mihaela van der Schaar. 2018. AutoPrognosis: Auto- [17] James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L. mated Clinical Prognostic Modeling via Bayesian Optimization with Struc- Roberts, Matthew E. Taylor, and Michael L. Littman. 2017. Interactive Learning tured Kernel Learning. arXiv e-prints, Article arXiv:1802.07207 (Feb. 2018), from Policy-Dependent Human Feedback. In Proceedings of the 34th Interna- arXiv:1802.07207 pages. arXiv:1802.07207 [cs.LG] tional Conference on Machine Learning (Proceedings of Machine Learning Research, [2] Ahmed M. Alaa and Mihaela van der Schaar. 2019. Attentive State-Space Modeling Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, International Conven- of Disease Progression. In Advances in Neural Information Processing Systems 32, tion Centre, Sydney, Australia, 2285–2294. http://proceedings.mlr.press/v70/ H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett macglashan17a.html (Eds.). Curran Associates, Inc., 11338–11348. http://papers.nips.cc/paper/9311- [18] Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer. attentive-state-space-modeling-of-disease-progression.pdf 2017. DropoutDAgger: A Bayesian Approach to Safe Imitation Learning. ArXiv [3] E. Altman. 1999. Constrained Markov Decision Processes. Chapman and Hall. abs/1709.06166 (2017). https://doi.org/10.1016/0167-6377(96)00003-X [19] Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer. [4] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, 2017. DropoutDAgger: A Bayesian Approach to Safe Imitation Learning. CoRR Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. abs/1709.06166 (2017). arXiv:1709.06166 http://arxiv.org/abs/1709.06166 2017. Hindsight Experience Replay. arXiv e-prints, Article arXiv:1707.01495 (July [20] Kunal Menda, Katherine Rose Driggs-Campbell, and Mykel J. Kochenderfer. 2017), arXiv:1707.01495 pages. arXiv:1707.01495 [cs.LG] 2018. EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. CoRR [5] Dilip Arumugam, Jun Ki Lee, Sophie Saskin, and Michael L. Littman. 2019. abs/1807.08364 (2018). arXiv:1807.08364 http://arxiv.org/abs/1807.08364 Deep Reinforcement Learning from Policy-Dependent Human Feedback. [21] Alex Ray, Joshua Achiam, and Dario Amodei. 2019. Benchmarking Safe Explo- arXiv e-prints, Article arXiv:1902.04257 (Feb. 2019), arXiv:1902.04257 pages. ration in Deep Reinforcement Learning. (2019). arXiv:1902.04257 [cs.LG] [22] Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, and Jan Leike. [6] Ioana Bica, Ahmed M. Alaa, J. Brian Jordon, and Mihaela van der Schaar. 2020. 2019. Learning Human Objectives by Evaluating Hypothetical Behavior. Estimating Counterfactual Treatment Outcomes over Time Through Adversari- arXiv e-prints, Article arXiv:1912.05652 (Dec. 2019), arXiv:1912.05652 pages. ally Balanced Representations. In Proc. 8th International Conference on Learning arXiv:1912.05652 [cs.CY] Recommending safe actions by learning from sub-optimal demonstrations HealthRecSys’20, September 26th, 2020, Online, Worldwide A INSIGHT ON EXPERT DEMONSTRATIONS Following we show the cost function calculated for the expert demonstrations. In 7 we see the to cost functions calculating the safety for 𝑙𝑒 𝑓 𝑡 and 𝑟𝑖𝑔ℎ𝑡. Figure 10: safety function side 0.1 safety focus Figure 7: demonstration safety function left and right In addition we evaluated the safety cost function in the dimen- sion 𝑠𝑡𝑟𝑎𝑖𝑔ℎ𝑡, as shown in 8. Figure 11: safety function front 0.1 safety focus Safety Focus 0.5 Training the agent with a safety focus of 0.5 results in the safety function shown in Figure 12 for side safety estimation and 13 for 𝑐 𝑓 𝑟𝑜𝑛𝑡 safety. Figure 8: demonstration safety function straight B ABLATION STUDY In the following we provide further insights on the agents perfor- mance trained on different levels of 𝑠𝑎𝑓 𝑒𝑡𝑦 𝑓 𝑜𝑐𝑢𝑠𝛼. Safety Focus 0.0 To complete the report on reinforcement agent performance with no safety focus besides 5 we provide the cost function referring to the safety evaluation 𝑓 𝑟𝑜𝑛𝑡. Training the agent with 𝛼 = 0.0 results in the cost function to the front shown in Figure Figure 12: safety function side 0.5 safety focus 9. A Safety focus of 0.5 not only emphasises behavior to return from safety critical states with respect to the 𝑙𝑒 𝑓 𝑡 and 𝑟𝑖𝑔ℎ𝑡 safety constraint but also the 𝑓 𝑟𝑜𝑛𝑡 safety. Figure 9: safety function front no safety focus Safety Focus 0.1 Training the agent with a safety focus of 0.1 results in the cost function shown below. Safety estimation to cost function sides is Figure 13: safety function front 0.5 safety focus shown in Figure 10 and function front in Figure 11 respectively. HealthRecSys’20, September 26th, 2020, Online, Worldwide Lars Boecking and Patrick Philipp Safety Focus 0.8 In addition to the safety function values for 𝑙𝑒 𝑓 𝑡 and 𝑟𝑖𝑔ℎ𝑡 shown in 6 we provide the cost function for 𝑓 𝑟𝑜𝑛𝑡. Training the agent with a safety focus of 𝛼 = 0.8 results in the cost function to the front shown in Figure 14. Figure 14: cost function front 0.8 safety focus