=Paper=
{{Paper
|id=Vol-3087/paper_14
|storemode=property
|title=Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints in Expert Demonstrations
|pdfUrl=https://ceur-ws.org/Vol-3087/paper_14.pdf
|volume=Vol-3087
|authors=Leopold Müller,Lars Böcking,Michael Färber
|dblpUrl=https://dblp.org/rec/conf/aaai/MullerBF22
}}
==Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints in Expert Demonstrations==
Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints in Expert Demonstrations Leopold Müller,1 Lars Böcking, 1 Michael Färber 1 1 Karlsruhe Institute of Technology leopold.mueller@student.kit.edu, lars.boecking@student.kit.edu, michael.faerber@kit.edu Abstract In the area of reinforcement learning, there are various ap- proaches to increase the safety of the agent. However, these When used in real-world environments, agents must meet high safety requirements as errors have direct consequences. are limited either to modifying the optimization criterion or Besides the safety aspect, the explainability of the systems to restricting the exploration process. First approaches to is of particular importance. Therefore, not only should errors draw on existing knowledge in the form of expert demon- be avoided during the learning process, but also the decision strations focus mainly on an accelerated convergence via so- process should be made transparent. Existing approaches are called warm starting. So far, there is a lack of approaches limited to solving a single problem. For real-world use, how- to specifically extract existing knowledge from previously ever, several criteria must be fulfilled at the same time. In this unused data and integrate this knowledge into the learning paper we derive comprehensible rules from expert demon- process. strations which can be used to monitor the agent. This work attempts to make use of existing expert demon- The developed approach uses state of the art classification strations in two ways. Firstly, it tries to make the decision- and regression trees for deriving safety rules combined with making process of the systems more transparent by identify- concepts in the field of association rule mining. The result is a compact and comprehensible rule set that explains the ing human-understandable rules. Secondly, the rules found expert’s behavior and ensures safety. We evaluate our frame- are used to monitor the system. The system is prohibited work in common OpenAI environments. Results show that from violating the given safety rules. the elaborated approach is able to identify safety-relevant The combination of state of the art classification and re- rules and imitate expert behavior especially in edge cases. gression trees with association rule mining represents a nov- Evaluations on higher dimensional observation spaces and elty in the field of safety reinforcement learning and opens continuous action spaces highlight the transferability of the up new possibilities to transfer the algorithms to new appli- approach to new tasks while maintaining compactness and cation areas. In critical domains such as medicine or trans- comprehensibility of the rule set.1 portation, the presented framework fulfills key requirements in terms of comprehensibility and transparency. The key Introduction contributions of this work are: In real applications, such as autonomous vehicles, the ex- • applying association rule mining to RL demonstrations plainability of the systems’ decisions is highly relevant. • combining classification and regression trees with asso- When the algorithm is in charge, we want to know what it is ciation rule mining doing and we want to know why it is doing it. The explain- • a framework for deriving comprehensible set of rules ability of the systems is not only relevant for the legal frame- work but also for social acceptance. Reinforcement learning • integrating a safety layer into the learning process approaches based on deep learning achieve excellent results in terms of their target function such as reward, but do not Related Work offer explainability and traceability. Association rule mining Modification of the optimization criterion. In the sim- is concerned with identifying patterns in data and formaliz- plest case, the optimization criterion is the risk-neutral crite- ing them for reproducibility and explainability. Existing ap- rion. Trust Region Policy Optimization (TRPO) is a further proaches are mainly concerned with modeling the complete development of policy gradient methods. They are used to context in data which leads to comprehensiblity problems. optimize strategies in the form of neural networks (Schul- The approach presented here combines these research areas man et al. 2015). The change of parameters must not exceed in a new way, with a special focus on application. a step size called Trust Region, which restricts the set of pos- Copyright © 2022, for this paper by its authors. Use permitted un- sible strategies. A further development of the TRPO is, the der Creative Commons License Attribution 4.0 International (CC Proximal Policy Optimization Algorithm presented in (John BY 4.0). Schulman et al. 2017) (PPO). While the process of strategy 1 discovery by the two algorithms is significantly more effec- https://github.com/leopoldmueller/safety-aware- reinforcement-learning tive and goal-oriented than a stochastic learning algorithm, they rely on the assumption of ergodicity (Moldovan and Approach Abbeel 2012). However, this does not exist in reality. Fast This paper attempts to make reinforcement learning safer. and effective strategy discovery does not protect against er- For this purpose, safety rules are derived from existing data rors or irreversible states. The approach of this work does sets. On the one hand, these serve to increase the trans- not try to replace learning algorithms like TRPO or PPO, parency of the agent’s decision-making, and on the other rather it tries to complement them. Thus, we use the PPO hand, they can be used to monitor the agent. algorithm in combination with the approach of this paper. The starting point for this approach are expert demonstra- tions. It is expected that for the task to be solved there is already an instance of any kind that can interact with the en- Modification of the exploration process. To avoid unde- vironment. This instance is called an expert. In the case of sirable states, these must be marked as undesirable without autonomous driving, a human driver could control the ve- visiting them. This is impossible without external knowl- hicle. Figure 1 summarises the complete systematics of the edge, since otherwise they would have to be visited during approach. the random exploration process (Javier Garcı́a, Fern, and o Fernández 2015). In general, external knowledge can be used in two different ways: (1) derive a strategy from a set of trajectories or (2) guide the exploration process by rec- ommendations from a teacher (Javier Garcı́a, Fern, and o Fernández 2015). First subjects the agent to a kind of initial- isation procedure. This is done on the basis of prior knowl- edge about the task. The RL agent learns a strategy based on expert demonstrations. In doing so, it imitates the behaviour that the expert demonstrated. In the literature, this approach is referred to as Imitation Learning (Hussein et al. 2017). Another approach to imitation learning is Inverse RL (Finn, Levine, and Abbeel 2016). Instead of the strategy, a reward signal is learned using the state-action pairs. Similarly, an approach is possible where the agent is guided by an ex- Figure 1: Systematics of the approach of this paper. pert when the chosen action falls below a safety value. In (Menda, Driggs-Campbell, and Kochenderfer 2019) an ap- proach is presented in which a trained agent takes over the The derivation of the safety rules and the integration of role of the expert. However, in many application areas, no these, in the form of a safety layer, form the basis of the agent exists that can take on this role. approach and are explained below. Derive Safety Rules Explainability of reinforcement learning systems. Ex- The starting point for deriving safety rules are the expert plainability is often considered a trade off to performance demonstrations (see Figure 1). They consist of a finite set of (Puiutta and Veith 2020; Longo et al. 2020). In (Bastani, Pu, trajectories containing a sequence of states and actions. This and Solar-Lezama 2018) an agent is deployed in an environ- behaviour is assumed to be safe. ment and its behaviour is recorded. Based on the demon- In the first step, the CART algorithm is applied to the ex- strations, a decision tree is then trained and used to ver- pert demonstrations. This algorithm creates a decision tree ify the agent’s strategy. The most common group of algo- based on the states and the actions that follow. The structure rithms in the field of association rule mining are Apriori of the decision tree provides the basis for the decision rules, Algorithm. They are designed to identify rules that have a which are interpreted as safety rules. The decision tree clas- minimum level of confidence and support. Algorithms like sifies states according to their characteristics. By querying ID3 were continuously improved resulting in C4.5 which the edges, the states are grouped according to the same ac- removed the restriction that features must be categorical, tions. It is first assumed that a finite set of actions, such as or memory optimized versions like the C5.0. Most recent ”braking” and ”accelerating”, is available for selection. The advancements were achieved by CART (Classification and actions are thus discrete. In the next step, the paths of the Regression Trees) which constructs binary trees. There are tree are converted into decision rules and summarised as a existing approaches to combine the research field of asso- rule set. ciation rule mining with other fields of application such as The result of the framework is a rule set consisting of integrating classification and association rule mining (Liu, all decision rules or paths of the decision tree. The length Hsu, and Ma 1998). Further research deals with online learn- and number of rules is determined by the depth of the tree. ing where association rules are derived from a continuous Therefore, the rule set can become arbitrarily large depend- stream of information (Hidber 1999). Combined with rein- ing on the complexity and amount of labelled data. To make forcement learning, there are first applications of association the decision process comprehensible, the rule set should be rule mining to solve, for example, multi-agent environments kept as compact as possible. This means that the size of the (Kaya and Alhajj 2005). tree must be limited with the help of a termination criterion. One possibility is to set a maximum tree depth. In this Rules that do not meet the minimum values are removed case, the splitting of a node stops as soon as its depth corre- from the rule set. The support describes the statistical rele- sponds to the maximum tree depth. If, for example, the al- vance of a rule. A rule with a support of 0.1 applies to 10% gorithm stops at a leaf node that specifies ”braking” for one of all states. Confidence of a rule indicates its uniqueness. half of the states and ”accelerating” for the other, the rule A rule with a confidence of 0.8 means that in a state where found is not a suitable candidate for a safety rule. A safety the rule applies, the action specified by the rule is performed rule should specify an action that is as unambiguous as pos- in 80% of the cases. The use of support and confidence as sible. Other factors must therefore be taken into account. termination criteria leads to a rule set in which all rules con- An alternative is to use the ginicoefficient. In this case, tain both a statistical relevance and a safe action instruction the equality of all selected actions is quantified. Trying to determined by the confidence. After filtering the paths of the apply this criterion in larger action spaces, however, means decision tree using the hyperparameters suppmin and confmin , that rules in which only a small subset of the possible ac- redundant rules are removed in line 20 and converted into tions are used can also be included in the set of rules due safety rules in line 22. For this purpose, all queries of the to their low ginicoefficient. This distribution based criterion decision nodes are linked via logical ands starting from the does not ensure unambiguous actions. root of the tree to the currently considered leaf. The most frequently selected action in the leaf is assigned to the rule. To achieve this, metrics from the association rules area are considered. Decision rules are understood as association Adaptations for continuous action spaces. Differenti- rules of the form X =⇒ ek . The queries of the characteris- ate two cases: (1) there are two discrete actions ”braking” tics of states st (referred to as xt in the decision tree) along and ”accelerating” and (2) there are two continuous actions the edges of a path represent X. The action at (represented ”braking” and ”accelerating”, both can take a value from in the decision tree by y), to which most states in the leaf [0, 1]. In case (1), algorithm 1 can be applied without adapta- node of the path are assigned, forms the consequence ek of tion. A decision tree is created for the classification of states. X. Now the rule set can be filtered for relevant rules. For In case (2), we extend our framework to categorize continu- this purpose, minimum value for support and confidence ous values into discrete ranges. Then algorithm 1 is applied are set. Algorithm 1 describes the framework for discrete to each action. In order to specify the granularity of the sub- action spaces. division of the action space, we add the hyperparameter di- vider, which specifies the number of intervals to divide into. Algorithm 1: Constraints Identifier Integration of Safety Rules Input: set of trajectories Ω The final component uses the set of rules to monitor an agent Parameter: suppmin , confmin during the learning/deployment process. The approach can Output: set of rules C therefore be used both during the learning process and dur- 1: X = list of states in Ω ing the deployment process because the agent itself is not 2: Y = list of actions in Ω directly modified. If a chosen action violates a rule it is ad- 3: C = create empty list of rules C justed accordingly. In Figure 1, this safety layer is shown in 4: filtered paths = create empty list of filtered paths green. The edges show input and output as well as access to 5: tree = DecisionTreeClassifier(X, Y ) the rule set. Algorithm 2 depicts how the safety layer works. 6: for path in tree do In case of continuous actions, these must first be converted 7: for node in path do into discrete values in order to be checked with the safety 8: if supp(node) > suppmin and conf (node) > rules in the next step. confmin then 9: cut off subsequent nodes Algorithm 2: Safety Layer 10: append shortened path to filtered paths Input: set of rules C, state s, action a 11: else Output: action a 12: if node is leaf then 1: for rule in C do 13: go to next path in tree 2: if if features of s fulfil all conditions of c then 14: else 3: if a == action of rule then 15: go to next node in path 4: break 16: end if 5: else 17: end if 6: a = adapt action of rule 18: end for 7: break 19: end for 8: end if 20: filtered paths = make filtered paths unique 9: else 21: for path in filtered paths do 10: continue {threshold not surpassed} 22: c = ConvertPathIntoRule(path) 11: end if 23: append c to C 12: end for 24: end for 13: return a 25: return set of rules C Evaluation one. This becomes clear when the course of the graph for In this section, the approach is used in different scenarios to suppmin = 0 is considered. The number of rules decreases evaluate it in terms of function and versatility. The evalua- continuously as the minimum value for confidence is re- tion is performed in three steps: duced. This clearly shows that if a higher uniqueness of (1) setting up a suitable test environment, the rule is demanded (high confidence confmin ), the num- (2) evaluating the framework to derive safety rules, and ber of rules that are identified is lower. The same applies to (3) evaluating the safety layer. the request of a more frequent occurrence of a rule (support suppmin ). Setup of Test Environment Influence of the hyperparameters on the average length For the evaluation, environments from Open AI (Brockman of the rules of a rule set. Comparable to the previous pro- et al. 2016) are used. Expert demonstrations serve as input cedure, the influence of the hyperparameters on the average for the algorithm. Trained agents are used as experts. For length of safety rules is now examined. Figure 3 illustrates the following evaluations, the PPO2 algorithm was used to- the result of the investigation using the CartPole environ- gether with a MlpPolicy (Hill et al. 2018). The number of ment. time steps t determines the number of trajectories |Ω| in- cluded in the expert demonstrations and is kept unchanged at 15, 000 for all following experiments. The agent is then deployed in the environment until the specified number of time steps has been recorded. Evaluation of the Derivation of Safety Rules In this section, the framework for deriving safety rules de- scribed in algorithm 1 is evaluated. Three different investi- gations are carried out. Influence of the Hyperparameters on the number of Rules in a Rule Set. First, the influence of the minimum values of support and confidence as termination criteria on the number of safety rules is investigated. For this, the framework from algorithm 1 is applied to the created expert demonstrations of the different environments. Depending on the set parameters, rule sets are generated as a result. Figure 2 shows the results for the CartPole environment. Figure 3: Comparison of rule sets according to the average length of the rules they contain, using the CartPole-v1. At first glance, parallels to Figure 2 can be seen. Here, too, the maximum lies at the point (suppmin = 0, confmin = 1). For suppmin > 0 the maxima move to the ”middle” of the confidence scale. From the investigations it can be con- cluded that the two hyperparameters influence both the av- erage length and the absolute number of rules of a rule set in the same way. Relevance of a rule set for the behaviour of agents during the learning process. In order to draw conclusions about the relevance of the rules of a rule set, the learning process of an agent is considered. The agent goes through a learn- ing and a test phase alternately. It is first deployed in the environment without prior knowledge and tested for 20, 000 time steps. The number of times it enters a state for which a safety rule exists is recorded. In addition, it is documented Figure 2: Comparison of rule sets by number of rules con- how often it fulfils or violates these rules. This is followed tained. The data is based on the CartPole-v1 environment. by a learning phase in which the agent is trained for a certain number of time steps (set to 500). After completion of the At the point (suppmin = 0, confmin = 1), the rule set is learning process, another test phase follows. Two rule sets unfiltered and thus has the highest number of rules. If the are examined using the CartPole environment as an exam- algorithm terminates before the subset of a node is unam- ple: one is the unfiltered rule set and the other is a filtered biguously assigned, the confidence of the rule is less than rule set. Table 1 summarises the parameters: Hyperparameter Value linked to safe behaviour of the agent. For the evaluation, an Minimum value for support (filtered, unfiltered) 0.0050, 0 untrained agent (hereafter referred to as a novice) is used in Minimum value for confidence(filtered, unfiltered) 0.95, 1 three different ways. The novice, by its random strategy, rep- resents the most uncertain state during the learning process. Table 1: Hyperparameter of the study: Relevance of a rule It is considered: set for the behaviour of agents during the learning process. • Novice, unsupervised. • Novice, monitored by safety layer with access to a fil- The data were collected during a test phase for different tered rule set. training progress. The solid lines show the progression for • Novice, monitored by safety layer with access to an un- the filtered, the dashed lines for the unfiltered rule set (cf. Ta- filtered rule set. ble 1). Figure 4 shows how often a rule is adhered to (green The hyperparameters for the rule sets (filtered/unfiltered) line) or not adhered to (red line) in relative frequencies. correspond to those in Table 1. The reference value for the final reward is the expert from whom the expert demonstra- tions originate. Figure 5 shows the results of the different runs using the CartPole environment as an example. Figure 4: Relevance of a rule set for the behaviour of agents during the learning process. For both cases (filtered and unfiltered rule set) the num- Figure 5: Evaluation for discrete action spaces based on the ber of rule violations decreases with increasing training CartPole environment. progress. A striking feature is the difference between the rel- ative curves of the filtered and unfiltered rule sets. While It can be clearly seen that the expert (green curve) has the green curve of the filtered rule set converges towards the best performance. In contrast, the reward of the unsu- confmin = 0.95, this cannot be observed for the filtered rule pervised novice (blue curve) varies between 8 and 95. The set. One possible reason for this is the high number of rules. comparatively poor performance can be explained by the The rule set specifies an action for each state. If the agent fact that the novice chooses actions randomly. The novice chooses this action in every state, its behaviour corresponds has not learned a strategy for processing the given states. If in theory to that of the expert. However, since the agent and this is monitored using the safety view from algorithm 2, the expert independently learn a strategy to solve the task, it course changes depending on the rule set used. The red curve is unlikely that the behaviour will match. As a result, the shows the performance when an unfiltered rule set is used. choice of actions differs in some states. The effect is also It corresponds in parts to the performance of the expert, but with the filtered rule set, but clearly smaller than with the un- shows sharp dips (reward of 167) for some episodes. Nev- filtered rule set. This suggests that the effect increases with ertheless, the performance in these cases is clearly better the number of rules in a rule set. compared to the unsupervised novice (blue curve). It can be concluded that the safety layer with access to the unfil- Evaluation of Safety Layer tered rule set exerts a consistently positive influence on the This section evaluates the safety layer. In the algorithms, a novice’s performance. When using a filtered rule set, the re- procedure has been formulated to monitor the agent using a sult is similar. Figure 6 presents the results using the Lu- set of safety rules. The framework is evaluated in common narLander environment as an example. OpenAI environments. Again the average reward of the expert (green) is the high- est. In second place is the performance of the novice with Evaluation of the mode of operation for discrete action unfiltered rule set (red), closely followed by the novice with spaces. In order to draw conclusions about the functioning filtered rule set (orange). In direct comparison, the perfor- of the safety layer, different combinations of agent and rule mance of the unsupervised novice (blue) is the worst. A set are compared. Object of the investigation is the aggre- closer look reveals significant differences between Figure 6 gated reward that the agent receives at the end of an episode. and 5. The supervised novice with filtered rule set (orange) For wrong decisions that lead to critical states (e.g. crashing reaches the level of the expert (green) in places. The down- the LunarLander), the agent receives a high negative reward. ward fluctuations are more pronounced than in the Cart- This means that the amount and variance of the reward is Pole environment. This suggests that a wrong decision in performance of the two monitored novices (red, orange) is not solely due to the quality of the demonstrations. The two curves (red, orange) show no upward fluctuations. Even with non-optimal demonstrations, approaches of good behaviour should produce better performance than that of the unsuper- vised novice (blue). However, this is not the case and thus it is assumed that the complexity of the BipedalWalker envi- ronment is too high to achieve good performance without a strategy. Lessons Learned Figure 6: Evaluation for discrete action spaces using the Lu- Current developments in technology make it possible to use narLander environment. advances in the field of artificial intelligence in a wide vari- ety of areas. The focus is on tasks of high complexity where the LunarLander environment is difficult to compensate for. errors can have fatal consequences. In addition to the as- In concrete terms, this means that if the flying object gets pect of safety, trust in the intelligent systems is a hurdle that into an unfavourable position, a controlled landing is hardly must be overcome. In order to strengthen trust, there are ap- possible without a strategy. Analogous to the CartPole en- proaches in research that attempt to increase the explainabil- vironment, the reward of the unsupervised novice (blue) is ity of systems. RL has achieved particular milestones in the lowest, it increases when the novice is supervised. An in- past. This was made possible by using RL in conjunction creased number of safety rules improves the performance of with Deep Learning. Deep RL is considered a suitable can- the agent. didate for complex tasks. However, the use of a neural net- work as a policy as well as learning it poses two problems: Evaluation of the mode of operation for continuous ac- The black-box problem of the neural network (focus on ex- tion spaces. To evaluate the safety layer in environments plainability) and the trial and error approach of the learning with continuous action spaces, the hyperparameters for the process (focus on safety). In current research, there are sev- rule sets (filtered/unfiltered) are the same as before. The eral approaches that address these problems. The approach divider for the discretisation is set to two. The results are in this paper differs from existing ones in many ways. For shown in Figure 7. one, both problems are addressed simultaneously. In addi- tion, the concept can be used independently of the structure of the RL system. This means that neither the agent nor the environment itself needs to be modified. Another advantage is that the concept can be applied to different environments and agents simply by adjusting the hyperparameters. It is also possible to use existing data sets. This reduces the time needed by human experts and offers the possibility to build on existing solutions. We developed a concrete framework for deriving safety rules from expert demonstrations. It is able to derive com- prehensible rules from a set of trajectories. We used a deci- sion tree based on the CART algorithm to derive the rules. Figure 7: Evaluation for continuous action spaces using the The result of the decision tree is a rule set with a high num- BipedalWalker environment. ber of rules. By using concepts from the field of associa- tion rules, the rule set can be filtered for relevant rules. We When looking at the performance of the expert (green), it used the evaluation to show how the parameters affect the is noticeable that it shows strong drops. The expert trained shape of the rule set and gave an intuition for the interpreta- with the help of the PPO2 learning algorithm is not able tion of the framework’s hyperparameters. Using the metrics to solve the environment. This contradicts the assumption from the association rules domain, we developed a termina- that the expert demonstrates safe behaviour. The unsafe be- tion criterion that gives clear conclusions about the statistical haviour affects the quality of the expert demonstrations and relevance and uniqueness of the rules. With the help of the thus the relevance of the safety rules. If the demonstrations decision tree, these also have a comprehensible form. The contain errors, these are also reflected in the safety rules. The framework is able to derive a compact rule set from expert positive effect of the safety layer is dependent on the qual- demonstrations. The rule set reflects the behaviour of the ex- ity of the safety rules. If the curve of the unfiltered novice pert in a comprehensible way. We implemented the integra- (red) is used, it is noticeable that it is clearly below the per- tion of the safety rules as follows: A safety layer that moni- formance of the expert (green). In contrast to discrete ac- tors the agent ensures that the rules are followed at all times. tion spaces, no improvement can be seen even in compar- The rules are derived from the expert’s demonstrations and ison to the unsupervised novice (blue). The same applies thus reflect his or her behaviour. They can thus be interpreted to the filtered novice (orange curve). However, the poorer as safety rules, provided that the expert’s behaviour is con- sidered safe. The results of the evaluation have shown that References in less complex environments, such as the LunarLander en- Bastani, O.; Pu, Y.; and Solar-Lezama, A. 2018. Verifiable vironment, the expert’s performance can be achieved using Reinforcement Learning via Policy Extraction. In Proceed- the safety layer alone. In general, a positive effect regarding ings of the 31st Annual Conference on Neural Information performance could be achieved by using the safety layer. A Processing Systems, NeurIPS’18, 2499–2509. guarantee to avoid wrong decisions is only possible if the Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; expert demonstrations cover all critical states. Only then can Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI the rule set contain all the necessary rules. In order to make Gym. CoRR, abs/1606.01540. the framework as universally valid as possible, it was ex- tended to include the ability to handle environments with Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided Cost continuous actions. Based on the significantly more com- Learning: Deep Inverse Optimal Control via Policy Opti- plex environments, it could be determined that the applica- mization. In Proceedings of the 33nd International Con- tion reaches its limits under the set goals of compactness and ference on Machine Learning, ICML’16, 49–58. comprehensibility of a rule set. Hidber, C. 1999. Online association rule mining. ACM Sig- mod Record, 28(2): 145–156. Hill, A.; Raffin, A.; Ernestus, M.; Gleave, A.; Kanervisto, Conclusion and Prospects A.; Traore, R.; Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; and The results of this paper show the great potential of the Wu, Y. 2018. Stable Baselines. concept Safety Aware Reinforcement Learning by Identify- Hussein, A.; Gaber, M. M.; Elyan, E.; and Jayne, C. 2017. ing Comprehensible Constraints in Expert Demonstrations, Imitation Learning. ACM Computing Surveys, 50(2): 1–35. but also reveal first weaknesses. Only if the experts demon- strate how to deal with all critical states, it is possible that Javier Garcı́a; Fern; and o Fernández. 2015. A Comprehen- for each state there is a rule that specifies safe behaviour. sive Survey on Safe Reinforcement Learning. Journal of Whether this is the case can only be verified by the expert. Machine Learning Research, 16(42): 1437–1480. In less complex environments, such as the LunarLander en- John Schulman; Filip Wolski; Prafulla Dhariwal; Alec Rad- vironment, a rule set with a low number of rules could also ford; and Oleg Klimov. 2017. Proximal Policy Optimization achieve a high reward. However, this had a high variance. Algorithms. CoRR, abs/1707.06347. This is because the rules reflect the behaviour of the expert. Kaya, M.; and Alhajj, R. 2005. Fuzzy OLAP associa- The expert tries to land on the landing site even if he is far tion rules mining-based modular reinforcement learning ap- away from it. The rules resulting from this behaviour, force proach for multiagent systems. IEEE Trans. Syst. Man Cy- the supervised novice into risky manoeuvres that are diffi- bern. Part B, 35(2): 326–338. cult to intercept by the rule set alone. So instead of a simple Liu, B.; Hsu, W.; and Ma, Y. 1998. Integrating Classifica- but safe landing, what follows is an unnecessarily risky ma- tion and Association Rule Mining. In Proceedings of the noeuvre that leads to a crash. The results of the LunarLander Fourth International Conference on Knowledge Discovery environment have shown that the expert consciously takes and Data Mining, KDD’98, 80–86. risks. However, the risks should not be included in the safety rules. A possible solution could be the prioritisation of be- Longo, L.; Goebel, R.; Lecue, F.; Kieseberg, P.; and haviour. In the case of the LunarLander, this would mean: Holzinger, A. 2020. Explainable Artificial Intelligence: An accident-free landing is necessary for safety, landing on Concepts, Applications, Research Challenges and Visions. the landing pad is secondary. One way to implement this In Machine Learning and Knowledge Extraction, 1–16. is to use targeted demonstrations that are limited to critical Menda, K.; Driggs-Campbell, K. R.; and Kochenderfer, situations. That is, the expert demonstrations are not com- M. J. 2019. EnsembleDAgger: A Bayesian Approach to Safe posed of an arbitrary set of trajectories, but contain only Imitation Learning. In Proceedings of the 2019 IEEE/RSJ those that show safety-relevant behaviour. In order to further International Conference on Intelligent Robots and Systems, investigate the functioning in complex environments with IROS’19, 5041–5048. continuous action spaces, it should be examined whether a Moldovan, T. M.; and Abbeel, P. 2012. Safe Exploration finer subdivision of the continuous interval into discrete ac- in Markov Decision Processes. In Proceedings of the 29th tions achieves a positive effect. Other approaches could be a International Conference on Machine Learning, ICML’12. better performing expert or optimising the hyperparameters Puiutta, E.; and Veith, E. M. S. P. 2020. Explainable Re- (amount of trajectories, support, confidence, etc.). inforcement Learning: A Survey. In Machine Learning and For a large number of the application areas commonly Knowledge Extraction, International Cross-Domain Confer- used in reinforcement learning, the framework developed ence, CD-MAKE’20, 77–95. Springer. in this work offers the possibility to learn clear and under- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M. I.; and standable rules from demonstrations. With the help of the Moritz, P. 2015. Trust Region Policy Optimization. In Pro- framework, the rules can be integrated into the learning and ceedings of the 32nd International Conference on Machine deployment process of the agent. As with the agents them- Learning, ICML’15, 1889–1897. selves, the high dimensionality of environments and the con- tinuous action spaces pose a challenge for this framework.