=Paper= {{Paper |id=Vol-3087/paper_14 |storemode=property |title=Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints in Expert Demonstrations |pdfUrl=https://ceur-ws.org/Vol-3087/paper_14.pdf |volume=Vol-3087 |authors=Leopold Müller,Lars Böcking,Michael Färber |dblpUrl=https://dblp.org/rec/conf/aaai/MullerBF22 }} ==Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints in Expert Demonstrations== https://ceur-ws.org/Vol-3087/paper_14.pdf
Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints
                          in Expert Demonstrations
                                    Leopold Müller,1 Lars Böcking, 1 Michael Färber 1
                                                     1
                                                Karlsruhe Institute of Technology
                    leopold.mueller@student.kit.edu, lars.boecking@student.kit.edu, michael.faerber@kit.edu



                             Abstract                                     In the area of reinforcement learning, there are various ap-
                                                                       proaches to increase the safety of the agent. However, these
  When used in real-world environments, agents must meet
  high safety requirements as errors have direct consequences.         are limited either to modifying the optimization criterion or
  Besides the safety aspect, the explainability of the systems         to restricting the exploration process. First approaches to
  is of particular importance. Therefore, not only should errors       draw on existing knowledge in the form of expert demon-
  be avoided during the learning process, but also the decision        strations focus mainly on an accelerated convergence via so-
  process should be made transparent. Existing approaches are          called warm starting. So far, there is a lack of approaches
  limited to solving a single problem. For real-world use, how-        to specifically extract existing knowledge from previously
  ever, several criteria must be fulfilled at the same time. In this   unused data and integrate this knowledge into the learning
  paper we derive comprehensible rules from expert demon-              process.
  strations which can be used to monitor the agent.                       This work attempts to make use of existing expert demon-
  The developed approach uses state of the art classification          strations in two ways. Firstly, it tries to make the decision-
  and regression trees for deriving safety rules combined with         making process of the systems more transparent by identify-
  concepts in the field of association rule mining. The result
  is a compact and comprehensible rule set that explains the
                                                                       ing human-understandable rules. Secondly, the rules found
  expert’s behavior and ensures safety. We evaluate our frame-         are used to monitor the system. The system is prohibited
  work in common OpenAI environments. Results show that                from violating the given safety rules.
  the elaborated approach is able to identify safety-relevant             The combination of state of the art classification and re-
  rules and imitate expert behavior especially in edge cases.          gression trees with association rule mining represents a nov-
  Evaluations on higher dimensional observation spaces and             elty in the field of safety reinforcement learning and opens
  continuous action spaces highlight the transferability of the        up new possibilities to transfer the algorithms to new appli-
  approach to new tasks while maintaining compactness and              cation areas. In critical domains such as medicine or trans-
  comprehensibility of the rule set.1                                  portation, the presented framework fulfills key requirements
                                                                       in terms of comprehensibility and transparency. The key
                         Introduction                                  contributions of this work are:
In real applications, such as autonomous vehicles, the ex-              • applying association rule mining to RL demonstrations
plainability of the systems’ decisions is highly relevant.              • combining classification and regression trees with asso-
When the algorithm is in charge, we want to know what it is               ciation rule mining
doing and we want to know why it is doing it. The explain-
                                                                        • a framework for deriving comprehensible set of rules
ability of the systems is not only relevant for the legal frame-
work but also for social acceptance. Reinforcement learning             • integrating a safety layer into the learning process
approaches based on deep learning achieve excellent results
in terms of their target function such as reward, but do not                                 Related Work
offer explainability and traceability. Association rule mining         Modification of the optimization criterion. In the sim-
is concerned with identifying patterns in data and formaliz-           plest case, the optimization criterion is the risk-neutral crite-
ing them for reproducibility and explainability. Existing ap-          rion. Trust Region Policy Optimization (TRPO) is a further
proaches are mainly concerned with modeling the complete               development of policy gradient methods. They are used to
context in data which leads to comprehensiblity problems.              optimize strategies in the form of neural networks (Schul-
The approach presented here combines these research areas              man et al. 2015). The change of parameters must not exceed
in a new way, with a special focus on application.                     a step size called Trust Region, which restricts the set of pos-
Copyright © 2022, for this paper by its authors. Use permitted un-     sible strategies. A further development of the TRPO is, the
der Creative Commons License Attribution 4.0 International (CC         Proximal Policy Optimization Algorithm presented in (John
BY 4.0).                                                               Schulman et al. 2017) (PPO). While the process of strategy
    1                                                                  discovery by the two algorithms is significantly more effec-
      https://github.com/leopoldmueller/safety-aware-
reinforcement-learning                                                 tive and goal-oriented than a stochastic learning algorithm,
they rely on the assumption of ergodicity (Moldovan and                                     Approach
Abbeel 2012). However, this does not exist in reality. Fast        This paper attempts to make reinforcement learning safer.
and effective strategy discovery does not protect against er-      For this purpose, safety rules are derived from existing data
rors or irreversible states. The approach of this work does        sets. On the one hand, these serve to increase the trans-
not try to replace learning algorithms like TRPO or PPO,           parency of the agent’s decision-making, and on the other
rather it tries to complement them. Thus, we use the PPO           hand, they can be used to monitor the agent.
algorithm in combination with the approach of this paper.             The starting point for this approach are expert demonstra-
                                                                   tions. It is expected that for the task to be solved there is
                                                                   already an instance of any kind that can interact with the en-
Modification of the exploration process. To avoid unde-            vironment. This instance is called an expert. In the case of
sirable states, these must be marked as undesirable without        autonomous driving, a human driver could control the ve-
visiting them. This is impossible without external knowl-          hicle. Figure 1 summarises the complete systematics of the
edge, since otherwise they would have to be visited during         approach.
the random exploration process (Javier Garcı́a, Fern, and
o Fernández 2015). In general, external knowledge can be
used in two different ways: (1) derive a strategy from a set
of trajectories or (2) guide the exploration process by rec-
ommendations from a teacher (Javier Garcı́a, Fern, and o
Fernández 2015). First subjects the agent to a kind of initial-
isation procedure. This is done on the basis of prior knowl-
edge about the task. The RL agent learns a strategy based on
expert demonstrations. In doing so, it imitates the behaviour
that the expert demonstrated. In the literature, this approach
is referred to as Imitation Learning (Hussein et al. 2017).
Another approach to imitation learning is Inverse RL (Finn,
Levine, and Abbeel 2016). Instead of the strategy, a reward
signal is learned using the state-action pairs. Similarly, an
approach is possible where the agent is guided by an ex-               Figure 1: Systematics of the approach of this paper.
pert when the chosen action falls below a safety value. In
(Menda, Driggs-Campbell, and Kochenderfer 2019) an ap-
proach is presented in which a trained agent takes over the           The derivation of the safety rules and the integration of
role of the expert. However, in many application areas, no         these, in the form of a safety layer, form the basis of the
agent exists that can take on this role.                           approach and are explained below.

                                                                   Derive Safety Rules
Explainability of reinforcement learning systems. Ex-              The starting point for deriving safety rules are the expert
plainability is often considered a trade off to performance        demonstrations (see Figure 1). They consist of a finite set of
(Puiutta and Veith 2020; Longo et al. 2020). In (Bastani, Pu,      trajectories containing a sequence of states and actions. This
and Solar-Lezama 2018) an agent is deployed in an environ-         behaviour is assumed to be safe.
ment and its behaviour is recorded. Based on the demon-               In the first step, the CART algorithm is applied to the ex-
strations, a decision tree is then trained and used to ver-        pert demonstrations. This algorithm creates a decision tree
ify the agent’s strategy. The most common group of algo-           based on the states and the actions that follow. The structure
rithms in the field of association rule mining are Apriori         of the decision tree provides the basis for the decision rules,
Algorithm. They are designed to identify rules that have a         which are interpreted as safety rules. The decision tree clas-
minimum level of confidence and support. Algorithms like           sifies states according to their characteristics. By querying
ID3 were continuously improved resulting in C4.5 which             the edges, the states are grouped according to the same ac-
removed the restriction that features must be categorical,         tions. It is first assumed that a finite set of actions, such as
or memory optimized versions like the C5.0. Most recent            ”braking” and ”accelerating”, is available for selection. The
advancements were achieved by CART (Classification and             actions are thus discrete. In the next step, the paths of the
Regression Trees) which constructs binary trees. There are         tree are converted into decision rules and summarised as a
existing approaches to combine the research field of asso-         rule set.
ciation rule mining with other fields of application such as          The result of the framework is a rule set consisting of
integrating classification and association rule mining (Liu,       all decision rules or paths of the decision tree. The length
Hsu, and Ma 1998). Further research deals with online learn-       and number of rules is determined by the depth of the tree.
ing where association rules are derived from a continuous          Therefore, the rule set can become arbitrarily large depend-
stream of information (Hidber 1999). Combined with rein-           ing on the complexity and amount of labelled data. To make
forcement learning, there are first applications of association    the decision process comprehensible, the rule set should be
rule mining to solve, for example, multi-agent environments        kept as compact as possible. This means that the size of the
(Kaya and Alhajj 2005).                                            tree must be limited with the help of a termination criterion.
   One possibility is to set a maximum tree depth. In this            Rules that do not meet the minimum values are removed
case, the splitting of a node stops as soon as its depth corre-    from the rule set. The support describes the statistical rele-
sponds to the maximum tree depth. If, for example, the al-         vance of a rule. A rule with a support of 0.1 applies to 10%
gorithm stops at a leaf node that specifies ”braking” for one      of all states. Confidence of a rule indicates its uniqueness.
half of the states and ”accelerating” for the other, the rule      A rule with a confidence of 0.8 means that in a state where
found is not a suitable candidate for a safety rule. A safety      the rule applies, the action specified by the rule is performed
rule should specify an action that is as unambiguous as pos-       in 80% of the cases. The use of support and confidence as
sible. Other factors must therefore be taken into account.         termination criteria leads to a rule set in which all rules con-
   An alternative is to use the ginicoefficient. In this case,     tain both a statistical relevance and a safe action instruction
the equality of all selected actions is quantified. Trying to      determined by the confidence. After filtering the paths of the
apply this criterion in larger action spaces, however, means       decision tree using the hyperparameters suppmin and confmin ,
that rules in which only a small subset of the possible ac-        redundant rules are removed in line 20 and converted into
tions are used can also be included in the set of rules due        safety rules in line 22. For this purpose, all queries of the
to their low ginicoefficient. This distribution based criterion    decision nodes are linked via logical ands starting from the
does not ensure unambiguous actions.                               root of the tree to the currently considered leaf. The most
                                                                   frequently selected action in the leaf is assigned to the rule.
   To achieve this, metrics from the association rules area
are considered. Decision rules are understood as association       Adaptations for continuous action spaces. Differenti-
rules of the form X =⇒ ek . The queries of the characteris-        ate two cases: (1) there are two discrete actions ”braking”
tics of states st (referred to as xt in the decision tree) along   and ”accelerating” and (2) there are two continuous actions
the edges of a path represent X. The action at (represented        ”braking” and ”accelerating”, both can take a value from
in the decision tree by y), to which most states in the leaf       [0, 1]. In case (1), algorithm 1 can be applied without adapta-
node of the path are assigned, forms the consequence ek of         tion. A decision tree is created for the classification of states.
X. Now the rule set can be filtered for relevant rules. For        In case (2), we extend our framework to categorize continu-
this purpose, minimum value for support and confidence             ous values into discrete ranges. Then algorithm 1 is applied
are set. Algorithm 1 describes the framework for discrete          to each action. In order to specify the granularity of the sub-
action spaces.                                                     division of the action space, we add the hyperparameter di-
                                                                   vider, which specifies the number of intervals to divide into.

Algorithm 1: Constraints Identifier                                Integration of Safety Rules
Input: set of trajectories Ω                                       The final component uses the set of rules to monitor an agent
Parameter: suppmin , confmin                                       during the learning/deployment process. The approach can
Output: set of rules C                                             therefore be used both during the learning process and dur-
 1: X = list of states in Ω                                        ing the deployment process because the agent itself is not
 2: Y = list of actions in Ω                                       directly modified. If a chosen action violates a rule it is ad-
 3: C = create empty list of rules C                               justed accordingly. In Figure 1, this safety layer is shown in
 4: filtered paths = create empty list of filtered paths           green. The edges show input and output as well as access to
 5: tree = DecisionTreeClassifier(X, Y )                           the rule set. Algorithm 2 depicts how the safety layer works.
 6: for path in tree do                                            In case of continuous actions, these must first be converted
 7:     for node in path do                                        into discrete values in order to be checked with the safety
 8:        if supp(node) > suppmin and conf (node) >               rules in the next step.
           confmin then
 9:           cut off subsequent nodes                             Algorithm 2: Safety Layer
10:           append shortened path to filtered paths              Input: set of rules C, state s, action a
11:        else                                                    Output: action a
12:           if node is leaf then                                  1: for rule in C do
13:              go to next path in tree                            2:   if if features of s fulfil all conditions of c then
14:           else                                                  3:       if a == action of rule then
15:              go to next node in path                            4:          break
16:           end if                                                5:       else
17:        end if                                                   6:          a = adapt action of rule
18:     end for                                                     7:          break
19: end for                                                         8:       end if
20: filtered paths = make filtered paths unique                     9:   else
21: for path in filtered paths do                                  10:       continue {threshold not surpassed}
22:     c = ConvertPathIntoRule(path)                              11:   end if
23:     append c to C                                              12: end for
24: end for                                                        13: return a
25: return set of rules C
                        Evaluation                                one. This becomes clear when the course of the graph for
In this section, the approach is used in different scenarios to   suppmin = 0 is considered. The number of rules decreases
evaluate it in terms of function and versatility. The evalua-     continuously as the minimum value for confidence is re-
tion is performed in three steps:                                 duced. This clearly shows that if a higher uniqueness of
   (1) setting up a suitable test environment,                    the rule is demanded (high confidence confmin ), the num-
   (2) evaluating the framework to derive safety rules, and       ber of rules that are identified is lower. The same applies to
   (3) evaluating the safety layer.                               the request of a more frequent occurrence of a rule (support
                                                                  suppmin ).
Setup of Test Environment                                         Influence of the hyperparameters on the average length
For the evaluation, environments from Open AI (Brockman           of the rules of a rule set. Comparable to the previous pro-
et al. 2016) are used. Expert demonstrations serve as input       cedure, the influence of the hyperparameters on the average
for the algorithm. Trained agents are used as experts. For        length of safety rules is now examined. Figure 3 illustrates
the following evaluations, the PPO2 algorithm was used to-        the result of the investigation using the CartPole environ-
gether with a MlpPolicy (Hill et al. 2018). The number of         ment.
time steps t determines the number of trajectories |Ω| in-
cluded in the expert demonstrations and is kept unchanged
at 15, 000 for all following experiments. The agent is then
deployed in the environment until the specified number of
time steps has been recorded.

Evaluation of the Derivation of Safety Rules
In this section, the framework for deriving safety rules de-
scribed in algorithm 1 is evaluated. Three different investi-
gations are carried out.
Influence of the Hyperparameters on the number of
Rules in a Rule Set. First, the influence of the minimum
values of support and confidence as termination criteria on
the number of safety rules is investigated. For this, the
framework from algorithm 1 is applied to the created expert
demonstrations of the different environments. Depending on
the set parameters, rule sets are generated as a result. Figure
2 shows the results for the CartPole environment.                 Figure 3: Comparison of rule sets according to the average
                                                                  length of the rules they contain, using the CartPole-v1.

                                                                     At first glance, parallels to Figure 2 can be seen. Here, too,
                                                                  the maximum lies at the point (suppmin = 0, confmin = 1).
                                                                  For suppmin > 0 the maxima move to the ”middle” of
                                                                  the confidence scale. From the investigations it can be con-
                                                                  cluded that the two hyperparameters influence both the av-
                                                                  erage length and the absolute number of rules of a rule set in
                                                                  the same way.
                                                                  Relevance of a rule set for the behaviour of agents during
                                                                  the learning process. In order to draw conclusions about
                                                                  the relevance of the rules of a rule set, the learning process
                                                                  of an agent is considered. The agent goes through a learn-
                                                                  ing and a test phase alternately. It is first deployed in the
                                                                  environment without prior knowledge and tested for 20, 000
                                                                  time steps. The number of times it enters a state for which
                                                                  a safety rule exists is recorded. In addition, it is documented
Figure 2: Comparison of rule sets by number of rules con-         how often it fulfils or violates these rules. This is followed
tained. The data is based on the CartPole-v1 environment.         by a learning phase in which the agent is trained for a certain
                                                                  number of time steps (set to 500). After completion of the
   At the point (suppmin = 0, confmin = 1), the rule set is       learning process, another test phase follows. Two rule sets
unfiltered and thus has the highest number of rules. If the       are examined using the CartPole environment as an exam-
algorithm terminates before the subset of a node is unam-         ple: one is the unfiltered rule set and the other is a filtered
biguously assigned, the confidence of the rule is less than       rule set. Table 1 summarises the parameters:
  Hyperparameter                                         Value        linked to safe behaviour of the agent. For the evaluation, an
  Minimum value for support (filtered, unfiltered)     0.0050, 0      untrained agent (hereafter referred to as a novice) is used in
  Minimum value for confidence(filtered, unfiltered)    0.95, 1       three different ways. The novice, by its random strategy, rep-
                                                                      resents the most uncertain state during the learning process.
Table 1: Hyperparameter of the study: Relevance of a rule             It is considered:
set for the behaviour of agents during the learning process.
                                                                       • Novice, unsupervised.
                                                                       • Novice, monitored by safety layer with access to a fil-
   The data were collected during a test phase for different             tered rule set.
training progress. The solid lines show the progression for            • Novice, monitored by safety layer with access to an un-
the filtered, the dashed lines for the unfiltered rule set (cf. Ta-      filtered rule set.
ble 1). Figure 4 shows how often a rule is adhered to (green             The hyperparameters for the rule sets (filtered/unfiltered)
line) or not adhered to (red line) in relative frequencies.           correspond to those in Table 1. The reference value for the
                                                                      final reward is the expert from whom the expert demonstra-
                                                                      tions originate. Figure 5 shows the results of the different
                                                                      runs using the CartPole environment as an example.




Figure 4: Relevance of a rule set for the behaviour of agents
during the learning process.

    For both cases (filtered and unfiltered rule set) the num-        Figure 5: Evaluation for discrete action spaces based on the
ber of rule violations decreases with increasing training             CartPole environment.
progress. A striking feature is the difference between the rel-
ative curves of the filtered and unfiltered rule sets. While              It can be clearly seen that the expert (green curve) has
the green curve of the filtered rule set converges towards            the best performance. In contrast, the reward of the unsu-
confmin = 0.95, this cannot be observed for the filtered rule         pervised novice (blue curve) varies between 8 and 95. The
set. One possible reason for this is the high number of rules.        comparatively poor performance can be explained by the
The rule set specifies an action for each state. If the agent         fact that the novice chooses actions randomly. The novice
chooses this action in every state, its behaviour corresponds         has not learned a strategy for processing the given states. If
in theory to that of the expert. However, since the agent and         this is monitored using the safety view from algorithm 2, the
expert independently learn a strategy to solve the task, it           course changes depending on the rule set used. The red curve
is unlikely that the behaviour will match. As a result, the           shows the performance when an unfiltered rule set is used.
choice of actions differs in some states. The effect is also          It corresponds in parts to the performance of the expert, but
with the filtered rule set, but clearly smaller than with the un-     shows sharp dips (reward of 167) for some episodes. Nev-
filtered rule set. This suggests that the effect increases with       ertheless, the performance in these cases is clearly better
the number of rules in a rule set.                                    compared to the unsupervised novice (blue curve). It can
                                                                      be concluded that the safety layer with access to the unfil-
Evaluation of Safety Layer                                            tered rule set exerts a consistently positive influence on the
This section evaluates the safety layer. In the algorithms, a         novice’s performance. When using a filtered rule set, the re-
procedure has been formulated to monitor the agent using a            sult is similar. Figure 6 presents the results using the Lu-
set of safety rules. The framework is evaluated in common             narLander environment as an example.
OpenAI environments.                                                      Again the average reward of the expert (green) is the high-
                                                                      est. In second place is the performance of the novice with
Evaluation of the mode of operation for discrete action               unfiltered rule set (red), closely followed by the novice with
spaces. In order to draw conclusions about the functioning            filtered rule set (orange). In direct comparison, the perfor-
of the safety layer, different combinations of agent and rule         mance of the unsupervised novice (blue) is the worst. A
set are compared. Object of the investigation is the aggre-           closer look reveals significant differences between Figure 6
gated reward that the agent receives at the end of an episode.        and 5. The supervised novice with filtered rule set (orange)
For wrong decisions that lead to critical states (e.g. crashing       reaches the level of the expert (green) in places. The down-
the LunarLander), the agent receives a high negative reward.          ward fluctuations are more pronounced than in the Cart-
This means that the amount and variance of the reward is              Pole environment. This suggests that a wrong decision in
                                                                    performance of the two monitored novices (red, orange) is
                                                                    not solely due to the quality of the demonstrations. The two
                                                                    curves (red, orange) show no upward fluctuations. Even with
                                                                    non-optimal demonstrations, approaches of good behaviour
                                                                    should produce better performance than that of the unsuper-
                                                                    vised novice (blue). However, this is not the case and thus it
                                                                    is assumed that the complexity of the BipedalWalker envi-
                                                                    ronment is too high to achieve good performance without a
                                                                    strategy.

                                                                                        Lessons Learned
Figure 6: Evaluation for discrete action spaces using the Lu-       Current developments in technology make it possible to use
narLander environment.                                              advances in the field of artificial intelligence in a wide vari-
                                                                    ety of areas. The focus is on tasks of high complexity where
the LunarLander environment is difficult to compensate for.         errors can have fatal consequences. In addition to the as-
In concrete terms, this means that if the flying object gets        pect of safety, trust in the intelligent systems is a hurdle that
into an unfavourable position, a controlled landing is hardly       must be overcome. In order to strengthen trust, there are ap-
possible without a strategy. Analogous to the CartPole en-          proaches in research that attempt to increase the explainabil-
vironment, the reward of the unsupervised novice (blue) is          ity of systems. RL has achieved particular milestones in the
lowest, it increases when the novice is supervised. An in-          past. This was made possible by using RL in conjunction
creased number of safety rules improves the performance of          with Deep Learning. Deep RL is considered a suitable can-
the agent.                                                          didate for complex tasks. However, the use of a neural net-
                                                                    work as a policy as well as learning it poses two problems:
Evaluation of the mode of operation for continuous ac-              The black-box problem of the neural network (focus on ex-
tion spaces. To evaluate the safety layer in environments           plainability) and the trial and error approach of the learning
with continuous action spaces, the hyperparameters for the          process (focus on safety). In current research, there are sev-
rule sets (filtered/unfiltered) are the same as before. The         eral approaches that address these problems. The approach
divider for the discretisation is set to two. The results are       in this paper differs from existing ones in many ways. For
shown in Figure 7.                                                  one, both problems are addressed simultaneously. In addi-
                                                                    tion, the concept can be used independently of the structure
                                                                    of the RL system. This means that neither the agent nor the
                                                                    environment itself needs to be modified. Another advantage
                                                                    is that the concept can be applied to different environments
                                                                    and agents simply by adjusting the hyperparameters. It is
                                                                    also possible to use existing data sets. This reduces the time
                                                                    needed by human experts and offers the possibility to build
                                                                    on existing solutions.
                                                                       We developed a concrete framework for deriving safety
                                                                    rules from expert demonstrations. It is able to derive com-
                                                                    prehensible rules from a set of trajectories. We used a deci-
                                                                    sion tree based on the CART algorithm to derive the rules.
Figure 7: Evaluation for continuous action spaces using the         The result of the decision tree is a rule set with a high num-
BipedalWalker environment.                                          ber of rules. By using concepts from the field of associa-
                                                                    tion rules, the rule set can be filtered for relevant rules. We
   When looking at the performance of the expert (green), it        used the evaluation to show how the parameters affect the
is noticeable that it shows strong drops. The expert trained        shape of the rule set and gave an intuition for the interpreta-
with the help of the PPO2 learning algorithm is not able            tion of the framework’s hyperparameters. Using the metrics
to solve the environment. This contradicts the assumption           from the association rules domain, we developed a termina-
that the expert demonstrates safe behaviour. The unsafe be-         tion criterion that gives clear conclusions about the statistical
haviour affects the quality of the expert demonstrations and        relevance and uniqueness of the rules. With the help of the
thus the relevance of the safety rules. If the demonstrations       decision tree, these also have a comprehensible form. The
contain errors, these are also reflected in the safety rules. The   framework is able to derive a compact rule set from expert
positive effect of the safety layer is dependent on the qual-       demonstrations. The rule set reflects the behaviour of the ex-
ity of the safety rules. If the curve of the unfiltered novice      pert in a comprehensible way. We implemented the integra-
(red) is used, it is noticeable that it is clearly below the per-   tion of the safety rules as follows: A safety layer that moni-
formance of the expert (green). In contrast to discrete ac-         tors the agent ensures that the rules are followed at all times.
tion spaces, no improvement can be seen even in compar-             The rules are derived from the expert’s demonstrations and
ison to the unsupervised novice (blue). The same applies            thus reflect his or her behaviour. They can thus be interpreted
to the filtered novice (orange curve). However, the poorer          as safety rules, provided that the expert’s behaviour is con-
sidered safe. The results of the evaluation have shown that                               References
in less complex environments, such as the LunarLander en-          Bastani, O.; Pu, Y.; and Solar-Lezama, A. 2018. Verifiable
vironment, the expert’s performance can be achieved using          Reinforcement Learning via Policy Extraction. In Proceed-
the safety layer alone. In general, a positive effect regarding    ings of the 31st Annual Conference on Neural Information
performance could be achieved by using the safety layer. A         Processing Systems, NeurIPS’18, 2499–2509.
guarantee to avoid wrong decisions is only possible if the
                                                                   Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.;
expert demonstrations cover all critical states. Only then can
                                                                   Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI
the rule set contain all the necessary rules. In order to make
                                                                   Gym. CoRR, abs/1606.01540.
the framework as universally valid as possible, it was ex-
tended to include the ability to handle environments with          Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided Cost
continuous actions. Based on the significantly more com-           Learning: Deep Inverse Optimal Control via Policy Opti-
plex environments, it could be determined that the applica-        mization. In Proceedings of the 33nd International Con-
tion reaches its limits under the set goals of compactness and     ference on Machine Learning, ICML’16, 49–58.
comprehensibility of a rule set.                                   Hidber, C. 1999. Online association rule mining. ACM Sig-
                                                                   mod Record, 28(2): 145–156.
                                                                   Hill, A.; Raffin, A.; Ernestus, M.; Gleave, A.; Kanervisto,
              Conclusion and Prospects                             A.; Traore, R.; Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol,
                                                                   A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; and
The results of this paper show the great potential of the          Wu, Y. 2018. Stable Baselines.
concept Safety Aware Reinforcement Learning by Identify-           Hussein, A.; Gaber, M. M.; Elyan, E.; and Jayne, C. 2017.
ing Comprehensible Constraints in Expert Demonstrations,           Imitation Learning. ACM Computing Surveys, 50(2): 1–35.
but also reveal first weaknesses. Only if the experts demon-
strate how to deal with all critical states, it is possible that   Javier Garcı́a; Fern; and o Fernández. 2015. A Comprehen-
for each state there is a rule that specifies safe behaviour.      sive Survey on Safe Reinforcement Learning. Journal of
Whether this is the case can only be verified by the expert.       Machine Learning Research, 16(42): 1437–1480.
In less complex environments, such as the LunarLander en-          John Schulman; Filip Wolski; Prafulla Dhariwal; Alec Rad-
vironment, a rule set with a low number of rules could also        ford; and Oleg Klimov. 2017. Proximal Policy Optimization
achieve a high reward. However, this had a high variance.          Algorithms. CoRR, abs/1707.06347.
This is because the rules reflect the behaviour of the expert.     Kaya, M.; and Alhajj, R. 2005. Fuzzy OLAP associa-
The expert tries to land on the landing site even if he is far     tion rules mining-based modular reinforcement learning ap-
away from it. The rules resulting from this behaviour, force       proach for multiagent systems. IEEE Trans. Syst. Man Cy-
the supervised novice into risky manoeuvres that are diffi-        bern. Part B, 35(2): 326–338.
cult to intercept by the rule set alone. So instead of a simple    Liu, B.; Hsu, W.; and Ma, Y. 1998. Integrating Classifica-
but safe landing, what follows is an unnecessarily risky ma-       tion and Association Rule Mining. In Proceedings of the
noeuvre that leads to a crash. The results of the LunarLander      Fourth International Conference on Knowledge Discovery
environment have shown that the expert consciously takes           and Data Mining, KDD’98, 80–86.
risks. However, the risks should not be included in the safety
rules. A possible solution could be the prioritisation of be-      Longo, L.; Goebel, R.; Lecue, F.; Kieseberg, P.; and
haviour. In the case of the LunarLander, this would mean:          Holzinger, A. 2020. Explainable Artificial Intelligence:
An accident-free landing is necessary for safety, landing on       Concepts, Applications, Research Challenges and Visions.
the landing pad is secondary. One way to implement this            In Machine Learning and Knowledge Extraction, 1–16.
is to use targeted demonstrations that are limited to critical     Menda, K.; Driggs-Campbell, K. R.; and Kochenderfer,
situations. That is, the expert demonstrations are not com-        M. J. 2019. EnsembleDAgger: A Bayesian Approach to Safe
posed of an arbitrary set of trajectories, but contain only        Imitation Learning. In Proceedings of the 2019 IEEE/RSJ
those that show safety-relevant behaviour. In order to further     International Conference on Intelligent Robots and Systems,
investigate the functioning in complex environments with           IROS’19, 5041–5048.
continuous action spaces, it should be examined whether a          Moldovan, T. M.; and Abbeel, P. 2012. Safe Exploration
finer subdivision of the continuous interval into discrete ac-     in Markov Decision Processes. In Proceedings of the 29th
tions achieves a positive effect. Other approaches could be a      International Conference on Machine Learning, ICML’12.
better performing expert or optimising the hyperparameters         Puiutta, E.; and Veith, E. M. S. P. 2020. Explainable Re-
(amount of trajectories, support, confidence, etc.).               inforcement Learning: A Survey. In Machine Learning and
   For a large number of the application areas commonly            Knowledge Extraction, International Cross-Domain Confer-
used in reinforcement learning, the framework developed            ence, CD-MAKE’20, 77–95. Springer.
in this work offers the possibility to learn clear and under-      Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M. I.; and
standable rules from demonstrations. With the help of the          Moritz, P. 2015. Trust Region Policy Optimization. In Pro-
framework, the rules can be integrated into the learning and       ceedings of the 32nd International Conference on Machine
deployment process of the agent. As with the agents them-          Learning, ICML’15, 1889–1897.
selves, the high dimensionality of environments and the con-
tinuous action spaces pose a challenge for this framework.