What criminal and civil law tells us about Safe RL techniques to generate
                                law-abiding behaviour
                                                                Hal Ashton1∗
                                                        1
                                                            University College London
                                                                  Gower Street
                                                                     London
                                                              ucabha5@ucl.ac.uk

                            Abstract                                       computer games the training environment is the same as the
                                                                           deployment environment and training is costless (excepting
  Safe Reinforcement Learning (Safe RL) aims to produce con-
                                                                           the ecological impact of the vast computational power that is
  strained policies with constraints typically motivated by is-
  sues of physical safety. This paper considers the issues that            often used). Potential real world applications of RL are of-
  arise from regulatory constraints or issues of legal safety.             ten more complex, almost certainly regulated, and the cost
  Without guarantees of safety, autonomous systems or agents               of mistakes made in training or deployment could be catas-
  (A-bots) trained through RL are expensive or dangerous to                trophic. In such applications where safety, cost or legality
  train and deploy. Many potential applications for RL involve             are issue, one approach is to conduct learning in a simulator
  acting in regulated environments and here existing research is           of the environment where the cost of bad policies is negligi-
  thin. Regulations impose behavioural restrictions which can              ble. The use of any simulator raises the risk of misspecifica-
  be more complex than those engendered by considerations of               tion and poor generalisation on deployment. An alternative
  physical safety. They are often inter-temporal, require plan-            to using a simulator is placing the RL agent very carefully
  ning on behalf of the learner and involve concepts of causal-            in the target environment with a human overseer ready to
  ity and intent. By examining the typical types of laws present
  in a regulated arena, this paper identifies design features that         take over in tricky spots. This approach has limitations ac-
  the RL learning process should possess in order to ensure that           cording to the complexity of the task (Saunders et al. 2018).
  it is able to generate legally safe or compliant policies.               It might not be feasible to use this approach in applications
                                                                           like trading because the speed of decision making is beyond
                                                                           the ability of a human overseer to monitor.
                        Introduction                                          Whether learning takes place in a simulator or carefully
In this position paper I will consider the problem of learning             in the target arena, the ability to generate legal policies with
a solution to a sequential decision making problem in an en-               high probability is highly desirable if the policy is to be
vironment governed by some laws via Reinforcement learn-                   deployed in a regulated setting. Laws can present different
ing (RL). I will assume that the learned policy should not                 challenges to other types of constraint. A legally transgres-
break these laws because doing so would impose sanctions                   sive policy might not be obvious in the way a physically
by the environment’s regulator or law enforcer. By present-                transgressive one might be. The nature of laws will dictate
ing a taxonomy of laws which exist in real life, whose fea-                the methods of RL used to generate optimal, legal policies.
tures are relevant to RL, I am able to make some inferences
about the general design of a RL process that can produce                                          Background
legal policies.
   RL can produce novel policies to solve sequential deci-                 Markov Decision Processes (MDPs) are a common frame-
sion problems. Its potential has been demonstrated in the                  work underpinning RL. In this formulation time is discre-
super-human mastery of Go (Silver et al. 2017) and ad-                     tised and labelled t = 1, 2, 3, . . . . A MDP is described by a
vanced performance in more complicated games like Star-                    tuple (S, S0 , A, T , R, γ) where:
craft (Vinyals et al. 2019) but adoption in real life settings            1. S is the set of states in the environment.
has been retarded by safety (including legal) considerations.
This is noticeable in Financial trading applications which                2. s0 is a distribution over the initial states of the environ-
already use algorithms extensively but have been slow (pub-                  ment p(s) for s ∈ S.
licly at least) to adopt RL.                                              3. A is the set of all actions available.
   RL requires an environment which allows ample explo-
ration and feedback. In game applications such as Go and                  4. T (s, a, s0 ) = P(s0 |s, a) is the transition probability distri-
                                                                             bution; the probability of transitioning to state s0 when in
   ∗
     Supported by the EPSRC                                                  state s ∈ S and choosing action a ∈ A.
Copyright c 2021 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC            5. R : S × A → R is the reward function, the feedback
BY 4.0).                                                                     mechanism through which learning is possible.
6. γ ∈ (0, 1] the discount factor to differentiate the value               Intent in RL
   of rewards now vs those received in the future. In finite               RL has been used infer the intent of others (Qi and Zhu
   horizon cases γ = 1 and can be ignored.                                 2018), and even in IRL to define a reward function that cor-
   The learner then has the objective of funding a policy                  responds to the intentions of an expert demonstrator (Mac-
 function from the set of all policy functions Π : S → A                   Glashan and Littman 2015). However it has not ever defined
 which solves the maximisation of the expected discounted                  what intent means for the learner. Ashton (2020b) presents
 sum of rewards:             X t                                          a definition of direct intent in terms of causality and de-
            π ∗ = arg max E
                                                   
                                   γ R(st , at )|π                         sire of realised states. An agent directly intends a state s by
                        π∈Π         t=0                                    committing an action a if a foreseeably causes s and the
    The policy function is often a probability distribution over           agent aims or desires state s. In the context of RL, where
 actions π(a|s) = P(a|s) ∀a ∈ A, s ∈ S                                     an agent has a value function over every possible state, in-
    The Markovian property of this process comes from the                  ferences can be made about what a learner desires. Within
 transition function. It is satisfied if the probability of tran-          criminal law, different levels of intent are required for dif-
 sition to a new state is determined only by the current state             ferent crimes (Loveless 2013). Direct intent is the strictest,
 and chosen action.                                                        being required for murder but lower levels such as oblique
    An extension of the MDP is the Partially Observable MDP                intent, recklessness and negligence also exist. Whilst these
 (POMDP). This covers the very probable contingency where                  lower levels of intent do not necessarily have any require-
 the full state of the world is not visible to the decision maker.         ments about desire, their definitions often include subjective
 It is described by the tuple (S, S0 , A, T , R, γ, Ω, O). The             and objective tests of foreseeability vis-a-vis the prohibited
 two additions to the tuple are as follows:                                outcomes of actions. Subjective tests raise interesting ques-
7. Ω is the set of all observations that the learner can receive.          tions in model-free modes of RL since the learner does not
    For convenience we assume that it includes the reward rt               explicitly expect any outcome to their action. Objective tests
    received in any time period.                                           require an external judgement about the probability of an
                                                                           outcome given an action. If a consequence of an action was
8. O = P(ω|s0 , a) is the probability distribution of receiving            foreseeable then the offender can be thought of intending the
    observation ω ∈ Ω after transition to state s0 and action a            outcome. Lagioia and Sartor (2020) discuss this method of
    was chosen.                                                            intent inference and consider it sufficient albeit principally
    The domain of the policy function then becomes the his-                in the context of Belief Desire Intent (BDI) agents (Cohen
 tory of all observations and actions which we write π(a|ht )              and Levesque 1990). An intriguing corollary of the use of
 where ht is short hand for (o1 , a1 , o2 , a2 , . . . at−1 , ot ). Con-   objective tests, is that the predictive model that the RL agent
 sequently the complexity of solving a POMDP is much                       uses or learns should be accurate. This bypasses the danger
 higher than that of a MDP (Abel, MacGlashan, and Littman                  of the learner developing a ’delusion box’ type model (Ring
 2016). Our interest in introducing POMDPs is not so much                  and Orseau 2011) to justify otherwise illegal policies.
 the partial observability of this problem but more the en-
 larged domain of the policy function which they necessitate.                     A non-exhaustive taxonomy of laws
 This paper will show that law abiding policies are likely to
 be dependent on the history of observations regardless of                 In this section I present a number of law-types which are
 whether there is partial observability or not.                            likely to exist in a regulated environment. I differentiate be-
                                                                           tween states and actions. Actions are assumed to be orig-
Structural Causal Models                                                   inated by the learner only and their commission is volun-
This paper will show that the legality of behaviour can de-                tary. States refer to some measurable property of the envi-
pend on establishing the causal effects of an action. The def-             ronment, stable for the duration of the time period. Actions
inition of causality is a complicated topic and there is a dis-            cause a measurable change in the environment but I assume
tinction between predictive causality which refers to predict-             that their duration is instantaneous so there is no state that
ing the effect of actions and actual causality which refers                records an action in progress.
to evidential analysis after actions have been taken. Struc-
tural Causal Models (SCMS) (Pearl 2000) can be used in                     Simple State restriction laws
both senses. SCMs are a special case of Bayesian Causal                    This the simplest type of law and the one which Safe RL
Networks where directed arcs between nodes express direct                  research has concentrated on as many physical safety con-
causality as well as usual independence statements. Pearl                  straints are of this type. Examples might include ’drive be-
introduces the concept and effect of the atomical interven-                low 30 miles per hour in urban environments’ or ’do not fly
tion which corresponds to setting a variable to a value in the             a drone in the vicinity of the airport’. For a state restriction
model; this has strong connections to taking actions in RL.                to be a law, its realisation should avoidable by the learner
Actual causality is a harder undertaking since it necessarily              through its actions. This is the case for the speed restriction
involves counterfactual reasoning and has to deal with is-                 example. Even though I have described these laws as simple,
sues like preemption and overdetermination. Halpern Pearl                  they might be context dependent. Roads with a solid central
Causality (Halpern 2016) is a general purpose definition of                line generally do not allow overtaking, but the presence of a
actual causality. Alternative, more simple definitions are dis-            stationery vehicle blocking progress might allow it, provid-
cussed in (Liepiņa, Sartor, and Wyner 2020).                              ing it is safe to do so.
Caused state restriction laws                                          to supply. In the USA conspiracy and solicitation (the re-
                                                                       quest, encouragement or payment for someone else to com-
Some states exist which could be both caused through the
                                                                       mit a felony) are major classes of inchoate offence.
learner’s actions and through an external mechanism. This
marks the first departure from conventional safe-RL research           Laws requiring intent
because the safety constraints traditionally considered do
not differentiate between those states caused by the learner           Common law as practised in UK, USA, India and Canada
and those that are not. It does not matter whether the drone           amongst others requires that the accused had mens rea (the
caused itself to fly over the volcano or whether it was blown          mental element of intent to commit a crime) for certain crim-
by the wind, the state of being located over the volcano is            inal offences to have been committed by them. Different lev-
the one to be avoided. For legal restrictions, certain states          els of intent exist, ranging from direct intent where the ac-
might only be restricted if they were caused by the learner1 .         cused deliberately caused and wanted to cause a prohibited
A concrete example of such a causal dependent state restric-           outcome in the extreme to oblique intent where prohibited
tion can be found within financial markets where the UK’s              outcomes were caused as a side-effect of their behaviour, to
financial regulator prohibits trading algorithms from creat-           recklessness and negligence where the prohibited outcomes
ing or contributing to disorderly market conditions. Such              were foreseeable outcomes of their behaviour to various de-
conditions could arise independently of the behaviour of               grees. Certain offences will specify what level of mens-rea
the learner, if the learner has no mechanism of determining            is required so murder requires direct or oblique intent.
whether this the case, learning efficiency will be compro-                Aside from the crimes of specific intent, certain laws exist
mised. For caused state restriction laws to be broken two              which require establishing for what purpose the accused did
conditions should be satisfied. Firstly a restricted state ŝ          something. This is called basis intent by Bathaee (2011). US
should occur and secondly that the actions of the learner              Examples cited include market anti-spoofing laws, a practice
foreseeably caused the state to occur.                                 which is defined as the placement of orders with the intent to
                                                                       cancel them. Another related and topical example is termed
                                                                       Gatekeeping Intent by Bathaee. Laws or systems which are
Action sequence laws
                                                                       discriminatory in effect are only unlawful if discriminatory
Laws exist in a variety of settings which are restrictions on          intent behind them is established.
conduct with no necessary requirement for a lasting change
in the state of the world. Examples in the UK include the                          Implications of the taxonomy
offence of Careless Driving and more seriously Dangerous               There exist a number of challenges to developing an RL
driving. There is no causation requirement since I assume              method which produces legal policies under a rich set of
that the A-bot has the freedom to choose its own actions at            laws. I classify them into three areas:
any time2 Action sequence laws could be transformed into
a simple state restriction law by adding a state variable that         1. Encoding The environment’s laws need to be described
indicates whether a restricted action sequence has occurred.              in such a way as to be machine interpretable.
Such an approach might not be efficient if a large number of           2. Determination A mechanism needs to exist which can
conduct laws exist in the environment as it would cause the               determine the legality of any behaviour either in advance
dimension of the state space to grow.                                     or in retrospect.
                                                                       3. Constrained policy learning There should exist a
Mixed State Action sequence laws                                          method to constrain the policy of the A-bot to be law abid-
Laws exist which combine a sequence of actions that cause                 ing when it is acting or learning.
restricted state(s). Continuing the driving examples from the             Generally these problems should be solved in the order
previous section, in the UK there exist statutory offences of          displayed. The determination of legality requires an encod-
causing death by careless or reckless driving. These laws              ing of law to reference (planned) behaviour against and con-
constitute a restriction on how certain states are arrived at.         strained policy learning relies on knowing what policies are
                                                                       acceptable and what are not. By looking at the taxonomy of
Inchoate offences: Laws restricting behaviour that                     laws, some inferences can be made about all three elements
may induce future restricted states                                    of this process. I will run through each of the three tasks and
                                                                       make comments on how laws affect them. In practice the
Inchoate offences are restrictions on action sequences and
                                                                       three tasks might be heavily intertwined.
states which may lead to restricted states which are not nec-
essarily realised. Examples in the UK include attempt crimes           Encoding
such as attempted murder or possession of drugs with intent
                                                                       Safe RL research is only beginning to pay attention to how
   1
     Death is not generally prohibited, but causing it generally is!   laws should be described. This is because the types of re-
   2
     Situations where there is no action which won’t break a law are   strictions considered have largely been of the simple state
analogous to the concept of deadlock in model checking (Baier and      type which can be encoded using simple algebraic expres-
Katoen 2008). Laws can still be broken when the perpetrator had        sions. This approach becomes untenable when considering
no choice but to break the law but the defence of necessity might      more complicated laws like the ones identified in the previ-
then be valid as mitigation                                            ous example. Furthermore the quantity of applicable rules in
regulated environments is larger than most rulesets hitherto            Determination
considered in research. I identify four desirable features of           Given that we have a codified set of rules, we require a de-
an encoding which should be used to convey laws.                        vice to determine when a sequence of behaviour has or is
1. Temporal A number of laws restrict sequences of states               contravening a law. I assume that the states referred to in
   or actions. Moreover there is no requirement that these              the encoding are either the same as those perceived by the
   sequences are contiguous. An encoding needs to be rich               learner or there is an available mapping function between
   enough to express multiple states and actions with tempo-            the learning and encoding statespaces. This is of course not a
   ral relations like always, until, next etc.                          given since the learner may be perceiving continuous states
                                                                        and the encoding is likely to refer to high level states. In
2. Probabilistic As in the case of Inchoate offences, some              the case of simple state restrictions this is not a difficult task
   laws refer to future states not realised. Since the space of         but it becomes increasingly complex with richer laws. I have
   all possible future events is a large one, law reasonably            separated the requirements into four main features.
   concerns itself with restricting foreseeable consequences
   of behaviours. An encoding of laws should be rich enough             1. Domain Consider an arbiter function Γ which determines
   to express this.                                                        legality to the binary set - denoting legal or illegal. The
                                                                           domain of this function is dependent on the type of law it
3. Causal As we saw in the previous section, laws will of-                 is considering. Laws which reference more than one state
   ten prohibit the causation of a state, not the state itself per         or action for example require a domain which includes the
   se. Our death is not usually prohibited, but causing it nor-            history of states and actions ht = (s1 , a1 , s2 , a2 , . . . st ).
   mally is. Causation when considered in prospect will also               Laws which reference future paths will also require the
   normally require some sort of probabilistic reasoning.                  policy function of the learner π(·). Laws which require in-
                                                                           tent may also require information about the reward func-
4. Intent Certain laws require establishing levels of intent
                                                                           tion of the learner or its estimation of state values.
   on the part of the transgressor. Different levels of intent
   exist and are applicable to different offences.                      2. Future Projection A model of the environment is re-
                                                                           quired for almost all laws. In the case where strong safety
   In Table 1 I summarise what level of encoding expres-                   is required in training, the legality of any action needs to
siveness is required for each of the law types described in                be assessed before choosing it and this means assessing
the previous section. Nearly all of the laws require a tem-                the likely state transitions which occur as a result. Projec-
poral expressiveness. Temporal Logic systems allows us to                  tion is also a requirement in causal reasoning.
express when conditions should be true. A wide variety ex-
ist such as Linear Temporal Logic (LTL) (Pnueli 1977),                  3. Causal Reasoning As certain laws are defined by causa-
Computation Tree Logic (CTL) (Clarke and Emerson 1981)                     tion, a method is required to determine whether restricted
which considers multiple future paths and Probabilistic CTL                states have or are likely to be caused by the learner’s ac-
(PCTL) (Hansson and Jonsson 1994) which as the name sug-                   tion. This requires a causal model of the environment.
gests accounts for probabilistic transitions. Kleinberg and             4. Intentional Reasoning In environments where laws are
Mishra (2009) extends this to provide a language capable                   defined by intent, a learner must be aware of what they
of expressing causal relationships. To our knowledge there                 are intending to do (ie their likely policy trajectory) by
is no similar extension to express intention and this is an im-            choosing a particular action at any moment in time.
mediate project. Alves, Dennis, and Fisher (2020) succeed
in encoding the road junction rules for an autonomous vehi-                In Table 2 I show the necessary features of a legality deter-
cle using a variant of LTL.                                             mination process according to the law type present. Reason-
                                                                        ing about intent requires an algorithmic definition of intent.
   The analysis is not exhaustive. For example in any given
                                                                        This is an open area of research since the concept of intent
regulated environment there are likely to be a large number
                                                                        has been deliberately left as a primitive by legal practition-
of rules that the learner should obey simultaneously. This is
                                                                        ers. Care must be taken to ensure that the definition of intent
likely to mean deadlock situations arise where not breaking
                                                                        used in safe RL corresponds to what a court would find suf-
one law may result in the breaking of another. A meta order-
                                                                        ficient.
ing of laws may be required to deal with this situation.
                                                                                        Future       Arbiter function domain             Reasoning
                                                                         Law Type
                            Encoding Expressiveness                                     trajectory
 Law Type                                                                               prediction
                            Temporal Probabilistic    Causal   Intent                                State path   Action path   Policy   Causal      Intent
 State Restriction                                                       State          Maybe
 Caused State Restriction   Maybe      Maybe          Yes                Caused State   Yes                       Yes                    Yes
 Action Sequence            Yes                                          Action Seq                               Yes
 Action state sequence      Yes        Maybe          Maybe              Action State   Maybe        Yes          Yes                    Maybe
                                                                         sequence
 Inchoate                   Yes        Yes            Yes      Maybe     Inchoate       Yes                                     Yes                  Probably
 Intent                     Yes        Yes            Yes      Yes       Intent         Yes                                     Yes      Yes         Yes

Table 1: The complexity of the encoding language required               Table 2: Taxonomy implications for the determination pro-
is dependent on the law type                                            cess
Learning Process                                                    points out that questions of legality always have closure. It’s
The taxonomy of laws informs us how those laws should be            important to observe that there is no single source of the law,
described and the process which determines the legality of          the determination of legality will likely rely on referencing
behaviour. Finally, it can also inform us about the proper-         multiple sources (Boella et al. 2014).
ties of RL methods which will generate legally constrained             Garcı́a and Fernández (2015) provide a general survey of
policies.                                                           Safe RL, dividing approaches into those that modify the re-
                                                                    ward structure and those that modify the exploration pro-
1. Memory Reinforcement Learning approaches typically               cess. Constrained Markov Decision Processes (CMDPS) do
   use a MDP formulation to model their task. Whilst a              the former, by adding a finite set of auxiliary cost functions
   record of the current state might still be valid for the tran-   Ci : S × A → R to the vanilla MDP. Policies should then
   sition model, most of the law types that I have identified       achieve a discounted total cost in expectation less than some
   rely to some extent on sequences of states and actions.          scalar di (whilst maximising the normal reward function).
   Thus the device which chooses actions at any state (most         This is largely the approach of Constrained Policy Optimi-
   likely the policy function) must include histories in its do-    sation presented by Achiam et al. (2017). A drawback with
   main. Otherwise the standard MDP learner would not be            such an approach is that bad states can be reached in explo-
   able to understand whether its current action is legal or        ration making learning outside a simulator potentially ex-
   not. Including the history of actions and states is some-        pensive. Constraints introduced as cost functions also need
   thing that POMDPs do in order to make inference on the           to be differentiable and Markovian if certain gradient meth-
   hidden states of this model.                                     ods are to be used. Neither of these restrictions apply to
2. Model for planning Determining the legality of any ac-           the Constrained Cross-Entropy method of Wen and Topcu
   tion requires a predicting the likelihood for future states.     (2018), though perhaps at the cost of data parsimony.
   How far prediction is expected to go into the future de-            Safe RL methods which constrain exploration include ap-
   pends on the laws present - avoiding inchoate offences           proaches where a policy is learnt from observing a safe
   presumably requires greater foresight. Much of RL is             policy in a process known as Inverse RL or Apprentice-
   ’model free’ and successfully so, but they seem unavoid-         ship RL (Abbeel and Ng 2004). Recent examples include
   able here. Established methods like Dyna-Q simultane-            Noothigattu et al. (2018) who train a learner to play Pac-
   ously learn to act and create a world model (Sutton 1990)        Man following the rule ’don’t eat the ghosts’ through ex-
                                                                    pert demonstration and a bandit policy which alternates be-
3. Causal model Determining whether a law has been bro-
                                                                    tween observed ’safe’ behaviour and optimal self taught be-
   ken or not will often require some test of causality ex post
                                                                    haviour. Abel, MacGlashan, and Littman (2016) present a
   (Turner 2019). The task of the learner is to make sure that
                                                                    method where the ethical-preferences of an expert are de-
   they do not cause a restricted state to occur ex-ante, and
                                                                    rived through observation and then used to develop policies
   if it does occur that they are not subsequently adjudged
                                                                    accordingly. IRL approaches such as these obviate the re-
   to have been a legal cause of it. The presence of causal
                                                                    quirement for an explicit representation of rules. This could
   restrictions necessitates a causal model of the world to be
                                                                    be seen as a good feature in constrained tasks such as the
   formed for predictions. Bayesian Causal Models or equiv-
                                                                    learning of ethical behaviour or customs where there is no
   alently Structural Causal Models (SCMs) (Pearl 2000)
                                                                    written source of what the constraints should be. This is not
   can be used to predict causal effects and used to deter-
                                                                    the case in regulated settings. Moreover IRL is an ill-posed
   mine causality ex-post. They readily accept techniques
                                                                    problem - many reward-functions exist to explain any ob-
   like counterfactual analysis which allows off-policy data
                                                                    served behaviour. To make the problem tractable, simplify-
   treatment (Bareinboim and Pearl 2016) which is impor-
                                                                    ing assumptions must be made about its form. The resulting
   tant for off-policy RL methods. There also exist defini-
                                                                    reward function might not be rich enough to encode the pref-
   tions of causality based on SCMs such as Actual Causality
                                                                    erence required not to break all laws. In particular, Arnold,
   of (Halpern 2016) which are capable of dealing with the
                                                                    Kasenberg, and Scheutz (2017) note that IRL does not infer
   trickier causal problems of overdetermination, preemp-
                                                                    intertemporal rules.
   tion and omission. See Bareinboim (2020) for an intro-
   duction to causal RL.                                               A developing area of Safe RL are those methods which
                                                                    combine formal methods based on symbolic logic into the
                                                                    learning machinery of RL. Many of these techniques orig-
                       Related work                                 inate from the research area of formal verification methods
The task of learning a legally constrained policy through           and model checking (Baier and Katoen 2008). These are the
RL has seldom been mentioned in isolation but instead cited         techniques developed to error check software systems and
as a possible use case in more general Safe RL work. Sur-           provide stronger guarantees for correctness. As discussed,
prisingly the learning of ethical policies has loomed larger        temporal logic allows a richer expressiveness of laws. Dif-
in published research. For a RL approach see Abel, Mac-             ferent temporal logic systems have been applied to the learn-
Glashan, and Littman (2016) or Winfield et al. (2019) for           ing of policies in MDPs where transitions are known or
a more general discussion on ethically constraining au-             not. Linear Temporal Logic (LTL) is used in Hasanbeig and
tonomous systems. Ethical constraint is a harder task since         Kroening (2020), Fu and Topcu (2015) and Wen, Ehlers,
there is no agreed source of ethical constraints to apply to        and Topcu (2015) and Differential Dynamic logic is used
the learner. In contrast to those of ethics, Hildebrandt (2019)     in Fulton and Platzer (2018). Probabilistic computation tree
logic (PCTL) is used in Mason et al. (2017).                       2st International Conference on Machine Learning (ICML).
   Alshiekh et al. (2018), and Jansen et al. (2018) use a struc-   ISBN 1581138285. doi:10.1145/1015330.1015430.
ture called a shield to create safe policies through RL. This      Abel, D.; MacGlashan, J.; and Littman, M. L. 2016. Rein-
is a system which sits between the learner and the agent and       forcement learning as a framework for ethical decision mak-
either filters the choice of available actions for the learner     ing. AAAI Workshop - Technical Report WS-16-01 -: 54–61.
in learning time, or replaces unwise actions in deployment.
Ashton (2020a) calls the design a legal oracle and explores        Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con-
its necessary features in a legal setting. A Shield has a model    strained Policy Optimization. URL http://arxiv.org/abs/1705.
of the environment, knows the required constraints which           10528.
are described in temporal logic, and is able to use a formal       Alshiekh, M.; Bloem, R.; Bettina, K.; Niekum, S.; Topcu,
program verification methods to check the legality of any          U.; and Street, E. 2018. Safe Reinforcement Learning
action at any moment in time. An attractive feature of this        via Shielding. In AAAI Conference on Artifical Intelli-
method is that the method of constraint is separate and some-      gence. URL https://aaai.org/ocs/index.php/AAAI/AAAI18/
what agnostic to the method of learning. Jansen et al. (2020)      paper/view/17211/16534.
identify three challenges to this approach: Model checking is
                                                                   Alves, G. V.; Dennis, L.; and Fisher, M. 2020. Formalisa-
computationally expensive, safety in a probabilistic environ-
                                                                   tion and Implementation of Road Junction Rules on an Au-
ment is not binary so threshholds need to be considered and        tonomous Vehicle Modelled as an Agent. In Formal Meth-
finally shielding may obstruct efficient exploration thereby       ods. FM 2019 International Workshops, volume 1, 217–232.
generating sub-optimal policies.                                   Springer International Publishing. ISBN 9783030549930.
   Seldonian Reinforcement learning (Thomas et al. 2019) is        ISSN 16113349. doi:10.1007/978-3-030-54994-7\16.
a recent technique that aims to produce RL algorithms that
only output safe policies with a certain (high) probability. It    Arnold, T.; Kasenberg, D.; and Scheutz, M. 2017. Value
differs from other methods discussed in this paper in that the     alignment or misalignment - What will keep systems account-
technique searches for learning algorithms not policies. The       able? AAAI Workshop - Technical Report WS-17-01 -: 81–88.
RL example presented in the paper has restrictions of lim-         Ashton, H. 2020a. AI Legal Counsel to train and regulate
ited complexity so we will have to wait for more published         legally constrained Autonomous systems. In IEEE Big Data
research to assess this method properly.                           2020 workshop on applications of artificial intelligence in the
                                                                   legal industry, Forthcoming.
                        Conclusion                                 Ashton, H. 2020b. Definitions of intent for AI derived from
The paper is motivated by an aim to design Safe RL pro-            common law. In Jurisin 2020: 14th Intl Workshop on Jurisi-
cesses which are capable of producing policies constrained         informatics. URL https://easychair.org/publications/preprint/
under a general rule set. By creating a brief taxonomy of          GfCZ.
laws in the language of states and actions specifically for        Baier, C.; and Katoen, J.-P. 2008. Principles Of Model Check-
the application I have been able to draw some conclusions          ing. MIT Press. ISBN 9780262026499. URL http://mitpress.
about the requirements of legally-safe RL. Laws are com-           mit.edu/books/principles-model-checking.
monly defined in inter-temporal ways over actions and state.
This means that a learning process must include a memory           Bareinboim, E. 2020. Causal Reinforcement Learning (CRL).
of past states and actions. Thus the domain of a legal pol-        URL https://crl.causalai.net/.
icy function will include history just as it does in RL un-        Bareinboim, E.; and Pearl, J. 2016. Causal inference and the
der a POMDP. Causality and Intent can be key concepts              data-fusion problem. Proceedings of the National Academy
in determining whether and which laws have been broken.            of Sciences of the United States of America 113(27): 7345–
Whilst RL is beginning to tackle causality, it has not done in     7352. doi:10.1073/pnas.1510507113.
the context of constrained learning. Intent is barely defined      Bathaee, Y. 2011. The artificial intelligence black box and
quantitatively but it will have to be if generally legal RL sys-   the failure of intent and causation. Harvard Journal of Law
tems are to be produced. Causality, Intent and the existence       & Technology 2(4): 31–40.
of inchoate offences mean that a legally-safe RL algorithm
will require prediction about likely future trajectories. This     Boella, G.; Humphreys, L.; Muthuri, R.; Rossi, P.; and Van
will require some type of environment model to be learned          Der Torre, L. 2014. A critical analysis of legal require-
or supplied to the learner and planning take place.                ments engineering from the perspective of legal practice.
   This work could be viewed as an application of legal re-        2014 IEEE 7th International Workshop on Requirements En-
quirements engineering. Care should be taken since it orig-        gineering and Law, RELAW 2014 - Proceedings 14–21. doi:
inates singularly from a computer scientist and not a legal        10.1109/RELAW.2014.6893476.
practitioner (Boella et al. 2014). Yet it is a starting point      Clarke, E. M.; and Emerson, E. A. 1981. Design and synthesis
which can at least begin to inform engineers.                      of synchronization skeletons using branching time temporal
                                                                   logic. In Workshop on Logic of Programs, 52–71.
                        References                                 Cohen, P. R.; and Levesque, H. J. 1990. Intention is choice
Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship Learning            with commitment. Artificial Intelligence 42(2-3): 213–261.
via Inverse Reinforcement Learning. In Proceedings of the          ISSN 00043702. doi:10.1016/0004-3702(90)90055-5.
Fu, J.; and Topcu, U. 2015. Probably Approximately Cor-          Noothigattu, R.; Bouneffouf, D.; Mattei, N.; Chandra, R.;
rect MDP Learning and Control With Temporal Logic Con-           Madan, P.; Varshney, K.; ...; and Rossi, F. 2018. Interpretable
straints. doi:10.15607/rss.2014.x.039.                           Multi-Objective Reinforcement Learning through Policy Or-
Fulton, N.; and Platzer, A. 2018. Safe reinforcement learning    chestration. URL http://arxiv.org/abs/1809.08343.
via formal methods: Toward safe control through proof and        Pearl, J. 2000. Causality: Models, reasoning and inference.
learning. 32nd AAAI Conference on Artificial Intelligence,       Cambridge University Press. ISBN 0521773628.
AAAI 2018 6485–6492.                                             Pnueli, A. 1977. The temporal logic of programs. Proceed-
Garcı́a, J.; and Fernández, F. 2015. A Comprehensive Survey     ings - Annual IEEE Symposium on Foundations of Computer
on Safe Reinforcement Learning. Journal of Machine Learn-        Science, FOCS 1977-Octob: 46–57. doi:10.1109/sfcs.1977.
ing Research 16: 1437–1480.                                      32.
Halpern, J. Y. 2016. Actual Causality. MIT Press, 1st edition.   Qi, S.; and Zhu, S.-C. 2018. Intent-aware Multi-agent Re-
ISBN 9780262035026.                                              inforcement Learning. In IEEE International Conference on
                                                                 Robotics and Automation (ICRA), 7533–7540. doi:10.1109/
Hansson, H.; and Jonsson, B. 1994. A logic for reasoning         ICRA.2018.8463211.
about time and reliability. Formal Aspects of Computing 6(5):
                                                                 Ring, M.; and Orseau, L. 2011. Delusion Survival and Intelli-
512–535. ISSN 09345043. doi:10.1007/BF01211866.
                                                                 gent Agents. In Conference on Artificial General Intelligence
Hasanbeig, M.; and Kroening, D. 2020. Cautious Rein-             (AGI-11). ISBN 9783642228872. doi:10.1007/978-3-642-
forcement Learning with Logical Constraints. doi:10.5555/        22887-2.
3398761.3398821.                                                 Saunders, W.; Stuhlmüller, A.; Sastry, G.; and Evans, O.
Hildebrandt, M. 2019. Closure: on ethics, code and law. In       2018. Trial without error: Towards safe reinforcement learn-
Law for Computer Scientists, chapter 11. Oxford University       ing via human intervention. Proceedings of the International
Press. ISBN 9780198860877.                                       Joint Conference on Autonomous Agents and Multiagent Sys-
                                                                 tems, AAMAS 3: 2067–2069.
Jansen, N.; Junges, S.; Bettina, K.; and Bloem, R. 2018.
Shielded Decision-Making in MDPs.                                Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
                                                                 Huang, A.; Guez, A.; ...; and Hassabis, D. 2017. Mas-
Jansen, N.; Könighofer, B.; Junges, S.; Serban, A.; and         tering the game of Go without human knowledge. Nature
Bloem, R. 2020. Safe Reinforcement Learning Using Prob-          550(7676): 354–359. doi:10.1038/nature24270.
abilistic Shields. In 31st International Conference on Con-
currency Theory, CONCUR 2020, 1–3. doi:10.4230/LIPIcs.           Sutton, R. S. 1990. Integrated architectures for learning, plan-
CONCUR.2020.3.                                                   ning, and reacting based on approximating dynamic program-
                                                                 ming. In Proceedings of the 7th. International Conference on
Kleinberg, S.; and Mishra, B. 2009. The temporal logic of        Machine Learning pages(1987): 216–224.
causal structures. Proceedings of the 25th Conference on Un-
certainty in Artificial Intelligence, UAI 2009 303–312. URL      Thomas, P. S.; da Silva, B. C.; Barto, A. G.; Giguere, S.;
https://arxiv.org/abs/1205.2634v1.                               Brun, Y.; and Brunskill, E. 2019. Preventing undesirable
                                                                 behavior of intelligent machines. Science 366(6468). doi:
Lagioia, F.; and Sartor, G. 2020. AI Systems Under Criminal      10.1126/science.aag3311.
Law: a Legal Analysis and a Regulatory Perspective. Phi-
                                                                 Turner, J. 2019. Robot Rules. Palgrave Macmillan. ISBN
losophy and Technology 33(3): 433–465. ISSN 22105441.
                                                                 978-3-319-96234-4.
doi:10.1007/s13347-019-00362-x.
                                                                 Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;
Liepiņa, R.; Sartor, G.; and Wyner, A. 2020. Arguing about      Dudzik, A.; Chung, J.; ...; and Silver, D. 2019. Grandmaster
causes in law: a semi-formal framework for causal argu-          level in StarCraft II using multi-agent reinforcement learning.
ments. Artificial Intelligence and Law 28(1): 69–89. ISSN        Nature 575(7782): 350–354. doi:10.1038/s41586-019-1724-
15728382. doi:10.1007/s10506-019-09246-z.                        z.
Loveless, J. 2013. Mens Rea: Intention, Recklessness, Neg-       Wen, M.; Ehlers, R.; and Topcu, U. 2015. Correct-by-
ligence and Gross Negligence. In Complete Criminal Law:          synthesis reinforcement learning with temporal logic con-
Test, Cases and Materials, chapter 3, 91–150. OUP Ox-            straints. IEEE International Conference on Intelligent Robots
ford. ISBN 0198848463. doi:10.1093/he/9780199646418.             and Systems 2015-Decem: 4983–4990. doi:10.1109/IROS.
003.0003.                                                        2015.7354078.
MacGlashan, J.; and Littman, M. L. 2015. Between imitation       Wen, M.; and Topcu, U. 2018. Constrained Cross-Entropy
and intention learning. In Twenty Fourth International Joint     Method for Safe Reinforcement Learning. In Advances in
Conference on Artificial Intelligence. ISBN 9781577357384.       Neural Information Processing Systems 31, NeurIPS, 7461–
ISSN 10450823.                                                   7471. URL http://papers.nips.cc/paper/7974-constrained-
Mason, G.; Calinescu, R.; Kudenko, D.; and Banks, A. 2017.       cross-entropy-method-for-safe-reinforcement-learning.pdf.
Assured reinforcement learning with formally verified ab-        Winfield, A. F. T.; Michael, K.; Pitt, J.; and Evers, V. 2019.
stract policies. ICAART 2017 - Proceedings of the 9th In-        Machine ethics: The design and governance of ethical ai and
ternational Conference on Agents and Artificial Intelligence     autonomous systems. Proceedings of the IEEE 107(3): 509–
2: 105–117. doi:10.5220/0006156001050117.                        517. doi:10.1109/JPROC.2019.2900622.