<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>What criminal and civil law tells us about Safe RL techniques to generate law-abiding behaviour</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hal Ashton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University College London Gower Street London</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Safe Reinforcement Learning (Safe RL) aims to produce constrained policies with constraints typically motivated by issues of physical safety. This paper considers the issues that arise from regulatory constraints or issues of legal safety. Without guarantees of safety, autonomous systems or agents (A-bots) trained through RL are expensive or dangerous to train and deploy. Many potential applications for RL involve acting in regulated environments and here existing research is thin. Regulations impose behavioural restrictions which can be more complex than those engendered by considerations of physical safety. They are often inter-temporal, require planning on behalf of the learner and involve concepts of causality and intent. By examining the typical types of laws present in a regulated arena, this paper identifies design features that the RL learning process should possess in order to ensure that it is able to generate legally safe or compliant policies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In this position paper I will consider the problem of learning
a solution to a sequential decision making problem in an
environment governed by some laws via Reinforcement
learning (RL). I will assume that the learned policy should not
break these laws because doing so would impose sanctions
by the environment’s regulator or law enforcer. By
presenting a taxonomy of laws which exist in real life, whose
features are relevant to RL, I am able to make some inferences
about the general design of a RL process that can produce
legal policies.</p>
      <p>
        RL can produce novel policies to solve sequential
decision problems. Its potential has been demonstrated in the
super-human mastery of Go
        <xref ref-type="bibr" rid="ref48">(Silver et al. 2017)</xref>
        and
advanced performance in more complicated games like
Starcraft
        <xref ref-type="bibr" rid="ref52">(Vinyals et al. 2019)</xref>
        but adoption in real life settings
has been retarded by safety (including legal) considerations.
This is noticeable in Financial trading applications which
already use algorithms extensively but have been slow
(publicly at least) to adopt RL.
      </p>
      <p>RL requires an environment which allows ample
exploration and feedback. In game applications such as Go and</p>
      <p>Supported by the EPSRC
Copyright c 2021 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 International (CC
BY 4.0).
computer games the training environment is the same as the
deployment environment and training is costless (excepting
the ecological impact of the vast computational power that is
often used). Potential real world applications of RL are
often more complex, almost certainly regulated, and the cost
of mistakes made in training or deployment could be
catastrophic. In such applications where safety, cost or legality
are issue, one approach is to conduct learning in a simulator
of the environment where the cost of bad policies is
negligible. The use of any simulator raises the risk of
misspecification and poor generalisation on deployment. An alternative
to using a simulator is placing the RL agent very carefully
in the target environment with a human overseer ready to
take over in tricky spots. This approach has limitations
according to the complexity of the task (Saunders et al. 2018).
It might not be feasible to use this approach in applications
like trading because the speed of decision making is beyond
the ability of a human overseer to monitor.</p>
      <p>Whether learning takes place in a simulator or carefully
in the target arena, the ability to generate legal policies with
high probability is highly desirable if the policy is to be
deployed in a regulated setting. Laws can present different
challenges to other types of constraint. A legally
transgressive policy might not be obvious in the way a physically
transgressive one might be. The nature of laws will dictate
the methods of RL used to generate optimal, legal policies.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>Markov Decision Processes (MDPs) are a common
framework underpinning RL. In this formulation time is
discretised and labelled t = 1; 2; 3; : : : . A MDP is described by a
tuple (S; S0; A; T ; R; ) where:
1. S is the set of states in the environment.
2. s0 is a distribution over the initial states of the
environment p(s) for s 2 S .
3. A is the set of all actions available.
4. T (s; a; s0) = P(s0js; a) is the transition probability
distribution; the probability of transitioning to state s0 when in
state s 2 S and choosing action a 2 A.
5. R : S A ! R is the reward function, the feedback
mechanism through which learning is possible.</p>
      <p>2 (0; 1] the discount factor to differentiate the value
of rewards now vs those received in the future. In finite
horizon cases = 1 and can be ignored.</p>
      <p>The learner then has the objective of funding a policy
function from the set of all policy functions : S ! A
which solves the maximisation of the expected discounted
sum of rewards:
= arg max E X tR(st; at)j</p>
      <p>2 t=0</p>
      <p>The policy function is often a probability distribution over
actions (ajs) = P(ajs) 8a 2 A, s 2 S</p>
      <p>The Markovian property of this process comes from the
transition function. It is satisfied if the probability of
transition to a new state is determined only by the current state
and chosen action.</p>
      <p>An extension of the MDP is the Partially Observable MDP
(POMDP). This covers the very probable contingency where
the full state of the world is not visible to the decision maker.
It is described by the tuple (S; S0; A; T ; R; ; ; O). The
two additions to the tuple are as follows:</p>
      <p>is the set of all observations that the learner can receive.
For convenience we assume that it includes the reward rt
received in any time period.
8. O = P(!js0; a) is the probability distribution of receiving
observation ! 2 after transition to state s0 and action a
was chosen.</p>
      <p>
        The domain of the policy function then becomes the
history of all observations and actions which we write (ajht)
where ht is short hand for (o1; a1; o2; a2; : : : at 1; ot).
Consequently the complexity of solving a POMDP is much
higher than that of a MDP
        <xref ref-type="bibr" rid="ref14 ref3">(Abel, MacGlashan, and Littman
2016)</xref>
        . Our interest in introducing POMDPs is not so much
the partial observability of this problem but more the
enlarged domain of the policy function which they necessitate.
This paper will show that law abiding policies are likely to
be dependent on the history of observations regardless of
whether there is partial observability or not.
      </p>
      <sec id="sec-2-1">
        <title>Structural Causal Models</title>
        <p>
          This paper will show that the legality of behaviour can
depend on establishing the causal effects of an action. The
definition of causality is a complicated topic and there is a
distinction between predictive causality which refers to
predicting the effect of actions and actual causality which refers
to evidential analysis after actions have been taken.
Structural Causal Models (SCMS)
          <xref ref-type="bibr" rid="ref42">(Pearl 2000)</xref>
          can be used in
both senses. SCMs are a special case of Bayesian Causal
Networks where directed arcs between nodes express direct
causality as well as usual independence statements. Pearl
introduces the concept and effect of the atomical
intervention which corresponds to setting a variable to a value in the
model; this has strong connections to taking actions in RL.
Actual causality is a harder undertaking since it necessarily
involves counterfactual reasoning and has to deal with
issues like preemption and overdetermination. Halpern Pearl
Causality
          <xref ref-type="bibr" rid="ref24">(Halpern 2016)</xref>
          is a general purpose definition of
actual causality. Alternative, more simple definitions are
discussed in
          <xref ref-type="bibr" rid="ref27 ref33 ref35">(Liepin¸a, Sartor, and Wyner 2020)</xref>
          .
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Intent in RL</title>
        <p>
          RL has been used infer the intent of others
          <xref ref-type="bibr" rid="ref22 ref45 ref55">(Qi and Zhu
2018)</xref>
          , and even in IRL to define a reward function that
corresponds to the intentions of an expert demonstrator
          <xref ref-type="bibr" rid="ref21 ref23 ref37">(MacGlashan and Littman 2015)</xref>
          . However it has not ever defined
what intent means for the learner.
          <xref ref-type="bibr" rid="ref10 ref11">Ashton (2020</xref>
          b) presents
a definition of direct intent in terms of causality and
desire of realised states. An agent directly intends a state s by
committing an action a if a foreseeably causes s and the
agent aims or desires state s. In the context of RL, where
an agent has a value function over every possible state,
inferences can be made about what a learner desires. Within
criminal law, different levels of intent are required for
different crimes
          <xref ref-type="bibr" rid="ref36">(Loveless 2013)</xref>
          . Direct intent is the strictest,
being required for murder but lower levels such as oblique
intent, recklessness and negligence also exist. Whilst these
lower levels of intent do not necessarily have any
requirements about desire, their definitions often include subjective
and objective tests of foreseeability vis-a-vis the prohibited
outcomes of actions. Subjective tests raise interesting
questions in model-free modes of RL since the learner does not
explicitly expect any outcome to their action. Objective tests
require an external judgement about the probability of an
outcome given an action. If a consequence of an action was
foreseeable then the offender can be thought of intending the
outcome.
          <xref ref-type="bibr" rid="ref33">Lagioia and Sartor (2020)</xref>
          discuss this method of
intent inference and consider it sufficient albeit principally
in the context of Belief Desire Intent (BDI) agents
          <xref ref-type="bibr" rid="ref19">(Cohen
and Levesque 1990)</xref>
          . An intriguing corollary of the use of
objective tests, is that the predictive model that the RL agent
uses or learns should be accurate. This bypasses the danger
of the learner developing a ’delusion box’ type model
          <xref ref-type="bibr" rid="ref46">(Ring
and Orseau 2011)</xref>
          to justify otherwise illegal policies.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A non-exhaustive taxonomy of laws</title>
      <p>In this section I present a number of law-types which are
likely to exist in a regulated environment. I differentiate
between states and actions. Actions are assumed to be
originated by the learner only and their commission is
voluntary. States refer to some measurable property of the
environment, stable for the duration of the time period. Actions
cause a measurable change in the environment but I assume
that their duration is instantaneous so there is no state that
records an action in progress.</p>
      <sec id="sec-3-1">
        <title>Simple State restriction laws</title>
        <p>This the simplest type of law and the one which Safe RL
research has concentrated on as many physical safety
constraints are of this type. Examples might include ’drive
below 30 miles per hour in urban environments’ or ’do not fly
a drone in the vicinity of the airport’. For a state restriction
to be a law, its realisation should avoidable by the learner
through its actions. This is the case for the speed restriction
example. Even though I have described these laws as simple,
they might be context dependent. Roads with a solid central
line generally do not allow overtaking, but the presence of a
stationery vehicle blocking progress might allow it,
providing it is safe to do so.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Caused state restriction laws</title>
        <p>Some states exist which could be both caused through the
learner’s actions and through an external mechanism. This
marks the first departure from conventional safe-RL research
because the safety constraints traditionally considered do
not differentiate between those states caused by the learner
and those that are not. It does not matter whether the drone
caused itself to fly over the volcano or whether it was blown
by the wind, the state of being located over the volcano is
the one to be avoided. For legal restrictions, certain states
might only be restricted if they were caused by the learner1.
A concrete example of such a causal dependent state
restriction can be found within financial markets where the UK’s
financial regulator prohibits trading algorithms from
creating or contributing to disorderly market conditions. Such
conditions could arise independently of the behaviour of
the learner, if the learner has no mechanism of determining
whether this the case, learning efficiency will be
compromised. For caused state restriction laws to be broken two
conditions should be satisfied. Firstly a restricted state s^
should occur and secondly that the actions of the learner
foreseeably caused the state to occur.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Action sequence laws</title>
        <p>Laws exist in a variety of settings which are restrictions on
conduct with no necessary requirement for a lasting change
in the state of the world. Examples in the UK include the
offence of Careless Driving and more seriously Dangerous
driving. There is no causation requirement since I assume
that the A-bot has the freedom to choose its own actions at
any time2 Action sequence laws could be transformed into
a simple state restriction law by adding a state variable that
indicates whether a restricted action sequence has occurred.
Such an approach might not be efficient if a large number of
conduct laws exist in the environment as it would cause the
dimension of the state space to grow.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Mixed State Action sequence laws</title>
        <p>Laws exist which combine a sequence of actions that cause
restricted state(s). Continuing the driving examples from the
previous section, in the UK there exist statutory offences of
causing death by careless or reckless driving. These laws
constitute a restriction on how certain states are arrived at.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Inchoate offences: Laws restricting behaviour that may induce future restricted states</title>
        <p>
          Inchoate offences are restrictions on action sequences and
states which may lead to restricted states which are not
necessarily realised. Examples in the UK include attempt crimes
such as attempted murder or possession of drugs with intent
1Death is not generally prohibited, but causing it generally is!
2Situations where there is no action which won’t break a law are
analogous to the concept of deadlock in model checking
          <xref ref-type="bibr" rid="ref12">(Baier and
Katoen 2008)</xref>
          . Laws can still be broken when the perpetrator had
no choice but to break the law but the defence of necessity might
then be valid as mitigation
to supply. In the USA conspiracy and solicitation (the
request, encouragement or payment for someone else to
commit a felony) are major classes of inchoate offence.
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>Laws requiring intent</title>
        <p>Common law as practised in UK, USA, India and Canada
amongst others requires that the accused had mens rea (the
mental element of intent to commit a crime) for certain
criminal offences to have been committed by them. Different
levels of intent exist, ranging from direct intent where the
accused deliberately caused and wanted to cause a prohibited
outcome in the extreme to oblique intent where prohibited
outcomes were caused as a side-effect of their behaviour, to
recklessness and negligence where the prohibited outcomes
were foreseeable outcomes of their behaviour to various
degrees. Certain offences will specify what level of mens-rea
is required so murder requires direct or oblique intent.</p>
        <p>
          Aside from the crimes of specific intent, certain laws exist
which require establishing for what purpose the accused did
something. This is called basis intent by
          <xref ref-type="bibr" rid="ref15">Bathaee (2011)</xref>
          . US
Examples cited include market anti-spoofing laws, a practice
which is defined as the placement of orders with the intent to
cancel them. Another related and topical example is termed
Gatekeeping Intent by Bathaee. Laws or systems which are
discriminatory in effect are only unlawful if discriminatory
intent behind them is established.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Implications of the taxonomy</title>
      <p>There exist a number of challenges to developing an RL
method which produces legal policies under a rich set of
laws. I classify them into three areas:
1. Encoding The environment’s laws need to be described
in such a way as to be machine interpretable.
2. Determination A mechanism needs to exist which can
determine the legality of any behaviour either in advance
or in retrospect.</p>
      <sec id="sec-4-1">
        <title>3. Constrained policy learning There should exist a</title>
        <p>method to constrain the policy of the A-bot to be law
abiding when it is acting or learning.</p>
        <p>Generally these problems should be solved in the order
displayed. The determination of legality requires an
encoding of law to reference (planned) behaviour against and
constrained policy learning relies on knowing what policies are
acceptable and what are not. By looking at the taxonomy of
laws, some inferences can be made about all three elements
of this process. I will run through each of the three tasks and
make comments on how laws affect them. In practice the
three tasks might be heavily intertwined.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Encoding</title>
        <p>Safe RL research is only beginning to pay attention to how
laws should be described. This is because the types of
restrictions considered have largely been of the simple state
type which can be encoded using simple algebraic
expressions. This approach becomes untenable when considering
more complicated laws like the ones identified in the
previous example. Furthermore the quantity of applicable rules in
regulated environments is larger than most rulesets hitherto
considered in research. I identify four desirable features of
an encoding which should be used to convey laws.
1. Temporal A number of laws restrict sequences of states
or actions. Moreover there is no requirement that these
sequences are contiguous. An encoding needs to be rich
enough to express multiple states and actions with
temporal relations like always, until, next etc.
2. Probabilistic As in the case of Inchoate offences, some
laws refer to future states not realised. Since the space of
all possible future events is a large one, law reasonably
concerns itself with restricting foreseeable consequences
of behaviours. An encoding of laws should be rich enough
to express this.
3. Causal As we saw in the previous section, laws will
often prohibit the causation of a state, not the state itself per
se. Our death is not usually prohibited, but causing it
normally is. Causation when considered in prospect will also
normally require some sort of probabilistic reasoning.
4. Intent Certain laws require establishing levels of intent
on the part of the transgressor. Different levels of intent
exist and are applicable to different offences.</p>
        <p>
          In Table 1 I summarise what level of encoding
expressiveness is required for each of the law types described in
the previous section. Nearly all of the laws require a
temporal expressiveness. Temporal Logic systems allows us to
express when conditions should be true. A wide variety
exist such as Linear Temporal Logic (LTL)
          <xref ref-type="bibr" rid="ref44">(Pnueli 1977)</xref>
          ,
Computation Tree Logic (CTL)
          <xref ref-type="bibr" rid="ref18">(Clarke and Emerson 1981)</xref>
          which considers multiple future paths and Probabilistic CTL
(PCTL)
          <xref ref-type="bibr" rid="ref26">(Hansson and Jonsson 1994)</xref>
          which as the name
suggests accounts for probabilistic transitions.
          <xref ref-type="bibr" rid="ref32">Kleinberg and
Mishra (2009)</xref>
          extends this to provide a language capable
of expressing causal relationships. To our knowledge there
is no similar extension to express intention and this is an
immediate project.
          <xref ref-type="bibr" rid="ref6">Alves, Dennis, and Fisher (2020</xref>
          ) succeed
in encoding the road junction rules for an autonomous
vehicle using a variant of LTL.
        </p>
        <p>The analysis is not exhaustive. For example in any given
regulated environment there are likely to be a large number
of rules that the learner should obey simultaneously. This is
likely to mean deadlock situations arise where not breaking
one law may result in the breaking of another. A meta
ordering of laws may be required to deal with this situation.
Law Type</p>
      </sec>
      <sec id="sec-4-3">
        <title>Determination</title>
        <p>Given that we have a codified set of rules, we require a
device to determine when a sequence of behaviour has or is
contravening a law. I assume that the states referred to in
the encoding are either the same as those perceived by the
learner or there is an available mapping function between
the learning and encoding statespaces. This is of course not a
given since the learner may be perceiving continuous states
and the encoding is likely to refer to high level states. In
the case of simple state restrictions this is not a difficult task
but it becomes increasingly complex with richer laws. I have
separated the requirements into four main features.
1. Domain Consider an arbiter function which determines
legality to the binary set - denoting legal or illegal. The
domain of this function is dependent on the type of law it
is considering. Laws which reference more than one state
or action for example require a domain which includes the
history of states and actions ht = (s1; a1; s2; a2; : : : st).
Laws which reference future paths will also require the
policy function of the learner ( ). Laws which require
intent may also require information about the reward
function of the learner or its estimation of state values.
2. Future Projection A model of the environment is
required for almost all laws. In the case where strong safety
is required in training, the legality of any action needs to
be assessed before choosing it and this means assessing
the likely state transitions which occur as a result.
Projection is also a requirement in causal reasoning.
3. Causal Reasoning As certain laws are defined by
causation, a method is required to determine whether restricted
states have or are likely to be caused by the learner’s
action. This requires a causal model of the environment.
4. Intentional Reasoning In environments where laws are
defined by intent, a learner must be aware of what they
are intending to do (ie their likely policy trajectory) by
choosing a particular action at any moment in time.</p>
        <p>In Table 2 I show the necessary features of a legality
determination process according to the law type present.
Reasoning about intent requires an algorithmic definition of intent.
This is an open area of research since the concept of intent
has been deliberately left as a primitive by legal
practitioners. Care must be taken to ensure that the definition of intent
used in safe RL corresponds to what a court would find
sufficient.</p>
        <p>Future
trajectory
prediction
Law Type
State Maybe
Caused State Yes
Action Seq
Action State Maybe
sequence
Inchoate Yes
Intent Yes</p>
        <p>Arbiter function domain</p>
        <p>Reasoning
State path Action path Policy Causal Intent
Yes</p>
        <p>Yes
Yes
Yes</p>
        <p>Yes
Maybe</p>
        <p>Yes
Yes
Yes</p>
        <p>
          Probably
Yes
The taxonomy of laws informs us how those laws should be
described and the process which determines the legality of
behaviour. Finally, it can also inform us about the
properties of RL methods which will generate legally constrained
policies.
1. Memory Reinforcement Learning approaches typically
use a MDP formulation to model their task. Whilst a
record of the current state might still be valid for the
transition model, most of the law types that I have identified
rely to some extent on sequences of states and actions.
Thus the device which chooses actions at any state (most
likely the policy function) must include histories in its
domain. Otherwise the standard MDP learner would not be
able to understand whether its current action is legal or
not. Including the history of actions and states is
something that POMDPs do in order to make inference on the
hidden states of this model.
2. Model for planning Determining the legality of any
action requires a predicting the likelihood for future states.
How far prediction is expected to go into the future
depends on the laws present - avoiding inchoate offences
presumably requires greater foresight. Much of RL is
’model free’ and successfully so, but they seem
unavoidable here. Established methods like Dyna-Q
simultaneously learn to act and create a world model
          <xref ref-type="bibr" rid="ref49">(Sutton 1990)</xref>
          3. Causal model Determining whether a law has been
broken or not will often require some test of causality ex post
          <xref ref-type="bibr" rid="ref51">(Turner 2019)</xref>
          . The task of the learner is to make sure that
they do not cause a restricted state to occur ex-ante, and
if it does occur that they are not subsequently adjudged
to have been a legal cause of it. The presence of causal
restrictions necessitates a causal model of the world to be
formed for predictions. Bayesian Causal Models or
equivalently Structural Causal Models (SCMs)
          <xref ref-type="bibr" rid="ref42">(Pearl 2000)</xref>
          can be used to predict causal effects and used to
determine causality ex-post. They readily accept techniques
like counterfactual analysis which allows off-policy data
treatment
          <xref ref-type="bibr" rid="ref14">(Bareinboim and Pearl 2016)</xref>
          which is
important for off-policy RL methods. There also exist
definitions of causality based on SCMs such as Actual Causality
of
          <xref ref-type="bibr" rid="ref24">(Halpern 2016)</xref>
          which are capable of dealing with the
trickier causal problems of overdetermination,
preemption and omission. See
          <xref ref-type="bibr" rid="ref13">Bareinboim (2020)</xref>
          for an
introduction to causal RL.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related work</title>
      <p>
        The task of learning a legally constrained policy through
RL has seldom been mentioned in isolation but instead cited
as a possible use case in more general Safe RL work.
Surprisingly the learning of ethical policies has loomed larger
in published research. For a RL approach see Abel,
MacGlashan, and Littman (2016) or
        <xref ref-type="bibr" rid="ref56">Winfield et al. (2019)</xref>
        for
a more general discussion on ethically constraining
autonomous systems. Ethical constraint is a harder task since
there is no agreed source of ethical constraints to apply to
the learner. In contrast to those of ethics,
        <xref ref-type="bibr" rid="ref28">Hildebrandt (2019)</xref>
        points out that questions of legality always have closure. It’s
important to observe that there is no single source of the law,
the determination of legality will likely rely on referencing
multiple sources
        <xref ref-type="bibr" rid="ref16">(Boella et al. 2014)</xref>
        .
      </p>
      <p>
        <xref ref-type="bibr" rid="ref23">Garc´ıa and Ferna´ndez (2015)</xref>
        provide a general survey of
Safe RL, dividing approaches into those that modify the
reward structure and those that modify the exploration
process. Constrained Markov Decision Processes (CMDPS) do
the former, by adding a finite set of auxiliary cost functions
Ci : S A ! R to the vanilla MDP. Policies should then
achieve a discounted total cost in expectation less than some
scalar di (whilst maximising the normal reward function).
This is largely the approach of Constrained Policy
Optimisation presented by
        <xref ref-type="bibr" rid="ref4">Achiam et al. (2017)</xref>
        . A drawback with
such an approach is that bad states can be reached in
exploration making learning outside a simulator potentially
expensive. Constraints introduced as cost functions also need
to be differentiable and Markovian if certain gradient
methods are to be used. Neither of these restrictions apply to
the Constrained Cross-Entropy method of
        <xref ref-type="bibr" rid="ref55">Wen and Topcu
(2018)</xref>
        , though perhaps at the cost of data parsimony.
      </p>
      <p>
        Safe RL methods which constrain exploration include
approaches where a policy is learnt from observing a safe
policy in a process known as Inverse RL or
Apprenticeship RL
        <xref ref-type="bibr" rid="ref1">(Abbeel and Ng 2004)</xref>
        . Recent examples include
        <xref ref-type="bibr" rid="ref41">Noothigattu et al. (2018)</xref>
        who train a learner to play
PacMan following the rule ’don’t eat the ghosts’ through
expert demonstration and a bandit policy which alternates
between observed ’safe’ behaviour and optimal self taught
behaviour. Abel, MacGlashan, and Littman (2016) present a
method where the ethical-preferences of an expert are
derived through observation and then used to develop policies
accordingly. IRL approaches such as these obviate the
requirement for an explicit representation of rules. This could
be seen as a good feature in constrained tasks such as the
learning of ethical behaviour or customs where there is no
written source of what the constraints should be. This is not
the case in regulated settings. Moreover IRL is an ill-posed
problem - many reward-functions exist to explain any
observed behaviour. To make the problem tractable,
simplifying assumptions must be made about its form. The resulting
reward function might not be rich enough to encode the
preference required not to break all laws. In particular,
        <xref ref-type="bibr" rid="ref9">Arnold,
Kasenberg, and Scheutz (2017</xref>
        ) note that IRL does not infer
intertemporal rules.
      </p>
      <p>
        A developing area of Safe RL are those methods which
combine formal methods based on symbolic logic into the
learning machinery of RL. Many of these techniques
originate from the research area of formal verification methods
and model checking
        <xref ref-type="bibr" rid="ref12">(Baier and Katoen 2008)</xref>
        . These are the
techniques developed to error check software systems and
provide stronger guarantees for correctness. As discussed,
temporal logic allows a richer expressiveness of laws.
Different temporal logic systems have been applied to the
learning of policies in MDPs where transitions are known or
not. Linear Temporal Logic (LTL) is used in
        <xref ref-type="bibr" rid="ref27">Hasanbeig and
Kroening (2020)</xref>
        ,
        <xref ref-type="bibr" rid="ref21">Fu and Topcu (2015)</xref>
        and
        <xref ref-type="bibr" rid="ref54">Wen, Ehlers,
and Topcu (2015</xref>
        ) and Differential Dynamic logic is used
in
        <xref ref-type="bibr" rid="ref22">Fulton and Platzer (2018)</xref>
        . Probabilistic computation tree
logic (PCTL) is used in
        <xref ref-type="bibr" rid="ref39">Mason et al. (2017)</xref>
        .
      </p>
      <p>
        <xref ref-type="bibr" rid="ref5">Alshiekh et al. (2018)</xref>
        , and
        <xref ref-type="bibr" rid="ref29">Jansen et al. (2018)</xref>
        use a
structure called a shield to create safe policies through RL. This
is a system which sits between the learner and the agent and
either filters the choice of available actions for the learner
in learning time, or replaces unwise actions in deployment.
        <xref ref-type="bibr" rid="ref10 ref11">Ashton (2020</xref>
        a) calls the design a legal oracle and explores
its necessary features in a legal setting. A Shield has a model
of the environment, knows the required constraints which
are described in temporal logic, and is able to use a formal
program verification methods to check the legality of any
action at any moment in time. An attractive feature of this
method is that the method of constraint is separate and
somewhat agnostic to the method of learning.
        <xref ref-type="bibr" rid="ref30">Jansen et al. (2020)</xref>
        identify three challenges to this approach: Model checking is
computationally expensive, safety in a probabilistic
environment is not binary so threshholds need to be considered and
finally shielding may obstruct efficient exploration thereby
generating sub-optimal policies.
      </p>
      <p>
        Seldonian Reinforcement learning
        <xref ref-type="bibr" rid="ref50">(Thomas et al. 2019)</xref>
        is
a recent technique that aims to produce RL algorithms that
only output safe policies with a certain (high) probability. It
differs from other methods discussed in this paper in that the
technique searches for learning algorithms not policies. The
RL example presented in the paper has restrictions of
limited complexity so we will have to wait for more published
research to assess this method properly.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The paper is motivated by an aim to design Safe RL
processes which are capable of producing policies constrained
under a general rule set. By creating a brief taxonomy of
laws in the language of states and actions specifically for
the application I have been able to draw some conclusions
about the requirements of legally-safe RL. Laws are
commonly defined in inter-temporal ways over actions and state.
This means that a learning process must include a memory
of past states and actions. Thus the domain of a legal
policy function will include history just as it does in RL
under a POMDP. Causality and Intent can be key concepts
in determining whether and which laws have been broken.
Whilst RL is beginning to tackle causality, it has not done in
the context of constrained learning. Intent is barely defined
quantitatively but it will have to be if generally legal RL
systems are to be produced. Causality, Intent and the existence
of inchoate offences mean that a legally-safe RL algorithm
will require prediction about likely future trajectories. This
will require some type of environment model to be learned
or supplied to the learner and planning take place.</p>
      <p>
        This work could be viewed as an application of legal
requirements engineering. Care should be taken since it
originates singularly from a computer scientist and not a legal
practitioner
        <xref ref-type="bibr" rid="ref16">(Boella et al. 2014)</xref>
        . Yet it is a starting point
which can at least begin to inform engineers.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Apprenticeship Learning via Inverse Reinforcement Learning</article-title>
          .
          <source>In Proceedings of the 2st International Conference on Machine Learning (ICML).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>ISBN 1581138285. doi:10.1145/1015330</source>
          .1015430.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Abel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; MacGlashan, J.; and
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Reinforcement learning as a framework for ethical decision making</article-title>
          .
          <source>AAAI Workshop - Technical Report WS-16-</source>
          <fpage>01</fpage>
          -:
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Held</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tamar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Abbeel,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Constrained Policy Optimization</article-title>
          . URL http://arxiv.org/abs/1705.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Alshiekh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bloem</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Bettina,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Niekum</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Topcu,
          <string-name>
            <given-names>U.</given-names>
            ; and
            <surname>Street</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Safe Reinforcement Learning via Shielding</article-title>
          .
          <source>In AAAI Conference on Artifical Intelligence</source>
          . URL https://aaai.org/ocs/index.php/AAAI/AAAI18/ paper/view/17211/16534.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>G. V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dennis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and Fisher,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Formalisation and Implementation of Road Junction Rules on an Autonomous Vehicle Modelled as an Agent</article-title>
          .
          <source>In Formal Methods. FM 2019 International Workshops</source>
          , volume
          <volume>1</volume>
          ,
          <fpage>217</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Springer International Publishing. ISBN 9783030549930.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          <year>16113349</year>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -
          <fpage>54994</fpage>
          -7n16.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Arnold</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kasenberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Scheutz,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Value alignment or misalignment - What will keep systems accountable</article-title>
          ? AAAI Workshop - Technical
          <string-name>
            <surname>Report</surname>
          </string-name>
          WS-
          <volume>17</volume>
          -01 -:
          <fpage>81</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Ashton</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2020a</year>
          .
          <article-title>AI Legal Counsel to train and regulate legally constrained Autonomous systems</article-title>
          .
          <source>In IEEE Big Data 2020 workshop on applications of artificial intelligence in the legal industry, Forthcoming.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Ashton</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2020b</year>
          .
          <article-title>Definitions of intent for AI derived from common law</article-title>
          .
          <source>In Jurisin</source>
          <year>2020</year>
          : 14th Intl Workshop on Jurisiinformatics. URL https://easychair.org/publications/preprint/ GfCZ.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Baier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Katoen</surname>
          </string-name>
          , J.-P.
          <year>2008</year>
          .
          <article-title>Principles Of Model Checking</article-title>
          . MIT Press. ISBN 9780262026499. URL http://mitpress.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Bareinboim</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Causal Reinforcement Learning (CRL).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Bareinboim</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Causal inference and the data-fusion problem</article-title>
          .
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          <volume>113</volume>
          (
          <issue>27</issue>
          ):
          <fpage>7345</fpage>
          -
          <lpage>7352</lpage>
          . doi:
          <volume>10</volume>
          .1073/pnas.1510507113.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Bathaee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>The artificial intelligence black box and the failure of intent and causation</article-title>
          .
          <source>Harvard Journal of Law &amp; Technology</source>
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Boella</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Humphreys</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Muthuri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Rossi,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Van Der Torre</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>A critical analysis of legal requirements engineering from the perspective of legal practice</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>2014 IEEE 7th International Workshop on Requirements Engineering and Law</source>
          ,
          <source>RELAW 2014 - Proceedings 14-21</source>
          . doi:
          <volume>10</volume>
          .1109/RELAW.
          <year>2014</year>
          .
          <volume>6893476</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Clarke</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Emerson</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          <year>1981</year>
          .
          <article-title>Design and synthesis of synchronization skeletons using branching time temporal logic</article-title>
          .
          <source>In Workshop on Logic of Programs</source>
          ,
          <fpage>52</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>P. R.</given-names>
          </string-name>
          ; and Levesque,
          <string-name>
            <surname>H. J.</surname>
          </string-name>
          <year>1990</year>
          .
          <article-title>Intention is choice with commitment</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>42</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>213</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          <year>00043702</year>
          . doi:
          <volume>10</volume>
          .1016/
          <fpage>0004</fpage>
          -
          <lpage>3702</lpage>
          (
          <issue>90</issue>
          )
          <fpage>90055</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Fu</surname>
            , J.; and Topcu,
            <given-names>U.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints</article-title>
          . doi:
          <volume>10</volume>
          .15607/rss.
          <year>2014</year>
          .x.
          <volume>039</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Fulton</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Platzer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Safe reinforcement learning via formal methods: Toward safe control through proof and learning</article-title>
          .
          <source>32nd AAAI Conference on Artificial Intelligence</source>
          , AAAI 2018
          <fpage>6485</fpage>
          -
          <lpage>6492</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>Garc´ıa</article-title>
          , J.; and Ferna´ndez,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>A Comprehensive Survey on Safe Reinforcement Learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>16</volume>
          :
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Halpern</surname>
            ,
            <given-names>J. Y.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Actual Causality</article-title>
          . MIT Press,
          <article-title>1st edition</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>ISBN</surname>
          </string-name>
          <year>9780262035026</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Hansson</surname>
          </string-name>
          , H.; and
          <string-name>
            <surname>Jonsson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>A logic for reasoning about time and reliability</article-title>
          .
          <source>Formal Aspects of Computing</source>
          <volume>6</volume>
          (
          <issue>5</issue>
          ):
          <fpage>512</fpage>
          -
          <lpage>535</lpage>
          . ISSN 09345043. doi:
          <volume>10</volume>
          .1007/BF01211866.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Hasanbeig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kroening</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Cautious Reinforcement Learning with Logical Constraints</article-title>
          . doi:
          <volume>10</volume>
          .5555/ 3398761.3398821.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Hildebrandt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Closure: on ethics, code and law</article-title>
          . In Law for Computer Scientists, chapter
          <volume>11</volume>
          . Oxford University Press. ISBN 9780198860877.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Junges</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bettina</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Bloem,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ; Ko¨nighofer, B.;
          <string-name>
            <surname>Junges</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Serban</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Bloem,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Safe Reinforcement Learning Using Probabilistic Shields</article-title>
          .
          <source>In 31st International Conference on Concurrency Theory, CONCUR</source>
          <year>2020</year>
          ,
          <volume>1</volume>
          -
          <fpage>3</fpage>
          . doi:
          <volume>10</volume>
          .4230/LIPIcs.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>CONCUR.</surname>
          </string-name>
          <year>2020</year>
          .
          <volume>3</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>The temporal logic of causal structures</article-title>
          .
          <source>Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <string-name>
            <surname>UAI</surname>
          </string-name>
          <year>2009</year>
          303-
          <fpage>312</fpage>
          . URL https://arxiv.org/abs/1205.2634v1.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Lagioia</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and Sartor,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>AI Systems Under Criminal Law: a Legal Analysis and a Regulatory Perspective</article-title>
          .
          <source>Philosophy and Technology</source>
          <volume>33</volume>
          (
          <issue>3</issue>
          ):
          <fpage>433</fpage>
          -
          <lpage>465</lpage>
          . ISSN 22105441.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>doi:10</source>
          .1007/s13347-019-00362-x.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>Liepin¸a</article-title>
          , R.; Sartor, G.; and
          <string-name>
            <surname>Wyner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Arguing about causes in law: a semi-formal framework for causal arguments</article-title>
          .
          <source>Artificial Intelligence and Law</source>
          <volume>28</volume>
          (
          <issue>1</issue>
          ):
          <fpage>69</fpage>
          -
          <lpage>89</lpage>
          . ISSN 15728382. doi:
          <volume>10</volume>
          .1007/s10506-019-09246-z.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Loveless</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2013</year>
          . Mens Rea: Intention, Recklessness, Negligence and
          <string-name>
            <given-names>Gross</given-names>
            <surname>Negligence</surname>
          </string-name>
          . In Complete Criminal Law: Test, Cases and Materials, chapter
          <volume>3</volume>
          ,
          <fpage>91</fpage>
          -
          <lpage>150</lpage>
          . OUP Oxford. ISBN 0198848463. doi:
          <volume>10</volume>
          .1093/he/9780199646418.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>MacGlashan</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Between imitation and intention learning</article-title>
          .
          <source>In Twenty Fourth International Joint Conference on Artificial Intelligence. ISBN 9781577357384.</source>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          <year>10450823</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Mason</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Calinescu,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Kudenko,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Banks</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <article-title>Assured reinforcement learning with formally verified abstract policies</article-title>
          .
          <source>ICAART 2017 - Proceedings of the 9th International Conference on Agents and Artificial Intelligence</source>
          <volume>2</volume>
          :
          <fpage>105</fpage>
          -
          <lpage>117</lpage>
          . doi:
          <volume>10</volume>
          .5220/0006156001050117.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Noothigattu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Bouneffouf,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Chandra</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Madan,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Varshney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; ...; and
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration</article-title>
          . URL http://arxiv.org/abs/
          <year>1809</year>
          .08343.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Causality: Models, reasoning and inference</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>Cambridge University Press. ISBN 0521773628.</mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Pnueli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1977</year>
          .
          <article-title>The temporal logic of programs</article-title>
          .
          <source>Proceedings - Annual IEEE Symposium on Foundations of Computer Science</source>
          , FOCS 1977-Octob:
          <fpage>46</fpage>
          -
          <lpage>57</lpage>
          . doi:
          <volume>10</volume>
          .1109/sfcs.
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , S.-C.
          <year>2018</year>
          .
          <article-title>Intent-aware Multi-agent Reinforcement Learning</article-title>
          .
          <source>In IEEE International Conference on Robotics and Automation (ICRA)</source>
          ,
          <fpage>7533</fpage>
          -
          <lpage>7540</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICRA.
          <year>2018</year>
          .
          <volume>8463211</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>Ring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Orseau</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Delusion Survival and Intelligent Agents</article-title>
          .
          <source>In Conference on Artificial General Intelligence (AGI-11). ISBN 9783642228872. doi:10</source>
          .1007/978-3-
          <fpage>642</fpage>
          - 22887-2.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          2018.
          <article-title>Trial without error: Towards safe reinforcement learning via human intervention</article-title>
          .
          <source>Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS</source>
          <volume>3</volume>
          :
          <fpage>2067</fpage>
          -
          <lpage>2069</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schrittwieser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>;</year>
          ...; and
          <string-name>
            <surname>Hassabis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Mastering the game of Go without human knowledge</article-title>
          .
          <source>Nature</source>
          <volume>550</volume>
          (
          <issue>7676</issue>
          ):
          <fpage>354</fpage>
          -
          <lpage>359</lpage>
          . doi:
          <volume>10</volume>
          .1038/nature24270.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          <year>1990</year>
          .
          <article-title>Integrated architectures for learning, planning, and reacting based on approximating dynamic programming</article-title>
          .
          <source>In Proceedings of the 7th. International Conference on Machine Learning</source>
          pages(
          <year>1987</year>
          ):
          <fpage>216</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          ; da Silva,
          <string-name>
            <given-names>B. C.</given-names>
            ;
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G.</surname>
          </string-name>
          ; Giguere,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Brun,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Brunskill</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Preventing undesirable behavior of intelligent machines</article-title>
          .
          <source>Science</source>
          <volume>366</volume>
          (
          <issue>6468</issue>
          ). doi:
          <volume>10</volume>
          .1126/science.aag3311.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <string-name>
            <surname>Turner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2019</year>
          .
          <string-name>
            <given-names>Robot</given-names>
            <surname>Rules</surname>
          </string-name>
          .
          <source>Palgrave Macmillan. ISBN 978-3-319-96234-4.</source>
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Babuschkin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Czarnecki,
          <string-name>
            <given-names>W. M.</given-names>
            ;
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Dudzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>;</year>
          ...; and
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Grandmaster level in StarCraft II using multi-agent reinforcement learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <source>Nature</source>
          <volume>575</volume>
          (
          <issue>7782</issue>
          ):
          <fpage>350</fpage>
          -
          <lpage>354</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41586-019-1724- z.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ehlers</surname>
            , R.; and Topcu,
            <given-names>U.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Correct-bysynthesis reinforcement learning with temporal logic constraints</article-title>
          .
          <source>IEEE International Conference on Intelligent Robots and Systems</source>
          2015-Decem:
          <fpage>4983</fpage>
          -
          <lpage>4990</lpage>
          . doi:
          <volume>10</volume>
          .1109/IROS.
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Topcu,
          <string-name>
            <surname>U.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Constrained Cross-Entropy Method for Safe Reinforcement Learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>31</volume>
          , NeurIPS,
          <fpage>7461</fpage>
          -
          <lpage>7471</lpage>
          . URL http://papers.nips.cc/paper/7974-constrainedcross
          <article-title>-entropy-method-for-safe-reinforcement-learning</article-title>
          .
          <source>pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <string-name>
            <surname>Winfield</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pitt</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Evers</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          <article-title>Machine ethics: The design and governance of ethical ai and autonomous systems</article-title>
          .
          <source>Proceedings of the IEEE</source>
          <volume>107</volume>
          (
          <issue>3</issue>
          ):
          <fpage>509</fpage>
          -
          <lpage>517</lpage>
          . doi:
          <volume>10</volume>
          .1109/JPROC.
          <year>2019</year>
          .
          <volume>2900622</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>