What criminal and civil law tells us about Safe RL techniques to generate law-abiding behaviour Hal Ashton1∗ 1 University College London Gower Street London ucabha5@ucl.ac.uk Abstract computer games the training environment is the same as the deployment environment and training is costless (excepting Safe Reinforcement Learning (Safe RL) aims to produce con- the ecological impact of the vast computational power that is strained policies with constraints typically motivated by is- sues of physical safety. This paper considers the issues that often used). Potential real world applications of RL are of- arise from regulatory constraints or issues of legal safety. ten more complex, almost certainly regulated, and the cost Without guarantees of safety, autonomous systems or agents of mistakes made in training or deployment could be catas- (A-bots) trained through RL are expensive or dangerous to trophic. In such applications where safety, cost or legality train and deploy. Many potential applications for RL involve are issue, one approach is to conduct learning in a simulator acting in regulated environments and here existing research is of the environment where the cost of bad policies is negligi- thin. Regulations impose behavioural restrictions which can ble. The use of any simulator raises the risk of misspecifica- be more complex than those engendered by considerations of tion and poor generalisation on deployment. An alternative physical safety. They are often inter-temporal, require plan- to using a simulator is placing the RL agent very carefully ning on behalf of the learner and involve concepts of causal- in the target environment with a human overseer ready to ity and intent. By examining the typical types of laws present in a regulated arena, this paper identifies design features that take over in tricky spots. This approach has limitations ac- the RL learning process should possess in order to ensure that cording to the complexity of the task (Saunders et al. 2018). it is able to generate legally safe or compliant policies. It might not be feasible to use this approach in applications like trading because the speed of decision making is beyond the ability of a human overseer to monitor. Introduction Whether learning takes place in a simulator or carefully In this position paper I will consider the problem of learning in the target arena, the ability to generate legal policies with a solution to a sequential decision making problem in an en- high probability is highly desirable if the policy is to be vironment governed by some laws via Reinforcement learn- deployed in a regulated setting. Laws can present different ing (RL). I will assume that the learned policy should not challenges to other types of constraint. A legally transgres- break these laws because doing so would impose sanctions sive policy might not be obvious in the way a physically by the environment’s regulator or law enforcer. By present- transgressive one might be. The nature of laws will dictate ing a taxonomy of laws which exist in real life, whose fea- the methods of RL used to generate optimal, legal policies. tures are relevant to RL, I am able to make some inferences about the general design of a RL process that can produce Background legal policies. RL can produce novel policies to solve sequential deci- Markov Decision Processes (MDPs) are a common frame- sion problems. Its potential has been demonstrated in the work underpinning RL. In this formulation time is discre- super-human mastery of Go (Silver et al. 2017) and ad- tised and labelled t = 1, 2, 3, . . . . A MDP is described by a vanced performance in more complicated games like Star- tuple (S, S0 , A, T , R, γ) where: craft (Vinyals et al. 2019) but adoption in real life settings 1. S is the set of states in the environment. has been retarded by safety (including legal) considerations. This is noticeable in Financial trading applications which 2. s0 is a distribution over the initial states of the environ- already use algorithms extensively but have been slow (pub- ment p(s) for s ∈ S. licly at least) to adopt RL. 3. A is the set of all actions available. RL requires an environment which allows ample explo- ration and feedback. In game applications such as Go and 4. T (s, a, s0 ) = P(s0 |s, a) is the transition probability distri- bution; the probability of transitioning to state s0 when in ∗ Supported by the EPSRC state s ∈ S and choosing action a ∈ A. Copyright c 2021 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC 5. R : S × A → R is the reward function, the feedback BY 4.0). mechanism through which learning is possible. 6. γ ∈ (0, 1] the discount factor to differentiate the value Intent in RL of rewards now vs those received in the future. In finite RL has been used infer the intent of others (Qi and Zhu horizon cases γ = 1 and can be ignored. 2018), and even in IRL to define a reward function that cor- The learner then has the objective of funding a policy responds to the intentions of an expert demonstrator (Mac- function from the set of all policy functions Π : S → A Glashan and Littman 2015). However it has not ever defined which solves the maximisation of the expected discounted what intent means for the learner. Ashton (2020b) presents sum of rewards: X t a definition of direct intent in terms of causality and de- π ∗ = arg max E  γ R(st , at )|π sire of realised states. An agent directly intends a state s by π∈Π t=0 committing an action a if a foreseeably causes s and the The policy function is often a probability distribution over agent aims or desires state s. In the context of RL, where actions π(a|s) = P(a|s) ∀a ∈ A, s ∈ S an agent has a value function over every possible state, in- The Markovian property of this process comes from the ferences can be made about what a learner desires. Within transition function. It is satisfied if the probability of tran- criminal law, different levels of intent are required for dif- sition to a new state is determined only by the current state ferent crimes (Loveless 2013). Direct intent is the strictest, and chosen action. being required for murder but lower levels such as oblique An extension of the MDP is the Partially Observable MDP intent, recklessness and negligence also exist. Whilst these (POMDP). This covers the very probable contingency where lower levels of intent do not necessarily have any require- the full state of the world is not visible to the decision maker. ments about desire, their definitions often include subjective It is described by the tuple (S, S0 , A, T , R, γ, Ω, O). The and objective tests of foreseeability vis-a-vis the prohibited two additions to the tuple are as follows: outcomes of actions. Subjective tests raise interesting ques- 7. Ω is the set of all observations that the learner can receive. tions in model-free modes of RL since the learner does not For convenience we assume that it includes the reward rt explicitly expect any outcome to their action. Objective tests received in any time period. require an external judgement about the probability of an outcome given an action. If a consequence of an action was 8. O = P(ω|s0 , a) is the probability distribution of receiving foreseeable then the offender can be thought of intending the observation ω ∈ Ω after transition to state s0 and action a outcome. Lagioia and Sartor (2020) discuss this method of was chosen. intent inference and consider it sufficient albeit principally The domain of the policy function then becomes the his- in the context of Belief Desire Intent (BDI) agents (Cohen tory of all observations and actions which we write π(a|ht ) and Levesque 1990). An intriguing corollary of the use of where ht is short hand for (o1 , a1 , o2 , a2 , . . . at−1 , ot ). Con- objective tests, is that the predictive model that the RL agent sequently the complexity of solving a POMDP is much uses or learns should be accurate. This bypasses the danger higher than that of a MDP (Abel, MacGlashan, and Littman of the learner developing a ’delusion box’ type model (Ring 2016). Our interest in introducing POMDPs is not so much and Orseau 2011) to justify otherwise illegal policies. the partial observability of this problem but more the en- larged domain of the policy function which they necessitate. A non-exhaustive taxonomy of laws This paper will show that law abiding policies are likely to be dependent on the history of observations regardless of In this section I present a number of law-types which are whether there is partial observability or not. likely to exist in a regulated environment. I differentiate be- tween states and actions. Actions are assumed to be orig- Structural Causal Models inated by the learner only and their commission is volun- This paper will show that the legality of behaviour can de- tary. States refer to some measurable property of the envi- pend on establishing the causal effects of an action. The def- ronment, stable for the duration of the time period. Actions inition of causality is a complicated topic and there is a dis- cause a measurable change in the environment but I assume tinction between predictive causality which refers to predict- that their duration is instantaneous so there is no state that ing the effect of actions and actual causality which refers records an action in progress. to evidential analysis after actions have been taken. Struc- tural Causal Models (SCMS) (Pearl 2000) can be used in Simple State restriction laws both senses. SCMs are a special case of Bayesian Causal This the simplest type of law and the one which Safe RL Networks where directed arcs between nodes express direct research has concentrated on as many physical safety con- causality as well as usual independence statements. Pearl straints are of this type. Examples might include ’drive be- introduces the concept and effect of the atomical interven- low 30 miles per hour in urban environments’ or ’do not fly tion which corresponds to setting a variable to a value in the a drone in the vicinity of the airport’. For a state restriction model; this has strong connections to taking actions in RL. to be a law, its realisation should avoidable by the learner Actual causality is a harder undertaking since it necessarily through its actions. This is the case for the speed restriction involves counterfactual reasoning and has to deal with is- example. Even though I have described these laws as simple, sues like preemption and overdetermination. Halpern Pearl they might be context dependent. Roads with a solid central Causality (Halpern 2016) is a general purpose definition of line generally do not allow overtaking, but the presence of a actual causality. Alternative, more simple definitions are dis- stationery vehicle blocking progress might allow it, provid- cussed in (Liepiņa, Sartor, and Wyner 2020). ing it is safe to do so. Caused state restriction laws to supply. In the USA conspiracy and solicitation (the re- quest, encouragement or payment for someone else to com- Some states exist which could be both caused through the mit a felony) are major classes of inchoate offence. learner’s actions and through an external mechanism. This marks the first departure from conventional safe-RL research Laws requiring intent because the safety constraints traditionally considered do not differentiate between those states caused by the learner Common law as practised in UK, USA, India and Canada and those that are not. It does not matter whether the drone amongst others requires that the accused had mens rea (the caused itself to fly over the volcano or whether it was blown mental element of intent to commit a crime) for certain crim- by the wind, the state of being located over the volcano is inal offences to have been committed by them. Different lev- the one to be avoided. For legal restrictions, certain states els of intent exist, ranging from direct intent where the ac- might only be restricted if they were caused by the learner1 . cused deliberately caused and wanted to cause a prohibited A concrete example of such a causal dependent state restric- outcome in the extreme to oblique intent where prohibited tion can be found within financial markets where the UK’s outcomes were caused as a side-effect of their behaviour, to financial regulator prohibits trading algorithms from creat- recklessness and negligence where the prohibited outcomes ing or contributing to disorderly market conditions. Such were foreseeable outcomes of their behaviour to various de- conditions could arise independently of the behaviour of grees. Certain offences will specify what level of mens-rea the learner, if the learner has no mechanism of determining is required so murder requires direct or oblique intent. whether this the case, learning efficiency will be compro- Aside from the crimes of specific intent, certain laws exist mised. For caused state restriction laws to be broken two which require establishing for what purpose the accused did conditions should be satisfied. Firstly a restricted state ŝ something. This is called basis intent by Bathaee (2011). US should occur and secondly that the actions of the learner Examples cited include market anti-spoofing laws, a practice foreseeably caused the state to occur. which is defined as the placement of orders with the intent to cancel them. Another related and topical example is termed Gatekeeping Intent by Bathaee. Laws or systems which are Action sequence laws discriminatory in effect are only unlawful if discriminatory Laws exist in a variety of settings which are restrictions on intent behind them is established. conduct with no necessary requirement for a lasting change in the state of the world. Examples in the UK include the Implications of the taxonomy offence of Careless Driving and more seriously Dangerous There exist a number of challenges to developing an RL driving. There is no causation requirement since I assume method which produces legal policies under a rich set of that the A-bot has the freedom to choose its own actions at laws. I classify them into three areas: any time2 Action sequence laws could be transformed into a simple state restriction law by adding a state variable that 1. Encoding The environment’s laws need to be described indicates whether a restricted action sequence has occurred. in such a way as to be machine interpretable. Such an approach might not be efficient if a large number of 2. Determination A mechanism needs to exist which can conduct laws exist in the environment as it would cause the determine the legality of any behaviour either in advance dimension of the state space to grow. or in retrospect. 3. Constrained policy learning There should exist a Mixed State Action sequence laws method to constrain the policy of the A-bot to be law abid- Laws exist which combine a sequence of actions that cause ing when it is acting or learning. restricted state(s). Continuing the driving examples from the Generally these problems should be solved in the order previous section, in the UK there exist statutory offences of displayed. The determination of legality requires an encod- causing death by careless or reckless driving. These laws ing of law to reference (planned) behaviour against and con- constitute a restriction on how certain states are arrived at. strained policy learning relies on knowing what policies are acceptable and what are not. By looking at the taxonomy of Inchoate offences: Laws restricting behaviour that laws, some inferences can be made about all three elements may induce future restricted states of this process. I will run through each of the three tasks and make comments on how laws affect them. In practice the Inchoate offences are restrictions on action sequences and three tasks might be heavily intertwined. states which may lead to restricted states which are not nec- essarily realised. Examples in the UK include attempt crimes Encoding such as attempted murder or possession of drugs with intent Safe RL research is only beginning to pay attention to how 1 Death is not generally prohibited, but causing it generally is! laws should be described. This is because the types of re- 2 Situations where there is no action which won’t break a law are strictions considered have largely been of the simple state analogous to the concept of deadlock in model checking (Baier and type which can be encoded using simple algebraic expres- Katoen 2008). Laws can still be broken when the perpetrator had sions. This approach becomes untenable when considering no choice but to break the law but the defence of necessity might more complicated laws like the ones identified in the previ- then be valid as mitigation ous example. Furthermore the quantity of applicable rules in regulated environments is larger than most rulesets hitherto Determination considered in research. I identify four desirable features of Given that we have a codified set of rules, we require a de- an encoding which should be used to convey laws. vice to determine when a sequence of behaviour has or is 1. Temporal A number of laws restrict sequences of states contravening a law. I assume that the states referred to in or actions. Moreover there is no requirement that these the encoding are either the same as those perceived by the sequences are contiguous. An encoding needs to be rich learner or there is an available mapping function between enough to express multiple states and actions with tempo- the learning and encoding statespaces. This is of course not a ral relations like always, until, next etc. given since the learner may be perceiving continuous states and the encoding is likely to refer to high level states. In 2. Probabilistic As in the case of Inchoate offences, some the case of simple state restrictions this is not a difficult task laws refer to future states not realised. Since the space of but it becomes increasingly complex with richer laws. I have all possible future events is a large one, law reasonably separated the requirements into four main features. concerns itself with restricting foreseeable consequences of behaviours. An encoding of laws should be rich enough 1. Domain Consider an arbiter function Γ which determines to express this. legality to the binary set - denoting legal or illegal. The domain of this function is dependent on the type of law it 3. Causal As we saw in the previous section, laws will of- is considering. Laws which reference more than one state ten prohibit the causation of a state, not the state itself per or action for example require a domain which includes the se. Our death is not usually prohibited, but causing it nor- history of states and actions ht = (s1 , a1 , s2 , a2 , . . . st ). mally is. Causation when considered in prospect will also Laws which reference future paths will also require the normally require some sort of probabilistic reasoning. policy function of the learner π(·). Laws which require in- tent may also require information about the reward func- 4. Intent Certain laws require establishing levels of intent tion of the learner or its estimation of state values. on the part of the transgressor. Different levels of intent exist and are applicable to different offences. 2. Future Projection A model of the environment is re- quired for almost all laws. In the case where strong safety In Table 1 I summarise what level of encoding expres- is required in training, the legality of any action needs to siveness is required for each of the law types described in be assessed before choosing it and this means assessing the previous section. Nearly all of the laws require a tem- the likely state transitions which occur as a result. Projec- poral expressiveness. Temporal Logic systems allows us to tion is also a requirement in causal reasoning. express when conditions should be true. A wide variety ex- ist such as Linear Temporal Logic (LTL) (Pnueli 1977), 3. Causal Reasoning As certain laws are defined by causa- Computation Tree Logic (CTL) (Clarke and Emerson 1981) tion, a method is required to determine whether restricted which considers multiple future paths and Probabilistic CTL states have or are likely to be caused by the learner’s ac- (PCTL) (Hansson and Jonsson 1994) which as the name sug- tion. This requires a causal model of the environment. gests accounts for probabilistic transitions. Kleinberg and 4. Intentional Reasoning In environments where laws are Mishra (2009) extends this to provide a language capable defined by intent, a learner must be aware of what they of expressing causal relationships. To our knowledge there are intending to do (ie their likely policy trajectory) by is no similar extension to express intention and this is an im- choosing a particular action at any moment in time. mediate project. Alves, Dennis, and Fisher (2020) succeed in encoding the road junction rules for an autonomous vehi- In Table 2 I show the necessary features of a legality deter- cle using a variant of LTL. mination process according to the law type present. Reason- ing about intent requires an algorithmic definition of intent. The analysis is not exhaustive. For example in any given This is an open area of research since the concept of intent regulated environment there are likely to be a large number has been deliberately left as a primitive by legal practition- of rules that the learner should obey simultaneously. This is ers. Care must be taken to ensure that the definition of intent likely to mean deadlock situations arise where not breaking used in safe RL corresponds to what a court would find suf- one law may result in the breaking of another. A meta order- ficient. ing of laws may be required to deal with this situation. Future Arbiter function domain Reasoning Law Type Encoding Expressiveness trajectory Law Type prediction Temporal Probabilistic Causal Intent State path Action path Policy Causal Intent State Restriction State Maybe Caused State Restriction Maybe Maybe Yes Caused State Yes Yes Yes Action Sequence Yes Action Seq Yes Action state sequence Yes Maybe Maybe Action State Maybe Yes Yes Maybe sequence Inchoate Yes Yes Yes Maybe Inchoate Yes Yes Probably Intent Yes Yes Yes Yes Intent Yes Yes Yes Yes Table 1: The complexity of the encoding language required Table 2: Taxonomy implications for the determination pro- is dependent on the law type cess Learning Process points out that questions of legality always have closure. It’s The taxonomy of laws informs us how those laws should be important to observe that there is no single source of the law, described and the process which determines the legality of the determination of legality will likely rely on referencing behaviour. Finally, it can also inform us about the proper- multiple sources (Boella et al. 2014). ties of RL methods which will generate legally constrained Garcı́a and Fernández (2015) provide a general survey of policies. Safe RL, dividing approaches into those that modify the re- ward structure and those that modify the exploration pro- 1. Memory Reinforcement Learning approaches typically cess. Constrained Markov Decision Processes (CMDPS) do use a MDP formulation to model their task. Whilst a the former, by adding a finite set of auxiliary cost functions record of the current state might still be valid for the tran- Ci : S × A → R to the vanilla MDP. Policies should then sition model, most of the law types that I have identified achieve a discounted total cost in expectation less than some rely to some extent on sequences of states and actions. scalar di (whilst maximising the normal reward function). Thus the device which chooses actions at any state (most This is largely the approach of Constrained Policy Optimi- likely the policy function) must include histories in its do- sation presented by Achiam et al. (2017). A drawback with main. Otherwise the standard MDP learner would not be such an approach is that bad states can be reached in explo- able to understand whether its current action is legal or ration making learning outside a simulator potentially ex- not. Including the history of actions and states is some- pensive. Constraints introduced as cost functions also need thing that POMDPs do in order to make inference on the to be differentiable and Markovian if certain gradient meth- hidden states of this model. ods are to be used. Neither of these restrictions apply to 2. Model for planning Determining the legality of any ac- the Constrained Cross-Entropy method of Wen and Topcu tion requires a predicting the likelihood for future states. (2018), though perhaps at the cost of data parsimony. How far prediction is expected to go into the future de- Safe RL methods which constrain exploration include ap- pends on the laws present - avoiding inchoate offences proaches where a policy is learnt from observing a safe presumably requires greater foresight. Much of RL is policy in a process known as Inverse RL or Apprentice- ’model free’ and successfully so, but they seem unavoid- ship RL (Abbeel and Ng 2004). Recent examples include able here. Established methods like Dyna-Q simultane- Noothigattu et al. (2018) who train a learner to play Pac- ously learn to act and create a world model (Sutton 1990) Man following the rule ’don’t eat the ghosts’ through ex- pert demonstration and a bandit policy which alternates be- 3. Causal model Determining whether a law has been bro- tween observed ’safe’ behaviour and optimal self taught be- ken or not will often require some test of causality ex post haviour. Abel, MacGlashan, and Littman (2016) present a (Turner 2019). The task of the learner is to make sure that method where the ethical-preferences of an expert are de- they do not cause a restricted state to occur ex-ante, and rived through observation and then used to develop policies if it does occur that they are not subsequently adjudged accordingly. IRL approaches such as these obviate the re- to have been a legal cause of it. The presence of causal quirement for an explicit representation of rules. This could restrictions necessitates a causal model of the world to be be seen as a good feature in constrained tasks such as the formed for predictions. Bayesian Causal Models or equiv- learning of ethical behaviour or customs where there is no alently Structural Causal Models (SCMs) (Pearl 2000) written source of what the constraints should be. This is not can be used to predict causal effects and used to deter- the case in regulated settings. Moreover IRL is an ill-posed mine causality ex-post. They readily accept techniques problem - many reward-functions exist to explain any ob- like counterfactual analysis which allows off-policy data served behaviour. To make the problem tractable, simplify- treatment (Bareinboim and Pearl 2016) which is impor- ing assumptions must be made about its form. The resulting tant for off-policy RL methods. There also exist defini- reward function might not be rich enough to encode the pref- tions of causality based on SCMs such as Actual Causality erence required not to break all laws. In particular, Arnold, of (Halpern 2016) which are capable of dealing with the Kasenberg, and Scheutz (2017) note that IRL does not infer trickier causal problems of overdetermination, preemp- intertemporal rules. tion and omission. See Bareinboim (2020) for an intro- duction to causal RL. A developing area of Safe RL are those methods which combine formal methods based on symbolic logic into the learning machinery of RL. Many of these techniques orig- Related work inate from the research area of formal verification methods The task of learning a legally constrained policy through and model checking (Baier and Katoen 2008). These are the RL has seldom been mentioned in isolation but instead cited techniques developed to error check software systems and as a possible use case in more general Safe RL work. Sur- provide stronger guarantees for correctness. As discussed, prisingly the learning of ethical policies has loomed larger temporal logic allows a richer expressiveness of laws. Dif- in published research. For a RL approach see Abel, Mac- ferent temporal logic systems have been applied to the learn- Glashan, and Littman (2016) or Winfield et al. (2019) for ing of policies in MDPs where transitions are known or a more general discussion on ethically constraining au- not. Linear Temporal Logic (LTL) is used in Hasanbeig and tonomous systems. Ethical constraint is a harder task since Kroening (2020), Fu and Topcu (2015) and Wen, Ehlers, there is no agreed source of ethical constraints to apply to and Topcu (2015) and Differential Dynamic logic is used the learner. In contrast to those of ethics, Hildebrandt (2019) in Fulton and Platzer (2018). Probabilistic computation tree logic (PCTL) is used in Mason et al. (2017). 2st International Conference on Machine Learning (ICML). Alshiekh et al. (2018), and Jansen et al. (2018) use a struc- ISBN 1581138285. doi:10.1145/1015330.1015430. ture called a shield to create safe policies through RL. This Abel, D.; MacGlashan, J.; and Littman, M. L. 2016. Rein- is a system which sits between the learner and the agent and forcement learning as a framework for ethical decision mak- either filters the choice of available actions for the learner ing. AAAI Workshop - Technical Report WS-16-01 -: 54–61. in learning time, or replaces unwise actions in deployment. Ashton (2020a) calls the design a legal oracle and explores Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con- its necessary features in a legal setting. A Shield has a model strained Policy Optimization. URL http://arxiv.org/abs/1705. of the environment, knows the required constraints which 10528. are described in temporal logic, and is able to use a formal Alshiekh, M.; Bloem, R.; Bettina, K.; Niekum, S.; Topcu, program verification methods to check the legality of any U.; and Street, E. 2018. Safe Reinforcement Learning action at any moment in time. An attractive feature of this via Shielding. In AAAI Conference on Artifical Intelli- method is that the method of constraint is separate and some- gence. URL https://aaai.org/ocs/index.php/AAAI/AAAI18/ what agnostic to the method of learning. Jansen et al. (2020) paper/view/17211/16534. identify three challenges to this approach: Model checking is Alves, G. V.; Dennis, L.; and Fisher, M. 2020. Formalisa- computationally expensive, safety in a probabilistic environ- tion and Implementation of Road Junction Rules on an Au- ment is not binary so threshholds need to be considered and tonomous Vehicle Modelled as an Agent. In Formal Meth- finally shielding may obstruct efficient exploration thereby ods. FM 2019 International Workshops, volume 1, 217–232. generating sub-optimal policies. Springer International Publishing. ISBN 9783030549930. Seldonian Reinforcement learning (Thomas et al. 2019) is ISSN 16113349. doi:10.1007/978-3-030-54994-7\16. a recent technique that aims to produce RL algorithms that only output safe policies with a certain (high) probability. It Arnold, T.; Kasenberg, D.; and Scheutz, M. 2017. Value differs from other methods discussed in this paper in that the alignment or misalignment - What will keep systems account- technique searches for learning algorithms not policies. The able? AAAI Workshop - Technical Report WS-17-01 -: 81–88. RL example presented in the paper has restrictions of lim- Ashton, H. 2020a. AI Legal Counsel to train and regulate ited complexity so we will have to wait for more published legally constrained Autonomous systems. In IEEE Big Data research to assess this method properly. 2020 workshop on applications of artificial intelligence in the legal industry, Forthcoming. Conclusion Ashton, H. 2020b. Definitions of intent for AI derived from The paper is motivated by an aim to design Safe RL pro- common law. In Jurisin 2020: 14th Intl Workshop on Jurisi- cesses which are capable of producing policies constrained informatics. URL https://easychair.org/publications/preprint/ under a general rule set. By creating a brief taxonomy of GfCZ. laws in the language of states and actions specifically for Baier, C.; and Katoen, J.-P. 2008. Principles Of Model Check- the application I have been able to draw some conclusions ing. MIT Press. ISBN 9780262026499. URL http://mitpress. about the requirements of legally-safe RL. Laws are com- mit.edu/books/principles-model-checking. monly defined in inter-temporal ways over actions and state. This means that a learning process must include a memory Bareinboim, E. 2020. Causal Reinforcement Learning (CRL). of past states and actions. Thus the domain of a legal pol- URL https://crl.causalai.net/. icy function will include history just as it does in RL un- Bareinboim, E.; and Pearl, J. 2016. Causal inference and the der a POMDP. Causality and Intent can be key concepts data-fusion problem. Proceedings of the National Academy in determining whether and which laws have been broken. of Sciences of the United States of America 113(27): 7345– Whilst RL is beginning to tackle causality, it has not done in 7352. doi:10.1073/pnas.1510507113. the context of constrained learning. Intent is barely defined Bathaee, Y. 2011. The artificial intelligence black box and quantitatively but it will have to be if generally legal RL sys- the failure of intent and causation. Harvard Journal of Law tems are to be produced. Causality, Intent and the existence & Technology 2(4): 31–40. of inchoate offences mean that a legally-safe RL algorithm will require prediction about likely future trajectories. This Boella, G.; Humphreys, L.; Muthuri, R.; Rossi, P.; and Van will require some type of environment model to be learned Der Torre, L. 2014. A critical analysis of legal require- or supplied to the learner and planning take place. ments engineering from the perspective of legal practice. This work could be viewed as an application of legal re- 2014 IEEE 7th International Workshop on Requirements En- quirements engineering. Care should be taken since it orig- gineering and Law, RELAW 2014 - Proceedings 14–21. doi: inates singularly from a computer scientist and not a legal 10.1109/RELAW.2014.6893476. practitioner (Boella et al. 2014). Yet it is a starting point Clarke, E. M.; and Emerson, E. A. 1981. Design and synthesis which can at least begin to inform engineers. of synchronization skeletons using branching time temporal logic. In Workshop on Logic of Programs, 52–71. References Cohen, P. R.; and Levesque, H. J. 1990. Intention is choice Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship Learning with commitment. Artificial Intelligence 42(2-3): 213–261. via Inverse Reinforcement Learning. In Proceedings of the ISSN 00043702. doi:10.1016/0004-3702(90)90055-5. Fu, J.; and Topcu, U. 2015. Probably Approximately Cor- Noothigattu, R.; Bouneffouf, D.; Mattei, N.; Chandra, R.; rect MDP Learning and Control With Temporal Logic Con- Madan, P.; Varshney, K.; ...; and Rossi, F. 2018. Interpretable straints. doi:10.15607/rss.2014.x.039. Multi-Objective Reinforcement Learning through Policy Or- Fulton, N.; and Platzer, A. 2018. Safe reinforcement learning chestration. URL http://arxiv.org/abs/1809.08343. via formal methods: Toward safe control through proof and Pearl, J. 2000. Causality: Models, reasoning and inference. learning. 32nd AAAI Conference on Artificial Intelligence, Cambridge University Press. ISBN 0521773628. AAAI 2018 6485–6492. Pnueli, A. 1977. The temporal logic of programs. Proceed- Garcı́a, J.; and Fernández, F. 2015. A Comprehensive Survey ings - Annual IEEE Symposium on Foundations of Computer on Safe Reinforcement Learning. Journal of Machine Learn- Science, FOCS 1977-Octob: 46–57. doi:10.1109/sfcs.1977. ing Research 16: 1437–1480. 32. Halpern, J. Y. 2016. Actual Causality. MIT Press, 1st edition. Qi, S.; and Zhu, S.-C. 2018. Intent-aware Multi-agent Re- ISBN 9780262035026. inforcement Learning. In IEEE International Conference on Robotics and Automation (ICRA), 7533–7540. doi:10.1109/ Hansson, H.; and Jonsson, B. 1994. A logic for reasoning ICRA.2018.8463211. about time and reliability. Formal Aspects of Computing 6(5): Ring, M.; and Orseau, L. 2011. Delusion Survival and Intelli- 512–535. ISSN 09345043. doi:10.1007/BF01211866. gent Agents. In Conference on Artificial General Intelligence Hasanbeig, M.; and Kroening, D. 2020. Cautious Rein- (AGI-11). ISBN 9783642228872. doi:10.1007/978-3-642- forcement Learning with Logical Constraints. doi:10.5555/ 22887-2. 3398761.3398821. Saunders, W.; Stuhlmüller, A.; Sastry, G.; and Evans, O. Hildebrandt, M. 2019. Closure: on ethics, code and law. In 2018. Trial without error: Towards safe reinforcement learn- Law for Computer Scientists, chapter 11. Oxford University ing via human intervention. Proceedings of the International Press. ISBN 9780198860877. Joint Conference on Autonomous Agents and Multiagent Sys- tems, AAMAS 3: 2067–2069. Jansen, N.; Junges, S.; Bettina, K.; and Bloem, R. 2018. Shielded Decision-Making in MDPs. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; ...; and Hassabis, D. 2017. Mas- Jansen, N.; Könighofer, B.; Junges, S.; Serban, A.; and tering the game of Go without human knowledge. Nature Bloem, R. 2020. Safe Reinforcement Learning Using Prob- 550(7676): 354–359. doi:10.1038/nature24270. abilistic Shields. In 31st International Conference on Con- currency Theory, CONCUR 2020, 1–3. doi:10.4230/LIPIcs. Sutton, R. S. 1990. Integrated architectures for learning, plan- CONCUR.2020.3. ning, and reacting based on approximating dynamic program- ming. In Proceedings of the 7th. International Conference on Kleinberg, S.; and Mishra, B. 2009. The temporal logic of Machine Learning pages(1987): 216–224. causal structures. Proceedings of the 25th Conference on Un- certainty in Artificial Intelligence, UAI 2009 303–312. URL Thomas, P. S.; da Silva, B. C.; Barto, A. G.; Giguere, S.; https://arxiv.org/abs/1205.2634v1. Brun, Y.; and Brunskill, E. 2019. Preventing undesirable behavior of intelligent machines. Science 366(6468). doi: Lagioia, F.; and Sartor, G. 2020. AI Systems Under Criminal 10.1126/science.aag3311. Law: a Legal Analysis and a Regulatory Perspective. Phi- Turner, J. 2019. Robot Rules. Palgrave Macmillan. ISBN losophy and Technology 33(3): 433–465. ISSN 22105441. 978-3-319-96234-4. doi:10.1007/s13347-019-00362-x. Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Liepiņa, R.; Sartor, G.; and Wyner, A. 2020. Arguing about Dudzik, A.; Chung, J.; ...; and Silver, D. 2019. Grandmaster causes in law: a semi-formal framework for causal argu- level in StarCraft II using multi-agent reinforcement learning. ments. Artificial Intelligence and Law 28(1): 69–89. ISSN Nature 575(7782): 350–354. doi:10.1038/s41586-019-1724- 15728382. doi:10.1007/s10506-019-09246-z. z. Loveless, J. 2013. Mens Rea: Intention, Recklessness, Neg- Wen, M.; Ehlers, R.; and Topcu, U. 2015. Correct-by- ligence and Gross Negligence. In Complete Criminal Law: synthesis reinforcement learning with temporal logic con- Test, Cases and Materials, chapter 3, 91–150. OUP Ox- straints. IEEE International Conference on Intelligent Robots ford. ISBN 0198848463. doi:10.1093/he/9780199646418. and Systems 2015-Decem: 4983–4990. doi:10.1109/IROS. 003.0003. 2015.7354078. MacGlashan, J.; and Littman, M. L. 2015. Between imitation Wen, M.; and Topcu, U. 2018. Constrained Cross-Entropy and intention learning. In Twenty Fourth International Joint Method for Safe Reinforcement Learning. In Advances in Conference on Artificial Intelligence. ISBN 9781577357384. Neural Information Processing Systems 31, NeurIPS, 7461– ISSN 10450823. 7471. URL http://papers.nips.cc/paper/7974-constrained- Mason, G.; Calinescu, R.; Kudenko, D.; and Banks, A. 2017. cross-entropy-method-for-safe-reinforcement-learning.pdf. Assured reinforcement learning with formally verified ab- Winfield, A. F. T.; Michael, K.; Pitt, J.; and Evers, V. 2019. stract policies. ICAART 2017 - Proceedings of the 9th In- Machine ethics: The design and governance of ethical ai and ternational Conference on Agents and Artificial Intelligence autonomous systems. Proceedings of the IEEE 107(3): 509– 2: 105–117. doi:10.5220/0006156001050117. 517. doi:10.1109/JPROC.2019.2900622.