-

A Case for Robust AI in Robotics

Shashank Pathak

shashank.pathak@iit.it 0 2

Luca Pulina

lpulina@uniss.it 0

Armando Tacchella

armando.tacchella@unige.it 0 0 Context , Motivation, Objectives 1 Universita degli Studi di Genova 2 iCub Facility

Researchers envision a world wherein robots are free to interact with the external environment, thereby including human beings, other living creatures, robots and a variety of inanimate objects. It is always tacitly assumed that interactions will be smooth, i.e., they will ful ll several desirable properties ranging from safety to appropriateness. We posit that a reasonable mathematical model to frame such vision is that of Markov decision processes, and that ensuring smooth interactions amounts to endow robots with control policies that are provably compliant with side conditions expressed in probabilistic temporal logic.

or equal to 0 ( 0) exactly when V (s) V 0 (s) for all s 2 S. Given some reasonable de nition of value V (s) for all s 2 S, solving an MDP amounts to nd such that for all possible | more about MDPs and related decision problems can be found in [ 1 ].

From a modeling point of view, Markov decision processes and associated optimization problems, capture a broad set of approaches to the analysis and synthesis of intelligent behavior for autonomous agents which can be put to good use in the eld of Robotics. For instance, in [ 2 ] it is argued that many problems of AI planning under uncertainty can be modeled as MDPs. In the learning community | see [ 3 ] for a recent persective | reinforcement learning (RL) is viewed as one of the key techniques to synthesize intelligent behavior for interactive agents, and the mathematical underpinning of RL is also given by Markov decision processes. Even closer to eld robotics, the area of optimal control has a long tradition of leveraging MDPs for problems involving sequential decision making under uncertainty | see, e.g., [ 4 ]. Given the widespread adoption of MDPs in AI and related (sub) elds, proposing techniques to achieve robustness of autonomous agents based on Markov decision processes is bound to have a broad impact. We believe that Robotics might bene t the most from robust AI techniques, since robots ought to be functional but also dependable, and the trade-o between these two aspects need to be fully understood and explored.

Our key proposition is to extend the modeling framework of MDPs to one that includes explicit side conditions expressed in probabilistic computation tree logic (PCTL). The syntax of PCTL is de ned considering the set of state formulas, and the set of path formulas. Given a set of atomic propositions AP , is de ned inductively as: (i) if ' 2 AP then ' 2 ; (ii) > 2 ; if ; 2 then also ^ 2 and : 2 ; and (iii) P./p[ ] 2 where ./2 f ; <; ; >g, cpo2nt[a0i;n1s] eaxnadctly2the ,ewxphreersesiPon./sp[of ]tyisptehXepro(bnaebxitli)s,ticUpakth o(bpoeurantdoerd. Tunhteils)etand

U (until ) where ; 2 and k 2 N | more on PCTL can be found in [ 5 ]. Given an MDP M, a de nition of value V (s) for all states of M, and a PCTL formula ', the Markov decision problem with probabilistic side condition (mdpp) can be de ned as the problem of nding such that for all policies and DM; j= ', i.e., ' is always satis ed in the discrete-time Markov chain (DTMC) DM; corresponding to the combination of M and | see [ 5 ] for details about DTMCS and PCTL semantics, and [ 6 ] for details about combining MDPs with policies to yield DTMCs. 2

State of the art We can distinguish current approaches to mdpp into two broad categories, the rst one oblivious of formal techniques, and the second one deeply rooted in formal veri cation and reasoning. In the former category we can list multiobjective reinforcement learning [ 7 ] | with Geibel and Wysotski's approach as a special case [ 8 ]. The main idea of these approaches is to encode the requirement expressed by ' in the value function V (s). Both approaches do not require knowledge of p( js; a) because the relevant information is learned by interacting with (a physical realisation of) M. In this way, solving the Markov decision problem yields a policy that most probably also satis es ', although formal guarantees are not provided. Still in the rst category, we can consider Gillula and Tomlin's safe online learning [ 9 ] which, albeit restricted to safety properties, provides a mathematically precise way to combine side conditions to the online solution of an MDP via reinforcement learning (RL) [ 10 ]. Decision-theoretic planning [ 2 ] | a.k.a. indirect RL [ 3 ] model-based RL [ 10 ] or controller synthesis on MDPs [ 11 ] | is another approach in which mdpp can be formalised by taking into account both the elements related optimisation of V and the side conditions expressed by '. In this case the solution is precise, but it requires the knowledge of p( js; a) in advance and the policy synthesised is always deterministic. Yet another way to incorporate side conditions is to consider the overall model as a constrained MDP [ 12 ], where one type of cost is to be optimised while keeping the other types of costs within a bound. As before, the approach requires knowledge of the model. All the methods listed above do not cover the cases in which both p( js; a) is unknown and the optimal policy is stochastic. Also, the learning-based ones have an unclear relation between the logic speci cation of ' and functional speci cation of rewards.

Another set of approaches to mdpp is the one based on formal methods which can be used to solve parts of the mdpp problem. For a known model M and a given policy , probabilistic model checking supported by e cient tools like PRISM [ 13 ] or MRMC [ 14 ] can be applied to check whether the side condition ' is satis ed by the controller policy . If a policy does not satisfy the PCTL property ', model repair [ 15 ] can be applied to modify the policy such that ' becomes true. First the DTMC model resulting from the MDP model under the given policy is parameterised, using linear combinations of real-valued parameters in the transition probabilities, where the parameter domains de ne the allowed areas for the repair. Additionally, a cost-function over the parameters can be given. Now model repair can be applied to nd (if it exists) a parameter valuation within the parameter domains which on the one hand induces the satisfaction of the property ' and on the other hand minimises the value of the cost-function, i.e., changing the transition probabilities and thereby repairing the DTMC with minimal costs. Unfortunately, this approach needs non-linear optimisation and therefore it does not scale for larger models. An approach that we recently proposed with other researchers uses a greedy repair algorithm [ 6 ]. Instead of global optimisation, it uses local repair steps iteratively. Though it needs to iteratively invoke probabilistic model checking, this approach scales well also for large models. However, it can incorporate rewards and values V (s) only heuristically in a quite restricted manner. 3

Towards Robust AI in Robotics: a Challenge Potentially, an AI agent embodied in a robot may face a wide variety of scenarios each characterized by di erent safety constraints and learning objectives. To Configure robot

Robottrainingstarts Observe trainer (acquire D1)

Robottrainingends Simulate learning (compute ) Check MD1; j= '

Check OK? NO YES Robotbehavessafely Ship robot

Improve configuration Improve acquisition (Re)Shape reward profile

NO Learning OK? YES

NO Data OK? YES

Initialize robot

Robotcalibrationstarts Observe user (acquire D2)

Robotcalibrationends Simulate learning (compute ) Check MD2; j= ' obtain a quantitative assessment about our capability to attack the mdpp problem it is useful to focus on a speci c scenario which contains all the basic ingredients found in more complex ones, yet it is signi cant and amenable of a relatively simple implementation. In particular, the case of a single robot interacting with a single human across a common workspace is considered. It is assumed that the robot observes the human while she is accomplishing a given task, which at some point, requires the robot to chip in and, e.g., nalize the task alone or help the human to do so. The task must be learned by the robot, but RL is run o ine in a simulator to avoid risk of injuries to the human during the trial-and-error process which characterizes RL. As shown in Figure 1 two di erent ows of activities are considered. The rst one | Figure 1 (left) | is thought to happen at the end of the production stage (factory), where the robot is con gured, trained and checked by experts to accomplish a given task. The second one | Figure 1 (right) | is thought to happen during the deployment stage (e.g., household), where the user is allowed to (i) calibrate the robot, i.e., adapt its behavior to the contingencies of the environment to be found at the user's place, and to (ii) modify the robot's behavior, i.e., customize the robot according to speci c preferences.

We believe that the current state of the art is unable to solve the mdpp problem in a totally satisfactorily way in cases like the one exempli ed in Figure 1. However there are strong potentials in combining e cient but potentially imprecise engineering approaches with precise but potentially ine cient formal methods. For example, RL-based methods are well established for MDP controller synthesis, where the optimality criteria are encoded by rewards and the value function V (s). To assure that the controller learned by RL is safe, during RL learning we could use probabilistic model checking. If the current (not yet necessarily optimal) controller turns out to be unsafe, we could repair the controller. Additionally, it might also be necessary to modify the rewards and/or the value function to direct RL towards safe solutions. This could be done, for example, based on probabilistic counterexamples [ 16 ]. In contrast with rewardshaping approaches that guarantee invariance of the optimal policy learned [ 17 ], such a reward or value function repair aims to obtain sub-optimal but safe policy.

1. Puterman , M.L. : Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley and Sons ( 1994 )

2. Boutilier , C. , Dean , T. , Hanks , S. : Decision-theoretic planning: Structural assumptions and computational leverage . Journal of Arti cial Intelligence Research 11 ( 1 ) ( 1999 ) 94

3. Wiering , M. , Van Otterlo , M. : Reinforcement learning . In: Adaptation, Learning, and Optimization . Volume 12 . Springer ( 2012 )

4. Bertsekas , D.P. , Bertsekas , D.P. , Bertsekas , D.P. , Bertsekas , D.P. : Dynamic programming and optimal control . Athena Scienti c Belmont , MA ( 1995 )

5. Baier , C. , Katoen , J.P. : Principles of Model Checking. The MIT Press ( 2008 )

6. Pathak , S. , Abraham , E. , Jansen , N. , Tacchella , A. , Katoen , J.: A greedy approach for the e cient repair of stochastic models . In: Proc. of NFM'15. Volume 9058 of LNCS , Springer ( 2015 ) 295 { 309

7. Natarajan , S. , Tadepalli , P. : Dynamic preferences in multi-criteria reinforcement learning . In: Proc. of ICML'05 , ACM ( 2005 ) 601 { 608

8. Geibel , P. , Wysotzki , F. : Risk-Sensitive Reinforcement Learning Applied to Control under Constraints . Journal of Arti cial Intelligence Research 24 ( 2005 ) 81 { 108

9. Gillula , J.H. , Tomlin , C.J.: Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor . In: Proc. of ICRA'12 , IEEE ( 2012 ) 2723 { 2730

10. Sutton , R. , Barto , A. : Reinforcement Learning { An Introduction . MIT Press ( 1998 )

11. Drager, K. , Forejt , V. , Kwiatkowska , M. , Parker , D. , Ujma , M. : Permissive controller synthesis for probabilistic systems . In: Proc. of TACAS'14 . Springer ( 2014 ) 531 { 546

12. Altman , E.: Constrained Markov decision processes. Volume 7 . CRC Press ( 1999 )

13. Kwiatkowska , M.Z. , Norman , G. , Parker , D. : Prism 4.0: Veri cation of probabilistic real-time systems . In: Proc. of CAV. Volume 6806 of LNCS , Springer ( 2011 ) 585 { 591

14. Katoen , J.P. , Zapreev , I.S. , Hahn , E.M. , Hermanns , H. , Jansen , D.N.: The ins and outs of the probabilistic model checker MRMC . Performance Evaluation 68 ( 2 ) ( 2011 ) 90 { 104

15. Bartocci , E. , Grosu , R. , Katsaros , P. , Ramakrishnan , C. , Smolka , S.A. : Model repair for probabilistic systems . In: Proc. of TACAS. Volume 6605 of LNCS , Springer ( 2011 ) 326 { 340

16. Abraham , E. , Becker , B. , Dehnert , C. , Jansen , N. , Katoen , J. , Wimmer , R.: Counterexample generation for discrete-time Markov models: An introductory survey . In: Proc. of SFM. Volume 8483 of LNCS , Springer ( 2014 ) 65 { 121

17. Ng , A.Y. , Harada , D. , Russell , S. : Policy invariance under reward transformations: Theory and application to reward shaping . In: ICML . Volume 99 ( 1999 ) 278 { 287