1. Introduction

Policy Interpretation for Partially Observable Monte-Carlo Planning: A Rule-Based Approach

Giulio Mazzi

giulio.mazzi@univr.it 0

Alberto Castellini

alberto.castellini@univr.it 0

Alessandro Farinelli

alessandro.farinelli@univr.it 0 0 Università degli Studi di Verona, Department of Computer Science , Strada Le Grazie 15, 37134, Verona , Italy

Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm that can generate online policies for large Partially Observable Markov Decision Processes. The lack of an explicit representation of the policy, however, hinders interpretability. In this work, we present a MAX-SMT based methodology to iteratively explore local properties of the policy. Our approach generates a compact and informative representation that describes the system under investigation.

Approach

1. Introduction

than 5%. To answer this kind of questions our approach allows expressing partially defined assumptions employing logical formulas, called rule templates. The quantitative details are computed by encoding the template into a MAX-SMT [ 11, 12 ] problem, and by analyzing the execution trace, a set of POMCP executions stored as (belief, action) pairs for each decision of the policy. The result is a compact and informative representation of the system called rule. Another key feature of the methodology is to identify states in which the planner does not respect the assumptions of the expert (“Is there a state in which the robot moves at high speed even if it is likely that the environment is cluttered?”). To achieve this, our methodology quantifies the divergence between rule decision boundaries and decisions that do not satisfy the rules and identifies decisions that violate expert assumptions. In this work, we describe the methodology, and we show how to use the approach to interpret a policy generated by POMCP. As a case study, we consider a problem in which a robot moves as fast as possible in a (possibly) cluttered environment while avoiding collisions.

2. Method

Figure 1 provides a summary of our methodology. As a first step, a logical formula with free variables is defined (see box 2 in Figure 1) to describe a property of interest of the policy under investigation. This formula, called rule template, defines a relationship between some properties of the belief (e.g., the probability to be in a specific state) and an action. Free variables in the formula allow the expert to avoid quantifying the limits of this relationship. These limits are then determined by analyzing a trace (see box 1). For instance, a template saying “Do this when the probability of avoiding collisions is at least x”, with x free variable, is transformed into “Do this when the probability of avoiding collisions is at least 0.85”. By defining a rule template the expert provides useful prior knowledge about the structure of the investigated property. We encode the template as a MAX-SMT problem (see box 3) which computes optimal values for the free variables to make the formula explain as many decisions as possible (without the requirement of describing every single decision). The result of the computation is a rule (see box 4) that provides a human-readable local representation of the policy function that incorporates the prior knowledge specified by the expert. The approach then analyzes the unsatisfiable steps to identify unexpected decisions (see box 6), related to actions that violate the logical rule (i.e., that do not verify the expert’s assumption). The approach quantifies the violation, i.e., the distance between the rule boundary and the unsatisfiable step, to support the analysis.

3. Results

We present a problem of velocity regulation in robotic platforms as a case study. A robot travels on a pre-specified path divided into eight segments which are in turn divided into subsegments of diferent sizes, as shown in Figure 2. Each segment has a (hidden) dificulty value among clear ( = 0 , where is used to identify the dificulty), lightly obstructed ( = 1 ) or heavily obstructed ( = 2 ). All subsegments share the same dificulty, hence the hidden state-space has 38 states. The goal of the robot is to travel on this path as fast as possible while avoiding collisions. In each subsegment, the robot must decide a speed level (i.e., action). We consider three diferent speed levels, namely 0 (slow), 1 (medium speed), and 2 (fast). The reward received for traversing a subsegment is equal to the length of the subsegment multiplied by 1 + , where is the speed of the agent, namely the action that it selects. The higher the speed, the higher the reward, but a higher speed sufers a greater risk of collision (see the collision probability table ( = 1 | , ) in Figure 2.c). The real dificulty of each segment is unknown to the robot (i.e., hidden part of the state), but in each subsegment, the robot receives an observation, which is 0 (no obstacles) or 1 (obstacles) with a probability depending on segment dificulty (see Figure 2.b). The state of the problem contains eight hidden variables (i.e., the dificulty of each segment), and two observable variables (current segment and subsegment).

To obtain rules that are compact and informative, we want them to be a local approximation of the behavior of the robot. We introduce the dif function which takes a distribution on the possible dificulties distr, a segment seg, and a required dificulty value d as input, and returns the probability that segment s has dificulty d in the distribution distr.

Iteration 1. We start with a rule describing when the robot travels at maximum speed (i.e., = 2 ). We expect that the robot should move at that speed only if it is confident enough to be in an easy-to-navigate segment. We express this with the template: 2 ∶ s e l e c t 2 w h e n 0 ≥ x1 ∨ 2 ≤ x2; w h e r e x1 ≥ 0.8 ∧ 0 = d i f f ( d i s t r , s e g , 0 ) ∧ 2 = d i f f ( d i s t r , s e g , 2 ) this template can be satisfied if the probability of being in a clear segment ( 0) is above a certain threshold or the probability of being in a heavily obstructed segment ( 2) is below another threshold. We expect x1 to be above 0.8, thus we add this information in the w h e r e statement. Our methodology provides the rule:

2 ∶ s e l e c t 2 w h e n : 0 ≥ 0.858 ∨ 2 ≤ 0.004; that fails to satisify 6 out of the 370 steps.

Iteration 2. By analyzing the unsatisfiable steps, we notice that three of them are in subsegment 8.12 (the robot moves at low speed with belief [ 0 = 0.895, 1 = 0.102, 2 = 0.003], [ 0 = 0.955, 1 = 0.045, 2 = 0.0], [ 0 = 0.879, and 1 = 0.120, 2 = 0.002] respectively). Figure 2 shows that this step is the shortest subsegment on the map. Our template is approximate and does not consider the length of the subsegment. This local rule cannot describe the behavior of the policy in segment 8.12, it is too short and POMCP decides that it is better to move slowly even if it is nearly certain that the subsegment is safe. Hence, we want to exclude the subsegment from the rule. Finally, by analyzing the other three steps which do not satisfy the rule, we notice that they are close to the rule, but cannot be described with this simple template (in these steps, the robot move a speed 2 with belief [ 0 = 0.789, 1 = 0.181, 2 = 0.031, [ 0 = 0.819, 1 = 0.164, 2 = 0.017], and [ 0 = 0.828, 1 = 0.162, 2 = 0.010]). To improve the template, we add a more complex literal ( 0 ≥ x3 ∧ 1 ≥ x4), that use both dificulty 0 (clear) and 1 (lightly obstructed) to describe the behavior of the policy. We obtain the template: 2 ∶ s e l e c t 2 w h e n ( ≠ 8.12) ∧ (

0 ≥ x1 ∨ 2 ≤ x2 ∨ ( 0 ≥ x3 ∧ 1 ≥ x4)); w h e r e x1 ≥ 0.8 ∧ = d i f f ( d i s t r , s e g , i ) , ∈ {1, 2, 3} and the rule: 2 ∶ s e l e c t 2 w h e n ( ≠ 8.12) ∧ (

0 ≥ 0.841 ∨ 2 ≤ 0.004 ∨ ( 0 ≥ 0.789 ∧ 1 ≥ 0.156)); that only fails to satisfy 2 steps (speed 1 with belief [ 0 = 0.801, 1 = 0.190, 2 = 0.009], and speed 0 with belief [ 0 = 0.826, 1 = 0.162, 2 = 0.013]). These steps were satisfied by the first iteration of the template, but now we have a stronger rule that describes more steps. We further refine the template, but this result is a good compromise between simplicity and correctness.

Iteration 3. We write a template to describe when the robot moves at slow speed. We identify three important situations that can lead the robot to move at slow speed i) the robot is uncertain about the current dificulty (the belief is close to a uniform distribution), ii) the robot knows that the current segment is hard iii) the robot is in the short subsegment 8.12. We try to use 1 ≥ y1 and 2 ≥ y2) to describe the first two situations. The template is the following: 2 ∶ s e l e c t 2 w h e n ( ≠ 8.12) ∧ ( 0 ∶ s e l e c t 0 w h e n ( = 8.12) ∨ 0 ≥ x1 ∨ 2 ≤ x2 ∨ ( 0 ≥ x3 ∧ 0 ≥ x4)); 1 ≥ y1 ∨ 2 ≥ y2; w h e r e x1 ≥ 0.8 ∧ = d i f f ( d i s t r , s e g , i ) , ∈ {1, 2, 3} that yields the rule: 2 ∶ s e l e c t 2 w h e n ( ≠ 8.12) ∧ ( 0 ∶ s e l e c t 0 w h e n ( = 8.12) ∨ 0 ≥ 0.841 ∨ 2 ≤ 0.004 ∨ ( 0 ≥ 0.789 ∧ 1 ≥ 0.156)); which fail to satisfy 38 out of 370 steps. Notice that the low value for y1, y2 (i.e., 0.244, 0.024) describes all the belief close to the uniform distribution. By analyzing the 38 unsatisfiable steps, we notice that 35 of them are situations in which the robot decides to move at speed 1 even if the condition for moving at speed 0 are satisfied (e.g, three of these steps have belief [ 0 = 0.319, 1 = 0.342, 2 = 0.338], [ 0 = 0.345, 1 = 0.337, 2 = 0.318], and [ 0 = 0.335, 1 = 0.333, 2 = 0.332] respectively). This analysis tell us that the POMCP considers a worthy risk to move at medium speed (i.e., speed 1) even if it does not have strong understanding of the current dificulty. If we consider this to be a non acceptable risk, we should modify the design of POMCP, e.g., by increasing the number of particles used in the simulation.

4. Conclusions and future work

In this work, we present a methodology that combines high-level indications provided by a human expert with an automatic procedure that analyzes an execution trace to synthesize key properties of a policy in the form of rules. This work paves the way towards several research directions. We aim at improving the expressiveness of the logical formulas used to formalize the indications of the expert and to integrate POMCP and the methodology online.

[1]

Cassandra ,

M. L.

Littman ,

N. L.

Zhang , Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , in: UAI ' 97 , 1997 , pp. 54 - 61 .

[2]

C. H.

Papadimitriou ,

J. N.

Tsitsiklis , The Complexity of Markov Decision Processes, Math. Oper. Res . 12 ( 1987 ) 441 - 450 .

[3]

Silver ,

Veness , Monte-Carlo Planning in large POMDPs , in: NIPS 2010 , Curran Associates, Inc., 2010 , pp. 2164 - 2172 .

[4]

Castellini ,

Chalkiadakis ,

Farinelli , Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning , in: IJCAI-19 , 2019 , pp. 5540 - 5546 .

[5]

Castellini ,

Marchesini ,

Farinelli , Online monte carlo planning for autonomous robots: Exploiting prior knowledge on task similarities , in: Proceedings of AIRO 2019 , volume 2594 of CEUR Workshop Proceedings, CEUR-WS.org , 2019 , pp. 25 - 32 .

[6]

Castellini , E. Marchesini,

Mazzi ,

Farinelli , Explaining the influence of prior knowledge on POMCP policies , in: EUMAS 2020 , Springer, Cham, 2020 , pp. 261 - 276 .

[7]

Gunning , DARPA's Explainable Artificial Intelligence (XAI) Program , 2019 , pp. ii -ii.

[8]

Fox ,

Long ,

Magazzeni , Explainable Planning, CoRR abs/1709 .10256 ( 2017 ).

[9]

Cashmore , A. Collins,

Krarup ,

Krivic ,

Magazzeni ,

Smith , Towards Explainable AI Planning as a Service , 2019 . 2nd ICAPS Workshop on Explainable Planning, XAIP 2019 .

[10]

Anjomshoae ,

Najjar ,

Calvaresi ,

Främling , Explainable Agents and Robots: Results from a Systematic Literature Review , in: AAMAS, IFAAMAS , 2019 , p. 1078 - 1088 .

[11] L. De Moura , N. Bjørner, Z3: An Eficient SMT Solver , in: Proceedings of the 14th ETAPS, TACAS'08/ETAPS'08 , Springer-Verlag, Berlin, Heidelberg, 2008 , p. 337 - 340 .

[12]

Bjørner , A. -D. Phan , L. Fleckenstein, vZ - An Optimizing SMT Solver, in: Proceedings of the 21st TACAS - Volume 9035 , Springer-Verlag, Berlin, Heidelberg, 2015 , p. 194 - 199 .