<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Policy Interpretation for Partially Observable Monte-Carlo Planning: A Rule-Based Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giulio Mazzi</string-name>
          <email>giulio.mazzi@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Castellini</string-name>
          <email>alberto.castellini@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Farinelli</string-name>
          <email>alessandro.farinelli@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Verona, Department of Computer Science</institution>
          ,
          <addr-line>Strada Le Grazie 15, 37134, Verona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm that can generate online policies for large Partially Observable Markov Decision Processes. The lack of an explicit representation of the policy, however, hinders interpretability. In this work, we present a MAX-SMT based methodology to iteratively explore local properties of the policy. Our approach generates a compact and informative representation that describes the system under investigation.</p>
      </abstract>
      <kwd-group>
        <kwd>Approach</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        than 5%. To answer this kind of questions our approach allows expressing partially defined
assumptions employing logical formulas, called rule templates. The quantitative details are
computed by encoding the template into a MAX-SMT [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] problem, and by analyzing the
execution trace, a set of POMCP executions stored as (belief, action) pairs for each decision of the
policy. The result is a compact and informative representation of the system called rule. Another
key feature of the methodology is to identify states in which the planner does not respect the
assumptions of the expert (“Is there a state in which the robot moves at high speed even if it
is likely that the environment is cluttered?”). To achieve this, our methodology quantifies the
divergence between rule decision boundaries and decisions that do not satisfy the rules and
identifies decisions that violate expert assumptions. In this work, we describe the methodology,
and we show how to use the approach to interpret a policy generated by POMCP. As a case
study, we consider a problem in which a robot moves as fast as possible in a (possibly) cluttered
environment while avoiding collisions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>Figure 1 provides a summary of our methodology. As a first step, a logical formula with free
variables is defined (see box 2 in Figure 1) to describe a property of interest of the policy under
investigation. This formula, called rule template, defines a relationship between some properties
of the belief (e.g., the probability to be in a specific state) and an action. Free variables in the
formula allow the expert to avoid quantifying the limits of this relationship. These limits are
then determined by analyzing a trace (see box 1). For instance, a template saying “Do this when
the probability of avoiding collisions is at least x”, with x free variable, is transformed into “Do
this when the probability of avoiding collisions is at least 0.85”. By defining a rule template
the expert provides useful prior knowledge about the structure of the investigated property.
We encode the template as a MAX-SMT problem (see box 3) which computes optimal values
for the free variables to make the formula explain as many decisions as possible (without the
requirement of describing every single decision). The result of the computation is a rule (see box
4) that provides a human-readable local representation of the policy function that incorporates
the prior knowledge specified by the expert. The approach then analyzes the unsatisfiable steps
to identify unexpected decisions (see box 6), related to actions that violate the logical rule (i.e.,
that do not verify the expert’s assumption). The approach quantifies the violation, i.e., the
distance between the rule boundary and the unsatisfiable step, to support the analysis.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We present a problem of velocity regulation in robotic platforms as a case study. A robot travels
on a pre-specified path divided into eight segments which are in turn divided into subsegments
of diferent sizes, as shown in Figure 2. Each segment has a (hidden) dificulty value among clear
( = 0 , where  is used to identify the dificulty), lightly obstructed ( = 1 ) or heavily obstructed
( = 2 ). All subsegments share the same dificulty, hence the hidden state-space has 38 states.
The goal of the robot is to travel on this path as fast as possible while avoiding collisions. In
each subsegment, the robot must decide a speed level  (i.e., action). We consider three diferent
speed levels, namely 0 (slow), 1 (medium speed), and 2 (fast). The reward received for traversing
a subsegment is equal to the length of the subsegment multiplied by 1 +  , where  is the speed
of the agent, namely the action that it selects. The higher the speed, the higher the reward, but
a higher speed sufers a greater risk of collision (see the collision probability table ( = 1 |  , )
in Figure 2.c). The real dificulty of each segment is unknown to the robot (i.e., hidden part of
the state), but in each subsegment, the robot receives an observation, which is 0 (no obstacles)
or 1 (obstacles) with a probability depending on segment dificulty (see Figure 2.b). The state
of the problem contains eight hidden variables (i.e., the dificulty of each segment), and two
observable variables (current segment and subsegment).</p>
      <p>To obtain rules that are compact and informative, we want them to be a local approximation
of the behavior of the robot. We introduce the dif function which takes a distribution on the
possible dificulties distr, a segment seg, and a required dificulty value d as input, and returns
the probability that segment s has dificulty d in the distribution distr.</p>
      <p>Iteration 1. We start with a rule describing when the robot travels at maximum speed (i.e.,
 = 2 ). We expect that the robot should move at that speed only if it is confident enough to be
in an easy-to-navigate segment. We express this with the template:
 2 ∶ s e l e c t  2 w h e n  0 ≥ x1 ∨  2 ≤ x2;
w h e r e x1 ≥ 0.8 ∧  0 = d i f f ( d i s t r , s e g , 0 ) ∧  2 = d i f f ( d i s t r , s e g , 2 )
this template can be satisfied if the probability of being in a clear segment (  0) is above a certain
threshold or the probability of being in a heavily obstructed segment ( 2) is below another
threshold. We expect x1 to be above 0.8, thus we add this information in the w h e r e statement.
Our methodology provides the rule:</p>
      <p>2 ∶ s e l e c t  2 w h e n :  0 ≥ 0.858 ∨  2 ≤ 0.004;
that fails to satisify 6 out of the 370 steps.</p>
      <p>Iteration 2. By analyzing the unsatisfiable steps, we notice that three of them are in
subsegment 8.12 (the robot moves at low speed with belief [ 0 = 0.895,  1 = 0.102,  2 = 0.003], [ 0 =
0.955,  1 = 0.045,  2 = 0.0], [ 0 = 0.879, and  1 = 0.120,  2 = 0.002] respectively). Figure 2
shows that this step is the shortest subsegment on the map. Our template is approximate and
does not consider the length of the subsegment. This local rule cannot describe the behavior of
the policy in segment 8.12, it is too short and POMCP decides that it is better to move slowly even
if it is nearly certain that the subsegment is safe. Hence, we want to exclude the subsegment
from the rule. Finally, by analyzing the other three steps which do not satisfy the rule, we
notice that they are close to the rule, but cannot be described with this simple template (in these
steps, the robot move a speed 2 with belief [ 0 = 0.789,  1 = 0.181,  2 = 0.031, [ 0 = 0.819,  1 =
0.164,  2 = 0.017], and [ 0 = 0.828,  1 = 0.162,  2 = 0.010]). To improve the template, we add
a more complex literal ( 0 ≥ x3 ∧  1 ≥ x4), that use both dificulty 0 (clear) and 1 (lightly
obstructed) to describe the behavior of the policy. We obtain the template:
 2 ∶ s e l e c t  2 w h e n ( ≠ 8.12) ∧ (</p>
      <p>0 ≥ x1 ∨  2 ≤ x2 ∨ ( 0 ≥ x3 ∧  1 ≥ x4));
w h e r e x1 ≥ 0.8 ∧   = d i f f ( d i s t r , s e g , i ) ,  ∈ {1, 2, 3}
and the rule:
 2 ∶ s e l e c t  2 w h e n ( ≠ 8.12) ∧ (</p>
      <p>0 ≥ 0.841 ∨  2 ≤ 0.004 ∨ ( 0 ≥ 0.789 ∧  1 ≥ 0.156));
that only fails to satisfy 2 steps (speed 1 with belief [ 0 = 0.801,  1 = 0.190,  2 = 0.009], and
speed 0 with belief [ 0 = 0.826,  1 = 0.162,  2 = 0.013]). These steps were satisfied by the first
iteration of the template, but now we have a stronger rule that describes more steps. We further
refine the template, but this result is a good compromise between simplicity and correctness.</p>
      <p>Iteration 3. We write a template to describe when the robot moves at slow speed. We
identify three important situations that can lead the robot to move at slow speed i) the robot is
uncertain about the current dificulty (the belief is close to a uniform distribution), ii) the robot
knows that the current segment is hard iii) the robot is in the short subsegment 8.12. We try to
use  1 ≥ y1 and  2 ≥ y2) to describe the first two situations. The template is the following:
 2 ∶ s e l e c t  2 w h e n ( ≠ 8.12) ∧ (
 0 ∶ s e l e c t  0 w h e n ( = 8.12) ∨ 
0 ≥ x1 ∨  2 ≤ x2 ∨ ( 0 ≥ x3 ∧  0 ≥ x4));
1 ≥ y1 ∨  2 ≥ y2;
w h e r e x1 ≥ 0.8 ∧   = d i f f ( d i s t r , s e g , i ) ,  ∈ {1, 2, 3}
that yields the rule:
 2 ∶ s e l e c t  2 w h e n ( ≠ 8.12) ∧ (
 0 ∶ s e l e c t  0 w h e n ( = 8.12) ∨ 
0 ≥ 0.841 ∨  2 ≤ 0.004 ∨ ( 0 ≥ 0.789 ∧  1 ≥ 0.156));
which fail to satisfy 38 out of 370 steps. Notice that the low value for y1, y2 (i.e., 0.244, 0.024)
describes all the belief close to the uniform distribution. By analyzing the 38 unsatisfiable
steps, we notice that 35 of them are situations in which the robot decides to move at speed 1
even if the condition for moving at speed 0 are satisfied (e.g, three of these steps have belief
[ 0 = 0.319,  1 = 0.342,  2 = 0.338], [ 0 = 0.345,  1 = 0.337,  2 = 0.318], and [ 0 = 0.335,  1 =
0.333,  2 = 0.332] respectively). This analysis tell us that the POMCP considers a worthy risk
to move at medium speed (i.e., speed 1) even if it does not have strong understanding of the
current dificulty. If we consider this to be a non acceptable risk, we should modify the design
of POMCP, e.g., by increasing the number of particles used in the simulation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future work</title>
      <p>In this work, we present a methodology that combines high-level indications provided by a
human expert with an automatic procedure that analyzes an execution trace to synthesize key
properties of a policy in the form of rules. This work paves the way towards several research
directions. We aim at improving the expressiveness of the logical formulas used to formalize
the indications of the expert and to integrate POMCP and the methodology online.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cassandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes</article-title>
          , in: UAI '
          <fpage>97</fpage>
          ,
          <year>1997</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Papadimitriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Tsitsiklis</surname>
          </string-name>
          ,
          <source>The Complexity of Markov Decision Processes, Math. Oper. Res</source>
          .
          <volume>12</volume>
          (
          <year>1987</year>
          )
          <fpage>441</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <article-title>Monte-Carlo Planning in large POMDPs</article-title>
          ,
          <source>in: NIPS</source>
          <year>2010</year>
          , Curran Associates, Inc.,
          <year>2010</year>
          , pp.
          <fpage>2164</fpage>
          -
          <lpage>2172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chalkiadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning</article-title>
          ,
          <source>in: IJCAI-19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5540</fpage>
          -
          <lpage>5546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchesini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Online monte carlo planning for autonomous robots: Exploiting prior knowledge on task similarities</article-title>
          ,
          <source>in: Proceedings of AIRO</source>
          <year>2019</year>
          , volume
          <volume>2594</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          , E. Marchesini,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Explaining the influence of prior knowledge on POMCP policies</article-title>
          ,
          <source>in: EUMAS 2020</source>
          , Springer, Cham,
          <year>2020</year>
          , pp.
          <fpage>261</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gunning</surname>
          </string-name>
          ,
          <source>DARPA's Explainable Artificial Intelligence (XAI) Program</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>ii</fpage>
          -ii.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          , Explainable Planning,
          <source>CoRR abs/1709</source>
          .10256 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cashmore</surname>
          </string-name>
          , A. Collins,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krarup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krivic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <source>Towards Explainable AI Planning as a Service</source>
          ,
          <year>2019</year>
          . 2nd ICAPS Workshop on Explainable Planning,
          <string-name>
            <surname>XAIP</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Anjomshoae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Najjar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvaresi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Främling</surname>
          </string-name>
          , Explainable Agents and
          <article-title>Robots: Results from a Systematic Literature Review</article-title>
          , in: AAMAS,
          <string-name>
            <surname>IFAAMAS</surname>
          </string-name>
          ,
          <year>2019</year>
          , p.
          <fpage>1078</fpage>
          -
          <lpage>1088</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>L. De Moura</surname>
            ,
            <given-names>N. Bjørner,</given-names>
          </string-name>
          <article-title>Z3: An Eficient SMT Solver</article-title>
          ,
          <source>in: Proceedings of the 14th ETAPS, TACAS'08/ETAPS'08</source>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2008</year>
          , p.
          <fpage>337</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bjørner</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-D. Phan</surname>
          </string-name>
          , L. Fleckenstein, vZ - An
          <source>Optimizing SMT Solver, in: Proceedings of the 21st TACAS -</source>
          Volume
          <volume>9035</volume>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2015</year>
          , p.
          <fpage>194</fpage>
          -
          <lpage>199</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>