<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis,
September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Rule-based Shield Synthesis for Partially Observable Monte Carlo Planning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giulio Mazzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Castellini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Farinelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Verona</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>22</volume>
      <issue>2021</issue>
      <abstract>
        <p>Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm able to generate approximate policies for large Partially Observable Markov Decision Processes. The online nature of this method supports scalability by avoiding complete policy representation. The lack of an explicit representation however hinders policy interpretability and makes policy verification very complex. In this work, we propose two contributions. The first is a method for identifying unexpected actions selected by POMCP with respect to expert prior knowledge of the task. The second is a shielding approach that prevents POMCP from selecting unexpected actions. The first method is based on Maximum Satisfiability Modulo Theory (MAX-SMT). It inspects traces (i.e., sequences of belief-action-observation triplets) generated by POMCP to compute the parameters of logical formulas about policy properties defined by the expert. The second contribution is a module that uses online the logical formulas to identify anomalous actions selected by POMCP and substitutes those actions with actions that satisfy the logical formulas fulfilling expert knowledge. We evaluate our approach in two domains. Results show that the shielded POMCP outperforms the standard POMCP in a case study in which a wrong parameter of POMCP makes it select wrong actions from time to time.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;POMCP</kwd>
        <kwd>SMT</kwd>
        <kwd>Shielding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Planning in partially observable environments while satisfying safety guarantees is a challenging
problem. Partially Observable Markov Decision Processes (POMDPs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a popular framework
to model systems with uncertainty. Computing an optimal solution for POMDPs is hard [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
However, it is possible to compute an approximate solution, and state-of-the-art algorithms
achieve great performance in real-world instances of POMDPs. A pioneering algorithm for this
purpose is Partially Observable Monte-Carlo Planning (POMCP) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which uses a particle filter
to represent the belief and a Monte-Carlo Tree Search based strategy to compute the policy
online. The online nature of the policy, however, makes the task of analyzing the decisions
taken by POMCP very dificult [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. In general, with a high number of particles POMCP
yields great performance, but sometimes the simulation does not properly assess the risk of
certain actions, especially if the number of particles used in the simulation is limited due to
engineering constraints. Moreover, in POMCP the policy is never fully computed or stored,
 :  when (ℎ ≤ x1 ∧  ≤ x2);
 :  when ℎ ≥ x3;
 :  when  ≥ x4;
where (x1 = x2) ∧ (x3 = x4) ∧ (x3 &gt; 0.9);
hence it is very dificult to identify the reasons for possible unexpected decisions of the system.
However, Explainability [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ] is becoming a key feature of artificial intelligence systems since
in these contexts humans need to understand why specific decisions are taken by the agent.
      </p>
      <p>In this work, we present a methodology for generating a safety mechanism from high-level
descriptions of the desired behavior of a POMCP-generated policy. In this approach, a human
expert provides qualitative information on a property of the system, enriched with an indication
of the expected behavior that the system should have in specific situations (e.g., “the robot
should move fast if it is highly confident that the path is not cluttered”). With this information,
our methodology analyzes a set of execution traces of the system and provides quantitative
details of these statements by analyzing the execution of the system (e.g., “the robot moves fast
if its confidence of being in an uncluttered segment is at least 93.4%). The proposed approach
formalizes the problem of parameters computation as a MAX-SMT problem which allows to
express complex logical formulas and to compute optimal assignments when the template is not
fully satisfiable (which happens in the majority of cases in real policy analysis). This quantitative
answer is then used to synthesize a shield, namely a safety mechanism that forces the POMCP
to satisfy the constraints expressed by the expert. The shield works alongside the Monte Carlo
Tree Search (MCTS) by preemptively blocking actions that violate the rules.</p>
      <p>In summary, we propose an SMT-based methodology that combines a logic-based description
of a system with the real execution traces of a POMCP policy to create a set of rules describing
the behaviors of an agent. This description can be used to synthesize a shield. we empirically
evaluate the shielding mechanism in two domains, namely, the well-known Tiger problem and
a robotic navigation problem, showing that it can exploit the knowledge provided by the expert
to achieve higher performance than standard POMCP when its parameters are imprecise.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology Overview</title>
      <p>The proposed methodology is summarized in Figure 1. It leverages the expressiveness of
logical formulas to represent specific properties of the system under investigation, and this
representation is used to automatically generate a shield, a security mechanism that forces the
POMCP system to satisfy a set of high-level requirements. As a first step, a logical formula
with free variables is defined (see box 2 in Figure 1) to describe a property of interest of the
policy under investigation. This formula, called rule template, defines a relationship between
some properties of the belief (e.g., the probability to be in a specific state) and an action. Free
variables in the formula allow the expert to avoid quantifying the limits of this relationship.
These limits are then computed by analyzing a set of observed traces (see box 1). For Instance,
to describe the behavior of the Tiger problem we can use the rule template presented in Figure 2.
The first template ( ) says that we must listen when the confidence in finding a treasure is
below a certain threshold for both doors (i.e., left or right). The other two (, ) say that
we must open the proper door when the confidence of finding the treasure is above a certain
threshold. By defining a rule template the expert provides useful prior knowledge about the
structure of the investigated property. This is combined with the real execution of a POMCP
system collected into a trace. The methodology computes a rule (i.e., a rule template with all the
free variables instantiated) using a MAX-SMT based algorithm. This algorithm finds a model
for the free variables that explain as many of the decisions taken by POMCP as possible while
satisfying the requirement defined in the template (box 3 of Figure 1). A set of rules is then used
to create a shield, a safety mechanism that we integrate into POMCP to preemptively block
actions that do not respect the details defined by the expert with the template (box 4).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        We test our methodology in two domains, namely, the standard POMDP domain Tiger [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and a robotic-inspired problem (velocity regulation) in which a robot travels a pre-specified
path divided into segments with a (hidden) dificulty. The goal is to travel as fast as possible
while avoiding collisions. The higher the speed, the higher the reward, but a higher speed
sufers a greater risk of collision. A full description of the problem is presented in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To
test the robustness of the shield in diferent scenarios, we injected an error in the POMCP
implementation of the two domains. We modify the RewardRange parameter (called  in the
following) in POMCP. This parameter is used by UCT to balance exploration and exploitation.
If this value is lower than the correct one the algorithm could find a reward that exceeds
the maximum expected value leading to a wrong state, namely, the agent believes to have
identified the best possible action and it stops exploring new actions. This is an interesting
error because it is hard to detect, it randomly afects the exploration-exploitation trade-of
without introducing any systematic mistakes. The code of the shielding mechanism is available
at https://github.com/GiuMaz/XPOMCP.
      </p>
      <p>In Tiger, the average return achieved using the shield is the same in all four cases, and this
is also identical to the return achieved by the correct policy. This is because in tiger we can
write a shield that perfectly recreates the behavior of the correct policy, a goal that is dificult to
achieve in real-world problems. This is particularly interesting because the shields in the cases
of  ∈ {80, 60, 40} are obtained by using traces generated with a POMCP implementation that
does make some mistakes. As a consequence, the Execution traces contain wrong decisions.
However, the combination of insight provided by the expert with the MAX-SMT-based analysis
of the traces results in a shield with extremely good performances.</p>
      <p>In velocity regulation, the first row shows that the usage of a shield can improve the
performance even when c is correct (i.e.,  = 103). In this case, the shield intervenes only 7 times
(over the 3500 analyzed steps), yielding a 5.38% increment in the return. This happens because
the shield blocks the rare cases in which the POMCP simulations are not enough to properly
assess the risk of moving at high speed. When c decreases, the shield intervenes more often
(see column #SA) since the error due to the limited number of simulations is combined with
the errors generated by an incorrect value of c. Table 1.b also shows that a higher number of
interventions leads to a bigger relative increase in the performance (column RI ). The diference
is statistically significant in the case of  ∈ {103, 90, 70}, and show that the introduction of the
shield improves the performance up to the 81%, even in cases in which the shield is trained
using traces generated by a POMCP process that makes some mistakes. In the case of  = 50
the return increase but the diference is not statistically significant. The shield intervenes 171
times by blocking risky high-speed moves, but unlike Tiger, in which we use a rule for every
possible action, here POMCP made many wrong decisions when it moves at low or medium
speed (for example, by moving slowly when the path is clear).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Work</title>
      <p>In this work, we present a methodology that generates a shielding mechanism for POMCP
exploiting a high-level representation of expected policy behavior provided by human experts.
The shielding mechanism preemptively blocks unexpected actions. We aim to further improve
the integration between POMCP and the shielding mechanism (e.g., by considering the efect of
shielding on other actions besides the first one of the simulation) and at developing an approach
for synthesizing logical rules online, i.e., while the POMCP algorithm is running.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cassandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes, in:</article-title>
          <source>In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>1997</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Papadimitriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Tsitsiklis</surname>
          </string-name>
          ,
          <source>The Complexity of Markov Decision Processes, Math. Oper. Res</source>
          .
          <volume>12</volume>
          (
          <year>1987</year>
          )
          <fpage>441</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <article-title>Monte-Carlo Planning in large POMDPs</article-title>
          , in: J.
          <string-name>
            <surname>D. Laferty</surname>
            ,
            <given-names>C. K. I.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. Shawe</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            , R. S. Zemel,
            <given-names>A</given-names>
          </string-name>
          . Culotta (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>23</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2010</year>
          , pp.
          <fpage>2164</fpage>
          -
          <lpage>2172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          , E. Marchesini,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Explaining the influence of prior knowledge on POMCP policies</article-title>
          ,
          <source>in: Proceedings of the 17th European Conference on Multi-Agents Systems, volume 12520 of Lecture Notes in Artificial Intelligence</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchesini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Online monte carlo planning for autonomous robots: Exploiting prior knowledge on task similarities</article-title>
          ,
          <source>in: Proceedings of the 6th Italian Workshop on Artificial Intelligence and Robotics (AIRO</source>
          <year>2019</year>
          @
          <article-title>AI*IA2019)</article-title>
          , volume
          <volume>2594</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chalkiadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning</article-title>
          ,
          <source>in: Proc. 28-th International Joint Conference on Artificial Intelligence, IJCAI-19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5540</fpage>
          -
          <lpage>5546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gunning</surname>
          </string-name>
          ,
          <source>DARPA's Explainable Artificial Intelligence (XAI) Program</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>ii</fpage>
          -ii.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          , Explainable Planning,
          <source>CoRR abs/1709</source>
          .10256 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cashmore</surname>
          </string-name>
          , A. Collins,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krarup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krivic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <source>Towards Explainable AI Planning as a Service</source>
          ,
          <year>2019</year>
          . 2nd ICAPS Workshop on Explainable Planning,
          <string-name>
            <surname>XAIP</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Kaelbling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Cassandra</surname>
          </string-name>
          ,
          <article-title>Planning and Acting in Partially Observable Stochastic Domains, Artif</article-title>
          . Intell.
          <volume>101</volume>
          (
          <year>1998</year>
          )
          <fpage>99</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Identification of Unexpected Decisions in Partially Observable Monte Carlo Planning: A Rule-Based Approach</article-title>
          , in: accepted at the 21th
          <source>International Conference on Autonomous Agents and MultiAgent Systems</source>
          , AAMAS '
          <volume>21</volume>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>