<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From POMDP executions to policy specifications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Meli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulio Mazzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Castellini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Farinelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Verona</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Partially Observable Markov Decision Processes (POMDPs) allow modeling systems with uncertain state using probability distributions over states (called beliefs). However, in complex domains, POMDP solvers must explore large belief spaces, which is computationally intractable. One solution is to introduce domain knowledge to drive exploration, in the form of logic specifications. However, defining efective specifications may be challenging even for domain experts. We propose an approach based on inductive logic programming to learn specifications with confidence level from observed POMDP executions. We show that the learning approach converges to robust specifications as the number of examples increases.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Partially Observable MDPs</kwd>
        <kwd>Inductive Logic Programming</kwd>
        <kwd>Answer Set Programming</kwd>
        <kwd>Explainable AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>already been used to enhance interpretability of black-box models [14, 15, 16] and mine robotic
task knowledge [17, 18].</p>
      <p>We show that state-of-the-art ILASP software [19, 20] is able to automatically learn logic rules
representing the policy strategy. These rules are human-readable, since they use human-defined
features.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Rocksample</title>
        <p>In the rocksample domain, an agent can move in cardinal directions (north, south, east and west)
on a  ×  grid, with the goal to reach and sample a set of  rocks with known positions.
Rocks may have a good or bad value, yielding to positive or negative reward when sampled,
respectively. The observable part of the state is the position of the agent and rocks, while the
unobservable part is the value of rocks, modeled with belief distribution. The agent can check
the value of rocks to reduce uncertainty. Finally, the agent gets a positive reward exiting the
grid from the right-hand side. In this paper, we use  = 12 and  = 4.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Answer Set Programming (ASP) and Inductive Logic Programming (ILP)</title>
        <p>
          As of standard syntax [21], an ASP program represents a domain of interest with a signature and
axioms. The signature is the alphabet of the domain, defining variables with their ranges (e.g.,
rock identifiers R∈ {1..4} in rocksample); and atoms, i.e., predicates of variables (e.g., actions
or environmental features as dist(R,D) for representing distance D between agent and rock
R). A variable with assigned value is ground, and an atom is ground if its variables are ground.
Axioms define logical implications between atoms in the form h :- b1.., (i.e., ⋀︀=1b →h). For
instance, in rocksample, axiom sample(R) :- dist(R,0) means that a rock can be sampled if
the agent is on it. ASP solvers start from known ground atoms (determined from observable
state and belief in POMDP traces) and propagate them through axioms to compute a ground
program (i.e., with ground atoms). Ground atoms are interpreted as Booleans and true atoms are
returned. For instance, in previous example, if dist(
          <xref ref-type="bibr" rid="ref1">1,0</xref>
          ), then sample(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) becomes executable.
        </p>
        <p>A generic ILP problem  under the ASP semantics is defined as  = ⟨,  , ⟩, where 
is the background knowledge, i.e., a set of ASP axioms, atoms and variables;  is the search
space, i.e., the set of candidate ASP axioms to be learned; and  is a set of examples, i.e.,
contextdependent partial interpretations, namely couples ⟨, ⟩, being  a partial interpretation and 
the context. A Partial Interpretation (PI) is a pair of sets of ground atoms (actions in this paper)
⟨, ⟩, being  the included set and  the excluded set. The context is a set of ground
atoms, here representing domain features. The goal of  is to find  ∈  such that 
can be grounded from  ∪  ∪ , while  cannot. ILASP [19] finds  which satisfies most
examples, also returning the number of counterexamples (i.e., examples whose  /  is not
/ is a PI of ).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Learning ASP rules from POMDP traces</title>
      <sec id="sec-3-1">
        <title>3.1. ASP representation of the task</title>
        <p>The first step of our methodology for learning ASP specifications from POMDP executions is to
define a map  : ℬ → (ℱ ) from the belief space ℬ to the set (ℱ ) of possible groundings of
ℱ , i.e., the set of user-defined features. Map  thus translates the belief distribution to ground
ASP atoms. In rocksample, we define the following features: guess(R,V), i.e., probability
V∈ {0, 10, ..., 100} that R is valuable; dist(R,D), i.e., the 1-norm D∈ N between positions of
agent and R; delta_x(R,D) and delta_y(R,D), i.e., D∈ Z is - or -coordinate of R with respect
to agent; bounds on D,V (e.g., D&lt;1); sampled(R) to mark sampled rocks; and num_sampled(N),
i.e., N∈ {0, 10, ..., 100} percentage of rocks were sampled.</p>
        <p>We represent the set A of actions as ASP atoms, e.g., sample(R), north. We also introduce
atom target(R) to identify next rock to sample and capture intention of the agent.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ILASP problem definition</title>
        <p>We are now interested in finding ASP axioms matching features to actions. With reference to
the notation of Section 2.2,  contains variables and ranges defined in Section 3.1.  is the set
of all possible axioms a :- b1.., being a an action and b1.. features. The set  is composed
as follows: whenever an action a is executed, an example is generated with  = {a} and
 = ∅; on the contrary, when a is not executed,  = {a} and  = ∅. The context is
computed from belief with map  . In this way, we learn axioms which explain not only why an
action was executed, but also cases when it was not.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <p>We generate 1000 diferent rocksample executions with a state-of-the-art planner [ 22],
randomizing positions and values of rocks and initial positions for the agent. We then construct
ILASP examples only from executions with returns greater than or equal to the average of all
executions in order to learn only from “good” evidence. Overall, 8487 examples for each action
are generated. ILASP tasks are run separately for each action for computational eficiency. As
an example of learned axioms, the one for sample(R) follows:</p>
      <p>
        sample(R) :- target(R),dist(R,D),D ≤ 0,not sampled(R),guess(R,V),V ≥ 90. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
meaning that an unsampled rock (not sampled(R)) can be sampled when the agent is on
it (dist(R,D), D ≤ 0) and the rock is valuable with high probability (guess(R,V),V ≥
90). Figure 1 shows the learning results, selecting diferent percentages of examples in the
training set. We want to discover ASP axioms underlying policy computation from patterns
in execution traces, hence we do not have access to ground truth specifications to compute
standard evaluation metrics and assess learning performance. Instead, we evaluate percentage
of counterexamples (on the left) and distance between learned axioms (on the right) for diferent
number (#) of training examples, with respect to axioms learned from the full dataset. Given 2
rules 1, 2, each one made of a set of atoms {a},  ∈ {1, 2}, we define distance 1 − 2 =
|{a1}∪{a2}|−|{ a1}∩{a2}|. For instance, sample(R) :- dist(R,V), V≤ 2 has a distance 5 from
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), due to the missing not sampled(R), guess(R, V),V ≥ 90 and target(R) and the diferent
upper bound on distance D. The chart on the right of Figure 1 reports distances normalized
with respect to the number of atoms in final axioms ( i.e., axioms using 100% examples). We
observe that using ≥ 80% of the dataset, the percentage of counterexamples stabilizes and
the distance becomes null for all actions, thus learning successfully converges to a specific
hypothesis. Overall, examples learned from the full dataset cover more than 73% of examples.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We have proposed a method based on ILP and ASP to induce logic specifications explaining
POMDP policies, starting from examples of POMDP executions. Our approach only requires the
definition of high-level domain-dependent features from an expert user, which is easier than
defining the structure of specifications. Our axioms are enriched with a level of confidence,
corresponding to the number of covered examples in the dataset. The confidence level converges
as the number of considered examples in the dataset increases, as well as the distance between
learned axioms. Furthermore, at least 73% of &gt; 8400 examples are covered, proving the goodness
of our axioms. In the future, we aim to include learned specifications in online POMDP solvers
to specify new kinds of constraints [23] able to improve performance and eficiency.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This project has received funding from the Italian Ministry for University and Research, under
the PON “Ricerca e Innovazione” 2014-2020 (grant agreement No. 40-G-14702-1).
[16] G. De Giacomo, M. Favorito, L. Iocchi, F. Patrizi, Imitation learning over heterogeneous
agents with restraining bolts, in: Proceedings of the international conference on automated
planning and scheduling, volume 30, 2020, pp. 517–521.
[17] D. Meli, P. Fiorini, M. Sridharan, Towards inductive learning of surgical task knowledge:
A preliminary case study of the peg transfer task, Procedia Computer Science 176 (2020)
440–449.
[18] D. Meli, M. Sridharan, P. Fiorini, Inductive learning of answer set programs for autonomous
surgical task planning, Machine Learning 110 (2021) 1739–1763.
[19] M. Law, A. Russo, K. Broda, The ILASP system for learning answer set programs, www.</p>
      <p>ilasp.com, 2015.
[20] M. Law, A. Russo, K. Broda, Iterative learning of answer set programs from context
dependent examples, Theory and Practice of Logic Programming 16 (2016) 834–848.
[21] F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone,
M. Maratea, F. Ricca, T. Schaub, Asp-core-2 input language format, Theory and Practice
of Logic Programming 20 (2020) 294–309.
[22] D. Silver, J. Veness, Monte-carlo planning in large pomdps, Advances in neural information
processing systems 23 (2010).
[23] A. Castellini, G. Chalkiadakis, A. Farinelli, Influence of State-Variable Constraints on
Partially Observable Monte Carlo Planning, in: IJCAI 2019, Macao, China, August 10-16,
2019, ijcai.org, 2019, pp. 5540–5546.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Cassandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Incremental pruning: A simple, fast, exact method for partially observable markov decision processes</article-title>
          ,
          <source>arXiv preprint arXiv:1302.1525</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Papadimitriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Tsitsiklis</surname>
          </string-name>
          ,
          <article-title>The complexity of markov decision processes</article-title>
          ,
          <source>Mathematics of operations research 12</source>
          (
          <year>1987</year>
          )
          <fpage>441</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Leonetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <article-title>A synthesis of automated planning and reinforcement learning for eficient, robust decision-making</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>241</volume>
          (
          <year>2016</year>
          )
          <fpage>103</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sridharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gelfond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Wyatt,
          <article-title>Reba: A refinement-based architecture for knowledge representation and reasoning in robotics</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>65</volume>
          (
          <year>2019</year>
          )
          <fpage>87</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Favorito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Patrizi</surname>
          </string-name>
          ,
          <article-title>Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications</article-title>
          ,
          <source>in: Proceedings of the international conference on automated planning and scheduling</source>
          , volume
          <volume>29</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Identification of unexpected decisions in partially observable monte-carlo planning: A rule-based approach</article-title>
          ,
          <source>in: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems</source>
          (AAMAS),
          <source>International Foundation for Autonomous Agents and Multiagent Systems</source>
          ,
          <year>2021</year>
          , p.
          <fpage>889</fpage>
          -
          <lpage>897</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Rule-based shielding for partially observable montecarlo planning</article-title>
          ,
          <source>in: Proceedings of the International Conference on Automated Planning and Scheduling</source>
          , volume
          <volume>31</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Active generation of logical rules for pomcp shielding</article-title>
          ,
          <source>in: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems</source>
          , AAMAS '22,
          <string-name>
            <surname>International</surname>
            <given-names>Foundation</given-names>
          </string-name>
          <source>for Autonomous Agents and Multiagent Systems</source>
          , Richland,
          <string-name>
            <surname>SC</surname>
          </string-name>
          ,
          <year>2022</year>
          , p.
          <fpage>1696</fpage>
          -
          <lpage>1698</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lifschitz</surname>
          </string-name>
          ,
          <article-title>Answer set planning</article-title>
          ,
          <source>in: International Conference on Logic Programming and Nonmonotonic Reasoning</source>
          , Springer,
          <year>1999</year>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>374</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Erdem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patoglu</surname>
          </string-name>
          ,
          <article-title>Applications of asp in robotics</article-title>
          ,
          <source>KI-Künstliche Intelligenz</source>
          <volume>32</volume>
          (
          <year>2018</year>
          )
          <fpage>143</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ginesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sansonetto</surname>
          </string-name>
          , P. Fiorini,
          <article-title>Autonomous task planning and situation awareness in robotic surgery</article-title>
          ,
          <source>in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>3144</fpage>
          -
          <lpage>3150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dall'Alba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fiorini</surname>
          </string-name>
          ,
          <article-title>Deliberation in autonomous robotic surgery: a framework for handling anatomical uncertainty</article-title>
          ,
          <source>arXiv preprint arXiv:2203</source>
          .05438, in publication for
          <source>IEEE ICRA</source>
          <year>2022</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          ,
          <article-title>Inductive logic programming</article-title>
          ,
          <source>New generation computing 8</source>
          (
          <year>1991</year>
          )
          <fpage>295</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rabold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siebers</surname>
          </string-name>
          , U. Schmid,
          <article-title>Explaining black-box classifiers with ilp-empowering lime with aleph to approximate non-linear decisions with relational rules</article-title>
          ,
          <source>in: International Conference on Inductive Logic Programming</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>A. D'Asaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spezialetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Raggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <article-title>Towards an inductive logic programming approach for explaining black-box preference learning systems</article-title>
          ,
          <source>in: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning</source>
          , volume
          <volume>17</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>859</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>