1. Introduction

From POMDP executions to policy specifications

Daniele Meli

Giulio Mazzi

Alberto Castellini

Alessandro Farinelli

0 0 Department of Computer Science, University of Verona , Italy

Partially Observable Markov Decision Processes (POMDPs) allow modeling systems with uncertain state using probability distributions over states (called beliefs). However, in complex domains, POMDP solvers must explore large belief spaces, which is computationally intractable. One solution is to introduce domain knowledge to drive exploration, in the form of logic specifications. However, defining efective specifications may be challenging even for domain experts. We propose an approach based on inductive logic programming to learn specifications with confidence level from observed POMDP executions. We show that the learning approach converges to robust specifications as the number of examples increases.

eol>Partially Observable MDPs Inductive Logic Programming Answer Set Programming Explainable AI

1. Introduction

already been used to enhance interpretability of black-box models [14, 15, 16] and mine robotic task knowledge [17, 18].

We show that state-of-the-art ILASP software [19, 20] is able to automatically learn logic rules representing the policy strategy. These rules are human-readable, since they use human-defined features.

2. Background 2.1. Rocksample

In the rocksample domain, an agent can move in cardinal directions (north, south, east and west) on a × grid, with the goal to reach and sample a set of rocks with known positions. Rocks may have a good or bad value, yielding to positive or negative reward when sampled, respectively. The observable part of the state is the position of the agent and rocks, while the unobservable part is the value of rocks, modeled with belief distribution. The agent can check the value of rocks to reduce uncertainty. Finally, the agent gets a positive reward exiting the grid from the right-hand side. In this paper, we use = 12 and = 4.

2.2. Answer Set Programming (ASP) and Inductive Logic Programming (ILP)

As of standard syntax [21], an ASP program represents a domain of interest with a signature and axioms. The signature is the alphabet of the domain, defining variables with their ranges (e.g., rock identifiers R∈ {1..4} in rocksample); and atoms, i.e., predicates of variables (e.g., actions or environmental features as dist(R,D) for representing distance D between agent and rock R). A variable with assigned value is ground, and an atom is ground if its variables are ground. Axioms define logical implications between atoms in the form h :- b1.., (i.e., ⋀︀=1b →h). For instance, in rocksample, axiom sample(R) :- dist(R,0) means that a rock can be sampled if the agent is on it. ASP solvers start from known ground atoms (determined from observable state and belief in POMDP traces) and propagate them through axioms to compute a ground program (i.e., with ground atoms). Ground atoms are interpreted as Booleans and true atoms are returned. For instance, in previous example, if dist( 1,0 ), then sample( 1 ) becomes executable.

A generic ILP problem under the ASP semantics is defined as = ⟨, , ⟩, where is the background knowledge, i.e., a set of ASP axioms, atoms and variables; is the search space, i.e., the set of candidate ASP axioms to be learned; and is a set of examples, i.e., contextdependent partial interpretations, namely couples ⟨, ⟩, being a partial interpretation and the context. A Partial Interpretation (PI) is a pair of sets of ground atoms (actions in this paper) ⟨, ⟩, being the included set and the excluded set. The context is a set of ground atoms, here representing domain features. The goal of is to find ∈ such that can be grounded from ∪ ∪ , while cannot. ILASP [19] finds which satisfies most examples, also returning the number of counterexamples (i.e., examples whose / is not / is a PI of ).

3. Learning ASP rules from POMDP traces 3.1. ASP representation of the task

The first step of our methodology for learning ASP specifications from POMDP executions is to define a map : ℬ → (ℱ ) from the belief space ℬ to the set (ℱ ) of possible groundings of ℱ , i.e., the set of user-defined features. Map thus translates the belief distribution to ground ASP atoms. In rocksample, we define the following features: guess(R,V), i.e., probability V∈ {0, 10, ..., 100} that R is valuable; dist(R,D), i.e., the 1-norm D∈ N between positions of agent and R; delta_x(R,D) and delta_y(R,D), i.e., D∈ Z is - or -coordinate of R with respect to agent; bounds on D,V (e.g., D<1); sampled(R) to mark sampled rocks; and num_sampled(N), i.e., N∈ {0, 10, ..., 100} percentage of rocks were sampled.

We represent the set A of actions as ASP atoms, e.g., sample(R), north. We also introduce atom target(R) to identify next rock to sample and capture intention of the agent.

3.2. ILASP problem definition

We are now interested in finding ASP axioms matching features to actions. With reference to the notation of Section 2.2, contains variables and ranges defined in Section 3.1. is the set of all possible axioms a :- b1.., being a an action and b1.. features. The set is composed as follows: whenever an action a is executed, an example is generated with = {a} and = ∅; on the contrary, when a is not executed, = {a} and = ∅. The context is computed from belief with map . In this way, we learn axioms which explain not only why an action was executed, but also cases when it was not.

4. Experimental results

We generate 1000 diferent rocksample executions with a state-of-the-art planner [ 22], randomizing positions and values of rocks and initial positions for the agent. We then construct ILASP examples only from executions with returns greater than or equal to the average of all executions in order to learn only from “good” evidence. Overall, 8487 examples for each action are generated. ILASP tasks are run separately for each action for computational eficiency. As an example of learned axioms, the one for sample(R) follows:

sample(R) :- target(R),dist(R,D),D ≤ 0,not sampled(R),guess(R,V),V ≥ 90. ( 1 ) meaning that an unsampled rock (not sampled(R)) can be sampled when the agent is on it (dist(R,D), D ≤ 0) and the rock is valuable with high probability (guess(R,V),V ≥ 90). Figure 1 shows the learning results, selecting diferent percentages of examples in the training set. We want to discover ASP axioms underlying policy computation from patterns in execution traces, hence we do not have access to ground truth specifications to compute standard evaluation metrics and assess learning performance. Instead, we evaluate percentage of counterexamples (on the left) and distance between learned axioms (on the right) for diferent number (#) of training examples, with respect to axioms learned from the full dataset. Given 2 rules 1, 2, each one made of a set of atoms {a}, ∈ {1, 2}, we define distance 1 − 2 = |{a1}∪{a2}|−|{ a1}∩{a2}|. For instance, sample(R) :- dist(R,V), V≤ 2 has a distance 5 from ( 1 ), due to the missing not sampled(R), guess(R, V),V ≥ 90 and target(R) and the diferent upper bound on distance D. The chart on the right of Figure 1 reports distances normalized with respect to the number of atoms in final axioms ( i.e., axioms using 100% examples). We observe that using ≥ 80% of the dataset, the percentage of counterexamples stabilizes and the distance becomes null for all actions, thus learning successfully converges to a specific hypothesis. Overall, examples learned from the full dataset cover more than 73% of examples.

5. Conclusion

We have proposed a method based on ILP and ASP to induce logic specifications explaining POMDP policies, starting from examples of POMDP executions. Our approach only requires the definition of high-level domain-dependent features from an expert user, which is easier than defining the structure of specifications. Our axioms are enriched with a level of confidence, corresponding to the number of covered examples in the dataset. The confidence level converges as the number of considered examples in the dataset increases, as well as the distance between learned axioms. Furthermore, at least 73% of > 8400 examples are covered, proving the goodness of our axioms. In the future, we aim to include learned specifications in online POMDP solvers to specify new kinds of constraints [23] able to improve performance and eficiency.

Acknowledgments

This project has received funding from the Italian Ministry for University and Research, under the PON “Ricerca e Innovazione” 2014-2020 (grant agreement No. 40-G-14702-1). [16] G. De Giacomo, M. Favorito, L. Iocchi, F. Patrizi, Imitation learning over heterogeneous agents with restraining bolts, in: Proceedings of the international conference on automated planning and scheduling, volume 30, 2020, pp. 517–521. [17] D. Meli, P. Fiorini, M. Sridharan, Towards inductive learning of surgical task knowledge: A preliminary case study of the peg transfer task, Procedia Computer Science 176 (2020) 440–449. [18] D. Meli, M. Sridharan, P. Fiorini, Inductive learning of answer set programs for autonomous surgical task planning, Machine Learning 110 (2021) 1739–1763. [19] M. Law, A. Russo, K. Broda, The ILASP system for learning answer set programs, www.

ilasp.com, 2015. [20] M. Law, A. Russo, K. Broda, Iterative learning of answer set programs from context dependent examples, Theory and Practice of Logic Programming 16 (2016) 834–848. [21] F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone, M. Maratea, F. Ricca, T. Schaub, Asp-core-2 input language format, Theory and Practice of Logic Programming 20 (2020) 294–309. [22] D. Silver, J. Veness, Monte-carlo planning in large pomdps, Advances in neural information processing systems 23 (2010). [23] A. Castellini, G. Chalkiadakis, A. Farinelli, Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning, in: IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. 5540–5546.

[1]

A. R.

Cassandra ,

M. L.

Littman ,

N. L.

Zhang , Incremental pruning: A simple, fast, exact method for partially observable markov decision processes , arXiv preprint arXiv:1302.1525 ( 2013 ).

[2]

C. H.

Papadimitriou ,

J. N.

Tsitsiklis , The complexity of markov decision processes , Mathematics of operations research 12 ( 1987 ) 441 - 450 .

[3]

Leonetti ,

Iocchi ,

Stone , A synthesis of automated planning and reinforcement learning for eficient, robust decision-making , Artificial Intelligence 241 ( 2016 ) 103 - 130 .

[4]

Sridharan ,

Gelfond ,

Zhang , J. Wyatt, Reba: A refinement-based architecture for knowledge representation and reasoning in robotics , Journal of Artificial Intelligence Research 65 ( 2019 ) 87 - 180 .

[5]

De Giacomo ,

Iocchi ,

Favorito ,

Patrizi , Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications , in: Proceedings of the international conference on automated planning and scheduling , volume 29 , 2019 , pp. 128 - 136 .

[6]

Mazzi ,

Castellini ,

Farinelli , Identification of unexpected decisions in partially observable monte-carlo planning: A rule-based approach , in: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), International Foundation for Autonomous Agents and Multiagent Systems , 2021 , p. 889 - 897 .

[7]

Mazzi ,

Castellini ,

Farinelli , Rule-based shielding for partially observable montecarlo planning , in: Proceedings of the International Conference on Automated Planning and Scheduling , volume 31 , 2021 , pp. 243 - 251 .

[8]

Mazzi ,

Castellini ,

Farinelli , Active generation of logical rules for pomcp shielding , in: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , AAMAS '22, International

Foundation

for Autonomous Agents and Multiagent Systems , Richland, SC , 2022 , p. 1696 - 1698 .

[9]

Lifschitz , Answer set planning , in: International Conference on Logic Programming and Nonmonotonic Reasoning , Springer, 1999 , pp. 373 - 374 .

[10]

Erdem ,

Patoglu , Applications of asp in robotics , KI-Künstliche Intelligenz 32 ( 2018 ) 143 - 149 .

[11]

Ginesi ,

Meli ,

Roberti ,

Sansonetto , P. Fiorini, Autonomous task planning and situation awareness in robotic surgery , in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2020 , pp. 3144 - 3150 .

[12]

Tagliabue ,

Meli ,

Dall'Alba ,

Fiorini , Deliberation in autonomous robotic surgery: a framework for handling anatomical uncertainty , arXiv preprint arXiv:2203 .05438, in publication for IEEE ICRA 2022 ( 2022 ).

[13]

Muggleton , Inductive logic programming , New generation computing 8 ( 1991 ) 295 - 318 .

[14]

Rabold ,

Siebers , U. Schmid, Explaining black-box classifiers with ilp-empowering lime with aleph to approximate non-linear decisions with relational rules , in: International Conference on Inductive Logic Programming , Springer, 2018 , pp. 105 - 117 .

[15]

A. D'Asaro ,

Spezialetti ,

Raggioli ,

Rossi , Towards an inductive logic programming approach for explaining black-box preference learning systems , in: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning , volume 17 , 2020 , pp. 855 - 859 .