From POMDP executions to policy specifications Daniele Meli, Giulio Mazzi, Alberto Castellini and Alessandro Farinelli Department of Computer Science, University of Verona, Italy Abstract Partially Observable Markov Decision Processes (POMDPs) allow modeling systems with uncertain state using probability distributions over states (called beliefs). However, in complex domains, POMDP solvers must explore large belief spaces, which is computationally intractable. One solution is to introduce domain knowledge to drive exploration, in the form of logic specifications. However, defining effective specifications may be challenging even for domain experts. We propose an approach based on inductive logic programming to learn specifications with confidence level from observed POMDP executions. We show that the learning approach converges to robust specifications as the number of examples increases. Keywords Partially Observable MDPs, Inductive Logic Programming, Answer Set Programming, Explainable AI 1. Introduction Partially Observable Markov Decision Processes (POMDPs) is a popular framework for modeling systems with state uncertainty [1]. The state cannot be completely observed by the agent, hence it is modeled as a probability distribution called belief. A POMDP agent can execute actions, depending on the current belief, and each belief-action pair is mapped to a new belief according to a belief update rule that considers all possible state transitions performed by a transition map. Finally, a reward is assigned to each state-action pair. The goal of the agent is to maximize the cumulative reward (return) by selecting optimal actions (through a policy function). Unfortunately, computing optimal policies for POMDPs is complex [2], because different belief realizations are possible at each step. In the field of MDPs and reinforcement learning [3, 4, 5], and recently POMDPs [6, 7, 8], one solution is bounding policy search with logic specifications. However, defining them requires usually unavailable knowledge about the policy. We propose to learn logic specifications offline from traces (belief-action pairs) of POMDP executions. We express specifications in Answer Set Programming (ASP) [9], a state-of-the-art paradigm for logic planning [10, 11, 12]. We convert the belief to ASP representation, in terms of higher-level domain features specified by an expert user, in order to find logic relationships between beliefs and actions. Defining features requires only basic domain knowledge (e.g., relevant domain quantities, used to represent any POMDP task instance). We then exploit Inductive Logic Programming (ILP) [13] to infer ASP rules from feature-action pairs. ILP has OVERLAY 2022: 4th Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis, November 28, 2022, Udine, Italy $ daniele.meli@univr.it (D. Meli); giulio.mazzi@univr.it (G. Mazzi); alberto.castellini@univr.it (A. Castellini); alessandro.farinelli@univr.it (A. Farinelli) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) already been used to enhance interpretability of black-box models [14, 15, 16] and mine robotic task knowledge [17, 18]. We show that state-of-the-art ILASP software [19, 20] is able to automatically learn logic rules representing the policy strategy. These rules are human-readable, since they use human-defined features. 2. Background 2.1. Rocksample In the rocksample domain, an agent can move in cardinal directions (north, south, east and west) on a 𝑁 Γ— 𝑁 grid, with the goal to reach and sample a set of 𝑀 rocks with known positions. Rocks may have a good or bad value, yielding to positive or negative reward when sampled, respectively. The observable part of the state is the position of the agent and rocks, while the unobservable part is the value of rocks, modeled with belief distribution. The agent can check the value of rocks to reduce uncertainty. Finally, the agent gets a positive reward exiting the grid from the right-hand side. In this paper, we use 𝑁 = 12 and 𝑀 = 4. 2.2. Answer Set Programming (ASP) and Inductive Logic Programming (ILP) As of standard syntax [21], an ASP program represents a domain of interest with a signature and axioms. The signature is the alphabet of the domain, defining variables with their ranges (e.g., rock identifiers R∈ {1..4} in rocksample); and atoms, i.e., predicates of variables (e.g., actions or environmental features as dist(R,D) for representing distance D between agent and rock R). A variable with assigned value is ground, and an atom is ground if its variables are ground. Axioms define logical implications between atoms in the form h :- b1..𝑛 , (i.e., 𝑛𝑖=1 b𝑖 β†’h). For β‹€οΈ€ instance, in rocksample, axiom sample(R) :- dist(R,0) means that a rock can be sampled if the agent is on it. ASP solvers start from known ground atoms (determined from observable state and belief in POMDP traces) and propagate them through axioms to compute a ground program (i.e., with ground atoms). Ground atoms are interpreted as Booleans and true atoms are returned. For instance, in previous example, if dist(1,0), then sample(1) becomes executable. A generic ILP problem 𝒯 under the ASP semantics is defined as 𝒯 = ⟨𝐡, 𝑆𝑀 , 𝐸⟩, where 𝐡 is the background knowledge, i.e., a set of ASP axioms, atoms and variables; 𝑆𝑀 is the search space, i.e., the set of candidate ASP axioms to be learned; and 𝐸 is a set of examples, i.e., context- dependent partial interpretations, namely couples βŸ¨π‘’, 𝐢⟩, being 𝑒 a partial interpretation and 𝐢 the context. A Partial Interpretation (PI) is a pair of sets of ground atoms (actions in this paper) βŸ¨π‘’π‘–π‘›π‘ , 𝑒𝑒π‘₯𝑐 ⟩, being 𝑒𝑖𝑛𝑐 the included set and 𝑒𝑒π‘₯𝑐 the excluded set. The context is a set of ground atoms, here representing domain features. The goal of 𝒯 is to find 𝐻 ∈ 𝑆𝑀 such that 𝑒𝑖𝑛𝑐 can be grounded from 𝐡 βˆͺ 𝐻 βˆͺ 𝐢, while 𝑒𝑒π‘₯𝑐 cannot. ILASP [19] finds 𝐻 which satisfies most examples, also returning the number of counterexamples (i.e., examples whose 𝑒𝑖𝑛𝑐 / 𝑒𝑒π‘₯𝑐 is not / is a PI of 𝐻). 3. Learning ASP rules from POMDP traces 3.1. ASP representation of the task The first step of our methodology for learning ASP specifications from POMDP executions is to define a map 𝐹 : ℬ β†’ 𝐺(β„± ) from the belief space ℬ to the set 𝐺(β„± ) of possible groundings of β„± , i.e., the set of user-defined features. Map 𝐹 thus translates the belief distribution to ground ASP atoms. In rocksample, we define the following features: guess(R,V), i.e., probability V∈ {0, 10, ..., 100} that R is valuable; dist(R,D), i.e., the 1-norm D∈ N between positions of agent and R; delta_x(R,D) and delta_y(R,D), i.e., D∈ Z is π‘₯- or 𝑦-coordinate of R with respect to agent; bounds on D,V (e.g., D<1); sampled(R) to mark sampled rocks; and num_sampled(N), i.e., N∈ {0, 10, ..., 100} percentage of rocks were sampled. We represent the set A of actions as ASP atoms, e.g., sample(R), north. We also introduce atom target(R) to identify next rock to sample and capture intention of the agent. 3.2. ILASP problem definition We are now interested in finding ASP axioms matching features to actions. With reference to the notation of Section 2.2, 𝐡 contains variables and ranges defined in Section 3.1. 𝑆𝑀 is the set of all possible axioms a :- b1..𝑛 , being a an action and b1..𝑛 features. The set 𝐸 is composed as follows: whenever an action a is executed, an example is generated with 𝑒𝑖𝑛𝑐 = {a} and 𝑒𝑒π‘₯𝑐 = βˆ…; on the contrary, when a is not executed, 𝑒𝑒π‘₯𝑐 = {a} and 𝑒𝑖𝑛𝑐 = βˆ…. The context is computed from belief with map 𝐹 . In this way, we learn axioms which explain not only why an action was executed, but also cases when it was not. 4. Experimental results We generate 1000 different rocksample executions with a state-of-the-art planner [22], ran- domizing positions and values of rocks and initial positions for the agent. We then construct ILASP examples only from executions with returns greater than or equal to the average of all executions in order to learn only from β€œgood” evidence. Overall, 8487 examples for each action are generated. ILASP tasks are run separately for each action for computational efficiency. As an example of learned axioms, the one for sample(R) follows: sample(R) :- target(R),dist(R,D),D ≀ 0,not sampled(R),guess(R,V),V β‰₯ 90. (1) meaning that an unsampled rock (not sampled(R)) can be sampled when the agent is on it (dist(R,D), D ≀ 0) and the rock is valuable with high probability (guess(R,V),V β‰₯ 90). Figure 1 shows the learning results, selecting different percentages of examples in the training set. We want to discover ASP axioms underlying policy computation from patterns in execution traces, hence we do not have access to ground truth specifications to compute standard evaluation metrics and assess learning performance. Instead, we evaluate percentage of counterexamples (on the left) and distance between learned axioms (on the right) for different number (#) of training examples, with respect to axioms learned from the full dataset. Given 2 rules 𝑅1 , 𝑅2 , each one made of a set of atoms {a𝑖 }, 𝑖 ∈ {1, 2}, we define distance 𝑅1 βˆ’ 𝑅2 = Figure 1: Learning results for different amounts of examples, expressed as percentage of the full dataset. |{a1 }βˆͺ{a2 }|βˆ’|{a1 }∩{a2 }|. For instance, sample(R) :- dist(R,V), V≀2 has a distance 5 from (1), due to the missing not sampled(R), guess(R, V),V β‰₯90 and target(R) and the different upper bound on distance D. The chart on the right of Figure 1 reports distances normalized with respect to the number of atoms in final axioms (i.e., axioms using 100% examples). We observe that using β‰₯ 80% of the dataset, the percentage of counterexamples stabilizes and the distance becomes null for all actions, thus learning successfully converges to a specific hypothesis. Overall, examples learned from the full dataset cover more than 73% of examples. 5. Conclusion We have proposed a method based on ILP and ASP to induce logic specifications explaining POMDP policies, starting from examples of POMDP executions. Our approach only requires the definition of high-level domain-dependent features from an expert user, which is easier than defining the structure of specifications. Our axioms are enriched with a level of confidence, corresponding to the number of covered examples in the dataset. The confidence level converges as the number of considered examples in the dataset increases, as well as the distance between learned axioms. Furthermore, at least 73% of > 8400 examples are covered, proving the goodness of our axioms. In the future, we aim to include learned specifications in online POMDP solvers to specify new kinds of constraints [23] able to improve performance and efficiency. Acknowledgments This project has received funding from the Italian Ministry for University and Research, under the PON β€œRicerca e Innovazione” 2014-2020 (grant agreement No. 40-G-14702-1). References [1] A. R. Cassandra, M. L. Littman, N. L. Zhang, Incremental pruning: A simple, fast, exact method for partially observable markov decision processes, arXiv preprint arXiv:1302.1525 (2013). [2] C. H. Papadimitriou, J. N. Tsitsiklis, The complexity of markov decision processes, Mathe- matics of operations research 12 (1987) 441–450. [3] M. Leonetti, L. Iocchi, P. Stone, A synthesis of automated planning and reinforcement learning for efficient, robust decision-making, Artificial Intelligence 241 (2016) 103–130. [4] M. Sridharan, M. Gelfond, S. Zhang, J. Wyatt, Reba: A refinement-based architecture for knowledge representation and reasoning in robotics, Journal of Artificial Intelligence Research 65 (2019) 87–180. [5] G. De Giacomo, L. Iocchi, M. Favorito, F. Patrizi, Foundations for restraining bolts: Rein- forcement learning with ltlf/ldlf restraining specifications, in: Proceedings of the interna- tional conference on automated planning and scheduling, volume 29, 2019, pp. 128–136. [6] G. Mazzi, A. Castellini, A. Farinelli, Identification of unexpected decisions in partially observable monte-carlo planning: A rule-based approach, in: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), In- ternational Foundation for Autonomous Agents and Multiagent Systems, 2021, p. 889–897. [7] G. Mazzi, A. Castellini, A. Farinelli, Rule-based shielding for partially observable monte- carlo planning, in: Proceedings of the International Conference on Automated Planning and Scheduling, volume 31, 2021, pp. 243–251. [8] G. Mazzi, A. Castellini, A. Farinelli, Active generation of logical rules for pomcp shield- ing, in: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2022, p. 1696–1698. [9] V. Lifschitz, Answer set planning, in: International Conference on Logic Programming and Nonmonotonic Reasoning, Springer, 1999, pp. 373–374. [10] E. Erdem, V. Patoglu, Applications of asp in robotics, KI-KΓΌnstliche Intelligenz 32 (2018) 143–149. [11] M. Ginesi, D. Meli, A. Roberti, N. Sansonetto, P. Fiorini, Autonomous task planning and situation awareness in robotic surgery, in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2020, pp. 3144–3150. [12] E. Tagliabue, D. Meli, D. Dall’Alba, P. Fiorini, Deliberation in autonomous robotic surgery: a framework for handling anatomical uncertainty, arXiv preprint arXiv:2203.05438, in publication for IEEE ICRA 2022 (2022). [13] S. Muggleton, Inductive logic programming, New generation computing 8 (1991) 295–318. [14] J. Rabold, M. Siebers, U. Schmid, Explaining black-box classifiers with ilp–empowering lime with aleph to approximate non-linear decisions with relational rules, in: International Conference on Inductive Logic Programming, Springer, 2018, pp. 105–117. [15] F. A. D’Asaro, M. Spezialetti, L. Raggioli, S. Rossi, Towards an inductive logic programming approach for explaining black-box preference learning systems, in: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, volume 17, 2020, pp. 855–859. [16] G. De Giacomo, M. Favorito, L. Iocchi, F. Patrizi, Imitation learning over heterogeneous agents with restraining bolts, in: Proceedings of the international conference on automated planning and scheduling, volume 30, 2020, pp. 517–521. [17] D. Meli, P. Fiorini, M. Sridharan, Towards inductive learning of surgical task knowledge: A preliminary case study of the peg transfer task, Procedia Computer Science 176 (2020) 440–449. [18] D. Meli, M. Sridharan, P. Fiorini, Inductive learning of answer set programs for autonomous surgical task planning, Machine Learning 110 (2021) 1739–1763. [19] M. Law, A. Russo, K. Broda, The ILASP system for learning answer set programs, www. ilasp.com, 2015. [20] M. Law, A. Russo, K. Broda, Iterative learning of answer set programs from context dependent examples, Theory and Practice of Logic Programming 16 (2016) 834–848. [21] F. Calimeri, W. Faber, M. Gebser, G. Ianni, R. Kaminski, T. Krennwallner, N. Leone, M. Maratea, F. Ricca, T. Schaub, Asp-core-2 input language format, Theory and Practice of Logic Programming 20 (2020) 294–309. [22] D. Silver, J. Veness, Monte-carlo planning in large pomdps, Advances in neural information processing systems 23 (2010). [23] A. Castellini, G. Chalkiadakis, A. Farinelli, Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning, in: IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. 5540–5546.