=Paper=
{{Paper
|id=Vol-1433/tc_30
|storemode=property
|title=Learning Probabilistic Action Models from
Interpretation Transitions
|pdfUrl=https://ceur-ws.org/Vol-1433/tc_30.pdf
|volume=Vol-1433
|dblpUrl=https://dblp.org/rec/conf/iclp/0004RIAT15
}}
==Learning Probabilistic Action Models from
Interpretation Transitions ==
Technical Communications of ICLP 2015. Copyright with the Authors. 1 Learning Probabilistic Action Models from Interpretation Transitions David Martı́nez Institut de Robotica i Informatica Industrial (CSIC-UPC) (e-mail: dmartinez@iri.upc.edu), Tony Ribeiro The Graduate University for Advanced Studies (Sokendai), (e-mail: tony ribeiro@nii.ac.jp), Katsumi Inoue National Institute of Informatics, (e-mail: inoue@nii.ac.jp) Guillem Alenyà, Carme Torras Institut de Robotica i Informatica Industrial (CSIC-UPC) (e-mail: {galenya,torras}@iri.upc.edu) submitted 29 April 2015; accepted 5 June 2015 Abstract There have been great advances in the probabilistic planning community during recent years, and planners can now provide solutions for very complex probabilistic tasks. However, planners require to have a model that represents the dynamics of the system, and in general these models are built by hand. In this paper, we present a framework to automatically infer probabilistic models from observations of the state transitions of a dynamic system. We propose an extension of previous works that perform learning from interpretation transitions. These works consider as input a set of state transitions and build a logic program that realizes the given transition relations. Here we extend this method to learn a compact set of probabilistic planning operators that capture probabilistic dynamics. Finally, we provide experimental validation of the quality of the learned models. KEYWORDS:inductive logic programming, learning from interpretation transitions, probabilistic planning operators, action model learning, probabilistic planning 1 Introduction Lately the performance of probabilistic task planners has improved to the point where they can find solutions in many complex scenarios (Kolobov et al. 2012; Keller and Eyerich 2012). Planners have been applied successfully in several fields, such as robotics (Chanel et al. 2014; Martı́nez et al. 2015) and aerospace (Surovik and Scheeres 2015). However, in most cases an expert is required to define the action model to be used by the planner. Creating a model is an error-prone task that requires in-depth knowledge about the problem and the planner, as well as extensive testing to assure the correctness of the model. We propose a new method to automatically learn action models based on a set of given state transitions. 2 D. Martı́nez et al. Most previous approaches have only tackled the problem of learning deterministic ac- tion models (Zhuo et al. 2010), including uncertainty in the perceptions (Mourao et al. 2012). Regarding logic programming, abductive action learning has been studied based on abductive event calculus (Eshghi 1988), an abductive extension of event calculus, and has been extended for applications to planning, e.g., (Shanahan 2000). (Moyle 2003) uses an ILP technique to learn a causal theory based on event calculus (Kowalski and Sergot 1989), given examples of input-output relations. Probabilistic logic programs to maximize the probabilities of observations are learned by (Corapi et al. 2011; Sykes et al. 2013). They use parameter estimation to find the probabilities associated with each atom and rule, but it requires to code manually the restrictions for the initial set of rule candidates. Ac- tion models with uncertain effects have also been learned by (Rodrigues et al. 2012; Pasula et al. 2007) in STRIPS-like languages. Their optimization looks for planning operators that define as many effects as possible at once, which makes it a better choice for preconditions- effects based PPDDL domains (Younes and Littman 2004), such as the domains in past planning competitions (IPPC 2004-2008). In contrast our approach finds planning opera- tors for each single action effect (i.e. rule head), which makes it a better choice to learn variable-based RDDL domains (Sanner 2010) used in newer planning competitions (IPPC 2011-2014). Our approach learns an action model encoded as a set of planning operators. Each oper- ator describes how the value of a predicate changes based on a set of preconditions given that an action is executed. We present a novel method to learn in two levels as shown in Fig. 1. On the right, the Learning From Interpretation Transitions (LFIT) framework (Inoue et al. 2014; Ribeiro and Inoue 2014) is used to generate rules. Given a set of state transi- tions of a system, it can learn a normal logic program that captures the system dynamics. The resulting rules are all candidates that have to be considered to select the best planning operators. On the left, the data is generalized between different objects by using a relational representation, and an optimization method is used to select the best subsets of planning operators that explain input transitions while maintaining generality. The dependency rela- tions between the planning operators are used to efficiently select the best candidates. Our method is designed to learn RDDL-like (Younes and Littman 2004) operators, where each variable is updated separately based on a set of preconditions. To that end, in this paper we present the following novel improvements: • The LUST algorithm: an extension of the LFIT framework to learn probabilistic dy- namics from uncertain state transitions. The new algorithm that we propose can con- struct a model of a non-deterministic system by learning probabilistic rules. • LUST also integrates multi-valued variables, which allows us to represent actions more efficiently since every transition has only one action of the many possible ac- tions. • The integration of logic programming to efficiently limit the number of rule can- didates with a relational representation that provides better generalization, and an optimization method to select the best subsets of planning operators. The paper is organized as follows. First we will introduce some background required to understand the logic rule learning, and the algorithms proposed to that end. Afterwards, the action model and the translations needed between logic programs and planning domains Learning Probabilistic Action Models from Interpretation Transitions 3 Grounded Symbolic Interpretation transitions transitions transitions Symbolic Interpretation Input observations interpretation transitions LFIT Operator Probabilistic Best RDDL operators optimization propositional rules Symbolic planning Probabilistic operators propositional rules Fig. 1. Overview of the learning framework. The modules that interact with planning data and op- erators are in the blue rectangle on the left, while the modules related to logic programming are included inside the red rectangle on the right. The input of the method are grounded transitions, which are then converted to symbolic transitions to generalize between different objects. To obtain rule candidates, LFIT requires that the symbolic transitions are represented with propositional atoms. Finally, after rule candidates are obtained, they are transformed to planning operators to select the best set of operators. are explained. Finally we show experimental results and conclusions of the presented ap- proach. 2 Preliminaries In this section we recall some preliminaries of logic programming. We also explain the basis of learning from interpretation transitions in order to make easier the understanding of its extension to multi-valued variable and probabilistic rule learning. 2.1 Logic Programming We consider a first-order language and denote the Herbrand base (the set of all ground atoms) as B. A (normal) logic program (NLP) is a set of rules of the form A ← A1 ∧ · · · ∧ Am ∧ ¬Am+1 ∧ · · · ∧ ¬An (1) where A and Ai are atoms (n ≥ m ≥ 0). For any rule R of the form (1), the atom A is called the head of R and is denoted as h(R), and the conjunction to the right of ← is called the body of R. An (Herbrand) interpretation I is a subset of B. For a logic program P and an Herbrand interpretation I, the immediate consequence operator (or TP operator) (Apt et al. 1988) is the mapping TP : 2B → 2B : TP (I) = { h(R) | R ∈ ground(P ), b+ (R) ⊆ I, b− (R) ∩ I = ∅}. (2) 2.2 Learning from Interpretation Transitions LF1T (Inoue et al. 2014) is an any time algorithm that takes a set of one-step state transi- tions E as input. These one-step state transitions can be considered as positive examples. 4 D. Martı́nez et al. From these transitions, the algorithm learns a logic program P that represents the dynam- ics of E. To perform this learning process, we can iteratively consider one-step transitions. In LF1T, the set of all atoms B is assumed to be finite. In the input E, a state transition is represented by a pair of interpretations. The output of LF1T is a logic program that realizes all state transitions of E. Learning from 1-Step Transitions (LF1T) Input: E a set of state transitions (I, J) of a system S and B the set of all possible atoms that can appear in I and J. Output: A logic program P such that J = next(I) holds for any (I, J) ∈ E. To build a logic program with LF1T, in (Ribeiro and Inoue 2014) we used a bottom-up method that generates hypotheses by specialization from the most general rules, that are fact rules, until the logic program is consistent with all input state transitions. Learning by specialization ensures to output the most general valid hypothesis. 2.3 Learning Multi-valued logic programs Research in multi-valued logic programming has proceed along three different directions (Kifer and Subrahmanian 1992): bilattice-based logics (Fitting 1991; Ginsberg 1988), quan- titative rule sets (Van Emden 1986) and annotated logics (Blair and Subrahmanian 1989; Blair and Subrahmanian 1988). The multi-valued logic representation used in our new al- gorithm is based on annotated logics. Here, to each atom corresponds a given set of values. In a rule, a literal is an atom annotated with one of these values. It allows us to represent annotated atoms simply as classical atoms and thus to remain in the normal logic program semantics. In order to represent multi-valued variables, we now restrict all atoms of a logic program to the form varval . The intuition behind this form is that var represents some variable of the system and val represents the value of this variable. In annotated logics, the atom var is said to be annotated by the constant val. We consider a multi-valued logic program as a set of rules of the form varval ← var1val1 ∧ · · · ∧ varnvaln (3) where varval and varivali ’s are atoms (n ≥ 1). Like before, for any rule R of the form (3), left part of ← is called the head of R and is denoted as h(R), and the conjunction to the right of ← is called the body of R. We represent the set of literals in the body of R of the form (3) as b(R) = {var1val1 , . . . , varnvaln }. A rule R of the form (3) is interpreted as follows: the variable var takes the value val in the next state if all variables vari have the value vali in the current state. An interpretation of a multi-valued program provides the value of each variable of the system and is defined as follows. Definition 1 (Multi-valued Interpretation) Let B be a set of atoms where each element has the form varval . An interpretation I of 0 0 a set of atoms B is a subset of B where ∀varval ∈ B, ∃varval ∈ I and ∀varval ∈ 00 I, @varval ∈ I, val0 6= val00 . For a system S represented by a multi-valued logic program P and a state s1 represented Learning Probabilistic Action Models from Interpretation Transitions 5 by an interpretation I, the successor of s1 is represented by the interpretation: next(I) = {h(R) | R ∈ P, b(R) ⊆ I} The state transitions of a logic program P are represented by a set of pairs of interpreta- tions (I, next(I)). Definition 2 (Multi-valued Consistency) Let R be a rule and (I, J) be a state transition. R is consistent with (I, J) iff b(R) ⊆ I implies h(R) ∈ J. Let E be a set of state transitions, R is consistent with E if R is consistent with all state transitions of E. A logic program P is consistent with E if all rules of P are consistent with E. Definition 3 (Subsumption) Let R1 and R2 be two rules. If h(R1 ) = h(R2 ) and b(R1 ) ⊆ b(R2 ) then R1 subsumes R2 . Let P be a logic program and R be a rule. P subsumes R if there exists a rule R0 ∈ P that subsumes R. We say that a rule R1 is more general than another rule R2 if R1 subsumes R2 . In particular, a rule R is most general if there is no rule R0 (6= R) that subsumes R (b(r) = ∅). Example 1 Let R1 and R2 be the two following rules: R1 = (a1 ← b1 ), R2 = (a1 ← a0 ∧ b1 ), R1 subsumes R2 because (b(R1 ) = {b1 }) ⊂ (b(R2 ) = {a0 , b1 }).When R1 appears in a logic program P , R2 is useless for P , because whenever R2 can be applied, R1 can be applied. To learn multi-valued logic programs with LF1T we need to adapt the ground resolution of (Inoue et al. 2014) and the least specialization of (Ribeiro and Inoue 2014) to handle non-boolean variables. Definition 4 (complement) Let R1 and R2 be two rules, R2 is a complement of R1 on varval if varval ∈ b(R1 ), 0 0 varval ∈ b(R2 ), val 6= val0 and (b(R2 ) \ {varval }) ⊆ (b(R1 ) \ {varval }). Definition 5 (multi-valued ground resolution) Let R be a rule, P be a logic program and B be a set of atoms, R can be generalized on 0 varval if ∀varval ∈ B, val 6= val0 , ∃R0 ∈ P such that R0 is a complement of R on varval : generalise(R, P ) = h(R) ← b(R) \ varval Definition 6 (Multi-valued least specialization) Let R1 and R2 be two rules such that h(R1 ) = h(R2 ) and R1 subsumes R2 . Let B be a set of atoms. The least specialization ls(R1 , R2 , B) of R1 over R2 w.r.t B is 0 0 ls(R1 , R2 , B) = {h(R1 ) ← b(R1 )∧varval |varval ∈ b(R2 )\b(R1 ), varval ∈ B, val0 6= val} Least specialization can be used on a rule R to avoid the subsumption of another rule with a minimal reduction of the generality of R. By extension, least specialization can be used on the rules of a logic program P to avoid the subsumption of a rule with a minimal reduction of the generality of P . Let P be a logic program, B be a set of atoms, R be a rule 6 D. Martı́nez et al. and S be the set of all rules of P that subsume R. The least specialization ls(P, R, B) of P by R w.r.t B is as follows: [ ls(P, R, B) = (P \ S) ∪ ( ls(RP , R, B)) RP ∈S LF1T starts with an initial logic program P = {varval ← | varval ∈ B}. Then LF1T iteratively analyzes each transition (I, J) ∈ E. For each labeled atom A that does not I appear in J, LF1T infers an anti-rule RA : ^ I RA =A← Bi Bi ∈I A rule of P that subsumes such an anti-rule is not consistent with the transitions of E and must be revised. The idea is to use least specialization to make P consistent with the new I transitions (I, J) by avoiding the subsumption of all anti-rules RA inferred from (I, J). After least specialization, P becomes consistent with the new transition while remaining consistent with all previously analyzed transitions (theorem 3 of (Ribeiro and Inoue 2014)). When all transitions of E have been analyzed, LF1T outputs the rules of the system that realize E. 3 Learning from Uncertain State Transitions In this section we extend the LFIT framework to learn probabilistic dynamics by propos- ing an extension of LFIT for learning from uncertain state transitions. Where other work like (Gutmann et al. 2011) perform inferences from a probabilistic logic program, what we do is inferring the rules of such logic program. The programs infered by our new al- gorithm are similar to paraconsistent logic program of (Blair and Subrahmanian 1988). The use of annotated atoms allows the learned programs to induce multiple values for the same represented variable. It allows us to represent multi-valued models and capture non- deterministic state transitions. Our semantics differs from previous work like (Fierens et al. 2013). There, the authors consider probabilistic logic programs as logic programs in which some of the facts are annotated with probabilities. But in our method, its the rules that have probabilities and they are independent. 3.1 Formalization An non-deterministic system can be represented by a set of logic programs where the rules have the following form: R = value(var, val, i, j) ← var1val1 ∧ · · · ∧ varnvaln where varval and varivali are atoms (n ≥ 1), value(var, val, i, j) is the head of R again denoted h(R) and i, j are natural numbers, i ≤ j. Let I be a multi-valued interpretation. R means that i times over j, var takes the value val in the successor of I if b(R) ⊂ I. Definition 7 (Non-deterministic successors) Let I be the multi-valued interpretation of a state of a non-deterministic system S repre- sented by a set of logic programs P . Let P 0 be a logic program, one of the successors of I Learning Probabilistic Action Models from Interpretation Transitions 7 according to P 0 is next(I, P 0 ) = {varval |R ∈ P 0 , b(R) ⊆ I, h(R) = value(var, val, i, j), 0 < i ≤ j} The set of successors of I in S according to P is successor(I, P ) = {J|J ∈ next(I, Pi ), Pi ∈ P } Example 2 Let R1 and R2 be two rules such that R1 = (value(a, 1, 10, 100) ← a1 ) and R2 = (value(a, 2, 90, 100) ← a1 ). Let S be a probabilistic system represented by a set of logic programs P such that P = {{R1 }, {R2 }}. The possible next states of I in S are successor(I, P ) = {{a1 }, {a2 }}. The likelihood of having a1 in the next state of I is 10% and the one of having a2 is 90%. 3.2 Algorithm We now present LUST, an extension of LFIT for Learning from Uncertain State Tran- sition. LUST learns a set of deterministic logic programs. The main idea is that when two transitions are not consistent we need two different programs to realize them. The first program will realize the first transition and the second one will realize the second transition. The algorithm will output a set of logic programs such that every transition given as input is realized by at least one of those programs. The rules learned also pro- vide the probability of the variable values in the next state. The probability of each rule R = value(var, val, i, j) ← b(R) is simply obtained by counting how many transitions (I, J) it realizes (when b(R) ⊆ I and varval ∈ J), represented by i, over how many transitions it matches (when b(R) ⊆ I), represented by j. LUST algorithm: • Input: a set of pairs of interpretations E and a set of atoms B . • Step 1: Initialize a set of logic programs P with one program P1 with fact rules for each atom of B. • Step 2: Pick (I, J) in E, check consistency of (I, J) with all programs of P : • if there is no logic program in P that realizes (I, J) then — copy one of the logic programs Pi into a Pi0 and add rules in Pi0 to realize (I, J). — Use full ground resolution to generalize Pi0 . • Step 3: Revise all logic programs that realize (I, J) by using least specialization. • Step 4: If there is a remaining transition in E, go to step 2. • Step 5: Compute the probability of each rule of all programs Pi according to E. • Output: P a set of multi-valued logic programs that realizes E. 4 Integration of Logic Programming and Planning Domains In this section we describe the formalization used to represent planning operators, as well as the data conversions needed to learn symbolic operators from grounded transitions while using propositional logic programming. 8 D. Martı́nez et al. 4.1 Planning Model Although LUST uses a propositional representation, the planning model uses a relational representation to provide a better generalization between different states. Relational do- mains represent the state structure and objects explicitly. These domains are described by using a vocabulary of predicates P and actions A, and a set of objects Cπ . Predicates and actions take objects as arguments to define their ground counterparts. Example 3 Let on(X,Y) be a symbolic predicate, and { box1,box2,box3} be a set of objects. Three possible groundings of the symbolic predicate on(X,Y) with the given atoms are on(box1, box2), on(box1, box3 ) and on(box3, box1). A ground predicate or action is equivalent to an atom. In a planning domain, a planning state s is a set of ground predicates s = {p1 , p2 , ..., pn } that is equivalent to an Herbrand interpretation. Our planner operators represent a subset of RDDL domains (Sanner 2010). For each variable, we define the probability that it will become true in the next state based on a set of preconditions. In contrast to the full RDDL specification, these preconditions can only consist of an action and a set of predicates. Thus we define a planning operator as a tuple o = hopval , oact , oprec , oprob i where: • opval is a predicate p whose value can change to val by applying the operator. It is equivalent to the head of a logic rule. • oact is the action that has to be executed for the operator to be applicable. • oprec is a set of predicates that have to be satisfied in the current state so that the planning operator is applicable. It is equivalent to the body of a logic rule. • oprob is the probability that pval will be true. 4.2 Data Representation To have a more general and compact model, we are using a relational representation at the planning level. The input of our method consists of state transitions that are tuples t = hs, act, s0 i where s and s0 are the states before and after executing the action act. The states consist of sets of ground predicates, and act is a grounded action. On the other hand, the output is a set of symbolic (i.e. non-grounded) planning operators. Therefore, our method transforms initial grounded data to symbolic planning operators. Moreover, LUST works with propositional atoms, so a transformation from symbolic predicates to atoms and back to symbolic predicates is also needed. Figure 1 shows the needed data conversions, which are explained in more detail below. Transform grounded transitions to symbolic transitions: • Input: a set of grounded transitions T = [t1 , t2 , ..., tn ]. • For each transition t = hs, act, s0 i: — Take every argument χ of the action act. — Substitute χ for a default symbolic parameter in s, act, and s0 . — Create a new transition t0 with the symbolic predicates in s and s0 and the sym- bolic action act. Learning Probabilistic Action Models from Interpretation Transitions 9 — Add the new transition t0 to T 0 . • Output: set of symbolic transitions T 0 . Transform a symbolic transition to an interpretation transition: • Input: a symbolic transition t = hs, act, s0 i. • Assign an atom to each predicate in s, act and s0 . • I = atoms that correspond to s. • Add act as an atom in I that represents the action. • J = atoms that correspond to s0 . • Output: interpretation transition (I, J). To transform a planning symbolic transition to an interpretation transition, a labeled atom that encodes the action is added to the body of the interpretation transition. As each transition has exactly one action, a multi-valued variable represents it more efficiently than boolean, otherwise every action would have to be represented as different variables with only one being true. Moreover, each symbolic predicate value is represented by one labeled atom. After the logic programs are generated, the labeled atoms are translated back to symbolic predicates by using the same conversion. Transform a logic program to a set of planning operators: • Input: a logic program P . • For every rule R ∈ P such that h(R) = value(var, val, i, j), a planning operator o is created so that: — opval = varval . — oact = the action in the atoms of b(R). — oprec = the set of atoms in b(R) that represent predicates. — oprob = i/j. — Add o to O • Output: set of planning operators O. 5 Selecting the Set of Planning Operators In this section we present the method to select the best subset of probabilistic planning operators by using the set of logic programs generated by LF1T. First, the requirements that the planning operators have to satisfy are presented. Afterwards, we explain the preferences to decide which are the best planning operators. Finally, the method to obtain the desired planning operators is described. 5.1 Planning Operators Requirements Probabilistic planners require that only one planning operator can be applied in each state- action pair. Therefore the model has to be defined with a set of non-conflicting operators, so the planner can always decide which operator to apply for each state-action pair. 10 D. Martı́nez et al. Definition 8 (Conflicting planning operators) Let o1 and o2 be two planning operators that represent the same action o1,act = o2,act and change the same predicate o1,pval = o2,pval with different probabilities o1,prob 6= o2,prob . A conflict exists between both planning operators if ∃s | o1,prec ⊂ s, o2,prec ⊂ s. 5.2 Score Function LUST provides the minimal set of rules that describe all possible transitions. However, the best subset of non-conflicting planning operators has to be selected to create the model. To decide which are the best set of operators the algorithm uses a score function. The algorithm prefers operators with a high likelihood (i.e. that can successfully explain the state transitions), but also has a regularization term to avoid overfitting to the training data. The regularization is based on the number of planning operators and preconditions in those operators, so that general operators are preferred when the likelihood of both is similar. This regularization penalty is bigger when there are few training transitions to estimate each operator, as general operators are preferred to poorly estimated specific ones. On the other hand, the regularization penalty decreases as our estimate improves with more transitions because the method can be more confident about the operators. The following functions are used to define the score function. • The likelihood P (t|O) = P (s0 |s, act, o) | (o ∈ O, oprec ∈ s, oact = act) is the P the set of planning operators O. probability that the transition t is covered by 1 • The penalty term P en(O) = |O| + |O| o∈O |oprec | is the number of planning operators plus the average number of preconditions that they have. • The confidence in a planning operator Conf (O, T ) is bounded by using the Hoeffd- ing’s inequality. The probability that our estimate o[prob is accurate enough |[ oprob − oprob | ≤ with a number of samples |T | is bounded by Conf (O, T ) ≤ 1 − 2 2e−2 |T | . Finally, the proposed score function is defined as ! 1 X P en(O) S(O, T ) = P (t|O) −α (4) |T | Conf (O, T ) t∈T where α is a scaling parameter for the penalty term. Also note that Conf (O, T ) ' 1 when the number of input transitions is large, so the penalty term will always be present to ensure general operators are preferred. 5.3 Selecting the Best Planning Operator Subset In this section we present the algorithm used to select the subset of non-conflicting plan- ning operators that maximizes the score function. To do it efficiently we make use of the dependency relations between operators by using the subsumption relation graph defined below. Definition 9 (Planning operator subsumption relation) Let o1 and o2 be two planning operators that represent the same action o1,act = o2,act and the same effect o1,pval = o2,pval . o1 subsumes the operator o2 if o1,prec ⊂ o2,prec . Learning Probabilistic Action Models from Interpretation Transitions 11 ∅ a b ¬b c a b a c a ¬b b c a ¬b c Fig. 2. Example of a subsumption graph. In each node, the planning operator preconditions are shown. Leaves are the nodes painted in blue. Definition 10 (Subsumption graph) The subsumption graph GO of a set of planning operators O = {o1 , ..., on } is a directed graph with arcs (oi , oj ) when oi subsumes of oj and |oj,prec | − |oi,prec | = 1. Figure 2 shows an example of a subsumption graph. We will call leaves L(GO ) of the graph all nodes that do not have a child. The subsumption graph orders the rules in levels that represent the generality of the operator: the less predicates in the preconditions the more general the operator is. This graph provides two advantages: generalizing operators is easier as we only have to follow the arcs, and it reduces the number of conflicts to check because general operators won’t be checked if other more specific operators that represent the same dynamics exist. Below a greedy approach to efficiently obtain a set of good planning operators is provided. Select set of planning operators: • Create the subsumption graph. • Repeat until nothing changes: — Select the graph leaves, and choose the subset of non-conflicting leafs with the highest score. Remove the other leaves. — Check if replacing any operator by a more general one (i.e. the operator parents in the subsumbtion graph) improves the score. If so, remove the specific operator. • The result is the set of leaves in the remaining graph. 6 Evaluation In this section we provide an experimental evaluation of our approach by learning two domains of the 2014 International Probabilistic Planning Competition (IPPC). The exper- iments use transitions t = hs, act, s0 i generated by randomly constructing a state s, ran- domly picking the arguments of the action act, and then applying the action to generate the state s0 . The distribution of samples is biased to guarantee that half of the samples have a chance to change the state. To measure the quality of the learned models, the errors shown represent the differences between the learned operators and the ground truth ones. For each incorrect predicate in an operator preconditions, the number was increased by 1. The left plot in Fig. 3 shows the results obtained in the Triangle Tireworld domain. In this domain, a car has to move to its destination, but it has a probability of getting a 12 D. Martı́nez et al. Triangle Tireworld Domain Elevators Domain 6 ChangeTire closeDoor LoadTire 6 openDoorGoingUp/Down MoveCar moveCurrentDir 4 Errors Errors 4 2 2 0 0 10 20 30 50 100 200 10 20 30 50 100 200 Training transitions Training transitions Fig. 3. Results of learning the Triangle Tireworld domain and the Elevators domain from IPPC 2014. The results shown are the means and standard deviations obtained from 10 runs. The number of training transitions per action are shown. flat tire while it moves. The domain has 3 actions. A “Change Tire” action that has only one precondition, a “Load Tire” action that has 2 preconditions to change one predicate, and a “MoveCar” action that changes 3 predicates based on 3 different preconditions. The results show that easy actions can be learned with just a few transitions, while very complex actions require 100 transitions until average errors below 1 are obtained. The right plot in Fig. 3 shows the results obtained in the Elevators domain. In this do- main, we are only learning the actions that interact with the elevators, and not the dynamic effects related to people using them. The “openDoorGoing*” and “closeDoor” actions are easy to learn, but the “moveCurrentDir” action has two effects with 3 different precondi- tions each, and another two effects with 4 preconditions. Therefore, to learn successfully the dynamics of “moveCurrentDir” a large number of input transitions is required. 7 Conclusions We have presented a new approach to learn planning operators. In contrast to previous ap- proaches, it learns the dynamics of every predicate independently, and thus can be applied to learn RDDL domains. The presented approach can find good sets of planning oper- ators efficiently, as it combines logic programming to restrict the set of candidates, and then a optimization method to select a good subset of operators (i.e. a set that generalizes as much as possible while explaining well enough the input transitions). To that end, the LFIT algorithm has been extended to learn probabilistic dynamics and multi-valued vari- ables. Finally, experimental validation is provided to show that the planning operators can be learned with a reduced number of transitions. As future work, we would like to work on planning operators independent of actions. This would require to have a more intelligent method to generate symbolic transitions from grounded transitions so that changes not represented by actions could be learned. Moreover, optimal solutions for selecting the best subset of planning operators should be analyzed. 8 Acknowledgments This research is supported in part by the JSPS 2014-2017 Grants-in-Aid for Scientific Re- search (B) No. 26540122 and 2014-2015 Challenging Exploratory Research No. 26280092. D. Martı́nez is also supported by the Spanish Ministry of Education, Culture and Sport via a FPU doctoral grant (FPU12-04173). Learning Probabilistic Action Models from Interpretation Transitions 13 References A PT, K. R., B LAIR , H. A., AND WALKER , A. 1988. Towards a theory of declarative knowledge. Foundations of deductive databases and logic programming, 89. B LAIR , H. A. AND S UBRAHMANIAN , V. 1988. Paraconsistent foundations for logic programming. Journal of non-classical logic 5, 2, 45–73. B LAIR , H. A. AND S UBRAHMANIAN , V. 1989. Paraconsistent logic programming. Theoretical Computer Science 68, 2, 135 – 154. C HANEL , C. P. C., L ESIRE , C., AND T EICHTEIL -K ÖNIGSBUCH , F. 2014. A robotic execution framework for online probabilistic (re) planning. In Twenty-Fourth International Conference on Automated Planning and Scheduling. 454–462. C ORAPI , D., S YKES , D., I NOUE , K., AND RUSSO , A. 2011. Probabilistic rule learning in non- monotonic domains. In Computational Logic in Multi-Agent Systems. Springer, 243–258. E SHGHI , K. 1988. Abductive planning with event calculus. In ICLP/SLP. 562–579. F IERENS , D., VAN DEN B ROECK , G., R ENKENS , J., S HTERIONOV, D., G UTMANN , B., T HON , I., JANSSENS , G., AND D E R AEDT, L. 2013. Inference and learning in probabilistic logic programs using weighted boolean formulas. Theory and Practice of Logic Programming, 1–44. F ITTING , M. 1991. Bilattices and the semantics of logic programming. The Journal of Logic Programming 11, 2, 91 – 116. G INSBERG , M. L. 1988. Multivalued logics: A uniform approach to reasoning in artificial intelli- gence. Computational intelligence 4, 3, 265–316. G UTMANN , B., T HON , I., K IMMIG , A., B RUYNOOGHE , M., AND D E R AEDT, L. 2011. The magic of logical inference in probabilistic programming. Theory and Practice of Logic Programming 11, 4-5, 663–680. I NOUE , K., R IBEIRO , T., AND S AKAMA , C. 2014. Learning from interpretation transition. Machine Learning 94, 1, 51–79. K ELLER , T. AND E YERICH , P. 2012. PROST: Probabilistic Planning Based on UCT. In Proceedings of the International Conference on Automated Planning and Scheduling. 119–127. K IFER , M. AND S UBRAHMANIAN , V. 1992. Theory of generalized annotated logic programming and its applications. Journal of Logic Programming 12, 4, 335–367. KOLOBOV, A., M AUSAM, AND W ELD , D. S. 2012. Lrtdp versus uct for online probabilistic plan- ning. In Proceedings of AAAI Conference on Artificial Intelligence. KOWALSKI , R. AND S ERGOT, M. 1989. A logic-based calculus of events. In Foundations of knowledge base management. Springer, 23–55. M ART ÍNEZ , D., A LENY À , G., AND T ORRAS , C. 2015. Planning robot manipulation to clean planar surfaces. Engineering Applications of Artificial Intelligence 39, March 2015, 23–32. M OURAO , K., Z ETTLEMOYER , L. S., P ETRICK , R., AND S TEEDMAN , M. 2012. Learning strips op- erators from noisy and incomplete observations. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 614–623. M OYLE , S. 2003. Using theory completion to learn a robot navigation control program. In Inductive Logic Programming. Springer, 182–197. PASULA , H. M., Z ETTLEMOYER , L. S., AND K AELBLING , L. P. 2007. Learning symbolic models of stochastic domains. Journal of Artificial Intelligence Research 29, 1, 309–352. R IBEIRO , T. AND I NOUE , K. 2014. Learning prime implicant conditions from interpretation tran- sition. In The 24th International Conference on Inductive Logic Programming. To appear (long paper) (http://tony.research.free.fr/paper/ILP2014long). RODRIGUES , C., G ÉRARD , P., ROUVEIROL , C., AND S OLDANO , H. 2012. Active learning of relational action models. In Inductive Logic Programming. Springer, 302–316. S ANNER , S. 2010. Relational dynamic influence diagram language (rddl): Language description. Unpublished ms. Australian National University. 14 D. Martı́nez et al. S HANAHAN , M. 2000. An abductive event calculus planner. The Journal of Logic Programming 44, 1, 207–240. S UROVIK , D. A. AND S CHEERES , D. J. 2015. Heuristic search and receding-horizon planning in complex spacecraft orbit domains. In Proceedings of the International Conference on Automated Planning and Scheduling. S YKES , D., C ORAPI , D., M AGEE , J., K RAMER , J., RUSSO , A., AND I NOUE , K. 2013. Learning revised models for planning in adaptive systems. In Proceedings of the International Conference on Software Engineering. 63–71. VAN E MDEN , M. H. 1986. Quantitative deduction and its fixpoint theory. The Journal of Logic Programming 3, 1, 37–53. YOUNES , H. L. AND L ITTMAN , M. L. 2004. Ppddl1. 0: An extension to pddl for expressing planning domains with probabilistic effects. Techn. Rep. CMU-CS-04-162. Z HUO , H. H., YANG , Q., H U , D. H., AND L I , L. 2010. Learning complex action models with quantifiers and logical implications. Artificial Intelligence 174, 18, 1540–1569.