=Paper= {{Paper |id=Vol-1433/tc_30 |storemode=property |title=Learning Probabilistic Action Models from Interpretation Transitions |pdfUrl=https://ceur-ws.org/Vol-1433/tc_30.pdf |volume=Vol-1433 |dblpUrl=https://dblp.org/rec/conf/iclp/0004RIAT15 }} ==Learning Probabilistic Action Models from Interpretation Transitions == https://ceur-ws.org/Vol-1433/tc_30.pdf

Technical Communications of ICLP 2015. Copyright with the Authors. 1

Learning Probabilistic Action Models from
Interpretation Transitions
David Martı́nez
Institut de Robotica i Informatica Industrial (CSIC-UPC)
(e-mail: dmartinez@iri.upc.edu),

Tony Ribeiro
The Graduate University for Advanced Studies (Sokendai),
(e-mail: tony ribeiro@nii.ac.jp),

Katsumi Inoue
National Institute of Informatics,
(e-mail: inoue@nii.ac.jp)

Guillem Alenyà, Carme Torras
Institut de Robotica i Informatica Industrial (CSIC-UPC)
(e-mail: {galenya,torras}@iri.upc.edu)

submitted 29 April 2015; accepted 5 June 2015

Abstract
There have been great advances in the probabilistic planning community during recent years, and
planners can now provide solutions for very complex probabilistic tasks. However, planners require
to have a model that represents the dynamics of the system, and in general these models are built
by hand. In this paper, we present a framework to automatically infer probabilistic models from
observations of the state transitions of a dynamic system. We propose an extension of previous works
that perform learning from interpretation transitions. These works consider as input a set of state
transitions and build a logic program that realizes the given transition relations. Here we extend this
method to learn a compact set of probabilistic planning operators that capture probabilistic dynamics.
Finally, we provide experimental validation of the quality of the learned models.
KEYWORDS:inductive logic programming, learning from interpretation transitions, probabilistic
planning operators, action model learning, probabilistic planning

1 Introduction
Lately the performance of probabilistic task planners has improved to the point where they
can find solutions in many complex scenarios (Kolobov et al. 2012; Keller and Eyerich
2012). Planners have been applied successfully in several fields, such as robotics (Chanel
et al. 2014; Martı́nez et al. 2015) and aerospace (Surovik and Scheeres 2015). However,
in most cases an expert is required to define the action model to be used by the planner.
Creating a model is an error-prone task that requires in-depth knowledge about the problem
and the planner, as well as extensive testing to assure the correctness of the model. We
propose a new method to automatically learn action models based on a set of given state
transitions.
2 D. Martı́nez et al.

Most previous approaches have only tackled the problem of learning deterministic ac-
tion models (Zhuo et al. 2010), including uncertainty in the perceptions (Mourao et al.
2012). Regarding logic programming, abductive action learning has been studied based
on abductive event calculus (Eshghi 1988), an abductive extension of event calculus, and
has been extended for applications to planning, e.g., (Shanahan 2000). (Moyle 2003) uses
an ILP technique to learn a causal theory based on event calculus (Kowalski and Sergot
1989), given examples of input-output relations. Probabilistic logic programs to maximize
the probabilities of observations are learned by (Corapi et al. 2011; Sykes et al. 2013).
They use parameter estimation to find the probabilities associated with each atom and rule,
but it requires to code manually the restrictions for the initial set of rule candidates. Ac-
tion models with uncertain effects have also been learned by (Rodrigues et al. 2012; Pasula
et al. 2007) in STRIPS-like languages. Their optimization looks for planning operators that
define as many effects as possible at once, which makes it a better choice for preconditions-
effects based PPDDL domains (Younes and Littman 2004), such as the domains in past
planning competitions (IPPC 2004-2008). In contrast our approach finds planning opera-
tors for each single action effect (i.e. rule head), which makes it a better choice to learn
variable-based RDDL domains (Sanner 2010) used in newer planning competitions (IPPC
2011-2014).
Our approach learns an action model encoded as a set of planning operators. Each oper-
ator describes how the value of a predicate changes based on a set of preconditions given
that an action is executed. We present a novel method to learn in two levels as shown in
Fig. 1. On the right, the Learning From Interpretation Transitions (LFIT) framework (Inoue
et al. 2014; Ribeiro and Inoue 2014) is used to generate rules. Given a set of state transi-
tions of a system, it can learn a normal logic program that captures the system dynamics.
The resulting rules are all candidates that have to be considered to select the best planning
operators. On the left, the data is generalized between different objects by using a relational
representation, and an optimization method is used to select the best subsets of planning
operators that explain input transitions while maintaining generality. The dependency rela-
tions between the planning operators are used to efficiently select the best candidates. Our
method is designed to learn RDDL-like (Younes and Littman 2004) operators, where each
variable is updated separately based on a set of preconditions. To that end, in this paper we
present the following novel improvements:

• The LUST algorithm: an extension of the LFIT framework to learn probabilistic dy-
namics from uncertain state transitions. The new algorithm that we propose can con-
struct a model of a non-deterministic system by learning probabilistic rules.
• LUST also integrates multi-valued variables, which allows us to represent actions
more efficiently since every transition has only one action of the many possible ac-
tions.
• The integration of logic programming to efficiently limit the number of rule can-
didates with a relational representation that provides better generalization, and an
optimization method to select the best subsets of planning operators.

The paper is organized as follows. First we will introduce some background required to
understand the logic rule learning, and the algorithms proposed to that end. Afterwards, the
action model and the translations needed between logic programs and planning domains
Learning Probabilistic Action Models from Interpretation Transitions 3
Grounded Symbolic Interpretation
transitions transitions transitions

Symbolic Interpretation
Input observations
interpretation transitions

LFIT

Operator Probabilistic
Best RDDL operators
optimization propositional rules

Symbolic planning Probabilistic
operators propositional rules

Fig. 1. Overview of the learning framework. The modules that interact with planning data and op-
erators are in the blue rectangle on the left, while the modules related to logic programming are
included inside the red rectangle on the right. The input of the method are grounded transitions,
which are then converted to symbolic transitions to generalize between different objects. To obtain
rule candidates, LFIT requires that the symbolic transitions are represented with propositional atoms.
Finally, after rule candidates are obtained, they are transformed to planning operators to select the
best set of operators.

are explained. Finally we show experimental results and conclusions of the presented ap-
proach.

2 Preliminaries
In this section we recall some preliminaries of logic programming. We also explain the
basis of learning from interpretation transitions in order to make easier the understanding
of its extension to multi-valued variable and probabilistic rule learning.

2.1 Logic Programming
We consider a first-order language and denote the Herbrand base (the set of all ground
atoms) as B. A (normal) logic program (NLP) is a set of rules of the form
A ← A1 ∧ · · · ∧ Am ∧ ¬Am+1 ∧ · · · ∧ ¬An (1)
where A and Ai are atoms (n ≥ m ≥ 0). For any rule R of the form (1), the atom A is
called the head of R and is denoted as h(R), and the conjunction to the right of ← is called
the body of R. An (Herbrand) interpretation I is a subset of B. For a logic program P and
an Herbrand interpretation I, the immediate consequence operator (or TP operator) (Apt
et al. 1988) is the mapping TP : 2B → 2B :
TP (I) = { h(R) | R ∈ ground(P ), b+ (R) ⊆ I, b− (R) ∩ I = ∅}. (2)

2.2 Learning from Interpretation Transitions
LF1T (Inoue et al. 2014) is an any time algorithm that takes a set of one-step state transi-
tions E as input. These one-step state transitions can be considered as positive examples.
4 D. Martı́nez et al.

From these transitions, the algorithm learns a logic program P that represents the dynam-
ics of E. To perform this learning process, we can iteratively consider one-step transitions.
In LF1T, the set of all atoms B is assumed to be finite. In the input E, a state transition is
represented by a pair of interpretations. The output of LF1T is a logic program that realizes
all state transitions of E.
Learning from 1-Step Transitions (LF1T)
Input: E a set of state transitions (I, J) of a system S and B the set of all possible
atoms that can appear in I and J.
Output: A logic program P such that J = next(I) holds for any (I, J) ∈ E.

To build a logic program with LF1T, in (Ribeiro and Inoue 2014) we used a bottom-up
method that generates hypotheses by specialization from the most general rules, that are
fact rules, until the logic program is consistent with all input state transitions. Learning by
specialization ensures to output the most general valid hypothesis.

2.3 Learning Multi-valued logic programs
Research in multi-valued logic programming has proceed along three different directions
(Kifer and Subrahmanian 1992): bilattice-based logics (Fitting 1991; Ginsberg 1988), quan-
titative rule sets (Van Emden 1986) and annotated logics (Blair and Subrahmanian 1989;
Blair and Subrahmanian 1988). The multi-valued logic representation used in our new al-
gorithm is based on annotated logics. Here, to each atom corresponds a given set of values.
In a rule, a literal is an atom annotated with one of these values. It allows us to represent
annotated atoms simply as classical atoms and thus to remain in the normal logic program
semantics.
In order to represent multi-valued variables, we now restrict all atoms of a logic program
to the form varval . The intuition behind this form is that var represents some variable of
the system and val represents the value of this variable. In annotated logics, the atom var
is said to be annotated by the constant val. We consider a multi-valued logic program as a
set of rules of the form
varval ← var1val1 ∧ · · · ∧ varnvaln (3)
where varval and varivali ’s are atoms (n ≥ 1). Like before, for any rule R of the form (3),
left part of ← is called the head of R and is denoted as h(R), and the conjunction to the
right of ← is called the body of R. We represent the set of literals in the body of R of
the form (3) as b(R) = {var1val1 , . . . , varnvaln }. A rule R of the form (3) is interpreted as
follows: the variable var takes the value val in the next state if all variables vari have the
value vali in the current state. An interpretation of a multi-valued program provides the
value of each variable of the system and is defined as follows.

Definition 1 (Multi-valued Interpretation)
Let B be a set of atoms where each element has the form varval . An interpretation I of
0 0
a set of atoms B is a subset of B where ∀varval ∈ B, ∃varval ∈ I and ∀varval ∈
00
I, @varval ∈ I, val0 6= val00 .

For a system S represented by a multi-valued logic program P and a state s1 represented
Learning Probabilistic Action Models from Interpretation Transitions 5

by an interpretation I, the successor of s1 is represented by the interpretation:
next(I) = {h(R) | R ∈ P, b(R) ⊆ I}
The state transitions of a logic program P are represented by a set of pairs of interpreta-
tions (I, next(I)).
Definition 2 (Multi-valued Consistency)
Let R be a rule and (I, J) be a state transition. R is consistent with (I, J) iff b(R) ⊆ I
implies h(R) ∈ J. Let E be a set of state transitions, R is consistent with E if R is
consistent with all state transitions of E. A logic program P is consistent with E if all
rules of P are consistent with E.

Definition 3 (Subsumption)
Let R1 and R2 be two rules. If h(R1 ) = h(R2 ) and b(R1 ) ⊆ b(R2 ) then R1 subsumes R2 .
Let P be a logic program and R be a rule. P subsumes R if there exists a rule R0 ∈ P that
subsumes R.
We say that a rule R1 is more general than another rule R2 if R1 subsumes R2 . In
particular, a rule R is most general if there is no rule R0 (6= R) that subsumes R (b(r) = ∅).
Example 1
Let R1 and R2 be the two following rules: R1 = (a1 ← b1 ), R2 = (a1 ← a0 ∧ b1 ), R1
subsumes R2 because (b(R1 ) = {b1 }) ⊂ (b(R2 ) = {a0 , b1 }).When R1 appears in a logic
program P , R2 is useless for P , because whenever R2 can be applied, R1 can be applied.
To learn multi-valued logic programs with LF1T we need to adapt the ground resolution
of (Inoue et al. 2014) and the least specialization of (Ribeiro and Inoue 2014) to handle
non-boolean variables.
Definition 4 (complement)
Let R1 and R2 be two rules, R2 is a complement of R1 on varval if varval ∈ b(R1 ),
0 0
varval ∈ b(R2 ), val 6= val0 and (b(R2 ) \ {varval }) ⊆ (b(R1 ) \ {varval }).

Definition 5 (multi-valued ground resolution)
Let R be a rule, P be a logic program and B be a set of atoms, R can be generalized on
0
varval if ∀varval ∈ B, val 6= val0 , ∃R0 ∈ P such that R0 is a complement of R on
varval :
generalise(R, P ) = h(R) ← b(R) \ varval

Definition 6 (Multi-valued least specialization)
Let R1 and R2 be two rules such that h(R1 ) = h(R2 ) and R1 subsumes R2 . Let B be a
set of atoms. The least specialization ls(R1 , R2 , B) of R1 over R2 w.r.t B is
0 0
ls(R1 , R2 , B) = {h(R1 ) ← b(R1 )∧varval |varval ∈ b(R2 )\b(R1 ), varval ∈ B, val0 6= val}
Least specialization can be used on a rule R to avoid the subsumption of another rule
with a minimal reduction of the generality of R. By extension, least specialization can be
used on the rules of a logic program P to avoid the subsumption of a rule with a minimal
reduction of the generality of P . Let P be a logic program, B be a set of atoms, R be a rule
6 D. Martı́nez et al.

and S be the set of all rules of P that subsume R. The least specialization ls(P, R, B) of
P by R w.r.t B is as follows:
[
ls(P, R, B) = (P \ S) ∪ ( ls(RP , R, B))
RP ∈S

LF1T starts with an initial logic program P = {varval ← | varval ∈ B}. Then LF1T
iteratively analyzes each transition (I, J) ∈ E. For each labeled atom A that does not
I
appear in J, LF1T infers an anti-rule RA :
^
I
RA =A← Bi
Bi ∈I

A rule of P that subsumes such an anti-rule is not consistent with the transitions of E and
must be revised. The idea is to use least specialization to make P consistent with the new
I
transitions (I, J) by avoiding the subsumption of all anti-rules RA inferred from (I, J).
After least specialization, P becomes consistent with the new transition while remaining
consistent with all previously analyzed transitions (theorem 3 of (Ribeiro and Inoue 2014)).
When all transitions of E have been analyzed, LF1T outputs the rules of the system that
realize E.

3 Learning from Uncertain State Transitions
In this section we extend the LFIT framework to learn probabilistic dynamics by propos-
ing an extension of LFIT for learning from uncertain state transitions. Where other work
like (Gutmann et al. 2011) perform inferences from a probabilistic logic program, what
we do is inferring the rules of such logic program. The programs infered by our new al-
gorithm are similar to paraconsistent logic program of (Blair and Subrahmanian 1988).
The use of annotated atoms allows the learned programs to induce multiple values for the
same represented variable. It allows us to represent multi-valued models and capture non-
deterministic state transitions. Our semantics differs from previous work like (Fierens et al.
2013). There, the authors consider probabilistic logic programs as logic programs in which
some of the facts are annotated with probabilities. But in our method, its the rules that have
probabilities and they are independent.

3.1 Formalization
An non-deterministic system can be represented by a set of logic programs where the rules
have the following form:
R = value(var, val, i, j) ← var1val1 ∧ · · · ∧ varnvaln
where varval and varivali are atoms (n ≥ 1), value(var, val, i, j) is the head of R again
denoted h(R) and i, j are natural numbers, i ≤ j. Let I be a multi-valued interpretation.
R means that i times over j, var takes the value val in the successor of I if b(R) ⊂ I.

Definition 7 (Non-deterministic successors)
Let I be the multi-valued interpretation of a state of a non-deterministic system S repre-
sented by a set of logic programs P . Let P 0 be a logic program, one of the successors of I
Learning Probabilistic Action Models from Interpretation Transitions 7

according to P 0 is
next(I, P 0 ) = {varval |R ∈ P 0 , b(R) ⊆ I, h(R) = value(var, val, i, j), 0 < i ≤ j}
The set of successors of I in S according to P is
successor(I, P ) = {J|J ∈ next(I, Pi ), Pi ∈ P }

Example 2
Let R1 and R2 be two rules such that R1 = (value(a, 1, 10, 100) ← a1 ) and R2 =
(value(a, 2, 90, 100) ← a1 ). Let S be a probabilistic system represented by a set of
logic programs P such that P = {{R1 }, {R2 }}. The possible next states of I in S are
successor(I, P ) = {{a1 }, {a2 }}. The likelihood of having a1 in the next state of I is
10% and the one of having a2 is 90%.

3.2 Algorithm
We now present LUST, an extension of LFIT for Learning from Uncertain State Tran-
sition. LUST learns a set of deterministic logic programs. The main idea is that when
two transitions are not consistent we need two different programs to realize them. The
first program will realize the first transition and the second one will realize the second
transition. The algorithm will output a set of logic programs such that every transition
given as input is realized by at least one of those programs. The rules learned also pro-
vide the probability of the variable values in the next state. The probability of each rule
R = value(var, val, i, j) ← b(R) is simply obtained by counting how many transitions
(I, J) it realizes (when b(R) ⊆ I and varval ∈ J), represented by i, over how many
transitions it matches (when b(R) ⊆ I), represented by j.

LUST algorithm:
• Input: a set of pairs of interpretations E and a set of atoms B .
• Step 1: Initialize a set of logic programs P with one program P1 with fact rules for
each atom of B.
• Step 2: Pick (I, J) in E, check consistency of (I, J) with all programs of P :
• if there is no logic program in P that realizes (I, J) then
— copy one of the logic programs Pi into a Pi0 and add rules in Pi0 to realize (I, J).
— Use full ground resolution to generalize Pi0 .
• Step 3: Revise all logic programs that realize (I, J) by using least specialization.
• Step 4: If there is a remaining transition in E, go to step 2.
• Step 5: Compute the probability of each rule of all programs Pi according to E.
• Output: P a set of multi-valued logic programs that realizes E.

4 Integration of Logic Programming and Planning Domains
In this section we describe the formalization used to represent planning operators, as well
as the data conversions needed to learn symbolic operators from grounded transitions while
using propositional logic programming.
8 D. Martı́nez et al.

4.1 Planning Model
Although LUST uses a propositional representation, the planning model uses a relational
representation to provide a better generalization between different states. Relational do-
mains represent the state structure and objects explicitly. These domains are described by
using a vocabulary of predicates P and actions A, and a set of objects Cπ . Predicates and
actions take objects as arguments to define their ground counterparts.
Example 3
Let on(X,Y) be a symbolic predicate, and { box1,box2,box3} be a set of objects. Three
possible groundings of the symbolic predicate on(X,Y) with the given atoms are on(box1,
box2), on(box1, box3 ) and on(box3, box1).
A ground predicate or action is equivalent to an atom. In a planning domain, a planning
state s is a set of ground predicates s = {p1 , p2 , ..., pn } that is equivalent to an Herbrand
interpretation.
Our planner operators represent a subset of RDDL domains (Sanner 2010). For each
variable, we define the probability that it will become true in the next state based on a set
of preconditions. In contrast to the full RDDL specification, these preconditions can only
consist of an action and a set of predicates. Thus we define a planning operator as a tuple
o = hopval , oact , oprec , oprob i where:
• opval is a predicate p whose value can change to val by applying the operator. It is
equivalent to the head of a logic rule.
• oact is the action that has to be executed for the operator to be applicable.
• oprec is a set of predicates that have to be satisfied in the current state so that the
planning operator is applicable. It is equivalent to the body of a logic rule.
• oprob is the probability that pval will be true.

4.2 Data Representation
To have a more general and compact model, we are using a relational representation at
the planning level. The input of our method consists of state transitions that are tuples
t = hs, act, s0 i where s and s0 are the states before and after executing the action act.
The states consist of sets of ground predicates, and act is a grounded action. On the other
hand, the output is a set of symbolic (i.e. non-grounded) planning operators. Therefore, our
method transforms initial grounded data to symbolic planning operators. Moreover, LUST
works with propositional atoms, so a transformation from symbolic predicates to atoms and
back to symbolic predicates is also needed. Figure 1 shows the needed data conversions,
which are explained in more detail below.

Transform grounded transitions to symbolic transitions:
• Input: a set of grounded transitions T = [t1 , t2 , ..., tn ].
• For each transition t = hs, act, s0 i:
— Take every argument χ of the action act.
— Substitute χ for a default symbolic parameter in s, act, and s0 .
— Create a new transition t0 with the symbolic predicates in s and s0 and the sym-
bolic action act.
Learning Probabilistic Action Models from Interpretation Transitions 9

— Add the new transition t0 to T 0 .
• Output: set of symbolic transitions T 0 .

Transform a symbolic transition to an interpretation transition:

• Input: a symbolic transition t = hs, act, s0 i.
• Assign an atom to each predicate in s, act and s0 .
• I = atoms that correspond to s.
• Add act as an atom in I that represents the action.
• J = atoms that correspond to s0 .
• Output: interpretation transition (I, J).

To transform a planning symbolic transition to an interpretation transition, a labeled
atom that encodes the action is added to the body of the interpretation transition. As each
transition has exactly one action, a multi-valued variable represents it more efficiently than
boolean, otherwise every action would have to be represented as different variables with
only one being true. Moreover, each symbolic predicate value is represented by one labeled
atom. After the logic programs are generated, the labeled atoms are translated back to
symbolic predicates by using the same conversion.

Transform a logic program to a set of planning operators:

• Input: a logic program P .
• For every rule R ∈ P such that h(R) = value(var, val, i, j), a planning operator o
is created so that:
— opval = varval .
— oact = the action in the atoms of b(R).
— oprec = the set of atoms in b(R) that represent predicates.
— oprob = i/j.
— Add o to O
• Output: set of planning operators O.

5 Selecting the Set of Planning Operators
In this section we present the method to select the best subset of probabilistic planning
operators by using the set of logic programs generated by LF1T. First, the requirements that
the planning operators have to satisfy are presented. Afterwards, we explain the preferences
to decide which are the best planning operators. Finally, the method to obtain the desired
planning operators is described.

5.1 Planning Operators Requirements
Probabilistic planners require that only one planning operator can be applied in each state-
action pair. Therefore the model has to be defined with a set of non-conflicting operators,
so the planner can always decide which operator to apply for each state-action pair.
10 D. Martı́nez et al.

Definition 8 (Conflicting planning operators)
Let o1 and o2 be two planning operators that represent the same action o1,act = o2,act and
change the same predicate o1,pval = o2,pval with different probabilities o1,prob 6= o2,prob .
A conflict exists between both planning operators if ∃s | o1,prec ⊂ s, o2,prec ⊂ s.

5.2 Score Function
LUST provides the minimal set of rules that describe all possible transitions. However, the
best subset of non-conflicting planning operators has to be selected to create the model.
To decide which are the best set of operators the algorithm uses a score function. The
algorithm prefers operators with a high likelihood (i.e. that can successfully explain the
state transitions), but also has a regularization term to avoid overfitting to the training data.
The regularization is based on the number of planning operators and preconditions in those
operators, so that general operators are preferred when the likelihood of both is similar.
This regularization penalty is bigger when there are few training transitions to estimate
each operator, as general operators are preferred to poorly estimated specific ones. On
the other hand, the regularization penalty decreases as our estimate improves with more
transitions because the method can be more confident about the operators. The following
functions are used to define the score function.
• The likelihood P (t|O) = P (s0 |s, act, o) | (o ∈ O, oprec ∈ s, oact = act) is the
P the set of planning operators O.
probability that the transition t is covered by
1
• The penalty term P en(O) = |O| + |O| o∈O |oprec | is the number of planning
operators plus the average number of preconditions that they have.
• The confidence in a planning operator Conf (O, T ) is bounded by using the Hoeffd-
ing’s inequality. The probability that our estimate o[prob is accurate enough |[
oprob −
oprob | ≤ with a number of samples |T | is bounded by Conf (O, T ) ≤ 1 −
2
2e−2 |T | .
Finally, the proposed score function is defined as
!
1 X P en(O)
S(O, T ) = P (t|O) −α (4)
|T | Conf (O, T )
t∈T

where α is a scaling parameter for the penalty term. Also note that Conf (O, T ) ' 1 when
the number of input transitions is large, so the penalty term will always be present to ensure
general operators are preferred.

5.3 Selecting the Best Planning Operator Subset
In this section we present the algorithm used to select the subset of non-conflicting plan-
ning operators that maximizes the score function. To do it efficiently we make use of the
dependency relations between operators by using the subsumption relation graph defined
below.
Definition 9 (Planning operator subsumption relation)
Let o1 and o2 be two planning operators that represent the same action o1,act = o2,act and
the same effect o1,pval = o2,pval . o1 subsumes the operator o2 if o1,prec ⊂ o2,prec .
Learning Probabilistic Action Models from Interpretation Transitions 11

∅

a b ¬b c

a b a c a ¬b b c

a ¬b c

Fig. 2. Example of a subsumption graph. In each node, the planning operator preconditions are
shown. Leaves are the nodes painted in blue.

Definition 10 (Subsumption graph)
The subsumption graph GO of a set of planning operators O = {o1 , ..., on } is a directed
graph with arcs (oi , oj ) when oi subsumes of oj and |oj,prec | − |oi,prec | = 1. Figure 2
shows an example of a subsumption graph. We will call leaves L(GO ) of the graph all
nodes that do not have a child.
The subsumption graph orders the rules in levels that represent the generality of the
operator: the less predicates in the preconditions the more general the operator is. This
graph provides two advantages: generalizing operators is easier as we only have to follow
the arcs, and it reduces the number of conflicts to check because general operators won’t
be checked if other more specific operators that represent the same dynamics exist. Below
a greedy approach to efficiently obtain a set of good planning operators is provided.

Select set of planning operators:
• Create the subsumption graph.
• Repeat until nothing changes:
— Select the graph leaves, and choose the subset of non-conflicting leafs with the
highest score. Remove the other leaves.
— Check if replacing any operator by a more general one (i.e. the operator parents
in the subsumbtion graph) improves the score. If so, remove the specific operator.
• The result is the set of leaves in the remaining graph.

6 Evaluation
In this section we provide an experimental evaluation of our approach by learning two
domains of the 2014 International Probabilistic Planning Competition (IPPC). The exper-
iments use transitions t = hs, act, s0 i generated by randomly constructing a state s, ran-
domly picking the arguments of the action act, and then applying the action to generate the
state s0 . The distribution of samples is biased to guarantee that half of the samples have a
chance to change the state. To measure the quality of the learned models, the errors shown
represent the differences between the learned operators and the ground truth ones. For each
incorrect predicate in an operator preconditions, the number was increased by 1.
The left plot in Fig. 3 shows the results obtained in the Triangle Tireworld domain.
In this domain, a car has to move to its destination, but it has a probability of getting a
12 D. Martı́nez et al.
Triangle Tireworld Domain Elevators Domain
6
ChangeTire closeDoor
LoadTire 6 openDoorGoingUp/Down
MoveCar moveCurrentDir
4
Errors

Errors
4

2
2

0 0
10 20 30 50 100 200 10 20 30 50 100 200
Training transitions Training transitions

Fig. 3. Results of learning the Triangle Tireworld domain and the Elevators domain from IPPC
2014. The results shown are the means and standard deviations obtained from 10 runs. The number
of training transitions per action are shown.

flat tire while it moves. The domain has 3 actions. A “Change Tire” action that has only
one precondition, a “Load Tire” action that has 2 preconditions to change one predicate,
and a “MoveCar” action that changes 3 predicates based on 3 different preconditions. The
results show that easy actions can be learned with just a few transitions, while very complex
actions require 100 transitions until average errors below 1 are obtained.
The right plot in Fig. 3 shows the results obtained in the Elevators domain. In this do-
main, we are only learning the actions that interact with the elevators, and not the dynamic
effects related to people using them. The “openDoorGoing*” and “closeDoor” actions are
easy to learn, but the “moveCurrentDir” action has two effects with 3 different precondi-
tions each, and another two effects with 4 preconditions. Therefore, to learn successfully
the dynamics of “moveCurrentDir” a large number of input transitions is required.

7 Conclusions
We have presented a new approach to learn planning operators. In contrast to previous ap-
proaches, it learns the dynamics of every predicate independently, and thus can be applied
to learn RDDL domains. The presented approach can find good sets of planning oper-
ators efficiently, as it combines logic programming to restrict the set of candidates, and
then a optimization method to select a good subset of operators (i.e. a set that generalizes
as much as possible while explaining well enough the input transitions). To that end, the
LFIT algorithm has been extended to learn probabilistic dynamics and multi-valued vari-
ables. Finally, experimental validation is provided to show that the planning operators can
be learned with a reduced number of transitions.
As future work, we would like to work on planning operators independent of actions.
This would require to have a more intelligent method to generate symbolic transitions
from grounded transitions so that changes not represented by actions could be learned.
Moreover, optimal solutions for selecting the best subset of planning operators should be
analyzed.

8 Acknowledgments
This research is supported in part by the JSPS 2014-2017 Grants-in-Aid for Scientific Re-
search (B) No. 26540122 and 2014-2015 Challenging Exploratory Research No. 26280092.
D. Martı́nez is also supported by the Spanish Ministry of Education, Culture and Sport via
a FPU doctoral grant (FPU12-04173).
Learning Probabilistic Action Models from Interpretation Transitions 13

References
A PT, K. R., B LAIR , H. A., AND WALKER , A. 1988. Towards a theory of declarative knowledge.
Foundations of deductive databases and logic programming, 89.
B LAIR , H. A. AND S UBRAHMANIAN , V. 1988. Paraconsistent foundations for logic programming.
Journal of non-classical logic 5, 2, 45–73.
B LAIR , H. A. AND S UBRAHMANIAN , V. 1989. Paraconsistent logic programming. Theoretical
Computer Science 68, 2, 135 – 154.
C HANEL , C. P. C., L ESIRE , C., AND T EICHTEIL -K ÖNIGSBUCH , F. 2014. A robotic execution
framework for online probabilistic (re) planning. In Twenty-Fourth International Conference on
Automated Planning and Scheduling. 454–462.
C ORAPI , D., S YKES , D., I NOUE , K., AND RUSSO , A. 2011. Probabilistic rule learning in non-
monotonic domains. In Computational Logic in Multi-Agent Systems. Springer, 243–258.
E SHGHI , K. 1988. Abductive planning with event calculus. In ICLP/SLP. 562–579.
F IERENS , D., VAN DEN B ROECK , G., R ENKENS , J., S HTERIONOV, D., G UTMANN , B., T HON , I.,
JANSSENS , G., AND D E R AEDT, L. 2013. Inference and learning in probabilistic logic programs
using weighted boolean formulas. Theory and Practice of Logic Programming, 1–44.
F ITTING , M. 1991. Bilattices and the semantics of logic programming. The Journal of Logic
Programming 11, 2, 91 – 116.
G INSBERG , M. L. 1988. Multivalued logics: A uniform approach to reasoning in artificial intelli-
gence. Computational intelligence 4, 3, 265–316.
G UTMANN , B., T HON , I., K IMMIG , A., B RUYNOOGHE , M., AND D E R AEDT, L. 2011. The
magic of logical inference in probabilistic programming. Theory and Practice of Logic
Programming 11, 4-5, 663–680.
I NOUE , K., R IBEIRO , T., AND S AKAMA , C. 2014. Learning from interpretation transition. Machine
Learning 94, 1, 51–79.
K ELLER , T. AND E YERICH , P. 2012. PROST: Probabilistic Planning Based on UCT. In Proceedings
of the International Conference on Automated Planning and Scheduling. 119–127.
K IFER , M. AND S UBRAHMANIAN , V. 1992. Theory of generalized annotated logic programming
and its applications. Journal of Logic Programming 12, 4, 335–367.
KOLOBOV, A., M AUSAM, AND W ELD , D. S. 2012. Lrtdp versus uct for online probabilistic plan-
ning. In Proceedings of AAAI Conference on Artificial Intelligence.
KOWALSKI , R. AND S ERGOT, M. 1989. A logic-based calculus of events. In Foundations of
knowledge base management. Springer, 23–55.
M ART ÍNEZ , D., A LENY À , G., AND T ORRAS , C. 2015. Planning robot manipulation to clean planar
surfaces. Engineering Applications of Artificial Intelligence 39, March 2015, 23–32.
M OURAO , K., Z ETTLEMOYER , L. S., P ETRICK , R., AND S TEEDMAN , M. 2012. Learning strips op-
erators from noisy and incomplete observations. In Proceedings of the Conference on Uncertainty
in Artificial Intelligence. 614–623.
M OYLE , S. 2003. Using theory completion to learn a robot navigation control program. In Inductive
Logic Programming. Springer, 182–197.
PASULA , H. M., Z ETTLEMOYER , L. S., AND K AELBLING , L. P. 2007. Learning symbolic models
of stochastic domains. Journal of Artificial Intelligence Research 29, 1, 309–352.
R IBEIRO , T. AND I NOUE , K. 2014. Learning prime implicant conditions from interpretation tran-
sition. In The 24th International Conference on Inductive Logic Programming. To appear (long
paper) (http://tony.research.free.fr/paper/ILP2014long).
RODRIGUES , C., G ÉRARD , P., ROUVEIROL , C., AND S OLDANO , H. 2012. Active learning of
relational action models. In Inductive Logic Programming. Springer, 302–316.
S ANNER , S. 2010. Relational dynamic influence diagram language (rddl): Language description.
Unpublished ms. Australian National University.
14 D. Martı́nez et al.

S HANAHAN , M. 2000. An abductive event calculus planner. The Journal of Logic
Programming 44, 1, 207–240.
S UROVIK , D. A. AND S CHEERES , D. J. 2015. Heuristic search and receding-horizon planning in
complex spacecraft orbit domains. In Proceedings of the International Conference on Automated
Planning and Scheduling.
S YKES , D., C ORAPI , D., M AGEE , J., K RAMER , J., RUSSO , A., AND I NOUE , K. 2013. Learning
revised models for planning in adaptive systems. In Proceedings of the International Conference
on Software Engineering. 63–71.
VAN E MDEN , M. H. 1986. Quantitative deduction and its fixpoint theory. The Journal of Logic
Programming 3, 1, 37–53.
YOUNES , H. L. AND L ITTMAN , M. L. 2004. Ppddl1. 0: An extension to pddl for expressing
planning domains with probabilistic effects. Techn. Rep. CMU-CS-04-162.
Z HUO , H. H., YANG , Q., H U , D. H., AND L I , L. 2010. Learning complex action models with
quantifiers and logical implications. Artificial Intelligence 174, 18, 1540–1569.