LIFT-UP: Lifted First-Order Planning Under Uncertainty
                      Steffen Hölldobler and Olga Skvortsova
                  International Center for Computational Logic
                Technische Universität Dresden, Dresden, Germany
                     {sh,skvortsova}@iccl.tu-dresden.de


                                          Abstract
         We present a new approach for solving first-order Markov decision processes
     combining first-order state abstraction and heuristic search. In contrast to existing
     systems, which start with propositionalizing the decision process and then perform
     state abstraction on its propositionalized version we apply state abstraction directly
     on the decision process avoiding propositionalization. Secondly, guided by an ad-
     missible heuristic, the search is restricted to those states that are reachable from
     the initial state. We demonstrate the usefulness of the above techniques for solv-
     ing first-order Markov decision processes within a domain dependent system called
     FluCaP which participated in the probabilistic track of the 2004 International
     Planning Competition. Working toward a domain independent implementation we
     present novel approaches to θ-subsumption involving literal and object contexts.


1    Introduction
We are interested in solving probabilistic planning problems, i. e. planning problems,
where the execution of an action leads to the desired effects only with a certain proba-
bility. For such problems, Markov decision processes have been adopted as a representa-
tional and computational model in much recent work, e.g., by [BBS95]. They are usually
solved using the so-called dynamic programming principle [BDH99] employing a value
iteration algorithm. Classical dynamic programming algorithms explicitly enumerate
the state space and are thus exponenetial. In recent years several methods have been
developed which avoid an explicit enuration of the state space. The most prominent are
state abstraction [BDH99], heuristic search (e. g. [BBS95, DKKN95]) and a combination
of both as used, for example, in symbolic LAO∗ [FH02].
    A common feature of these approaches is that a Markov decision process is proposi-
tionalized before state abstraction techniques and heuristic search algorithms are applied
within a value iteration algorithm. Unfortunately, the propositionalization step itself
may increase the problem significantly. To overcome this problem, it was first proposed
in [BRP01] to solve first-order Markov decision processes by applying a first-order value
iteration algorithm and first-order state abstraction techniques. Whereas this sym-
bolic dynamic programming approach was rooted in a version of the Situation Calculus
[Rei91], we have reformulated and extended these ideas in a variant of the fluent cal-
culus [HS04]. In this system, which is now called LIFT-UP, lifted first-order planning
under uncertainty can be performed.
     In the LIFT-UP system, states and actions are expressed in the language of the
fluent calculus [HS90], which is slightly extended to handle probabilities. In addition,
value functions and policies are represented by constructing first-order formulas which
partition the state space into clusters, referred to as abstract states. Then, value iter-
ation can be performed on top of these clusters, obviating the need for explicit state
enumeration. This allows the solution of first-order Markov decision processes without
requiring explicit state enumeration or propositionalization. In addition, heuristics are
used to guide the search and normalization techniques are applied to eliminate redun-
dant states. The LIFT-UP approach can thus be viewed as a first-order generalization
of symbolic LAO∗ or, alternatively, as symbolic dynamic programming enhanced by
heuristic search and state space normalization.
     To evaluate the LIFT-UP system we have developed a domain-dependent implemen-
tation called FluCaP. It can solve probabilistic blocksword problems as they appeared,
for example, in the colored blocksworld domain of the 2004 International Planning
Competition. FluCaP is quite successful and outperforming other systems on truly
first-order problems. On the other hand and working towards a domain-independent
implementation we have studied θ-subsumption algorithms. θ-subsumption problems
arise at various places in the LIFT-UP system: The normalization process requires to
check whether one abstract state subsumes another one; the check whether an action is
applicable to some abstract state and the computation of set of the successor or prede-
cessor states also requires subsumption. One should observe that the latter application
requires to compute a complete set of substitutions.
     In this paper we give an overview of the LIFT-UP approach.


2    First-order Markov Decision Processes
A Markov decision process, is a tuple (Z, A, P, R, C), where Z is a finite set of states,
A is a finite set of actions, and P : Z × Z × A → [0, 1], written P(z " |z, a), specifies
transition probabilities. In particular, P(z " |z, a) denotes the probability of ending up at
state z " given that the agent was in state z and action a was executed. R : Z → R is
a real-valued reward function associating with each state z its immediate utility R(z).
C : A → R is a real-valued cost function associating a cost C(a) with each action a. A
sequential decision problem consists of a Markov decision process and is the problem
of finding a policy π : Z → A that maximizes the total expected discounted reward
received when executing the policy π over an infinite (or indefinite) horizon. A Markov
decision process is said to be first-order if the expressions used to define Z, A and P
are first-order.
    The value Vπ (z) of a state z with respect to the policy π is defined as
                                                    !
                   Vπ (z) = R(z) + C(π(z)) + γ          P(z " |z, π(z))Vπ (z " ),
                                                z ! ∈Z

where 0 ≤ γ ≤ 1 is a discount factor. We take γ equal to 1 for indefinite-horizon
problems only, i. e. when a goal is reached the system enters an absorbing state in which
no further rewards or costs are accrued. A value function V is set to be optimal if it
satisfies                                       !
                       R(z) + max{C(a) + γ               P(z " |z, a)V ∗ (z " )} ,
                                a∈A
                                                z ! ∈Z

for each z ∈ Z; in this case the value function is usually denoted by V ∗ (z). The optimal
policy is extracted from the optimal value function.

    We assume that planning problems meet the following requirements:

    1. Each problem has a goal statement, identifying a set of absorbing goal states.

    2. A positive reward is associated with each action ending in a goal state; otherwise
       it is 0.

    3. A cost is associated with each action.

    4. A “done” action is available in all states.

The “done” action can be used to end any further accumulation of reward. Together,
these conditions ensure that an MDP model of a planning problem is a positive bounded
model as described by [Put94]. Such planning problems are also often called stochastic
shortest path problems.


3     Probabilistic Fluent Calculus
States, actions, transition probabilities, cost and reward function are specified in a
probabilistic and sorted extension of the fluent calculus [HS90, Thi98].

Fluents and States Let Σ denote a set of function symbols containing the binary
function symbol ◦ and the nullary function symbol 1. ◦ is an AC1-symbol with 1 as
unit element. Let Σ− = Σ \ {◦, 1}. Non-variable Σ− -terms are called fluents. Let
f (t1 , . . . , tn ) be a fluent. The terms ti , 1 ≤ i ≤ n are called objects. A state is a finite
set of ground fluents. Let D be the set of all states.

Fluent Terms and Abstract States Fluent terms are defined inductively as follows:
1 is a fluent term; each fluent is a fluent term; if G1 and G2 are fluent terms, then so is
G1 ◦ G2 . Let F be the set of all fluent terms. We assume that each fluent term obeys
the singularity condition: each fluent may occur at most once in a fluent term. Because
of the latter, there is a bijection ·M between ground fluent terms and states. Some
care must be taken when instantiating a non-ground fluent term F by a substitution θ
because F θ may violate the singularity condition. A substitution θ is allowed for fluent
term F if F θ meets the singularity condition.
    Abstract states are expressions of the form F or F ◦ X, where F is a fluent term and
X is a variable of sort fluent term. Let S denote the set of abstract states. Abstract
states denote sets of states as defined by the mapping ·I : S → 2D : Let Z be an abstract
state. Then

             [Z]I = {[Zθ]M | θ is an allowed grounding substitution for Z}.
                                                       (b)

                                                              (d)
                                 (a)
                                                             (c)


Figure 1: The interpretations of the abstract states (a) Z1 = on(X1 , a) ◦ on(a, table),
(b) Z2 = on(X2 , a) ◦ on(a, table) ◦ Y2 , (c) Z3 = on(X3 , a) ◦ on(a, table) ◦ clear(X3 ) and
(d) Z4 = on(X4 , a) ◦ on(a, table) ◦ clear(X4 ) ◦ Y4 , where a is an object denoting a block,
table is an object denoting a table, X1 , X2 , X3 and X4 are variables of sort object, Y2
and Y4 are variables of sort fluent term, on(Xi , a), i = 1 . . . 4, is a fluent denoting that
some block Xi is on a and clear(Xi ), i = 3, 4, is a fluent denoting that block Xi is clear.

This is illustrated in Figure 1. In other words, abstract states are characterized by
means of positive conditions that must hold in each ground instance thereof and, thus,
they represent clusters of states. In this way, abstract states embody a form of state
space abstraction, which is called first-order state abstraction.
    As a running example, we consider problems taken from the colored Blocksworld
scenario, which is an extension of the classical Blocksworld scenario in the sense that
along with the unique identifier, each block is now assigned a specific color. Thus, a
state description provides an arrangement of colors instead of an arrangement of blocks.
For example, a state Z defined as a fluent term:
        Z = red(X0 ) ◦ green(X1 ) ◦ blue(X2 ) ◦ red(X3 ) ◦ red(X4 )◦
                       red(X5 ) ◦ green(X6 ) ◦ green(X7 ) ◦ T ower(X0 , . . . , X7 ) ,
specifies a tower that is comprised of eigth colored blocks.

Subsumption Let Z1 and Z2 be abstract states. Then Z1 is subsumed by Z2 , in
symbols Z1 & Z2 , if there exists an allowed substitution θ such that Z2 θ =AC1 Z1 .
Intuitively, Z1 is subsumed by Z2 iff Z1I ⊆ Z2I . In the LIFT-UP system we are often
concerned with the problem of finding a complete set of allowed substitutions solving
the AC1-matching problem Z2 θ =AC1 Z1 .
    For example, consider the abstract states mentioned in Figure 1. Then, Z1 & Z2
with θ = {X2 (→ X1 , Y2 (→ 1}, Z3 & Z2 with θ = {X2 (→ X3 , Y2 (→ clear(X3 )}. However,
Z1 )& Z3 and Z3 )& Z1 .

Actions Let Σa denote a set of action names, where Σa ∩ Σ = ∅. An action space A is
a set of expressions of the form (a(X1 , . . . , Xn ), C, E), where a ∈ Σa , Xi , 1 ≤ i ≤ n, are
variables or constants, C ∈ F called precondition and E ∈ F called effect of the action
a(X1 , . . . , Xn ). E.g., a pickup-action in the blockworld can be specified by
         (pickup (X, Y ), on(X, Y ) ◦ clear(X) ◦ empty, holding(X) ◦ clear(Y )),
where empty denotes that the robot arm is empty and holding(X) that the block X
is in the gripper. For simplicity, we will often supress parameters, preconditions and
effects of an action (a(X1 , . . . , Xn ), C, E) and refer to it as a instead.
Nature’s Choice and Probabilities In analogy to the approach in [BRP01] stochas-
tic actions are decomposed into deterministic primitives under nature’s control, referred
to as nature’s choices. It can be modelled with the help of a binary relation symbol
choice as follows: Consider the action pickup (X, Y ):
        choice (pickup (X, Y ), a) ↔ (a = pickupS (X, Y ) ∨ a = pickupF (X, Y )),
where pickupS and pickupF define two nature’s choices for action pickup , viz., that it
succeeds or fails. For simplicity, we denote the set of nature’s choices of an action a as
Ch (a) := {aj |choice (a, aj )}.
    For each of nature’s choices aj associated with an action a we define the probability
prob (aj , a, Z) denoting the probability with which one of nature’s choices aj is chosen
in a state Z. For example,
                      prob (pickupS (X, Y ), pickup (X, Y ), Z) = .75
states that the probability for the successful execution of the pickup action in state Z
is .75. We require that for each action the probabilities of all its nature’s choices sum
up to 1.

Rewards and Costs Reward and cost functions are defined for abstract states using
the unary relation symbols reward and cost. For example, we might want to give a
reward of 500 to all states in which some block X is on block a and 0, otherwise:
                        reward (Z) = 500 ↔ Z & (on(X, a), ∅),
                        reward (Z) = 0   ↔ Z )& (on(X, a), ∅).
In other words, the state space is divided into two abstract states depending on whether
or not, a block X is on block a. Likewise, value functions can be specified with respect
to the abstract states only. Action costs can be analogously defined. E. g., with
                                 cost(pickup (X, Y )) = 3
the execution of the pickup -action is penalized with 3.

Forward and Backward Application of Actions An action (a(X1 , . . . , Xn ), C, E)
is forward applicable with θ to an abstract state Z ∈ S, denoted as forward (Z, a, θ), if
(C ◦ U )θ =AC1 Z, where U is a new variable of sort fluent term and θ is an allowed
substitution. If applicable, then the action progresses to or yields the state (E ◦ U )θ. In
this case, (E ◦ U )θ is called successor state of Z and denoted as succ(Z, a, θ).
    An action (a(X1 , . . . , Xn ), C, E) is backward applicable with θ to an abstract state
Z ∈ S, denoted as backward (Z, a, θ), if (E ◦ U )θ =AC1 Z, where U is a new variable of
sort fluent term and θ is an allowed substitution. If applicable, then the action regresses
to the state (C ◦ U )θ. In this case, (C ◦ U )θ is called predecessor state of Z and denoted
as pred(Z, a, θ).
    One should observe that the AC1-matching problems involved in the application of
actions are subsumption problems, viz. Z & (C ◦ U ) and Z & (E ◦ U ). Moreover, in
order to determine all possible successor or predecessor states of some state with respect
to some action we have to compute complete sets of allowed substitutions solving the
corresponding subsumption problems.
                          policyExpansion(π, S 0 , G)
                           E := F := ∅
                           f rom := S 0
                           repeat   S       S
                            to :=                   {succ(Z, aj , θ)},
                                  Z∈f rom aj ∈Ch(a)
                           where (a, θ) := π(Z)
                           F := F ∪ (to − G)
                           E := E ∪ f rom
                           f rom := to ∩ G − E
                          until (f rom = ∅)
                          E := E ∪ F
                          G := G ∪ F
                          return (E, F, G)

                          FOVI(E, A, prob, reward, cost, γ, V )
                           repeat
                            V " := V
                            loop for each Z ∈ E
                             loop for each a ∈ A
                              loop for each θ such that forward (Z, a, θ)
                               Q(Z, a, θ)P:= reward(Z) + cost(a)+
                                    γ          prob(aj , a, Z) · V " (succ(Z, aj , θ))
                                    aj ∈Ch(a)
                             end loop
                            end loop
                            V (Z) := max Q(Z, a, θ)
                                      (a,θ)
                           end loop
                           V := normalize(V )
                           r := 'V − V " '
                          until stopping criterion
                          π := extractP olicy(V )
                          return (V, π, r)

                          FOLAO∗ (A, prob, reward, cost, γ, S 0 , h, ε)
                           V := h
                           G := ∅
                           For each Z ∈ S 0 , initialize π with an arbitrary action
                           repeat
                             (E, F, G) := policyExpansion(π, S 0 , G)
                             (V, π, r) := FOVI(E, A, prob, reward, cost, γ, V )
                           until (F = ∅) and r ≤ ε
                           return (π, V )


                               Figure 2: LIFT-UP algorithm.


4    LIFT-UP Algorithm
In order to solve first-order MDPs, we have developed a new algorithm that combines
heuristic search and first-order state abstraction techniques.
     Our algorithm, referred to as LIFT-UP, can be seen as a generalization of the sym-
bolic LAO∗ algorithm by [FH02]. Given an initial state, LIFT-UP uses an admissible
heuristic to focus computation on the parts of the state space that are reachable from
the initial state. Moreover, it specifies MDP components, value functions, policies, and
admissible heuristics using a first-order language of the Probabilistic Fluent Calculus.
This allows LIFT-UP to manipulate abstract states instead of individual states. The
algorithm itself is presented in Figure 2.
     As symbolic LAO∗ , LIFT-UP has two phases that alternate until a complete solution
is found, which is guaranteed to be optimal. First, it expands the best partial policy and
evaluates the states on its fringe using an admissible heuristic function. Then it performs
dynamic programming on the states visited by the best partial policy, to update their
values and possibly revise the current best partial policy. We note that we focus on
partial policies that map a subcollection of states into actions.
     In the policy expansion step, we perform reachability analysis to find the set F of
states that have not yet been expanded, but are reachable from the set S 0 of initial
states by following the partial policy π. The set of states G contains states that have
been expanded so far. By expanding a partial policy we mean that it will be defined for
a larger set of states in the dynamic programming step.
    In symbolic LAO∗ , reachability analysis is performed on propositional algebraic de-
cision diagrams (ADDs). Therefore, an additional preprocessing of a first-order MDP
is required at the outset of any solution attempt. This preprocessing involves propo-
sitionalization of the first-order structure of an MDP, viz., instantiation of the MDP
components with all possible combinations of domain objects. Whereas, LIFT-UP re-
lies on the lifted first-order reasoning, that is, computations are kept on the first-order
level avoiding propositionalization. In particular, action applicability check and compu-
tation of successors as well as predecessors are accomplished on abstract states directly.
    In the dynamic programming step of LIFT-UP, we employ a modified first-order
value iteration algorithm (FOVI) that computes the value only on those states which
are reachable from the initial states. More precisely, we call FOVI on the set E of states
that are visited by the best current partial policy. In this way, we improve the efficiency
of the original FOVI algorithm by [HS04] by using symbolic dynamic programming
together with reachability analysis.
    Given a FOMDP and a value function represented in PFC, FOVI returns the best
partial value function V , the best partial policy π and the residual r. In order to update
the values of the states Z in E, we assign the values from the current value function
to the successors of Z. We compute successors with respect to all nature’s choices aj .
The residual r is computed as the absolute value of the largest difference between the
current and the newly computed value functions V " and V , respectively. We note that
the newly computed value function V is taken in its normalized form, i.e., as a result of
the normalize procedure that will be described in Section 4.2.1. Extraction of a best
partial policy π is straightforward: One simply needs to extract the maximizing actions
from the best partial value function V .
    As with symbolic LAO∗ , LIFT-UP converges to an ε-optimal policy when three
conditions are met: (1) its current policy does not have any unexpanded states, (2) the
residual r is less than the predefined threshold ε, and (3) the value function is initialized
with an admissible heuristic. The original convergence proofs for LAO∗ and symbolic
LAO∗ by [HZ01] carry over in a straightforward way to LIFT-UP.
    When calling LIFT-UP, we initialize the value function with an admissible heuristic
function h that focuses the search on a subset of reachable states. A simple way to create
an admissible heuristic is to use dynamic programming to compute an approximate
value function. Therefore, in order to obtain an admissible heuristic h in LIFT-UP, we
perform several iterations of the original FOVI. We start the algorithm on an initial
value function that is admissible. Since each step of FOVI preserves admissibility, the
resulting value function is admissible as well. The initial value function assigns the goal
reward to each state thereby overestimating the optimal value, since the goal reward is
the maximal possible reward.
    Since all computations in LIFT-UP are performed on abstract states instead of
individual states, FOMDPs are solved avoiding explicit state and action enumeration
and propositionalization. Lifted first-order reasoning leads to better performance of
LIFT-UP in comparison to symbolic LAO∗ , as shown in Section 5.2.
                                                          from = { Z 0 }         Z1           to = { Z 2 , Z 3 }
from = { Z 0 }         Z1    to = { Z 1 , Z 2 }
                 a11                                                                          F = {Z3}
                             F = {Z1,Z 2 }
                                                                                 G            E = {Z0}
                       F,G   E = { Z 0 , Z 1 , Z 2}       Z0
Z0
                 a12         G = {Z1,Z 2 }
                                                                           a21   Z2
                       Z2    FOVIA ({ Z 0 , Z 1 , Z 2})                                                            b)

                                                                           a22   F
                        a)                                                       Z3

                                                                                     to = { Z 4 , Z 5 }
                                                          from = { Z 2 }         Z 1 F = { Z , Z ,Z }
                                                                                              3     4   5
                                                                                     E = { Z 0 , Z 2 ,Z 3 ,Z 4 ,Z 5 }
                                                          Z0                     G   G = { Z 1 , Z 2 ,Z 3 ,Z 4 ,Z 5 }
                                                                                      a11            Z4
                                                                                 Z2
                                                                                                    F              c)
                                                                                            a12

                                                                                 Z3                  Z5

                                                                                 FOVIA ( { Z 0 , Z 2 , Z 3 , Z 4 , Z 5 } )


                                       Figure 3: Policy Expansion.


4.1      Policy Expansion
We illustrate the policy expansion procedure in LIFT-UP by means of an example.
Assume that we start from the initial state Z0 and two nondeterministic actions a1
and a2 are applicable in Z0 , each having two outcomes a11 , a12 and a21 , a22 , respectively.
Without loss of generality, we assume that the current best policy π chooses a1 as an
optimal action at state Z0 . We construct the successors Z1 and Z2 of Z0 with respect
to both outcomes a11 and a12 of the action a1 . The fringe set F as well as the set G of
states expanded so far contain the states Z1 and Z2 only, whereas, the set E of states
visited by the best current partial policy gets the state Z0 in addition. See Figure 3a.
In the next step, FOVI is performed on the set E. We assume that the values have been
updated in such a way that a2 becomes an optimal action in Z0 . Thus, the successors
of Z0 have to be recomputed with respect to the optimal action a2 . See Figure 3b.
    One should observe that one of the a2 -successors of Z0 , namely Z2 , is an element of
the set G and thus, it has been contained already in the fringe F during the previous
expansion step. Hence, the state Z2 should be expanded and its value recomputed. This
is shown in Figure 3c, where states Z4 and Z5 are a1 -successors of Z2 , under assumption
that a1 is an optimal action in Z2 . As a result, the fringe set F contains the newly
discovered states Z3 , Z4 and Z5 and we perform FOVI on E = {Z0 , Z2 , Z3 , Z4 , Z5 }. The
state Z1 is not contained in E, because it does not belong to the best current partial
policy, and the dynamic programming step is performed only on the states that were
visited by the best current partial policy.
        N   Number of states     Time, msec    Runtime, msec   Runtime w/o norm, msec
            Supdate  Snorm     Update Norm
        0     9        6        144       1         145                 144
        1     24       14       393       3         396                 593
        2     94       23       884       12        896                 2219
        3    129       33       1377      16        1393               13293
        4    328       39       2079      46        2125               77514
        5    361       48       2519      51        2570               805753
        6    604       52       3268     107        3375                n/a
        7    627       54       3534     110        3644                n/a
        8    795       56       3873     157        4030                n/a
        9    811       59       4131     154        4285                n/a


Table 1: Representative timing results for the first ten iterations of the first-order value
iteration with the normalization procedure switched on or off.


4.2     First-order Value Iteration
The first-order value iteration algorithm (FOVI) produces a first-order representation of
the optimal value function and policy by exploiting the logical structure of a first-order
MDP. Thus, FOVI can be seen as a first-order counterpart of the classical value iteration
algorithm by [Bel57].
    In LIFT-UP, the first-order value iteration algorithm serves two purposes: First,
we perform several iterations of FOVI in order to create an admissible heuristic h in
LIFT-UP. Second, in the dynamic programming step of LIFT-UP, we apply FOVI on
the states visited by the best partial policy in order to update their values and possibly
revise the current best partial policy.

4.2.1    Normalization
It was already mentioned by several authors that value iteration adds a dramatic compu-
tational overhead to a solution technique for first-order MDPs if no care about redundant
computations is taken [BRP01, HS04].
    Recently, there have been proposed an automated normalization procedure that,
given a state space, delivers an equivalent one that contains no redundancy [HS04].
This procedure, referred to as normalize in the LIFT-UP algorithm, is always called
before the value function is transmitted to the next iteration step, thereby preventing the
propagation of redundancy to the next computation steps. The technique employs the
notion of the subsumption relation defined in Section 3. Informally, given two abstract
states Z1 and Z2 such that Z1 & Z2 and the values associated to states are identical, Z1
can be easily removed from the state space because it contains redundant information.
    Table 1 illustrates the importance of the normalization algorithm by providing some
representative timing results for the first ten iterations of the first-order value iteration.
The experiments were carried out on the problem taken from the colored Blocksworld
scenario consisting of ten blocks. Even on such a relatively simple problem FOVI with
the normalization switched off does not scale beyond the sixth iteration.
    The results in Table 1 demonstrate that the normalization during some iteration
of FOVI dramatically shrinks the computational effort during the next iterations. The
columns labelled Supdate and Snorm show the size of the state space after performing
the value updates and the normalization, respectively. For example, the normalization
factor, i.e., the ratio between the number Supdate of states obtained after performing one
update step and the number Snorm of states obtained after performing the normalization
step, at the seventh iteration is 11.6. This means that more than ninety percent of
the state space contained redundant information. The fourth and fifth columns in
Table 1 contain the time Update and Norm spent on performing value updates and on
the normalization, respectively. The total runtime Runtime, when the normalization is
switched on, is given in the sixth column. The seventh column labelled Runtime w/o
norm depicts the total runtime of FOVI when the normalization is switched off. If we
would sum up all values in the seventh column and the values in the sixth column
up to the sixth iteration inclusively, subtract the latter from the former and divide
the result by the total time Norm needed for performing normalization during the first
six iterations, then we would obtain the normalization gain of about three orders of
magnitude.


5     The Planning System FluCaP
To evaluate the LIFT-UP approach we have developed a domain-dependent implementa-
tion called FluCaP. It can solve probabilistic Blocksworld problems as they appeared,
for example, in the colored Blocksworld domain of the 2004 International Planning
Competition.

5.1   Domain-dependent Optimizations
So far, we have presented a general theory of LIFT-UP for finding solutions in uncertain
planning environments which are represented as first-order MDPs. However, several
domain-driven optimizations or relaxations have been posed on the general theory. As
a result, FluCaP has demonstrated a competitive computational behaviour.

Action Applicability Since in the Blocksworld domain, all states were fully specified,
abstract states were described as fluent terms only. This allows to relax the forward
and backward action applicability conditions. Since the cases are symmetric, we will
concentrate on the forward action applicability condition that was initially defined as:
An action (a(X1 , . . . , Xn ), C, E) is forward applicable with θ to an abstract state Z ∈ S,
denoted as forward (Z, a, θ), if (C ◦ U )θ =AC1 Z, where U is a new variable of sort
fluent term and θ is an allowed substitution. Since C and Z are fluent terms under the
singularity condition, the aforementioned AC1-matching problem can be transformed
into the θ-subsumption problem [Rob65].
    Moreover, we optimize the obtained θ-subsumption problem further. Since a state
description in a colored Blocksworld represents a number of towers, we compare towers
of blocks and their color distributions instead of matching respective fluent terms. The
experiments have shown that it is much faster to manipulate with towers rather than
with fluent terms. For example, assume that an action precondition contains a fluent
clear(X). Let a state describe three towers of blocks. By inspecting the uppermost
blocks in the towers, we conclude that there are only three blocks, which satisfy the
precondition. It would be interesting to check whether this optimization technique can
be successfully applied in other planning domains as well.
    Meanwhile, in Section 6, we present some results towards efficient domain-independent
solution methods for the θ-subsumption problem.

Normalization The similar situation occurs in the case of normalization which relies
on the subsumption relation defined in Section 3. The AC1-matching problem underly-
ing the subsumption relation reduces to the θ-subsumption problem.
    Again, it is much faster to operate on towers rather than on fluent terms. For exam-
ple, assume that one state describes two towers of four and three blocks, respectively.
Another state also describes two towers but of five and two blocks, respectively. In order
to decide, whether one state subsumes another one we try to match the corresponding
towers and their color distributions. As experiments have shown, this optimization
speeds up the normalization immensely.

5.2   Experimental Evaluation
We demonstrate the advantages of combining the heuristic search together with first-
order state abstraction on a FluCaP system, that has successfully entered the domain-
dependent track of the probabilistic part of the 2004 International Planning Competition
(IPC’2004). The experimental results were all obtained using RedHat Linux running on
a 3.4GHz Pentium IV machine with 3GB of RAM.
     In Table 2, we present the performance comparison of FluCaP together with sym-
bolic LAO∗ on examples taken from the colored Blocksworld (BW) scenario. Our main
objective was to investigate whether first-order state abstraction using logic could im-
prove the computational behaviour of a planning system for solving FOMDPs. The
colored BW problems were our main interest since they were the only ones represented
in first-order terms and hence the only ones that allowed us to make use of the first-order
state abstraction.
     At the outset of solving a colored BW problem, symbolic LAO∗ starts by proposi-
tionalizing its components, namely, the goal statement and actions. Only after that,
the abstraction using propositional ADDs is applied. In contrast, FluCaP performs
first-order abstraction on a colored BW problem directly, avoiding unnecessary ground-
ing. In the following, we show how an abstraction technique affects the computation of
a heuristic function. To create an admissible heuristic, FluCaP performs twenty itera-
tions of FOVI and symbolic LAO∗ performs twenty iterations of an approximate value
iteration algorithm similar to APRICODD by [SAHB00]. The columns labelled H.time
and NAS show the time needed for computing a heuristic function and the number of
abstract states it covers, respectively. In comparison to FluCaP, symbolic LAO∗ needs
to evaluate fewer abstract states in the heuristic function but takes considerably more
time. One can conclude that abstract states in symbolic LAO∗ enjoy more complex
structure than those in FluCaP.
     We note that, in comparison to FOVI, FluCaP restricts the value iteration to a
smaller state space. Intuitively, the value function, which is delivered by FOVI, covers a
larger state space, because the time that is allocated for the heuristic search in FluCaP
is now used for performing additional iterations in FOVI. The results in the column
Problem    Total av. reward                       Total time, sec.              H.time, sec.              NAS               NGS, ×103           %


                                 FluCaP–


                                                                      FluCaP–
                 FluCaP


                                                    FluCaP


                                                                                         FluCaP


                                                                                                           FluCaP


                                                                                                                                       FluCaP

                                                                                                                                                FluCaP
          LAO*


                                           LAO*


                                                                                 LAO*


                                                                                                  LAO*


                                                                                                                             LAO*
                          FOVI


                                                             FOVI


                                                                                                                     FOVI
B    C
     4    494    494      494    494 22.3 22.0 23.4                  31.1    8.7         4.2      35      410       1077    0.86     0.82     2.7
5    3    496    495      495    496 23.1 17.8 22.7                  25.1    9.5         1.3      34      172       687     0.86      0.68 2.1
     2    496    495      495    495 27.3 11.7 15.7                  16.5    12.7        0.3      32      55        278     0.86     0.66     1.9
     4    493    493      493    493 137.6 78.5 261.6                285.4   76.7       21.0      68     1061       3847    7.05     4.24     3.1
6    3    493    492      493    492 150.5 33.0 119.1                128.5 85.0          9.3      82      539       1738    7.05     6.50     2.3
     2    495    494      495    496 221.3 16.6 56.4                 63.3   135.0        1.2      46      130        902    7.05      6.24 2.0
     4    492    491      491    491 1644 198.1 2776                  n/a   757.0       171.3     143    2953       12014   65.9     23.6     3.5
7    3    494    494      494    494 1265 161.6 1809                 2813   718.3       143.6     112    2133       7591    65.9     51.2     2.4
     2    494    494      494    494 2210 27.3 317.7                  443.6 1241        12.3      101     425       2109    65.9     61.2     2.0
     4    n/a    490      n/a    n/a n/a 1212 n/a                     n/a    n/a        804.1     n/a    8328         n/a   n/a      66.6     4.1
8    3    n/a    490      n/a    n/a n/a 598.5 n/a                    n/a    n/a        301.2     n/a    3956         n/a   n/a     379.7 3.0
     2    n/a    492      n/a    n/a n/a 215.3 1908                   n/a    n/a        153.2     n/a    2019        7251   n/a      1121     2.3
15   3    n/a    486      n/a    n/a n/a 1809 n/a                     n/a    n/a        1733      n/a    7276         n/a   n/a     1.2 · 107 5.7
17   4    n/a    481      n/a    n/a n/a 3548 n/a                     n/a    n/a        1751      n/a    15225        n/a   n/a     2.5 · 107 6.1


Table 2: Performance comparison of FluCaP (denoted as FluCaP) and symbolic LAO∗
(denoted as LAO*), where the cells n/a denote the fact that a planner did not deliver a
solution within the time limit of one hour. NAS and NGS are number of abstract and
ground states, respectively.


labelled % justify that the harder the problem is (that is, the more colors it contains), the
higher the percentage of runtime spent on normalization. Almost on all test problems,
the effort spent on normalization takes three percent of the total runtime on average.
    In order to compare the heuristic accuracy, we present in the column labelled NGS
the number of ground states which the heuristic assigns non-zero values to. One can
see that the heuristics returned by FluCaP and symbolic LAO∗ have similar accuracy,
but FluCaP takes much less time to compute them. This reflects the advantage of
the plain first-order abstraction in comparison to the marriage of propositionalization
with abstraction using propositional ADDs. In some examples, we gain several orders
of magnitude in H.time.
    The column labelled Total time presents the time needed to solve a problem. During
this time, a planner must execute 30 runs from an initial state to a goal state. A one-hour
block is allocated for each problem. We note that, in comparison to FluCaP, the time
required by heuristic search in symbolic LAO∗ (i.e., difference between Total time and
H.time) grows considerably faster in the size of the problem. This reflects the potential
of employing first-order abstraction instead of abstraction based on propositional ADDs
during heuristic search.
    The average reward obtained over 30 runs, shown in column Total av. reward, is
the planner’s evaluation score. The reward value close to 500 (which is the maximum
possible reward) simply indicates that a planner found a reasonably good policy. Each
time the number of blocks B increases by 1, the running time for symbolic LAO∗ increases
roughly 10 times. Thus, it could not scale to problems having more than seven blocks.
This is in contrast to FluCaP which could solve problems of seventeen blocks. We
        B    Total av. reward, ≤500   Total time, sec.   H.time, sec.   NAS    NGS × 1021
        20            489.0                137.5             56.8       711        1.7
        22            487.4                293.8            110.2       976     1.1 × 103
        24            492.0                757.3            409.8       1676    1.0 × 106
        26            482.8                817.0            117.2       1141    4.6 × 108
        28            493.0               2511.3            823.3       2832   8.6 × 1011
        30            491.2               3580.4           1174.0       4290   1.1 × 1015
        32            476.0               3953.8            781.8       2811   7.4 × 1017
        34            475.6               3954.1            939.4       3248    9.6× 1020
        36             n/a                  n/a              n/a        n/a        n/a


Table 3: Performance of FluCaP on larger instances of one-color Blocksworld problems,
where the cells n/a denote the fact that a planner did not deliver a solution within the
time limit.

note that the number of colors C in a problem affects the efficiency of an abstraction
technique. In FluCaP, as C decreases, the abstraction rate increases which, in turn, is
reflected by the dramatic decrease in runtime. The opposite holds for symbolic LAO∗ .
    In addition, we compare FluCaP with two variants. The first one, denoted as
FOVI, performs no heuristic search at all, but rather, employs FOVI to compute the ε-
optimal total value function from which a policy is extracted. The second one, denoted
as FluCaP– , performs ‘trivial’ heuristic search starting with an initial value function
as an admissible heuristic. As expected, FluCaP that combines heuristic search and
FOVI demonstrates an advantage over plain FOVI and trivial heuristic search. These
results illustrate the significance of heuristic search in general (FluCaP vs. FOVI) and the
importance of heuristic accuracy, in particular (FluCaP vs. FluCaP– ). FOVI and FluCaP–
do not scale to problems with more than seven blocks.
    Table 3 presents the performance results of FluCaP on larger instances of one-color
BW problems with the number of blocks varying from twenty to thirty four. We believe
that FluCaP does not scale to problems of larger size because the implementation is
not yet well optimized. In general, we believe that the FluCaP system should not be
as sensitive to the size of a problem as propositional planners are.
    Our experiments were targeted at the one-color problems only because they are,
on the one hand, the simplest ones for us and, on the other hand, the bottleneck for
propositional planners. The structure of one-color problems allows us to apply first-
order state abstraction in its full power. For example, for a 34-blocks problem FluCaP
operates on about 3.3 thousand abstract states that explode to 9.6 × 1041 individual
states after propositionalization. A propositional planner must be highly optimized in
order to cope with this non-trivial state space.
    We note that additional colors in larger instances (more than 20 blocks) of BW
problems cause dramatic increase in computational time, so we consider these prob-
lems as being unsolved. One should also observe that the number of abstract states
NAS increases with the number of blocks non-monotonically because the problems are
generated randomly. For example, the 30-blocks problem happens to be harder than
the 34-blocks one. Finally, we note that all results that appear in Tables 2 and 3 were
obtained by using the new version of the evaluation software that does not rely on propo-
sitionalization in contrast to the initial version that was used during the competition.
The competition domains and results are available in [YLWA05].


6     Domain-independent Methods for θ-subsumption
Given two fluent terms Z1 and Z2 under singularity condition, Z1 θ-subsumes Z2 , written
Z1 .AC1
     θ    Z2 , iff there exists an allowed substitution θ such that (Z1 ◦ U )θ =AC1 Z2 .
    Initially, θ-subsumption was defined on clauses. Given two clauses C and D, C
θ-subsumes D iff there exists a substitution θ such that Cθ ⊆ D [Rob65]. In general,
θ-subsumption is np-complete [KN86]. In the domain-dependent implementation of
the LIFT-UP approach, that was described in the previous section, we have employed
domain-driven optimization techniques that have allowed to reduce the complexity of
θ-subsumption. This section is devoted to the efficient domain-independent solution
methods for θ-subsumption which cope with its np-completeness.
    One approach to cope with the np-completeness of θ-subsumption is deterministic
subsumption. A state is said to be determinate if there is an ordering of fluents, such
that in each step there is a fluent which has exactly one match that is consistent with
the previously matched fluents [KL94]. However, in practice, there may be only few
fluents, or none at all, that can be matched deterministically. Recently, in [SHW96],
it was developed another approach, which we refer to as literal context, LitCon, for
short, to cope with the complexity of θ-subsumption. The authors propose to reduce
the number of matching candidates for each fluent by using the contextual information.
The method is based on the idea that fluents may only be matched to those fluents that
possess the same relations up to an arbitrary depth in a clause. As a result, a certain
superset of determinate states can be tested for subsumption in polynomial time.
    Unfortunately, as it was shown in [KRS06], LitCon does not scale very well up to
large depth. Because in some planning problems, the size of state descriptions can be
relatively large, it might be necessary to compute the contextual information for large
values of the depth parameter. Therefore, we are strongly interested in a technique that
scales better than LitCon. In this section, we present an approach, referred to as object
context, ObjCon, for short, which demonstrates better computational behaviour than
LitCon. Based on the idea of ObjCon, we develop a new θ-subsumption algorithm
and compare it with the LitCon-based approach.

6.1   Object Context
In general, a fluent f in a state Z1 can be matched with several fluents in a state Z2 , that
are referred to as matching candidates of f . LitCon is based on the idea that fluents
in Z1 can be only matched to those fluents in Z2 , the context of which include the
context of the fluents in Z1 [SHW96]. The context is given by occurrences of identical
objects (variables Vars (Z) and constants Const (Z)) or chains of such occurrences and
is defined up to some fixed depth. In effect, matching candidates that do not meet
the above context condition can be effortlessly pruned. In most cases, such pruning
results in deterministic subsumption, thereby considerably extending the tractable class
of states.
    The computation of the context itself is dramatically affected by the depth parame-
ter: The larger the depth is, the longer the chains of objects’ occurrences are, and thus,
more effort should be devoted to build them. Unfortunately, LitCon does not scale
very well up to large depth [KRS06]. For example, consider a state

            Z = on(X, Y ) ◦ on(Y, table) ◦ r(X) ◦ b(Y ) ◦ h(X) ◦ h(Y ) ◦ w(X) ◦ l(Y )

that can be informally read as: A block X is on the block Y which is on the table,
and both blocks enjoy various properties, like color (red r or blue b) or weight (heavy
h or light l), they can be wet w. Z contains eight fluents and only three objects. In
LitCon, the context should be computed for each of eight fluents in order to keep track
of all occurrences of identical objects. What if we were to compute the context for each
object instead? In our running example, we would need to perform computations only
three times, in this case.
    Herein, we propose a more efficient approach, referred to as ObjCon, for computing
the contextual information and incorporate it into a new context-based θ-subsumption
algorithm. More formally, we build the object occurrence graph GZ = (V, E, %) for a state
Z, where vertices are objects of Z, denoted as Obj (Z), and edges E = {(o1 , π1 , f, π2 , o2 )|
Z contains f (t1 , . . . , tn ) and o1 = tπ1 and o2 = tπ2 } with o1 , o2 ∈ Obj (Z), f (t1 , . . . , tn )
being a fluent and π1 , π2 being positions of objects o1 , o2 in f . The labeling function
%(o) = {f |Z contains f (o)} associates each object o with a unary fluent name f this
object belongs to. The object occurrence graph for the state Z from our running example
will contain three vertices X, Y and table with labels {r, h, w}, {b, h, l} and {}, resp.,
and two edges (X, 1, on, 2, Y ) and (Y, 1, on, 2, table).
    The object context ObjCon(o, Z, d) of depth d > 0 is defined for each object o of a
                                              π 1 ·f 1 ·π 1   π 2 ·f 2 ·π 2   π d ·f d ·π d
state Z as a chain of labels: %(o) 1−→ 2 %(o1 ) 1−→ 2 . . . 1−→ 2 %(od ) ∈ ObjCon(o, Z, d)
      π 1 ·f 1 ·π 1   π 2 ·f 2 ·π 2   π d ·f d ·π d
iff o 1−→ 2 o1 1−→ 2 . . . 1−→ 2 od is a path in GZ of length d starting at o. In our
running example, ObjCon(X, Z, 1) of depth 1 of the variable X in Z contains one chain
           1·on·2
{{r, h, w} −→ {b, h, l}}.
    Following the ideas of [SHW96], we define the embedding of object contexts for states
Z1 and Z2 , which serves as a pruning condition for reducing the space of matching can-
didates for Z1 and Z2 . Briefly, let OC1 =ObjCon(o1 , Z1 , d), OC2 =ObjCon(o2 , Z2 , d).
Then OC1 is embedded in OC2 , written OC1 ! OC2 , iff for every chain of labels in OC1
there exists a chain of labels in OC2 which preserves the positions of objects in fluents
and the labels for each object in OC1 are included in the respective labels in OC2 up
to the depth d. Finally, if ObjCon(X, Z1 , d) )! ObjCon(o, Z2 , d) then there exists no
θ such that (Z1 ◦ U )µθ =AC1 Z2 , where µ = {X (→ o} and U is a new variable of sort
fluent term. In other words, a variable X in Z1 cannot be matched against an object o
in Z2 within a globally consistent match, if the variable’s context cannot be embedded
in the object’s context. Therefore, the substitutions that meet the above condition can
be effortlessly pruned from the search space. For any context depth d > 0, the context
inclusion is an additional condition that reduces the number of candidates, and hence
there exists more often at most one remaining matching candidate.
    Based on the idea of the object context, we describe a new θ-subsumption algorithm
in Algorithm 1. Please note that this algorithm provides a complete set of all allowed
substitutions which is used later on for determining the set of all possible successors or
predecessors of some state with respect to some action. Due to the lack of space, we
   Input: Two fluent terms Z1 , Z2 .
   Output: A complete set of substitutitons θ such that Z1 #AC1
                                                            θ   Z2 .
      1. Deterministically match as many fluents of Z1 as possible to fluents of Z2 . Substitute Z1 with
         the substitution found. If some fluent of Z1 does not match any fluent of Z2 , decide Z1 !AC1
                                                                                                   θ   Z2 .
      2. ObjCon-based deterministically match as many fluents of Z1 as possible to fluents of Z2 .
         Substitute Z1 with the substitution found. If some fluent of Z1 does not match any fluent of Z2 ,
         decide Z1 !AC1
                    θ   Z2 .
      3. Build the substitution graph (V, E) for Z1 and Z2 with nodes v = (µ, i) ∈ V , where µ is a
         matching candidate for Z1 and Z2 , i.e., matches some fluent at position i in Z1 to some fluent in
         Z2 and i ≥ 1 is referred to as a layer of v. Two nodes (µ1 , i1 ) and (µ2 , i2 ) are connected with an
         edge iff µ1 µ2 = µ2 µ1 and i1 = i2 . Delete all nodes (µ, i) with Xµ = o, for some X ∈ Vars (Z1 )
                                       &
         and o ∈ Obj (Z2 ), and ObjCon(X,     Z1 , d) ! ObjCon(o, Z2 , d) for some d. Find all cliques of size
         |Z1 | in (V, E).                             &

                            Algorithm 1: ObjCon-alltheta.


omit the algorithm for computing all cliques in a substitution graph. However, it can
be found in [KRS06].

6.2   Experimental Evaluation
Figure 4 depicts the comparison timing results between the LitCon-based subsumption
reasoner, referred to as AllTheta, and its ObjCon-based opponent, referred to as
FluCaP. The results were obtained using RedHat Linux running on a 2.4GHz Pentium
IV machine with 2GB of RAM.
    We demonstrate the advantages of exploiting the object-based context information
on problems that stem from the colored Blocksworld and Pipesworld planning scenar-
ios. The Pipesworld domain models the flow of oil-derivative liquids through pipeline
segments connecting areas, and is inspired by applications in the oil industry. Liquids
are modeled as batches of a certain unit size. A segment must always contain a certain
number of batches (i.e., it must always be full). Batches can be pushed into pipelines
from either side, leading to the batch at the opposite end “falling” into the incident
area. Batches have associated product types, and batches of certain types may never
be adjacent to each other in a pipeline. Moreover, areas may never have constraints on
how many batches of a certain product type they can hold.
    For each problem, there have been done 1000 subsumption tests. The time limit
of 100 minutes has been allocated. The results show that FluCaP scales better than
AllTheta. It is best to observe on the problems of forteen-, twenty-, and thirty-
blocks. As empirical results demonstrate, the optimal value of the depth parameter for
Blocksworld and Pipesworld is four.
    The main reason for the computational gain of FluCaP is that it is less sensitive
to the growth of the depth parameter. Under the condition that the number of objects
in a state is strictly less than the number of fluents and other parameters are fixed,
the amount of object-based context information is strictly less than the amount of the
literal-based context information. Moreover, on the Pipesworld problems, FluCaP
requires two orders of magnitude less time than AllTheta.
                                                  10 blocks                                                                     12 blocks                                                                          14 blocks
                       40                                                                           40                                                                            40
                                                               FluCaP                                                                        FluCaP                                                                             FluCaP
                       35                                     AllTheta                              35                                      AllTheta                              35                                           AllTheta
Subsumption time, ms


                                                                             Subsumption time, ms


                                                                                                                                                           Subsumption time, ms
                       30                                                                           30                                                                            30
                       25                                                                           25                                                                            25
                       20                                                                           20                                                                            20
                       15                                                                           15                                                                            15
                       10                                                                           10                                                                            10
                       5                                                                            5                                                                             5
                       0                                                                            0                                                                             0
                            2               3                   4        5                               2                3                   4        5                               2                     3                     4      5
                                                Context depth                                                                 Context depth                                                                      Context depth

                                                  20 blocks                                                                     30 blocks                                                                            4 Areas
                       40                                                                           40                                                                            1000
                                                               FluCaP                                                                        FluCaP                                                                             FluCaP
                       35                                     AllTheta                              35                                      AllTheta                                                                           AllTheta
Subsumption time, ms


                                                                             Subsumption time, ms


                                                                                                                                                           Subsumption time, ms
                       30                                                                           30
                                                                                                                                                                                   100
                       25                                                                           25
                       20                                                                           20
                       15                                                                           15
                                                                                                                                                                                       10
                       10                                                                           10
                       5                                                                            5
                       0                                                                            0                                                                                  1
                            2               3                   4        5                               2                3                   4        5                                    2        3              4          5
                                                Context depth                                                                 Context depth                                                                       Context depth

                                                    6 Areas                                                                        8 Areas                                                                              10 Areas
                                                                                                    100000                                                                         1e+06
                       10000                                   FluCaP                                                                        FluCaP                                                                             FluCaP
                                                              AllTheta                                                                      AllTheta                                                                           AllTheta
Subsumption time, ms


                                                                             Subsumption time, ms


                                                                                                                                                           Subsumption time, ms
                                                                                                     10000                                                                        100000
                        1000                                                                                                                                                       10000
                                                                                                         1000
                            100                                                                                                                                                        1000
                                                                                                          100
                                                                                                                                                                                        100
                            10                                                                               10                                                                             10

                                1                                                                            1                                                                              1
                                    2   3          4       5                                                      2   3           4       5                                                      2       3           4       5
                                                 Context depth                                                                  Context depth                                                                      Context depth


Figure 4: Comparison timing results for FluCaP and AllTheta. The results present
the average time needed for one subsumption test. Please note that the plots for
Pipesworld are shown in logscale. Therefore small differences in the plot may indicate
a substantial difference on runtimes.

7                               Related Work
We follow the symbolic dynamic programming (SDP) approach within Situation Calcu-
lus (SC) of [BRP01] in using first-order state abstraction for FOMDPs. In the course
of first-order value iteration, a state space may contain redundant abstract states that
dramatically affect the algorithm’s efficiency. In order to achieve computational savings,
normalization must be performed to remove this redundancy. However, in the original
work by [BRP01] this was done by hand. To the best of our knowledge, the preliminary
implementation of the SDP approach within SC uses human-provided rewrite rules for
logical simplification. In contrast, [HS04] have developed an automated normalization
procedure for FOVI brings the computational gain of several orders of magnitude. An-
other crucial difference is that our algorithm uses heuristic search to limit the number
of states for which a policy is computed.
    The ReBel algorithm by [KvOdR04] relates to LIFT-UP in that it also uses a rep-
resentation language that is simpler than Situation Calculus. This feature makes the
state space normalization computationally feasible.
    All the above algorithms can be classified as deductive approaches to solving FOMDPs.
They can be characterized by the following features: (1) they are model-based, (2) they
aim at exact solutions, and (3) logical reasoning methods are used to compute ab-
stractions. We should note that FOVI aims at exact solution for a FOMDP, whereas
LIFT-UP, due to the heuristic search that avoids evaluating all states, seeks for an ap-
proximate solution. Therefore, it would be more appropriate to classify LIFT-UP as an
approximate deductive approach to FOMDPs.
    In another vein, there is some research on developing inductive approaches to solving
FOMDPs, e.g., by [FYG03]. The authors propose the approximate policy iteration
(API) algorithm, where they replace the use of cost-function approximations as policy
representations in API with direct, compact state-action mappings, and use a standard
relational learner to learn these mappings. A recent approach by [GT04] proposes an
inductive policy construction algorithm that strikes a middle-ground between deductive
and inductive techniques.


8    Conclusions
We have proposed a new approach that combines heuristic search and first-order state
abstraction for solving first-order MDPs more efficiently. In contrast to existing systems,
which start with propositionalizing the decision problem at the outset of any solution
attempt, we perform lifted reasoning on the first-order structure of an MDP directly.
However, there is plenty remaining to be done. For example, we are interested in the
question of to what extent the optimization techniques applied in modern propositional
planners can be combined with first-order state abstraction.

Acknowledgements
We thank anonymous reviewers for their comments on the previous versions of this
paper. We also thank Zhengzhu Feng for fruitful discussions and for providing us with
the executable of the symbolic LAO∗ planner. Olga Skvortsova was supported by a
grant within the Graduate Programme GRK 334 “Specification of discrete processes and
systems of processes by operational models and logics” under auspices of the Deutsche
Forschungsgemeinschaft (DFG).


References
[BBS95]      A. G. Barto, S. J. Bradtke, and S. P Singh. Learning to act using real-time
             dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995.

[BDH99]      C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Struc-
             tural assumptions and computational leverage. Journal of Artificial Intel-
             ligence Research, 11:1–94, 1999.

[Bel57]      R. E. Bellman. Dynamic programming. Princeton University Press, Prince-
             ton, NJ, USA, 1957.

[BRP01]      C. Boutilier, R. Reiter, and B. Price. Symbolic Dynamic Programming
             for First-Order MDPs. In Bernhard Nebel, editor, Proceedings of the Sev-
             enteenth International Conference on Artificial Intelligence (IJCAI’2001),
             pages 690–700. Morgan Kaufmann, 2001.

[DKKN95] T. Dean, L. Kaelbling, J. Kirman, and A. Nicholson. Planning under time
         constraints in stochastic domains. Artificial Intelligence, 76:35–74, 1995.
[FH02]     Z. Feng and E. Hansen. Symbolic heuristic search for factored Markov
           Decision Processes. In R. Dechter, M. Kearns, and R. Sutton, editors,
           Proceedings of the Eighteenth National Conference on Artificial Intelligence
           (AAAI’2002), pages 455–460, Edmonton, Canada, 2002. AAAI Press.

[FYG03]    A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy
           language bias. In S. Thrun, L. Saul, and B. Schölkopf, editors, Proceedings
           of the Seventeenth Annual Conference on Neural Information Processing
           Systems (NIPS’2003), Vancouver, Canada, 2003. MIT Press.

[GT04]     C. Gretton and S. Thiebaux. Exploiting first-order regression in inductive
           policy selection. In M. Chickering and J. Halpern, editors, Proceedings of the
           Twentieth Conference on Uncertainty in Artificial Intelligence (UAI’2004),
           Banff, Canada, July 2004. Morgan Kaufmann.

[HS90]     S. Hölldobler and J. Schneeberger. A new deductive approach to planning.
           New Generation Computing, 8:225–244, 1990.

[HS04]     S. Hölldobler and O. Skvortsova. A Logic-Based Approach to Dynamic
           Programming. In Proceedings of the Workshop on “Learning and Planning
           in Markov Processes–Advances and Challenges” at the Nineteenth National
           Conference on Artificial Intelligence (AAAI’04), pages 31–36, San Jose,
           CA, July 2004. AAAI Press.

[HZ01]     E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that
           finds solutions with loops. Artificial Intelligence, 129:35–62, 2001.

[KL94]     J.-U. Kietz and M. Lübbe. An efficient subsumption algorithm for inductive
           logic programming. In Proceedings of the Eleventh International Conference
           on MAchine Learning, pages 130–138, 1994.

[KN86]     D. Kapur and P. Narendran. NP-completeness of the set unification and
           matching problems. In J.H.Siekman, editor, Proceedings of the Conference
           on Automated Deduction, pages 489–495. Springer, Berlin, 1986. Lecture
           Notes in Computer Science 230.

[KRS06]    E. Karabaev, G. Rammé, and O. Skvortsova. Efficient symbolic reason-
           ing for first-order MDPs. In Proceedings of the Workshop on ”Planning,
           Learning and Monitoring with Uncertainty and Dynamic Worlds” at the
           Seventeenth European Conference on Artificial Intelligence (ECAI’2006),
           Riva del Garda, Italy, 2006. To appear.

[KvOdR04] K. Kersting, M. van Otterlo, and L. de Raedt. Bellman goes relational. In
          C. E. Brodley, editor, Proceedings of the Twenty-First International Con-
          ference in Machine Learning (ICML’2004), pages 465–472, Banff, Canada,
          July 2004. ACM.

[Put94]    M. L. Puterman. Markov Decision Processes - Discrete Stochastic Dynamic
           Programming. John Wiley & Sons, Inc., New York, NY, 1994.
[Rei91]    R. Reiter. The Frame Problem in the Situation Calculus: A Simple Solution
           (Sometimes) and a Completeness Result for Goal Regression. In Vladimir
           Lifschitz, editor, Artificial Intelligence and Mathematical Theory of Com-
           putation: Papers in Honor of John McCarthy, pages 359–380. Academic
           Press, San Diego, CA, 1991.

[Rob65]    J.A. Robinson. A machine-learning logic based on the resolution principle.
           Journal of the Association for Computing Machinery, 12(1):23–41, 1965.

[SAHB00]   R. St-Aubin, H. Hoey, and C. Boutilier. APRICODD: Approximate policy
           construction using decision diagrams. In T. K. Leen, T. G. Dietterich,
           and V. Tresp, editors, Proceedings of the Fourteenth Annual Conference
           on Neural Information Processing Systems (NIPS’2000), pages 1089–1095,
           Denver, 2000. MIT Press.

[SHW96]    T. Scheffer, R. Herbrich, and F. Wysotzki. Efficient θ-subsumption based
           on graph algorithms. In Proceedings of the 6th International Workshop on
           Inductive Logic Programming, pages 212–228, Berlin, August 1996. volume
           1314 of LNAI.

[Thi98]    M. Thielscher. Introduction to the fluent calculus. Electronic Transactions
           on Artificial Intelligence, 2(3-4):179–192, 1998.

[YLWA05] H.L.S. Younes, M.L. Littman, D. Weissman, and J. Asmuth. The first
         probabilistic track of the International Planning Competition. Journal of
         Artificial Intelligence Research, 24:851–887, 2005.