Determining a proper initial configuration
                   of Red-Black planning by machine learning
                                                   Otakar Trunda and Roman Barták 1


1     INTRODUCTION                                                             subsumes some goal state. The length of relaxed plan is then used to
                                                                               estimate the length of the real plan.
Planning deals with finding a sequence of actions that transforms
the world from a given initial state to a state that satisfies a certain
                                                                               2     RED-BLACK PLANNING
goal condition [8]. For the purposes of this paper we can define a
planning problem simply as a state-transition system where states are          Red-Black planning is a new approach to heuristics design which
the world states and transitions correspond to application of actions.         generalizes the delete relaxation and compensates for many of its
States are defined by values of state variables. Let X be the set of           shortcomings with a reasonable computational effort [7, 6, 5]. It di-
state variables, each variable xi has a finite domain
                                                  S       Di of its possible
                                                                     S         vides the set of state variables into two disjoint subsets - Red and
values. Then state s is a mapping from X to i Di . s : X 7→ i Di ,             Black, which are treated differently during the planning. The Red
where ∀i, s(xi ) ∈ Di .                                                        variables are treated as in the delete relaxation while the Black vari-
    The state space hasQa form of the Cartesian product of variables’          ables are not relaxed. If all variables are Red then the heuristic works
domains. Space =             Di Every state s ∈ Space has assigned a           same as the delete relaxation.
(possibly empty) set of its successor states designed succ(s), ev-                The authors showed that by the proper selection of Red variables,
ery t ∈ succ(s) is labeled by the action that transforms s to t (i.e.          we can reduce the underestimation (in most cases) and still keep the
performing actions changes values of state variables). The task is to          method polynomial. They also observed that the selection of Red
find a path p in this state-transition system that leads from a given          variables has a great impact on the overall performance. While proper
initial state to some state satisfying a goal condition (a goal state).        selection leads to good performance, with poor selection the perfor-
p = {s0 , s1 , . . . , sn }, where s0 is the initial state, sn is some goal    mance degrades. Selecting the proper variables, however, appears to
state and ∀0 ≤ i < n : si+1 ∈ succ(si ). Such a path is called a               be a hard problem.
solution plan. The goal is to reach a state where some variables have             The authors performed several tests with intuitive and counter-
specified values.                                                              intuitive variable selection methods (where intuitive relaxes the least
    One of the most promising approaches to solve the planning prob-           important variables, while the counter-intuitive method relaxes the
lem (based on the results of several International Planning Compe-             most important variables). It turned out that the counter-intuitive
titions [2]) is heuristic-guided forward search. (Mostly in a form of          method often beats the intuitive one (with respect to the time re-
A∗ or a hill-climbing). These approaches make use of a heuristic es-           quired for solving the problem) which makes the problem quite un-
timation during search and the accuracy of the heuristic estimator has         predictable. This led the authors to hypothesize that no simple and
a great impact on the performance. Hence designing a powerful and              efficient method for selecting the variables can be found.
easy-to-compute heuristic is of paramount importance.
    Heuristics are usually based on relaxations of the problem. When           3     OUR METHOD
estimating the quality of the best solution, we relax the problem by
ignoring some constraints (making the problem easier), then solve              We believe that different domains require different ways of selecting
the relaxed problem and use the quality of that solution as a lower            the variables. We propose a method based on machine learning that
bound on the quality of the best solution to the original problem. In          works as follows: first it creates a set of small sub-problems of the
planning, this principle is represented by the well known delete re-           original problem and then it determines the proper variable selection
laxation heuristic and its variants [8, 3, 4]. Heuristics based on this        for these sub-problems (by enumerating all possibilities). Finally, it
principle often work well, but in some situations they greatly under-          uses the solutions of sub-problems to derive the solution to the orig-
estimate the real value making them inaccurate (see [6] for example).          inal problem.
    Delete relaxation allows the state variables to hold several values
simultaneously, so the relaxed state subsumes several ordinary states.         3.1    Creating samples
Furthermore, performing actions (i.e. making transitions) only adds
new elements to the set of values that each variable currently holds           We create the sub-problems by selecting small subsets of variables
(never removes any value). Hence the set of ordinary states that the           and restricting the original problem to these variables only. The re-
relaxed state subsumes monotonically increases on every path. A                striction has a form of projection which preserves the paths - i.e. if
path is a relaxed solution plan if it leads to a relaxed state which           there is a path from s to t in the original state-transition system, then
                                                                               there is a path from restriction(s) to restriction(t) in the new
1    Charles University in Prague, Faculty of Mathematics and Physics,         system. Of course, new paths may emerge during the restriction that
    email: otakar.trunda@mff.cuni.cz, roman.bartak@mff.cuni.cz                 were not present before.
   Formally, let A be a planningQ problem as defined earlier, X its           original problem. Step three can be modified to val(xi ) =
                                                                              P
state variables, and Space =      Di its state space. Then for every              {Best|xi ∈Best}
                                                                                                   w(Best), where w is a weight function. We used
P ⊆ X called pattern and every   state s ∈ Space we define a restric-         w(Best) = |Best|, but different functions are also possible.
tion of s to P as sP : P 7→ Di , where ∀xi ∈ P, sP (xi ) = s(xi ).
                             S
                                                                                 Finally, step two can be modified to work with more samples than
                                                               P
A restriction of A to a pattern P is a planning
                                             Q problem A with                 just the best one. Imagine that there might be a variable which is
                                       P
state variables P , state space Space =         {i|x ∈P }
                                                          Di , and for        rarely in the best sample, but often in the second best one. This would
                                                       i
s, t ∈ SpaceP : s ∈ succ(t) if and only if there exist u, v ∈ Space           still be a good candidate to pick. In step three we would then average
such that s = uP , t = v P and u ∈ succ(v). The initial state and             the evaluation of all samples that contain the variable xi , possibly
goal states of the restricted problem are restrictions of the originals.      weighted according to the size of the pattern they use. This modifi-
   In each sub-problem induced by a pattern, we create samples by             cation should lead to more accurate results, but it takes more time to
enumerating all ways of selecting the Red variables. A sample then            compute. Therefore it is not yet clear whether or not it will improve
consists of a pair (pattern, selected Red variables).                         the overall performance.


3.2    Evaluating samples                                                     4    CONCLUSIONS AND FUTURE WORK
Let X be the set of state variables of the original planning problem. A       We present the parameter learning method in a very simple form,
sample q is given in a form q = (Pq , Rq ), where Pq is the pattern and       many issues remain unresolved. Preliminary experiments show
Rq is the set of selected red variables, Rq ⊆ Pq ⊆ X. To evaluate             promising results, but the method still needs to be adjusted and prop-
the sample q, we chose the following procedure:                               erly tested on a larger set of planning domains.
                                                                                 One part we didn’t address yet is the selection of patterns dur-
1. Restrict the original problem to the pattern Pq                            ing the creation of samples. Unlike typical machine learning appli-
2. Solve the restricted problem by A∗ with the Red-Black heuristic            cations, here we can decide what samples we use for the learning.
   using Rq as a set of red variables.                                        Patterns should be selected iteratively and the selection should be
3. Measure the time required to perform step 2 in seconds and use it          based on previous results and should support both exploration and
   to evaluate the sample. (V al(q) denotes the value of a sample q.)         exploitation. We intend to make use of some Monte-Carlo technique,
                                                                              possibly Monte-Carlo Tree Search [1]. The method is guaranteed to
We decided to use the run-time to evaluate the sample rather than
                                                                              converge to optimal solution if patterns are chosen incrementally (as
other characteristics like heuristic calls or expanded nodes. We be-
                                                                              the size of the pattern grows, the sub-problem converges to the orig-
lieve that using such characteristics would bias the selection in favor
                                                                              inal problem). The speed of convergence, however, needs yet to be
of large patterns and small Red sets, since such combination would
                                                                              determined for various domains.
lead to a very accurate heuristic. However, such heuristic might take
                                                                                 The proposed method of learning from pattern-induced sub-
a long time to compute and probably wouldn’t be the best alternative.
                                                                              problems is not bound to the Red-Black planning heuristic only, but
   Since run-time of the whole process is the criterion we want to
                                                                              can be used to gain information about other features of the planning
optimize, it seems appropriate to use it to evaluate samples.
                                                                              problem as well. Such information might then help to improve vari-
                                                                              ous search methods.
3.3    Learning from samples
After evaluating enough samples, we have to select the red variables          ACKNOWLEDGEMENTS
for the original problem. In our preliminary experiments, we used the         The research is supported by the Grant Agency of Charles University
following simple procedure, but we believe that this phase can yet be         under contract no. 390214 and it is also supported by SVV project
perfected by using more sophisticated approach.                               number 260 104.
1. Given the set of samples Q, a sample q = (Pq , Rq ), divide the
   samples to groups by the pattern they use. QP = {q ∈ Q | Pq =              REFERENCES
   P}
                                                                              [1] C.B. Browne et al., ‘A survey of monte carlo tree search methods’, Com-
2. Select the best sample in each group QP (one with the lowest                   putational Intelligence and AI in Games, IEEE Transactions on, 4(1),
   evaluation), and denote its Red set as BestP .                                 1–43, (March 2012).
3. For each state variable count how many times it appears in some            [2] ICAPS Competitions. http://ipc.icaps-conference.org, June 2014.
   Best set. val(xi ) = {BestP | xi ∈ BestP }                                 [3] Jörg Hoffmann, ‘Where ”ignoring delete lists” works: Local search
                                                                                  topology in planning benchmarks’, J. Artif. Int. Res., 24(1), 685–758,
4. Select variables with the highest evaluation.                                  (November 2005).
                                                                              [4] Jörg Hoffmann, ‘Where Ignoring Delete Lists Works, Part II: Causal
   In step four, the number of variables to select can be a fixed con-            Graphs’, in 21st International Conference on Automated Planning and
stant or a fixed ratio, but we chose a different approach. Suppose                Scheduling, Freiburg, Allemagne, (2011).
there are n state variables. We sort the variables nonincreasingly by         [5] Michael Katz and Jörg Hoffmann, ‘Red-black relaxed plan heuristics
their evaluation: {x1 , x2 , . . . , xn }, where val(xi ) ≥ val(xi+1 ). We        reloaded.’, in SOCS, eds., Malte Helmert and Gabriele Rger. AAAI
                                                                                  Press, (2013).
add the first variable and then keep adding more until val(xi ) −             [6] Michael Katz, Jörg Hoffmann, and Carmel Domshlak, ‘Red-black re-
val(xi+1 ) > val(x1 )−val(x
                        n
                                  n)
                                      . This stopping criterion should find       laxed plan heuristics’, in AAAI’13, (2013).
the gap between the good variables and the bad variables. We intend           [7] Michael Katz, Jörg Hoffmann, and Carmel Domshlak. Who said we need
to test other selection policies as well.                                         to relax all variables?, 2013.
                                                                              [8] Dana Nau, Malik Ghallab, and Paolo Traverso, Automated Planning:
   Step three can be generalized by introducing weights to the                    Theory & Practice, Morgan Kaufmann Publishers Inc., San Francisco,
Best sets. Currently, each Best set has a weight of 1, but larger                 CA, USA, 2004.
patterns give us more information since they are closer to the