Determining a proper initial configuration of Red-Black planning by machine learning Otakar Trunda and Roman Barták 1 1 INTRODUCTION subsumes some goal state. The length of relaxed plan is then used to estimate the length of the real plan. Planning deals with finding a sequence of actions that transforms the world from a given initial state to a state that satisfies a certain 2 RED-BLACK PLANNING goal condition [8]. For the purposes of this paper we can define a planning problem simply as a state-transition system where states are Red-Black planning is a new approach to heuristics design which the world states and transitions correspond to application of actions. generalizes the delete relaxation and compensates for many of its States are defined by values of state variables. Let X be the set of shortcomings with a reasonable computational effort [7, 6, 5]. It di- state variables, each variable xi has a finite domain S Di of its possible S vides the set of state variables into two disjoint subsets - Red and values. Then state s is a mapping from X to i Di . s : X 7→ i Di , Black, which are treated differently during the planning. The Red where ∀i, s(xi ) ∈ Di . variables are treated as in the delete relaxation while the Black vari- The state space hasQa form of the Cartesian product of variables’ ables are not relaxed. If all variables are Red then the heuristic works domains. Space = Di Every state s ∈ Space has assigned a same as the delete relaxation. (possibly empty) set of its successor states designed succ(s), ev- The authors showed that by the proper selection of Red variables, ery t ∈ succ(s) is labeled by the action that transforms s to t (i.e. we can reduce the underestimation (in most cases) and still keep the performing actions changes values of state variables). The task is to method polynomial. They also observed that the selection of Red find a path p in this state-transition system that leads from a given variables has a great impact on the overall performance. While proper initial state to some state satisfying a goal condition (a goal state). selection leads to good performance, with poor selection the perfor- p = {s0 , s1 , . . . , sn }, where s0 is the initial state, sn is some goal mance degrades. Selecting the proper variables, however, appears to state and ∀0 ≤ i < n : si+1 ∈ succ(si ). Such a path is called a be a hard problem. solution plan. The goal is to reach a state where some variables have The authors performed several tests with intuitive and counter- specified values. intuitive variable selection methods (where intuitive relaxes the least One of the most promising approaches to solve the planning prob- important variables, while the counter-intuitive method relaxes the lem (based on the results of several International Planning Compe- most important variables). It turned out that the counter-intuitive titions [2]) is heuristic-guided forward search. (Mostly in a form of method often beats the intuitive one (with respect to the time re- A∗ or a hill-climbing). These approaches make use of a heuristic es- quired for solving the problem) which makes the problem quite un- timation during search and the accuracy of the heuristic estimator has predictable. This led the authors to hypothesize that no simple and a great impact on the performance. Hence designing a powerful and efficient method for selecting the variables can be found. easy-to-compute heuristic is of paramount importance. Heuristics are usually based on relaxations of the problem. When 3 OUR METHOD estimating the quality of the best solution, we relax the problem by ignoring some constraints (making the problem easier), then solve We believe that different domains require different ways of selecting the relaxed problem and use the quality of that solution as a lower the variables. We propose a method based on machine learning that bound on the quality of the best solution to the original problem. In works as follows: first it creates a set of small sub-problems of the planning, this principle is represented by the well known delete re- original problem and then it determines the proper variable selection laxation heuristic and its variants [8, 3, 4]. Heuristics based on this for these sub-problems (by enumerating all possibilities). Finally, it principle often work well, but in some situations they greatly under- uses the solutions of sub-problems to derive the solution to the orig- estimate the real value making them inaccurate (see [6] for example). inal problem. Delete relaxation allows the state variables to hold several values simultaneously, so the relaxed state subsumes several ordinary states. 3.1 Creating samples Furthermore, performing actions (i.e. making transitions) only adds new elements to the set of values that each variable currently holds We create the sub-problems by selecting small subsets of variables (never removes any value). Hence the set of ordinary states that the and restricting the original problem to these variables only. The re- relaxed state subsumes monotonically increases on every path. A striction has a form of projection which preserves the paths - i.e. if path is a relaxed solution plan if it leads to a relaxed state which there is a path from s to t in the original state-transition system, then there is a path from restriction(s) to restriction(t) in the new 1 Charles University in Prague, Faculty of Mathematics and Physics, system. Of course, new paths may emerge during the restriction that email: otakar.trunda@mff.cuni.cz, roman.bartak@mff.cuni.cz were not present before. Formally, let A be a planningQ problem as defined earlier, X its original problem. Step three can be modified to val(xi ) = P state variables, and Space = Di its state space. Then for every {Best|xi ∈Best} w(Best), where w is a weight function. We used P ⊆ X called pattern and every state s ∈ Space we define a restric- w(Best) = |Best|, but different functions are also possible. tion of s to P as sP : P 7→ Di , where ∀xi ∈ P, sP (xi ) = s(xi ). S Finally, step two can be modified to work with more samples than P A restriction of A to a pattern P is a planning Q problem A with just the best one. Imagine that there might be a variable which is P state variables P , state space Space = {i|x ∈P } Di , and for rarely in the best sample, but often in the second best one. This would i s, t ∈ SpaceP : s ∈ succ(t) if and only if there exist u, v ∈ Space still be a good candidate to pick. In step three we would then average such that s = uP , t = v P and u ∈ succ(v). The initial state and the evaluation of all samples that contain the variable xi , possibly goal states of the restricted problem are restrictions of the originals. weighted according to the size of the pattern they use. This modifi- In each sub-problem induced by a pattern, we create samples by cation should lead to more accurate results, but it takes more time to enumerating all ways of selecting the Red variables. A sample then compute. Therefore it is not yet clear whether or not it will improve consists of a pair (pattern, selected Red variables). the overall performance. 3.2 Evaluating samples 4 CONCLUSIONS AND FUTURE WORK Let X be the set of state variables of the original planning problem. A We present the parameter learning method in a very simple form, sample q is given in a form q = (Pq , Rq ), where Pq is the pattern and many issues remain unresolved. Preliminary experiments show Rq is the set of selected red variables, Rq ⊆ Pq ⊆ X. To evaluate promising results, but the method still needs to be adjusted and prop- the sample q, we chose the following procedure: erly tested on a larger set of planning domains. One part we didn’t address yet is the selection of patterns dur- 1. Restrict the original problem to the pattern Pq ing the creation of samples. Unlike typical machine learning appli- 2. Solve the restricted problem by A∗ with the Red-Black heuristic cations, here we can decide what samples we use for the learning. using Rq as a set of red variables. Patterns should be selected iteratively and the selection should be 3. Measure the time required to perform step 2 in seconds and use it based on previous results and should support both exploration and to evaluate the sample. (V al(q) denotes the value of a sample q.) exploitation. We intend to make use of some Monte-Carlo technique, possibly Monte-Carlo Tree Search [1]. The method is guaranteed to We decided to use the run-time to evaluate the sample rather than converge to optimal solution if patterns are chosen incrementally (as other characteristics like heuristic calls or expanded nodes. We be- the size of the pattern grows, the sub-problem converges to the orig- lieve that using such characteristics would bias the selection in favor inal problem). The speed of convergence, however, needs yet to be of large patterns and small Red sets, since such combination would determined for various domains. lead to a very accurate heuristic. However, such heuristic might take The proposed method of learning from pattern-induced sub- a long time to compute and probably wouldn’t be the best alternative. problems is not bound to the Red-Black planning heuristic only, but Since run-time of the whole process is the criterion we want to can be used to gain information about other features of the planning optimize, it seems appropriate to use it to evaluate samples. problem as well. Such information might then help to improve vari- ous search methods. 3.3 Learning from samples After evaluating enough samples, we have to select the red variables ACKNOWLEDGEMENTS for the original problem. In our preliminary experiments, we used the The research is supported by the Grant Agency of Charles University following simple procedure, but we believe that this phase can yet be under contract no. 390214 and it is also supported by SVV project perfected by using more sophisticated approach. number 260 104. 1. Given the set of samples Q, a sample q = (Pq , Rq ), divide the samples to groups by the pattern they use. QP = {q ∈ Q | Pq = REFERENCES P} [1] C.B. Browne et al., ‘A survey of monte carlo tree search methods’, Com- 2. Select the best sample in each group QP (one with the lowest putational Intelligence and AI in Games, IEEE Transactions on, 4(1), evaluation), and denote its Red set as BestP . 1–43, (March 2012). 3. For each state variable count how many times it appears in some [2] ICAPS Competitions. http://ipc.icaps-conference.org, June 2014. Best set. val(xi ) = {BestP | xi ∈ BestP } [3] Jörg Hoffmann, ‘Where ”ignoring delete lists” works: Local search topology in planning benchmarks’, J. Artif. Int. Res., 24(1), 685–758, 4. Select variables with the highest evaluation. (November 2005). [4] Jörg Hoffmann, ‘Where Ignoring Delete Lists Works, Part II: Causal In step four, the number of variables to select can be a fixed con- Graphs’, in 21st International Conference on Automated Planning and stant or a fixed ratio, but we chose a different approach. Suppose Scheduling, Freiburg, Allemagne, (2011). there are n state variables. We sort the variables nonincreasingly by [5] Michael Katz and Jörg Hoffmann, ‘Red-black relaxed plan heuristics their evaluation: {x1 , x2 , . . . , xn }, where val(xi ) ≥ val(xi+1 ). We reloaded.’, in SOCS, eds., Malte Helmert and Gabriele Rger. AAAI Press, (2013). add the first variable and then keep adding more until val(xi ) − [6] Michael Katz, Jörg Hoffmann, and Carmel Domshlak, ‘Red-black re- val(xi+1 ) > val(x1 )−val(x n n) . This stopping criterion should find laxed plan heuristics’, in AAAI’13, (2013). the gap between the good variables and the bad variables. We intend [7] Michael Katz, Jörg Hoffmann, and Carmel Domshlak. Who said we need to test other selection policies as well. to relax all variables?, 2013. [8] Dana Nau, Malik Ghallab, and Paolo Traverso, Automated Planning: Step three can be generalized by introducing weights to the Theory & Practice, Morgan Kaufmann Publishers Inc., San Francisco, Best sets. Currently, each Best set has a weight of 1, but larger CA, USA, 2004. patterns give us more information since they are closer to the