1. Introduction

Using Gradient-based Optimization for Planning with Deep Q-Networks in Parametrized Action Spaces

JonasEhrhard

0 1

JohannesSchmidt

0 1 2

René Heesch

rene.heesch@hsu-hh.de 0 1 2

OliverNiggemann

0 1 2

Planning, Parametrized Markov Decision Processes, Ofline Reinforcement Learning, Deep Q-Networks

0 CAIPI'25: ECAI Workshop on AI-based Planning for Complex Real-World Applications , Bologna, Italy, 2025 1 HSU-AI Institute for Artificial Intelligence, Helmut-Schmidt-University , Hamburg , Germany 2 Institute of Automation, Helmut-Schmidt-University , Hamburg , Germany

2025

Many real-world planning problems feature parametrized action spaces, where each action is augmented by continuous parameters. Though deep Reinforcement Learning has achieved remarkable results in solving control and planning problems, it falls short at two central challenges of real-world planning problems with parametrized action spaces: (i) There is an infinite number of action-parameter candidates in every step of solving a planning problem, (ii) interacting with the planning domain is typically prohibitively expensive and available recordings from the planning domain are sparse. To counter these challenges, we introduce our novel Goal-Conditioned Model-Augmented Deep Q-Networks algorithm (GCM-DQN). The intuition behind GCM-DQN is to use gradient-based optimization on the surface of the Q-Function, instead of blunt estimators, to estimate the optimal parameters of an action in a state. In combination with a goal-conditioning of the DQN, and a state transition model, this allows us to find plans for planning problems in planning domains with parametrized action spaces. Our algorithm outperforms state-of-the-art Reinforcement Learning algorithms for planning in parametrized action spaces.

1. Introduction

Planning, the combinatorial problem of finding a sequence of actions that transitions an initial state into a goal state, is a fundamental problem in many real-world application1s,a2n].dCAoIn[ventional planning and Reinforcement Learning methods typically feature either purely discrete action spaces (i.e. a finite set of actions, like moving up, down, left, or right in a grid woorrldp)urely continuous action spaces (i.e. an infinite set of actions, like controlling the acceleration of a cart o2,n3a]. slope) [ However, many real-world problems featpuarreametrized action spaces. In a parametrized action space, a finite set of actions is augmented by real-valued parameters, which influence the efects of the actions [ 3, 4, 5 ]. During planning in parametrized action spaces, a planner hence must not only select from the ifnite action set, but also real-valued parameters, to reach it3s].gFooarle[xample, consider injection molding, where there is a finite set of actions (e.g. close mold, inject, hold, cool, eject), which are each augmented by real-valued parameters (e.g. heating/cooling energy, velocity, pressure, etc.). Both the combinatorial aspect of finite action selection, e.g., injecting material before closing the mold would lead to a mess, as well as the parametrization aspect, e.g., injecting too cold material leads to poor surface characteristics of the molded product majorly, have a major influence on the molded product. Getting both aspects right is the task of planning in parametrized action spaces. Besides this simplified example, many other real-world problems, from robotics to factory planning, feature parametrized action spaces4[ , 3, 6, 7, 8 ].

There are two central challenges in solving planning problems in real-world parametrized action spaces: (i) Due to the continuous nature of the parameter space, there is an infinite number of action

CEUR Workshop

ISSN1613-0073 parameter tuples a planner has to choose from in every statien.finTitheibsranching of action-parameter tuples in every state poses a challenge for selecting the optimal action-parame9t]e.rTytpuipclaell[y, infinite branching is either countered by parameter estima1t0o]r,sw[hich have the risk of being imprecise, or search1[ 1 ], which has the risk of being computationally expens(iiv)eO.ften there is no suficient model of the planning domain available, interaction with the domain is prohibitively expensive or unsafe, and recorded data is sca1r2c]e. [Hence, solving planning problems typically, either requires a manually crafted, expensive, and error-prone planning domain mo13d,e5l][, or requires advanced Reinforcement Learning algorithms which can be trained ofline, meaning without interaction with the planning domain, but strongly rely on the assumption that the distribution of the recorded data does not shift strongly from the application ca1s2e]s. [

In this paper, we tackle the challenges of infinite branching and training data scarcity in real-world parametrized action spaces. Therefore, we propose to extend the well known Deep Q-Network (DQN) algorithm1[ 4 ]. DQN uses a Neural Network to approximate the action value function, which returns the expected cumulative return of taking an action in a state. In combination with a greedy policy, DQN can solve even complex planning and control probl1e4m]s. [We propose to transfer DQN into a novel, ofline and model-augmented Reinforcement Learning setup, which allows us to use it for solving planning problems in planning domains with parametrized action spa3]ce(csf[. Figure1). More precisely, we propose three extension to the DQN algorit(ah)mT:o tackle infinite branching, we introd u ce , a novel gradient-based optimization algorithm, to eficiently find optimal parameters for a given action in a given state(.b) To make our algorithm applicable to unseen planning problems, we integrate a goal-conditioning to the DQ1N5][. (c) To allow using the DQN for planning without interacting with the environment, we propose a novel state-transition model, which is trained along the DQN and allows for planning in deterministic and probabilistic domains. We reduce the amount of training data to fit the models, by employing Hindsight Experience Repla1y6][and Conservative Q-Learnin1g7[].

(i) initial state goal state state param action goal

(iv) state transition model

(ii) goal-conditioned

DQN expected return gradient-based optimization (iii) gpreoelicdyy (paacrtiaomn,) plan

As a result, we present our Goal-Conditioned Model Augmented DQN algorithm (GCM-DQN). GCMDQN is can be trained on a sparse dataset of recorded plans from a planning domain. It returns a DQN which can either be used as a policy in probabilistic scenarios, or in combination with the parallelly trained state transition model as planner for deterministic domains. In contrast to estimator or searchbased algorithms for planning in parametrized action spaces, GCM-DQN converges quickly to optimal parameters due to the gradient-based parameter optimization. The main contributions of our paper are: • novel gradient-based optimization algorithm to eficiently counter infinite branching in planning domains with parametrized action spaces. • A novel integrationof , goal-conditioning, and a novel state-transition model into DQN to allow harnessing it for planning. • A systematic and comprehensive evaluation of our approach against state-of-the-art Reinforcement Learning paradigms for parametrized action spaces.

The remaining paper is structured as follows: In Sec2twioenreview related research in the domains of Reinforcement Learning for planning in parametrized action spaces. In S3ewcteioinntroduce the formalization of our problem. Sect4ioinntroduces our solution, followed by its theoretical and empirical evaluation in Sect5ioannd discussion in Sectio6n. We conclude our paper in secti7o.n

2. Related Work

In Deep Reinforcement Learning, there are two directions when handling parameterized action spaces: Using Neural Networks as estimators that suggest parameters for actions, and using search or optimization to find optimal parameters for an action. Typically, policy network approaches are grounded in the Deep Deterministic Policy Gradient (DDPG) parad1ig0m]. [DDPG is an Actor-Critic approach, in which the actor is a deep policy network that, given a state, suggests actions and the critic is a deep Q-network that calculates the cumulative expected return of the suggested action and state. Using backpropagation over both networks allows for adapting their weights to converge to an optimal policy- and Q-network. To solve planning problems in parametrized action spaces, Hausknecht and Stone[ 4 ] extendeded the DDPG paradigm by expanding the deep policy network with an additional non-binary output for suggesting parameters values, resulting in the P-DDPG algorithm. Fan et al. [18] propose a similar approach. They use individual separate heads for selecting an action from the ifnite action set, and individual separate heads for estimating its numerical para18m]e. tHeorws[ever, both approaches neglect that there is a dependency between an action and the numerical parameters [19]. Hence, Li et al[.19] proposed to encode the finite set of actions and numerical parameters into a joint latent representation space on which the policy operates, and from which discrete and continuous components are decoded for interaction with the environment. While the introduced approaches can handle parametrized action spaces, they remain restricted to online settings, which require the agent to interact directly with the environment, and are not well suited to an ofline scenario with only little available training data.

Optimization or search-based approaches typically follow a value-based paradigm, in which a greedy policy selects the action-parameter tuple with the highest expected return. While met2h0o]dusselike [ a divide-and-conquer approach for complex actions-parameter tuples that operates on a joint latent representation, Xiong et[a6]l.uses a separate parameter estimation network which feeds into a DQN, forming a parametrized DQN or P-DQN. Thereby, they can select a discrete action directly using a greedy policy and do not rely on a continuous relaxation of the discrete action components (as, e.g., Hausknecht and Ston[e4]) [ 6 ]. Finally, Ma et a[l1.1] uses an evolutionary optimization algorithm for estimating an optimal action from a continuous action space. While such approaches can also be adapted to parametrized action spaces, they are computationally expensive due to the uninformed optimization paradigm.

In contrast to typical Reinforcement Learning tasks, e.g., like control, the reward structure in planning problems sparse. Typically, the reward for solving a planning problem is formalized by a single reward signal upon reaching the goal state. This sparse reward signal hence is exclusively dependent on the goal state, and changes for planning problems with diverging goal states. To make Reinforcement Learning agents applicable to altering reward functions, Schau[1l5e]tinatlr.oduced Universal Value Function Approximators. Universal Value Function Approximators condition the value function approximator on an embedding of the goal state, hence making it generalizable across altering planning problems within the same domain1[ 5 ]. Other methods for countering sparsity of reward signals, especially in ofline settings, include data augmentation, such as Hindsight Experience Re1p6l]a, yor[ regularization in training by additional loss terms, such as Conservative Q-Lea1r7n].ing [

3. Formalization

Reinforcement Learning follows the assumption that there is an underlying MDP within all planning domains. As we focus on planning problems in parametrized action spaces, we consider Parametrized one individual acti o n each: is the transition functio=n(

+1 | , , ) that describes the probability of transitioning into state +1 ∈ given state ∈ , action ∈

and a parameter ∈ Ψ at time . ℛ is the reward function ℛ ∶ × → ℝ

that returns the scalar re wwarhden transitioning fr ominto +1 using an action , and ∈ ℝ a discount factor. We will further referatsothe dynamics of the MDP.

As the transition dynamics in real-world PAMDPs can grow very complex, large models and large datasets are needed to properly capture them. Leveraging on the parametrized action spaces, we propose to manage the complexity of real-world dynamics by a modular factorization of the parametrized action space. Therefore, we split into a finite se t of transition functio n s , which each are related to = { | =

( +1 | , ), ∈ Ψ , = 1, ..., } This allows us to model the transition dynamics for each action in one individua l m≈od el, reducing the complexity of the modeling problem, while overall not afecting the PAMDP dynamics. We can denote the collection of by sampling from the transition models alals ℱ = {

}=1 . During planning, we can infer state transitions

Action Markov Decision Processes (PAMDP3]).[ 3.1. Parametrized Action Markov Decision Processes

PAMDPs extend continuous Markov Decision Processes by introducing a hybrid, so-called, parametrized action space. They can be formalized as a tuple where ⊆ ℝ is the continuous state spa c=e,{

0, ..., , ..., }, ∈ ℕ is a finite set of actions, in which each action is extended by a continuous parameter spΨace∈ ℝ and the union of all parameter spaces is given asΨ = ⋃=1 Ψ . Together they form the parametrized action space

⟨ , , Ψ, , ℛ, ⟩, = ⋃ {( , )| ∈ Ψ }. ∈ +1 ∼ ( , ). ( , ) = +1 . ⟨ , , Ψ, ℱ , ℛ

, , 0, ⟩, ℛ () = { ,

if ∈ 0, else (1) (2) (3) (4) (5) (6) (7)

3.2. Describing Planning Problems with PAMDPs

Planning describes the task of finding a sequen c=e {( , transition an initial st0aitneto a goal sta t∈e ⊂ . Hence, a planning problem in a PAMDP can be )} −1 of action-parameter tuples, that =0 denoted as whereℛ is a goal conditioned, sparse reward function efectively turn s into a deterministic function

In deterministic scenarios, the transition probabilitiecsoolflapse to a Dirac delta distribution, which , with the numerical reward val∈ueℝ .

Reinforcement Learning typically solves planning problems by iteratively applying aopnotlhicey planning problem. Hence, a plan can be seen as a trajectory-level instantiation of a policy. A policy in a PAMDP is a mapping from the current st ataend goal sta teto an action-parameter tuple. For deterministic planning domains, the mapping is a funct(idoent)( , ) = (, ) domains, the mapping is a conditional distribut((io, n)| , ) , where ∈ , ∈ , , ∈

. , For probabilistic planning

For deterministic domains, the solution of a planning problem is a,wplhaicnh, when executed from 0, reaches a ∈ . For probabilistic domains, the solution of a planning problem is a prope r .policy A proper policy optimizes the discounted return of the planning problem and results in a goal state ∈ . The sequence of actions-parameter tuples selected by the policy during execution for m.s a plan

4. Solution

In this section, we introduce our GCM-DQN algorithm. GCM-DQN tackles the challenges of infinite branching, prohibitively expensive domain interactions, and data scarcity in real world planning domains with parametrized action spaces. The intuition of GCM-DQN is to leverage on the diferentiability of a DQN [ 14 ] during planning for finding the optimal parameters and actions via gradient-based optimization, instead of using estimators or search. Therefore, we add three extensions to the DQN algorithm1[ 4 ]: (a) To tackle the problem of infinite branching, we introduce the algorithm, a gradient-based optimization algorithm inspired21b,y8][, for finding an (leastwise locally) optimal action-parameter tuple during planning (cf. Sec4.t3i)o.n(b) To make GCM-DQN applicable to any planning problem within the planning domain, we introduce a goal-conditioning to the DQN, as proposed in [ 15 ]. We tackle data scarcity in training the goal-conditioned DQN, by using Hindsight Experience Replay [ 16 ] and Conservative Q-Learnin1g7][ (cf. Section4.2). (c) Finally, to counter prohibitively expensive domain interaction, we propose a novel state transition model which is parallelly trained to the DQN on the same dataset (cf. Sect4i.4o)n,allowing to simulate state transitions without any interaction with the planning domain.

By combining the three proposed extensions, we result in our novel GCM-DQN algorithm (cf. Section 4.1). GCM-DQN can operate in planning domains with parametrized action spaces. It can either be used as a policy for probabilistic planning domains or, when using the state transition model, as a planner for deterministic planning domains (cf. Fig2u)r.e transition modelk value function approximator calculate decision value greedy policy plan

4.1. Planning with Goal-Conditioned Model-augmented Deep Q-Networks

In this Section we provide an overview on the GCM-DQN algorithm (cf. Algo1riatnhdmFigure2). In its essence GCM-DQN is a goal-conditioned greedy policy, which is trained in an ofline setting. Hence, the first step includes training the D Q Nand the state transition modℱel=s { using a dataset of recorded plan.sDuring planning, GCM-DQN uses t he | = 1, ..., } algorithm (cf.

Algorithm2) on to calculate the optimal param ẽ∗teforr every action. To guide the selection of optimal action-parameter tuples, we calculate a decision vfoarlueeach actio n. includes the value, the weighted variance of the succeeding stvarte(cf. Equation18), and a potential based shaping Using instead of the pur e values counters, the selection of actions which would lead into nonpermissible states, e.g., colliding with boundaries. A greedy p oglrieceydy then picks the highes t and adds the corresponding action-parame(te, r ̃ ) tuple to the plan. By sampling from the associated state transition modelthe next sta te+1 can be inferred and passed to the next iteration. The iterations stop, whe+n1 becomes a state within(or ± , where is an error margin). In cases, in which there is no solution to the planning problem, a stopping cr itcearniobne introduced to bound the maximum number of iterations. The complete GCM-DQN algorithm is outlined in Alg1o.rithm The following section introduce the extensions of GCM-DQN in detail.

Algorithm 1: GCM-DQN during planning

Require :

4.2. Goal-Conditioned DQN for Parametrized Action Spaces

In this section, we describe our adaptions to DQN to allow using it for planning in planning domains with parametrized action spaces. We achieve this by including the goal state into the input of the DQN, thereby conditioning it on the goal state, and handling continuous per-action parameters via gradient-based optimization.

The original DQN uses a Neural Network to approximate the action value f(u, n)ctionfa domain [ 14 ], which describes the expected discounted return for taking acintiostnate, and satisfies the Bellman equation in the optimal case ( , ) =

problem [ 15 ] and the parametrized actions, so that where ∈ is an action from the finite action s e ti,s an associated continuous parameter, ainsd the goal state of the planning problem. For our updated Q-function, the Bellman equation becomes max arg max ( +1 , +1 , +1 , )].

As the inner maximization over+1 is non-convex when is approximated by a Neural Network, solving it is intractable. Hence, we propose to leverage on global optimization algorithms for finding leastwise local optima foarnd solve Equatio1n1 in two steps. In the first step, we find optimal action-parameter tuples for each action in the current state, ∗ = arg max (, ∈Ψ

, , ) ∀ ∈ , +1 using projected gradient ascent (cf. Sec4t.i3o)n.As we cannot guarantee a global optimum, we denote the resulting parameters w it̃∗h . This first step allows us to reformulate Equa1t1ioans [ℛ ( ) + max ( +1 , +1 , ̃∗+1 , )], which resembles the Bellman equation with a goal conditioning and an approximate inner maximization.

We train our goal-conditioned DQ N , with parameter s∈ ℝ , for parametrized action spaces in an ofline Reinforcement Learning setup, to cater the restrictions on of prohibitively expensive domain interactions in real-world planning domains. Therefore, we assume a training dataset of recorded plans = { }=0 . A major problem in ofline Reinforcement Learning is the distributional shift between training data and the application dom1a2i]n. W[e counter this problem by augmentingwith Hindsight Experience Replay1[ 6 ], and Conservative Action Samplin2g3][(cf. Figure3). Hindsight Experience Replay augments the available dataset by sampling sub-traces from the recorded plans, relabeling the ifnal state as the goal stat16e][. Conservative Action Sampling also samples sub-trances from the for the datase2t3[]. Using both augmentation techniques, results in the da t ã[s1e6t]sand ̃̄ [23]. recorded plans, however, labeling their final state as miss, therefore artificially creating negative samples g g g sample and

re-lable sub-sequences with new goals sample and

re-lable sub-sequences with random negative goals g g g g g g g g conservative Qlearning of Sampling [23] for augmenting our training dataset.

Following1[ 4 ] we use an of-policy training setup, using an online netw oraknd a target network −. During training, only the weights oafre updated via gradient descent, whereas the weights of − are copied from every steps. We use a composite loss function ℒ

= ℒ + ℒ consisting of the squared TD-loℒss [ 14 ] and a conservative penalty teℒrm[17]. The conservative penalty termℒ helps to regulariz e to overestimate Q-values of unseen or underrepresented actions [17]. We denote the squared TD-loss as ℒ = ( , , , , +1 )∼ ̃̄ [ + (1 − ) max −( +1 , +1 (13) (14) (15) where ∈ {0, 1} indicates whether the plan at t ,imsoe tha t = 1, if +1 ∈ . Following1[ 7 ], we formulate the conservative penalty term as

1 ∈ =1 ℒ = [ log(∑ ∑ exp( ( , , (16) where is the trade-of factor between Bellman-fit and conservat i=sm||, is the number of discrete actions, and is the number of parameter samples per action used in the log-sum-exp penalty. For our () uniformly from the empirical pool of parameters for action ofline training, we draw samples to approximate∫ ( ,

Regarding , three edge cases must be considere(di): including no data(i,i) including little data, and(iii) including infinite data. In cas(ei), where no data is availab le,cannot be trained. Hence, data must be collected by random exploration or through sampling state transitions from the domain. Case(ii) describes the normal operation of GCM-DQN. We note that the higher the variance in the dataset, the better the approximatio ntoofthe rea l. Case (iii) describes a special case, where all data are available. Given a large en o,uthghis allow s to fit exactly.

4.3. Gradient-based Parameter Estimation

For finding the optimal parameters for an action in a given state, we propose to leverage on the diferentiability of the DQN and use gradient ascent in a nested optimization loop for finding optimal parameters for a given action (cf. Equat12io).nTherefore, we introduce the algorithm, which draws inspiration from24[] and its applications i9n, [ 8 ].

The idea of

is to use the same algorithm, which is used to adapt the weigh tsdoufring training, for finding the optimal action-parameter tuples during execution. However, instead of optimizing the weights of t h e, we optimize the parameter componentof its input. Therefore, we initialize the parameter componenwtith a gues s ,̂ e.g., random numbers, zeros, or values fr om. After calculatin g (, ,

, ̂) , we use backpropagation to derive the gradient with resp e,̂catltlowing us to use gradient ascent with a learnin g rtaotuepdat e ̂ in a direction which increases the Q-value. The optimization stops after the updates of the Q-valΔu e,,fall below a thresho l,dreturning the last update of ̂ as ̃∗. Algorithm2 summarizes our parameter estimation loop through input optimization.

Algorithm 2: paramOpt Gradient‐Based Parameter Optimization

Require : , , ←̂ init() 1 Δ

← +∞ 2 (prev) ← −∞ 3 while Δ > do 4 5 6 7 8 ←̂ Δ

← ∇ (, , (val) ← (, , (prev) ← (val) , ̂) , ̂) ← (val) − (prev) clip[ min, max]

( +̂ ) 9 return ̃∗ ← ̂ // state, action, goal // goal‐conditioned DQN

// learning rate // stopping threshold

// initial guess // backprop wrt. parameters // projected gradient ascent // caclulate action value // optimized parameter

As we are using gradient ascent as optimization algorithm over the DQN, we cannot guarantee to find the true global optimum∗. This is due to the non-convex shape of. The result of the optimization hence can be strongly dependent on the initializ a t̂aionnd otfhe learning ra t.eAs there are diferent options for initialization, e.g., zeros, ones, or random numbers, we suggest incorporating prior knowledge from the dataset, in the form of estimators like the mean over observed parameter settings as starting guesses.

Additionally, parameters are typically bound to value ranges, e.g., a temperature cannot fall below 0 Kelvin. To incorporate this, we use projected gradient a2sc5]endtur[ing optimization, efectively clipping values that exceed the bounds. As one naïve solution for retrieving the bounds, we suggest iterating through the dataseatnd collecting minima and maxima of each parameter.

4.4. Learning State Transition Dynamics

In real-world planning problems, directly interacting with the planning domain to predict action efects is rarely possible or prohibitively expensi1v2e].[ Hence, planning requires a model of the state transition dynami1c]s w[ hich maps a current sta teand parameters to a successor state.

In deterministic domains this is a funct (io

n, ) = +1 (cf. Eq. 5); in probabilistic domains it is a conditional distributi(on+1 ∣ , ) from which +1 is sampled (cf. Eq.4).

Following the modular per–action factorization of PAMDP dynamics (c3f)., wEqe. learn one we use the same datas et, which is also used for training the DQN. transition model actioℱn,= {

}=1 , each predicting the next state for a ctigoinven ( , ). Thereby,

We propose to capture the stochasticity of probabilistic planning domains with a novel conditional latent-variable state transition model, inspir2e6d]. bTyh[ereby, each per-action model comprises an encoder and a decode r part.

During training , the encoder processes the inp u, t , and +1 into the parametersand of a latent posterio r (| , +1 , ). Using the reparametrization trick, it sa mp=les+ ⊙ , ∼ (0, ) The decoder reconstruc t+s1 from , , and under a standard normal pri o(r) = (0, ) . As . training criterion, we minimize the negative Evidence Lower Bound, A high variance indicates a high predictive uncertai nt̂+y1 ,inwhich indicates boundaries or nonpermissible states, like obstacles.

For deterministic domains, the stochastic latceannt be omitted and reduces to a standard

Multilayer Perceptron. 5. Evaluation

We evaluate our GCM-DQN algorithm empirically against ofline versions of state-of-the-art baselines for planning in parametrized action spa4c,e6s][. As performance metrics, we use the rate of successfully solved planning problems from a set of unseen planning problems. Therefore, we used domains with navigation problems and domains from the international planning competition’s (IPC) reinforcement learning track27[] (cf. Figure4). We hypothesize tha(t 1) GCM-DQN shows a higher performance ℒ = − (| , +1 , )

[log ( +1 | , , )] + KL( (| , +1 , )| () ), where KL denotes the Kullback–Leibler divergence. when sampling th e vector times: During planning , the encoder is discarded and o nlyis further used. Given the current s taatned parameters ̃∗ (estimated wit h ), we draw ∼ (0, ) and decode sample ŝ+1 = ( , ̃∗, ) .

Boundaries and non-permissible states can be detected by analyzing the scalar vvaarroi afn̂ ce +1 var=

1 − 1 =1 ∑ || ̂+1 () − +̄1 ||2, +̄1 =

=1 1 ∑ ̂(+)1 , (17) () (18) than the baselines, when trained on the same limited dataset o f ,palannds( 2) GCM-DQN longer maintains a higher performance than the baselines, when systematically reducing the number of samples in .

5.1. Experimental Setup

For setting up our experiments, we follow the experimental design guidelines for empirical Machine Learning research by Vranješ et[a28l].. We generate samples for the data setbsy running either an ∗ search or JaxPlan29[] for randomly initialized planning problems of the chosen planning domains. We used Optuna3[0] for hyperparameter optimization of GCM-DQN and the baselines to allow for a fair comparison. We repeated all experiments on eight diferent seeds to rule out lucky initializations. All code and datasets for replicating the experiments can be fohutntdpsu: nder //github.com/j-ehrhardt/gcmdq. nWe used the following planning domains for evaluation: Navigation Domains The navigation domains feature two-dimensional path finding problems in a continuous space with obstacles. The goal is to find a sequence of actions that lead from the start state to the goal state. There is a set of four actions - up, down, left, right - in which each action can be augmented with a plus minus ten-degree tilt. The step-width is fixed and collisions with the obstacles are forbidden. The planning problems are non-trivial, as the reward function is sparse and planners need to deal with linear and non-linear obstacles.

IPC Domains The IPC domains feature domains from the International Planning Competition’s Probabilistic and Reinforcement Learning Track from 270]2.3 W[e picked the reservoir, powergen and HVAC domains.

While the navigation domains have a stronger focus on the combinatorial aspect of finding a correct action to solve the planning problems, the IPC domains emphasize stronger on finding the correct parameters. As it is highly unrealistic that a learning algorithm on a scar cecdoautladsmetatch the classical evaluation metrics like optimality, soundness, eficiency, and comple1t,ewneecsshose the planning success rate,describing the number of successfully solved planning problems from a set of unseen test planning problems.

As baselines we used P-DQN [ 6 ] and P-DDPG [ 4 ] from literature, as, to our knowledge, there are no ofline Reinforcement Learning algorithms for solving planning problems in PAMDPs. While P-DDPG is a policy based approach which is trained in an actor-critic4s]e, tPu-DpQ[N is closer related to our approach using a DQN for evaluating diferent action-parameter tuples. However, instead of finding optimal parameter values via gradient-based search, it uses a Neural Network as heuristic for suggesting parameter values6][. We transferred both baselines in an ofline setting, using Conservative Q-learning, Hindsight Experience Replay, and potential-based shaping as for our algorithm. 1As our algorithm is grounded in the Bellman Equation, its solutions will converge to optimal, sound, and complete results with an infinitely large dataset.However, this is not its operational scenario. We hence do not consider very large datasets for evaluation.

5.2. Evaluating the Planning Performance of GCM-DQN

For evaluating the performance of GCM-DQN in comparison to the baselines, we created a training datase t of 128 solved planning problems and a test dataset of 100 solved problems per domain. We ran a hyperparameter search with 64 trials for each algorithm and domain and subsequently tested each algorithm with the best hyperparameter setup on eight diferent seeds, to rule out lucky initialization. The results are reported in Ta1b.le

We hypothesized that GCM-DQN shows a higher performance than the baselines, when trained on the same limited datas et. For the navigation domains, our results indicate that GCM-DQN shows a higher mean planning success rate over the eight diferent seeds than the baselines, when trained on a limited dataset of 128 plans. For the IPC domains, either P-DDPG or GCM-DQN show the highest performance, with only narrow diferences. As the IPC domains have a stronger emphasis on the parametrization than on the combinatorial action selection it is expectable that the Actor-Critic approach performs well in the IPC domains, while underperforming in the navigation domains. Overall, all algorithms show declining performance with increasing complexity of the planning domains. Yet, our GCM-DQN shows the most stable results, in comparison to the baselines.

5.3. Evaluating the Planning Performance of GCM-DQN on Succeedingly Scarce Data

The application scenario for GCM-DQN is planning under circumstances where only little data is available and interactions with the environment are not possible. For evaluating the behavior of GCMDQN on scarce data, we trained the GCM-DQN and the baselines on succeedingly less samp.les in Therefore, we created subsets ofcontaining{64, 32, 16, 8, 4, 2} samples and trained GCM-DQN and the baselines on the hyperparameter settings from above. For each algorithm and dataset, we repeated the procedure on eight diferent seeds. Figu5reshows the results for the navigation and IPC domains.

We hypothesized that GCM-DQN maintains a higher performance under progressive sample reduction compared to the baseline methods. For the navigation domains, we mostly could confirm this. GCMDQN shows an increase in planning success rates, when increasing the number of plans in the training dataset. In the navigation domains, GCM-DQN gets overtaken by P-DQN in the lower sample area of the circle domains and the very closely in the higher sample area of the bars domain. Yet, it shows in general lower variance across the seeds, suggesting a more stable outcome. In the IPC domains, GCM-DQN and P-DDPG are consistently strong and stable, with an exception for GCM-DQN in the reservoir domain, while P-DQN performs low. Overall, GCM-DQN tends to improve, sometimes sharply in the higher sample area, while P-DDPG is competitive but more variable. P-DQN underperforms across the IPC domains.

2 4 n8training pla3n2s 64 128 16

6. Discussion

In the following, we discuss the findings from our Evaluation Sec5t. iWone place special emphasis on discussing architectural limitations of GCM-DQN, the distributiona l sthoift tohfe application scenarios, and the implications of aleatoric uncertainty from latent factors in the planning domains. Architectural Limitations of GCM-DQN Given the architecture we chose for our GCM-DQN algorithm, there are inherent limitations. Our gradient -based function for estimating the parameters for actions can converge to local optima in the Q-function. Especially in complex, non-convex Q-functions, this poses a serious problem. Mitigation strategies could include ensemble approaches with diferently seeded optimizers, multi-start optimization with diferent initial guesses, or a combination of both. Additionally, in essence, our GCM-DQN algorithm is one-step greedy (though implicitly operating on the expected returns of the DQN). Especially for domains in which long plans are necessary to reach a goal, the sparse reward signal of training data might lead to wrong results. Using the transition model for look-ahead methods, like Monte Carlo Tree Search, might result in better performance of the planner. Alternatively, a hierarchical perspective where GCM-DQN plans between intermediate goals might lead to increased performance with longer plans. As some hyperparameters, e.gw.,etighhet of Conservative Q-Learning or the number of Conservative Actions Samples, have a strong impact on the performance and stability of the planner, including them as parameters in the training loop to dynamically adapt the conservatism or data augmentation level of the model during training, might be a future improvement.

Data Quantity and Diversity The quantity and diversity of the training data in the training dataset had a significant impact on the performance of the tested algorithms. Our results support the intuition that more and diverse data improves the approximation of the true Q-function and true transition dynamics. The planning success rate of our GCM-DQN algorithm continuously improved as the number of plans in increased. All methods struggled in scenarios where only few samples in the training dataset were available. We deliberately focused on scarce data scenarios in our evaluation, as they reflect the real-world application of planners, where collecting more data and an interaction with the environment is prohivitively expensive. In this context, including Conservative Q-Learning an and Hindsight Experience Replay as mitigations for scarce data was important. Even though Hindsight Experience replay did not raise the mean planning success rates, it reduced the outcome variability and thus improved the reliability of GCM-DQN on small data. This implies that when working with scarce data and the performance is insuficient, adding additional datamatyo be more efective than tweaking the algorithms in isolation.

Distributional Shift in Ofline Reinforcement Learning One of the core challenges in Ofline

Reinforcement Learning is the distributional shift between the training data and application scenarios [ 12 ]. Especially in the context of planning, the planner is likely to encounter state, action, parameter combinations that lie outside the support of the training data, which can lead to extrapolation errors. We mitigated this risk, using three mechanisms from the Ofline Reinforcement Learning literature: Using Hindsight Experience Replay16[], Conservative Action Samplin2g3][, and Conservative Q-Learning [17]. Our results indicate that all measures improved training stability and planning performance. Aleatoric Uncertainty from Latent Factors in the Planning Domain Real-world application scenarios for planners, e.g., industrial processes often show hidden factors and randomness that ofline training cannot fully predict. I.e., in a manufacturing domain, tool wear out can alter a system’s dynamics, introducing aleatoric uncertainty. Though our GCM-DQN approach attempts to accommodate stochasticity in its state transition models, systematic latent factor shifts over time will lead to mispredictions of future transitions as the underlying transition dynamics changed. This limitation, however, is not unique to our approach but shared by all ofline learning methods. Mitigating it could involve a periodic re-training with ”fresh” data or designing the model to model these factors explicitly or in latent variables.

Evaluation Fairness of Ofline Baselines Finally, we discuss the evaluation fairness of the employed baselines. The employed baselines P-DDPG4][and P-DQN [ 6 ] were originally designed for online Reinforcement learning, where extensive interactions with the environment shapes the policy and DQNs. Conversely, we evaluated them in an ofline setting. However, to ensure a fair evaluation with our GCM-DQN algorithm, we adapted both baselines to the ofline setup, by incorporating the same techniques that we used in GCM-DQN to improve the training performance of the models. Namely, we used the same state transition models, Conservative Q-Learning, Hindsight Experience Replay, and Conservative Action sampling, creating a common and fair ground for evaluation.

7. Conclusion & Outlook

In this paper, we introduced the Goal-conditioned Model-augmented DQN algorithm (GCM-DQN), a model-augmented Ofline Reinforcement Learning algorithm for planning in parametrized action spaces, where no model of the planning domain and only a limited dataset of recorded plans are available. GCM-DQN tackles three central challenges of planning with Reinforcement Learning in parametrized action spaces(:i) infinite branching of action-parameter tup(liie)sg,oal-dependent reward functions, and(iii) substituting domain interactions with a model during planning time. To address the challenges, we introduce , a novel gradient-based optimization algorithm over the DQN for finding the optimal parameters for an action in a state, a goal-conditioning of the DQN that allows for planning with changing and sparse reward functions, and a novel state transition model that allows to capture the inherent uncertainty in stochastic of probabilistic planning domains. We evaluate GCM-DQN against ofline versions of two closely related algorithms. GCM-DQN shows significantly higher performance than the baselines, especially in data scarce scenarios. Future work will include the refinement of GCM-DQNs architecture and its application on real-world industrial planning scenarios.

Acknowledgement

This research as part of the project LaiLa and EKI is funded by dtec.bw – Digitalization and Technology Research Center of the Bundeswehr, which we gratefully acknowledge. dtec.bw is funded by the European Union – NextGenerationEU.

Declaration on Generative AI

Any use of generative AI in this manuscript adheres to ethical guidelines of IEEE for use and acknowledgement of generative AI. Each author has made a substantial contribution to the work, using LLMs exclusively for language refinement, formatting purposes, and for non-substantial coding, e.g., for creating plots. [17] A. Kumar, A. Zhou, G. Tucker, S. Levine, Conservative q-learning for ofline reinforcement learning, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1179–1191. [18] Z. Fan, R. Su, W. Zhang, Y. Yu, Hybrid actor-critic reinforcement learning in parameterized action space, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 2279–2285. doi:10.24963/ijcai.2019/316. [19] B. Li, H. Tang, Y. Zheng, J. Hao, P. Li, Z. Wang, Z. Meng, L. Wang, Hyar: Addressing discretecontinuous action reinforcement learning via hybrid action representation1, 02.042815.d5o0/i: ARXIV.2109.05490. [20] A. Tavakoli, F. Pardo, P. Kormushev, Action branching architectures for deep reinforcement learning, Proceedings of the AAAI Conference on Artificial Intelligence 32 (201180)..1d6o0i:9/ aaai.v32i1.11798. [21] G. Wu, B. Say, S. Sanner, Scalable planning with tensorflow for hybrid nonlinear domains, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. [22] A. Y. Ng, D. Harada, S. J. Russell, Policy invariance under reward transformations: Theory and application to reward shaping, in: Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999, p. 278–287. [23] Y. Chebotar, K. Hausman, Y. Lu, T. Xiao, D. Kalashnikov, J. Varley, A. Irpan, B. Eysenbach, R. C.

Julian, C. Finn, S. Levine, Actionable models: Unsupervised ofline reinforcement learning of robotic skills, in: Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 1518–1528. [24] D. P. Kingma, S. Mohamed, D. J. Rezende, M. Welling, Semi-supervised learning with deep generative models, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 27, Curran Associates, Inc., 2014. [25] P. H. Calamai, J. J. Moré, Projected gradient methods for linearly constrained problems, Mathematical Programming 39 (1987) 93–116. do1i0:.1007/bf02592073. [26] K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional generative models, in: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 28, Curran Associates, Inc., 2015. [27] A. Taitler, R. Alford, J. Espasa, G. Behnke, D. Fišer, M. Gimelfarb, F. Pommerening, S. Sanner, E. Scala, D. Schreiber, J. Segovia‐Aguas, J. Seipp, The 2023 international planning competition, AI Magazine 45 (2024) 280–296. doi1:0.1002/aaai.12169. [28] D. Vranješ, J. Ehrhardt, R. Heesch, L. Moddemann, H. S. Steude, O. Niggemann, Design Principles for Falsifiable, Replicable and Reproducible Empirical Machine Learning Research, in: 35th International Conference on Principles of Diagnosis and Resilient Systems (DX 2024), volume 125, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 202140..d4o2i3: 0/ OASIcs.DX.2024.7. [29] M. Gimelfarb, A. Taitler, S. Sanner, Jaxplan and gurobiplan: Optimization baselines for replanning in discrete and mixed discrete-continuous probabilistic domains, Proceedings of the International Conference on Automated Planning and Scheduling 34 (2024) 230–2381.0d.o1i6:09/icaps.v34i1. 31480. [30] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.

[1]

Ghallab ,

Nau , P. Traverso, Automated Planning and Acting , Cambridge University Press, 2016 .

[2]

R. S.

Sutton ,

A. G.

Barto , Reinforcement learning: An introduction , MIT press, 2018 .

[3]

Masson , P. Ranchod, G. Konidaris, Reinforcement learning with parameterized actions , Proceedings of the AAAI Conference on Artificial Intelligence 30 ( 2016 ) 1 . 0d . o1i:609/aaai.v30i1.10226.

[4]

Hausknecht ,

Stone , Deep reinforcement learning in parameterized action space , 21001 .6. doi: 48550/ARXIV.1511.04143.

[5]

Heesch ,

Ehrhardt , O. Niggemann, Integrating machine learning into an smt-based planning approach for production planning in cyber-physical production systems , in: Artificial Intelligence. ECAI 2023 International Workshops , Springer Nature Switzerland, Cham, 2024 , pp. 318 - 331 .

[6]

Xiong ,

Wang ,

Yang ,

Sun , L. Han, Y . Zheng,

Fu ,

Zhang , J. Liu, H. Liu, Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space , 2018 . doi: 10 .48550/ARXIV. 1810 . 06394 .

[7]

Ehrhardt ,

Heesch ,

Niggemann , Learning process steps as dynamical systems for a subsymbolic approach of process planning in cyber-physical production systems , in: Artificial Intelligence. ECAI 2023 International Workshops , Springer Nature Switzerland, Cham, 2024 , pp. 332 - 345 .

[8]

Heesch ,

Cimatti ,

Ehrhardt ,

Diedrich ,

Niggemann , A lazy approach to neural numerical planning with control parameters , in: European Conference on Artificial Intelligence (ECAI) , 2024 .

[9]

Wu ,

Say ,

Sanner , Scalable planning with deep neural network learned transition models , Journal of Artificial Intelligence Research 68 ( 2020 ) 571 - 6061 . 0d . o1i:613/jair.1 .11829.

[10]

T. P.

Lillicrap ,

J. J.

Hunt ,

Pritzel ,

Heess ,

Erez ,

Tassa ,

Silver ,

Wierstra , Continuous control with deep reinforcement learning , 20161 . 0d . o4i8:550/ARXIV .1509.02971.

[11]

Ma , T. Liu,

Wei ,

Liu ,

Xu ,

Li , Evolutionary Action Selection for Gradient-Based Policy Learning , Springer International Publishing, 2023 , p. 579 - 5901 . 0d . o1i0 : 07 / 978 -3- 031 -30111-7_ 49 .

[12]

Levine ,

Kumar , G. Tucker,

Fu , Ofline reinforcement learning: Tutorial, review , and perspectives on open problems, 2020 . do1i0:.48550/ARXIV. 2005 . 01643 .

[13]

Grand ,

Pellier , H. Fiorino, TempAMLSI: Temporal action model learning based on STRIPS translation , Proceedings of the International Conference on Automated Planning and Scheduling 32 ( 2022 ) 597 - 605 . doi: 10 .1609/icaps.v32i1. 19847 .

[14]

Mnih ,

Kavukcuoglu ,

Silver ,

A. A.

Rusu ,

Veness ,

M. G.

Bellemare ,

Graves ,

Riedmiller ,

A. K.

Fidjeland , G. Ostrovski,

Petersen ,

Beattie ,

Sadik , I. Antonoglou,

King ,

Kumaran ,

Wierstra ,

Legg ,

Hassabis , Human-level control through deep reinforcement learning , Nature 518 ( 2015 ) 529 - 533 . do1i : 0 .1038/nature14236.

[15]

Schaul ,

Horgan ,

Gregor ,

Silver , Universal value function approximators , in: Proceedings of the 32nd International Conference on Machine Learning, volumPero3c7eoedfings of Machine Learning Research , PMLR, Lille, France, 2015 , pp. 1312 - 1320 .

[16]

Andrychowicz ,

Wolski ,

Ray ,

Schneider ,

Fong ,

Welinder ,

McGrew ,

Tobin ,

Abbeel , W. Zaremba, Hindsight experience replay , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 .