<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Gradient-based Optimization for Planning with Deep Q-Networks in Parametrized Action Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>JonasEhrhard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>JohannesSchmidt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>René Heesch</string-name>
          <email>rene.heesch@hsu-hh.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>OliverNiggemann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Planning, Parametrized Markov Decision Processes, Ofline Reinforcement Learning, Deep Q-Networks</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAIPI'25: ECAI Workshop on AI-based Planning for Complex Real-World Applications</institution>
          ,
          <addr-line>Bologna, Italy, 2025</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HSU-AI Institute for Artificial Intelligence, Helmut-Schmidt-University</institution>
          ,
          <addr-line>Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Automation, Helmut-Schmidt-University</institution>
          ,
          <addr-line>Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Many real-world planning problems feature parametrized action spaces, where each action is augmented by continuous parameters. Though deep Reinforcement Learning has achieved remarkable results in solving control and planning problems, it falls short at two central challenges of real-world planning problems with parametrized action spaces: (i) There is an infinite number of action-parameter candidates in every step of solving a planning problem, (ii) interacting with the planning domain is typically prohibitively expensive and available recordings from the planning domain are sparse. To counter these challenges, we introduce our novel Goal-Conditioned Model-Augmented Deep Q-Networks algorithm (GCM-DQN). The intuition behind GCM-DQN is to use gradient-based optimization on the surface of the Q-Function, instead of blunt estimators, to estimate the optimal parameters of an action in a state. In combination with a goal-conditioning of the DQN, and a state transition model, this allows us to find plans for planning problems in planning domains with parametrized action spaces. Our algorithm outperforms state-of-the-art Reinforcement Learning algorithms for planning in parametrized action spaces.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Planning, the combinatorial problem of finding a sequence of actions that transitions an initial state
into a goal state, is a fundamental problem in many real-world application1s,a2n].dCAoIn[ventional
planning and Reinforcement Learning methods typically feature either purely discrete action spaces
(i.e. a finite set of actions, like moving up, down, left, or right in a grid woorrldp)urely continuous
action spaces (i.e. an infinite set of actions, like controlling the acceleration of a cart o2,n3a]. slope) [
However, many real-world problems featpuarreametrized action spaces. In a parametrized action space,
a finite set of actions is augmented by real-valued parameters, which influence the efects of the actions
[
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. During planning in parametrized action spaces, a planner hence must not only select from the
ifnite action set, but also real-valued parameters, to reach it3s].gFooarle[xample, consider injection
molding, where there is a finite set of actions (e.g. close mold, inject, hold, cool, eject), which are each
augmented by real-valued parameters (e.g. heating/cooling energy, velocity, pressure, etc.). Both the
combinatorial aspect of finite action selection, e.g., injecting material before closing the mold would
lead to a mess, as well as the parametrization aspect, e.g., injecting too cold material leads to poor
surface characteristics of the molded product majorly, have a major influence on the molded product.
Getting both aspects right is the task of planning in parametrized action spaces. Besides this simplified
example, many other real-world problems, from robotics to factory planning, feature parametrized
action spaces4[
        <xref ref-type="bibr" rid="ref3 ref6 ref7 ref8">, 3, 6, 7, 8</xref>
        ].
      </p>
      <p>There are two central challenges in solving planning problems in real-world parametrized action
spaces: (i) Due to the continuous nature of the parameter space, there is an infinite number of
action</p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
parameter tuples a planner has to choose from in every statien.finTitheibsranching of action-parameter
tuples in every state poses a challenge for selecting the optimal action-parame9t]e.rTytpuipclaell[y,
infinite branching is either countered by parameter estima1t0o]r,sw[hich have the risk of being
imprecise, or search1[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which has the risk of being computationally expens(iiv)eO.ften there is
no suficient model of the planning domain available, interaction with the domain is prohibitively
expensive or unsafe, and recorded data is sca1r2c]e. [Hence, solving planning problems typically, either
requires a manually crafted, expensive, and error-prone planning domain mo13d,e5l][, or requires
advanced Reinforcement Learning algorithms which can be trained ofline, meaning without interaction
with the planning domain, but strongly rely on the assumption that the distribution of the recorded
data does not shift strongly from the application ca1s2e]s. [
      </p>
      <p>
        In this paper, we tackle the challenges of infinite branching and training data scarcity in real-world
parametrized action spaces. Therefore, we propose to extend the well known Deep Q-Network (DQN)
algorithm1[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. DQN uses a Neural Network to approximate the action value function, which returns the
expected cumulative return of taking an action in a state. In combination with a greedy policy, DQN can
solve even complex planning and control probl1e4m]s. [We propose to transfer DQN into a novel, ofline
and model-augmented Reinforcement Learning setup, which allows us to use it for solving planning
problems in planning domains with parametrized action spa3]ce(csf[. Figure1). More precisely, we
propose three extension to the DQN algorit(ah)mT:o tackle infinite branching, we introd u ce ,
a novel gradient-based optimization algorithm, to eficiently find optimal parameters for a given action
in a given state(.b) To make our algorithm applicable to unseen planning problems, we integrate a
goal-conditioning to the DQ1N5][. (c) To allow using the DQN for planning without interacting with
the environment, we propose a novel state-transition model, which is trained along the DQN and allows
for planning in deterministic and probabilistic domains. We reduce the amount of training data to fit
the models, by employing Hindsight Experience Repla1y6][and Conservative Q-Learnin1g7[].
      </p>
      <p>(i)
initial
state
goal
state
state
param
action
goal</p>
      <p>(iv)
state transition
model</p>
      <p>(ii)
goal-conditioned</p>
      <p>DQN
expected
return
gradient-based
optimization
(iii)
gpreoelicdyy (paacrtiaomn,)
plan</p>
      <p>As a result, we present our Goal-Conditioned Model Augmented DQN algorithm (GCM-DQN).
GCMDQN is can be trained on a sparse dataset of recorded plans from a planning domain. It returns a DQN
which can either be used as a policy in probabilistic scenarios, or in combination with the parallelly
trained state transition model as planner for deterministic domains. In contrast to estimator or
searchbased algorithms for planning in parametrized action spaces, GCM-DQN converges quickly to optimal
parameters due to the gradient-based parameter optimization. The main contributions of our paper are:
•  novel gradient-based optimization algorithm to eficiently counter infinite branching
in planning domains with parametrized action spaces.
• A novel integrationof , goal-conditioning, and a novel state-transition model into DQN
to allow harnessing it for planning.
• A systematic and comprehensive evaluation of our approach against state-of-the-art
Reinforcement Learning paradigms for parametrized action spaces.</p>
      <p>The remaining paper is structured as follows: In Sec2twioenreview related research in the domains
of Reinforcement Learning for planning in parametrized action spaces. In S3ewcteioinntroduce
the formalization of our problem. Sect4ioinntroduces our solution, followed by its theoretical and
empirical evaluation in Sect5ioannd discussion in Sectio6n. We conclude our paper in secti7o.n</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In Deep Reinforcement Learning, there are two directions when handling parameterized action spaces:
Using Neural Networks as estimators that suggest parameters for actions, and using search or
optimization to find optimal parameters for an action. Typically, policy network approaches are grounded in
the Deep Deterministic Policy Gradient (DDPG) parad1ig0m]. [DDPG is an Actor-Critic approach, in
which the actor is a deep policy network that, given a state, suggests actions and the critic is a deep
Q-network that calculates the cumulative expected return of the suggested action and state. Using
backpropagation over both networks allows for adapting their weights to converge to an optimal
policy- and Q-network. To solve planning problems in parametrized action spaces, Hausknecht and
Stone[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] extendeded the DDPG paradigm by expanding the deep policy network with an additional
non-binary output for suggesting parameters values, resulting in the P-DDPG algorithm. Fan et al.
[18] propose a similar approach. They use individual separate heads for selecting an action from the
ifnite action set, and individual separate heads for estimating its numerical para18m]e. tHeorws[ever,
both approaches neglect that there is a dependency between an action and the numerical parameters
[19]. Hence, Li et al[.19] proposed to encode the finite set of actions and numerical parameters into a
joint latent representation space on which the policy operates, and from which discrete and continuous
components are decoded for interaction with the environment. While the introduced approaches can
handle parametrized action spaces, they remain restricted to online settings, which require the agent to
interact directly with the environment, and are not well suited to an ofline scenario with only little
available training data.
      </p>
      <p>
        Optimization or search-based approaches typically follow a value-based paradigm, in which a greedy
policy selects the action-parameter tuple with the highest expected return. While met2h0o]dusselike [
a divide-and-conquer approach for complex actions-parameter tuples that operates on a joint latent
representation, Xiong et[a6]l.uses a separate parameter estimation network which feeds into a DQN,
forming a parametrized DQN or P-DQN. Thereby, they can select a discrete action directly using a
greedy policy and do not rely on a continuous relaxation of the discrete action components (as, e.g.,
Hausknecht and Ston[e4]) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Finally, Ma et a[l1.1] uses an evolutionary optimization algorithm
for estimating an optimal action from a continuous action space. While such approaches can also be
adapted to parametrized action spaces, they are computationally expensive due to the uninformed
optimization paradigm.
      </p>
      <p>
        In contrast to typical Reinforcement Learning tasks, e.g., like control, the reward structure in planning
problems sparse. Typically, the reward for solving a planning problem is formalized by a single reward
signal upon reaching the goal state. This sparse reward signal hence is exclusively dependent on the goal
state, and changes for planning problems with diverging goal states. To make Reinforcement Learning
agents applicable to altering reward functions, Schau[1l5e]tinatlr.oduced Universal Value Function
Approximators. Universal Value Function Approximators condition the value function approximator
on an embedding of the goal state, hence making it generalizable across altering planning problems
within the same domain1[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Other methods for countering sparsity of reward signals, especially in
ofline settings, include data augmentation, such as Hindsight Experience Re1p6l]a, yor[ regularization
in training by additional loss terms, such as Conservative Q-Lea1r7n].ing [
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Formalization</title>
      <p>Reinforcement Learning follows the assumption that there is an underlying MDP within all planning
domains. As we focus on planning problems in parametrized action spaces, we consider Parametrized
one individual acti o n each:
 is the transition functio=n(</p>
      <p>+1 |  ,   ,   ) that describes the probability of transitioning into state
 +1 ∈  given state ∈  , action
 ∈</p>
      <p>and a parameter ∈ Ψ at time . ℛ is the reward function
ℛ ∶  ×  → ℝ</p>
      <p>that returns the scalar re wwarhden transitioning fr ominto +1 using an action ,
and ∈ ℝ a discount factor. We will further referatsothe dynamics of the MDP.</p>
      <p>As the transition dynamics in real-world PAMDPs can grow very complex, large models and large
datasets are needed to properly capture them. Leveraging on the parametrized action spaces, we propose
to manage the complexity of real-world dynamics by a modular factorization of the parametrized action
space. Therefore, we split into a finite se t  of  transition functio n s , which each are related to

  = {  |   =</p>
      <p>( +1 |  ,   ),   ∈ Ψ ,  = 1, ...,  }
This allows us to model the transition dynamics for each action in one individua l m≈od el,


reducing the complexity of the modeling problem, while overall not afecting the PAMDP dynamics.
We can denote the collection of
by sampling from the transition models
alals ℱ = {</p>
      <p>}=1 . During planning, we can infer state transitions</p>
      <sec id="sec-3-1">
        <title>Action Markov Decision Processes (PAMDP3]).[</title>
        <sec id="sec-3-1-1">
          <title>3.1. Parametrized Action Markov Decision Processes</title>
          <p>PAMDPs extend continuous Markov Decision Processes by introducing a hybrid, so-called, parametrized
action space. They can be formalized as a tuple
where ⊆ ℝ  is the continuous state spa c=e,{</p>
          <p>0, ...,   , ...,   },  ∈ ℕ is a finite set of actions, in

which each action is extended by a continuous parameter spΨace∈ ℝ and the union of all parameter
spaces is given asΨ = ⋃=1 Ψ . Together they form the parametrized action space</p>
          <p>⟨ , , Ψ,  , ℛ,  ⟩,
 =
⋃ {(  ,   )|  ∈ Ψ }.
  ∈
 +1 ∼  

(  ,   ).



(  ,   ) =  +1 .
⟨ , , Ψ, ℱ , ℛ</p>
          <p>,  ,  0, ⟩,
ℛ () = {
,</p>
          <p>if  ∈ 
0, else
(1)
(2)
(3)
(4)
(5)
(6)
(7)</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Describing Planning Problems with PAMDPs</title>
          <p>Planning describes the task of finding a sequen c=e {(  ,  
transition an initial st0aitneto a goal sta t∈e ⊂ 
. Hence, a planning problem in a PAMDP can be
)}
 −1 of  action-parameter tuples, that
=0
denoted as
whereℛ is a goal conditioned, sparse reward function
efectively turn s  into a deterministic function</p>
          <p>In deterministic scenarios, the transition probabilitiecsoolflapse to a Dirac delta distribution, which


, with the numerical reward val∈ueℝ .</p>
          <p>Reinforcement Learning typically solves planning problems by iteratively applying aopnotlhicey
planning problem. Hence, a plan can be seen as a trajectory-level instantiation of a policy. A policy
in a PAMDP is a mapping from the current st ataend goal sta teto an action-parameter tuple. For
deterministic planning domains, the mapping is a funct(idoent)(  , ) = (,  )
domains, the mapping is a conditional distribut((io, n)|
 , ) , where  ∈  ,
 ∈ , ,  ∈</p>
          <p>.
, For probabilistic planning</p>
          <p>For deterministic domains, the solution of a planning problem is a,wplhaicnh, when executed from
 0, reaches a ∈  . For probabilistic domains, the solution of a planning problem is a prope r .policy
A proper policy optimizes the discounted return of the planning problem and results in a goal state
 ∈  . The sequence of actions-parameter tuples selected by the policy during execution for m.s a plan</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Solution</title>
      <p>
        In this section, we introduce our GCM-DQN algorithm. GCM-DQN tackles the challenges of infinite
branching, prohibitively expensive domain interactions, and data scarcity in real world planning domains
with parametrized action spaces. The intuition of GCM-DQN is to leverage on the diferentiability
of a DQN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] during planning for finding the optimal parameters and actions via gradient-based
optimization, instead of using estimators or search. Therefore, we add three extensions to the DQN
algorithm1[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: (a) To tackle the problem of infinite branching, we introduce the
algorithm,
a gradient-based optimization algorithm inspired21b,y8][, for finding an (leastwise locally) optimal
action-parameter tuple during planning (cf. Sec4.t3i)o.n(b) To make GCM-DQN applicable to any
planning problem within the planning domain, we introduce a goal-conditioning to the DQN, as proposed
in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. We tackle data scarcity in training the goal-conditioned DQN, by using Hindsight Experience
Replay [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and Conservative Q-Learnin1g7][ (cf. Section4.2). (c) Finally, to counter prohibitively
expensive domain interaction, we propose a novel state transition model which is parallelly trained
to the DQN on the same dataset (cf. Sect4i.4o)n,allowing to simulate state transitions without any
interaction with the planning domain.
      </p>
      <p>By combining the three proposed extensions, we result in our novel GCM-DQN algorithm (cf. Section
4.1). GCM-DQN can operate in planning domains with parametrized action spaces. It can either be used
as a policy for probabilistic planning domains or, when using the state transition model, as a planner
for deterministic planning domains (cf. Fig2u)r.e
transition
modelk
value function
approximator
calculate
decision
value
greedy
policy
plan</p>
      <sec id="sec-4-1">
        <title>4.1. Planning with Goal-Conditioned Model-augmented Deep Q-Networks</title>
        <p>In this Section we provide an overview on the GCM-DQN algorithm (cf. Algo1riatnhdmFigure2).
In its essence GCM-DQN is a goal-conditioned greedy policy, which is trained in an ofline setting.
Hence, the first step includes training the D Q Nand the state transition modℱel=s { 
using a dataset of recorded plan.sDuring planning, GCM-DQN uses t he

| = 1, ...,  }
algorithm (cf.</p>
        <p>Algorithm2) on   to calculate the optimal param ẽ∗teforr every action. To guide the selection of

optimal action-parameter tuples, we calculate a decision vfoarlueeach actio n. includes the value,
the weighted variance of the succeeding stvarte(cf. Equation18), and a potential based shaping

Using   instead of the pur e values counters, the selection of actions which would lead into
nonpermissible states, e.g., colliding with boundaries. A greedy p oglrieceydy then picks the highes t and
adds the corresponding action-parame(te, r ̃ ) tuple to the plan. By sampling from the associated
state transition modelthe next sta te+1 can be inferred and passed to the next iteration. The

iterations stop, whe+n1 becomes a state within(or ±  , where is an error margin). In cases, in
which there is no solution to the planning problem, a stopping cr itcearniobne introduced to bound
the maximum number of iterations. The complete GCM-DQN algorithm is outlined in Alg1o.rithm
The following section introduce the extensions of GCM-DQN in detail.</p>
        <sec id="sec-4-1-1">
          <title>Algorithm 1: GCM-DQN during planning</title>
          <p>Require :</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Goal-Conditioned DQN for Parametrized Action Spaces</title>
        <p>In this section, we describe our adaptions to DQN to allow using it for planning in planning domains
with parametrized action spaces. We achieve this by including the goal state into the input of the
DQN, thereby conditioning it on the goal state, and handling continuous per-action parameters via
gradient-based optimization.</p>
        <p>
          The original DQN uses a Neural Network to approximate the action value f(u, n)ctionfa domain
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which describes the expected discounted return for taking acintiostnate, and satisfies the
Bellman equation in the optimal case
(  ,   ) =
        </p>
        <p>
          problem [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and the parametrized actions, so that
where  ∈  is an action from the finite action s e ti,s an associated continuous parameter, ainsd
the goal state of the planning problem. For our updated Q-function, the Bellman equation becomes
max arg max ( +1 ,   +1
,   +1
, )].
        </p>
        <p>As the inner maximization over+1 is non-convex when is approximated by a Neural Network,
solving it is intractable. Hence, we propose to leverage on global optimization algorithms for finding
leastwise local optima foarnd solve Equatio1n1 in two steps. In the first step, we find optimal
action-parameter tuples for each action in the current state,
  ∗ = arg max (, 
  ∈Ψ</p>
        <p>,   , ) ∀  ∈  ,
 +1
using projected gradient ascent (cf. Sec4t.i3o)n.As we cannot guarantee a global optimum, we denote
the resulting parameters w it̃∗h . This first step allows us to reformulate Equa1t1ioans
[ℛ (  ) +  max ( +1 ,   +1
,  ̃∗+1
, )],

which resembles the Bellman equation with a goal conditioning and an approximate inner maximization.</p>
        <p>
          We train our goal-conditioned DQ N , with parameter s∈ ℝ , for parametrized action spaces in an
ofline Reinforcement Learning setup, to cater the restrictions on of prohibitively expensive domain
interactions in real-world planning domains. Therefore, we assume a training dataset of recorded plans
 = {  }=0 . A major problem in ofline Reinforcement Learning is the distributional shift between
training data and the application dom1a2i]n. W[e counter this problem by augmentingwith Hindsight
Experience Replay1[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and Conservative Action Samplin2g3][(cf. Figure3). Hindsight Experience
Replay augments the available dataset by sampling sub-traces from the recorded plans, relabeling the
ifnal state as the goal stat16e][. Conservative Action Sampling also samples sub-trances from the
for the datase2t3[]. Using both augmentation techniques, results in the da t ã[s1e6t]sand ̃̄ [23].
recorded plans, however, labeling their final state as miss, therefore artificially creating negative samples
g
g
g
sample and
        </p>
        <p>re-lable
sub-sequences
with new goals
sample and</p>
        <p>re-lable
sub-sequences
with random
negative goals
g
g
g
g
g
g
g
g
conservative
Qlearning of
Sampling [23] for augmenting our training dataset.</p>
        <p>
          Following1[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] we use an of-policy training setup, using an online netw oraknd a target network
  −. During training, only the weights  oafre updated via gradient descent, whereas the weights of
  − are copied from  every steps. We use a composite loss function
ℒ
        </p>
        <p>
          = ℒ + ℒ
consisting of the squared TD-loℒss [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and a conservative penalty teℒrm[17]. The conservative
penalty termℒ helps to regulariz e to overestimate Q-values of unseen or underrepresented actions
[17]. We denote the squared TD-loss as
ℒ =
(  ,  
,  

, , +1 )∼ ̃̄

[  +  (1 −   ) max   −( +1 ,   +1
(13)
(14)
(15)
where  ∈ {0, 1} indicates whether the plan at t ,imsoe tha t  = 1, if  +1 ∈  . Following1[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we
formulate the conservative penalty term as
        </p>
        <p>1 
∈  =1
ℒ = [ log(∑
∑ exp(  (  ,   , 

(16)
where is the trade-of factor between Bellman-fit and conservat i=sm||,
is the number of discrete
actions, and is the number of parameter samples per action used in the log-sum-exp penalty. For our
() uniformly from the empirical pool of parameters for action

ofline training, we draw samples 
to approximate∫ 
  (  ,</p>
        <p>Regarding , three edge cases must be considere(di):  including no data(i,i)  including little
data, and(iii)  including infinite data. In cas(ei), where no data is availab le,cannot be trained.
Hence, data must be collected by random exploration or through sampling state transitions from the
domain. Case(ii) describes the normal operation of GCM-DQN. We note that the higher the variance
in the dataset, the better the approximatio ntoofthe rea l. Case (iii) describes a special case, where
all data are available. Given a large en o,uthghis allow s to fit  exactly.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Gradient-based Parameter Estimation</title>
        <p>
          For finding the optimal parameters for an action in a given state, we propose to leverage on the
diferentiability of the DQN and use gradient ascent in a nested optimization loop for finding optimal
parameters for a given action (cf. Equat12io).nTherefore, we introduce the
algorithm, which
draws inspiration from24[] and its applications i9n, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>The idea of</p>
        <p>is to use the same algorithm, which is used to adapt the weigh tsdoufring
training, for finding the optimal action-parameter tuples during execution. However, instead of
optimizing the weights of t h e, we optimize the parameter componentof its input. Therefore, we
initialize the parameter componenwtith a gues s ,̂ e.g., random numbers, zeros, or values fr om.
After calculatin g  (, ,</p>
        <p>, ̂) , we use backpropagation to derive the gradient with resp e,̂catltlowing
us to use gradient ascent with a learnin g rtaotuepdat e ̂ in a direction which increases the Q-value.
The optimization stops after the updates of the Q-valΔu e,,fall below a thresho l,dreturning the last
update of ̂ as  ̃∗. Algorithm2 summarizes our parameter estimation loop through input optimization.</p>
        <p>Algorithm 2: paramOpt Gradient‐Based Parameter Optimization</p>
        <p>Require : , , 


 
 ←̂ init()
1 Δ</p>
        <p>← +∞
2  (prev) ← −∞
3 while Δ &gt;  do
4
5
6
7
8
 ←̂
Δ</p>
        <p>← ∇   (, ,
 (val) ←   (, ,
 (prev) ←  (val)
 , ̂)
 , ̂)
←  (val) −  (prev)
clip[ min, max]</p>
        <p>( +̂    )
9 return  ̃∗ ←  ̂
// state, action, goal
// goal‐conditioned DQN</p>
        <p>// learning rate
// stopping threshold</p>
        <p>// initial guess
// backprop wrt. parameters
// projected gradient ascent
// caclulate action value
// optimized parameter</p>
        <p>As we are using gradient ascent as optimization algorithm over the DQN, we cannot guarantee
to find the true global optimum∗. This is due to the non-convex shape of. The result of the
optimization hence can be strongly dependent on the initializ a t̂aionnd otfhe learning ra t.eAs there
are diferent options for initialization, e.g., zeros, ones, or random numbers, we suggest incorporating
prior knowledge from the dataset, in the form of estimators like the mean over observed parameter
settings as starting guesses.</p>
        <p>Additionally, parameters are typically bound to value ranges, e.g., a temperature cannot fall below
0 Kelvin. To incorporate this, we use projected gradient a2sc5]endtur[ing optimization, efectively
clipping values that exceed the bounds. As one naïve solution for retrieving the bounds, we suggest
iterating through the dataseatnd collecting minima and maxima of each parameter.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Learning State Transition Dynamics</title>
        <p>In real-world planning problems, directly interacting with the planning domain to predict action
efects is rarely possible or prohibitively expensi1v2e].[ Hence, planning requires a model of the
state transition dynami1c]s w[ hich maps a current sta teand parameters to a successor state.</p>
        <sec id="sec-4-4-1">
          <title>In deterministic domains this is a funct (io</title>
          <p>n,   ) =  +1 (cf. Eq. 5); in probabilistic domains it is a
conditional distributi(on+1 ∣   ,   ) from which +1 is sampled (cf. Eq.4).</p>
          <p>Following the modular per–action factorization of PAMDP dynamics (c3f)., wEqe. learn one
we use the same datas et, which is also used for training the DQN.
transition model actioℱn,= {</p>
          <p>}=1 , each predicting the next state for a ctigoinven (  ,   ). Thereby,</p>
          <p>We propose to capture the stochasticity of probabilistic planning domains with a novel conditional
latent-variable state transition model, inspir2e6d]. bTyh[ereby, each per-action model comprises an
encoder  and a decode r part.</p>
          <p>During training , the encoder processes the inp u, t  , and +1 into the parametersand of a latent
posterio r (|  ,  +1 ,   ). Using the reparametrization trick, it sa mp=les+  ⊙ ,
 ∼  (0,  )
The decoder  reconstruc t+s1 from   ,   , and under a standard normal pri o(r) =  (0,  ) . As
.
training criterion, we minimize the negative Evidence Lower Bound,
A high variance indicates a high predictive uncertai nt̂+y1 ,inwhich indicates boundaries or
nonpermissible states, like obstacles.</p>
          <p>For deterministic domains, the stochastic latceannt be omitted and reduces to a standard</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Multilayer Perceptron.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>We evaluate our GCM-DQN algorithm empirically against ofline versions of state-of-the-art baselines
for planning in parametrized action spa4c,e6s][. As performance metrics, we use the rate of successfully
solved planning problems from a set of unseen planning problems. Therefore, we used domains with
navigation problems and domains from the international planning competition’s (IPC) reinforcement
learning track27[] (cf. Figure4). We hypothesize tha(t 1) GCM-DQN shows a higher performance
ℒ = −
  (|

 , +1 ,  )</p>
      <p>[log   ( +1 |  ,   , )] +  KL(  (|  ,  +1 ,   )|   () ),
where KL denotes the Kullback–Leibler divergence.
when sampling th e vector times:
During planning , the encoder is discarded and o nlyis further used. Given the current s taatned
parameters ̃∗ (estimated wit
h
), we draw ∼  (0,  )
and decode sample ŝ+1 =   (  ,  ̃∗, ) .</p>
      <p>Boundaries and non-permissible states can be detected by analyzing the scalar vvaarroi afn̂ ce
+1
var=</p>
      <p>1
 − 1 =1

∑ || ̂+1
()
−   +̄1 ||2,   +̄1 =</p>
      <p>=1
1 ∑  ̂(+)1 ,
(17)
()
(18)
than the baselines, when trained on the same limited dataset o f ,palannds( 2) GCM-DQN longer
maintains a higher performance than the baselines, when systematically reducing the number of samples
in  .</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>For setting up our experiments, we follow the experimental design guidelines for empirical Machine
Learning research by Vranješ et[a28l].. We generate samples for the data setbsy running either
an  ∗ search or JaxPlan29[] for randomly initialized planning problems of the chosen planning
domains. We used Optuna3[0] for hyperparameter optimization of GCM-DQN and the baselines
to allow for a fair comparison. We repeated all experiments on eight diferent seeds to rule out
lucky initializations. All code and datasets for replicating the experiments can be fohutntdpsu: nder
//github.com/j-ehrhardt/gcmdq. nWe used the following planning domains for evaluation:
Navigation Domains The navigation domains feature two-dimensional path finding problems in a
continuous space with obstacles. The goal is to find a sequence of actions that lead from the start state
to the goal state. There is a set of four actions - up, down, left, right - in which each action can be
augmented with a plus minus ten-degree tilt. The step-width is fixed and collisions with the obstacles
are forbidden. The planning problems are non-trivial, as the reward function is sparse and planners
need to deal with linear and non-linear obstacles.</p>
        <p>IPC Domains The IPC domains feature domains from the International Planning Competition’s
Probabilistic and Reinforcement Learning Track from 270]2.3 W[e picked the reservoir, powergen
and HVAC domains.</p>
        <p>While the navigation domains have a stronger focus on the combinatorial aspect of finding a correct
action to solve the planning problems, the IPC domains emphasize stronger on finding the correct
parameters. As it is highly unrealistic that a learning algorithm on a scar cecdoautladsmetatch the
classical evaluation metrics like optimality, soundness, eficiency, and comple1t,ewneecsshose the
planning success rate,describing the number of successfully solved planning problems from a set of
unseen test planning problems.</p>
        <p>
          As baselines we used P-DQN [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and P-DDPG [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] from literature, as, to our knowledge, there are no
ofline Reinforcement Learning algorithms for solving planning problems in PAMDPs. While P-DDPG
is a policy based approach which is trained in an actor-critic4s]e, tPu-DpQ[N is closer related to our
approach using a DQN for evaluating diferent action-parameter tuples. However, instead of finding
optimal parameter values via gradient-based search, it uses a Neural Network as heuristic for suggesting
parameter values6][. We transferred both baselines in an ofline setting, using Conservative Q-learning,
Hindsight Experience Replay, and potential-based shaping as for our algorithm.
1As our algorithm is grounded in the Bellman Equation, its solutions will converge to optimal, sound, and complete results
with an infinitely large dataset.However, this is not its operational scenario. We hence do not consider very large datasets
for evaluation.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluating the Planning Performance of GCM-DQN</title>
        <p>For evaluating the performance of GCM-DQN in comparison to the baselines, we created a training
datase t of 128 solved planning problems and a test dataset of 100 solved problems per domain. We
ran a hyperparameter search with 64 trials for each algorithm and domain and subsequently tested each
algorithm with the best hyperparameter setup on eight diferent seeds, to rule out lucky initialization.
The results are reported in Ta1b.le</p>
        <p>We hypothesized that GCM-DQN shows a higher performance than the baselines, when trained on the
same limited datas et. For the navigation domains, our results indicate that GCM-DQN shows a higher
mean planning success rate over the eight diferent seeds than the baselines, when trained on a limited
dataset of 128 plans. For the IPC domains, either P-DDPG or GCM-DQN show the highest performance,
with only narrow diferences. As the IPC domains have a stronger emphasis on the parametrization
than on the combinatorial action selection it is expectable that the Actor-Critic approach performs well
in the IPC domains, while underperforming in the navigation domains. Overall, all algorithms show
declining performance with increasing complexity of the planning domains. Yet, our GCM-DQN shows
the most stable results, in comparison to the baselines.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluating the Planning Performance of GCM-DQN on Succeedingly Scarce</title>
      </sec>
      <sec id="sec-5-4">
        <title>Data</title>
        <p>The application scenario for GCM-DQN is planning under circumstances where only little data is
available and interactions with the environment are not possible. For evaluating the behavior of
GCMDQN on scarce data, we trained the GCM-DQN and the baselines on succeedingly less samp.les in
Therefore, we created subsets ofcontaining{64, 32, 16, 8, 4, 2} samples and trained GCM-DQN and
the baselines on the hyperparameter settings from above. For each algorithm and dataset, we repeated
the procedure on eight diferent seeds. Figu5reshows the results for the navigation and IPC domains.</p>
        <p>We hypothesized that GCM-DQN maintains a higher performance under progressive sample reduction
compared to the baseline methods. For the navigation domains, we mostly could confirm this.
GCMDQN shows an increase in planning success rates, when increasing the number of plans in the training
dataset. In the navigation domains, GCM-DQN gets overtaken by P-DQN in the lower sample area of
the circle domains and the very closely in the higher sample area of the bars domain. Yet, it shows
in general lower variance across the seeds, suggesting a more stable outcome. In the IPC domains,
GCM-DQN and P-DDPG are consistently strong and stable, with an exception for GCM-DQN in the
reservoir domain, while P-DQN performs low. Overall, GCM-DQN tends to improve, sometimes sharply
in the higher sample area, while P-DDPG is competitive but more variable. P-DQN underperforms
across the IPC domains.</p>
        <p>2 4 n8training pla3n2s 64 128
16</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>In the following, we discuss the findings from our Evaluation Sec5t. iWone place special emphasis
on discussing architectural limitations of GCM-DQN, the distributiona l sthoift tohfe application
scenarios, and the implications of aleatoric uncertainty from latent factors in the planning domains.
Architectural Limitations of GCM-DQN Given the architecture we chose for our GCM-DQN
algorithm, there are inherent limitations. Our gradient -based function for estimating the
parameters for actions can converge to local optima in the Q-function. Especially in complex, non-convex
Q-functions, this poses a serious problem. Mitigation strategies could include ensemble approaches with
diferently seeded optimizers, multi-start optimization with diferent initial guesses, or a combination of
both. Additionally, in essence, our GCM-DQN algorithm is one-step greedy (though implicitly operating
on the expected returns of the DQN). Especially for domains in which long plans are necessary to
reach a goal, the sparse reward signal of training data might lead to wrong results. Using the transition
model for look-ahead methods, like Monte Carlo Tree Search, might result in better performance of the
planner. Alternatively, a hierarchical perspective where GCM-DQN plans between intermediate goals
might lead to increased performance with longer plans. As some hyperparameters, e.gw.,etighhet
of Conservative Q-Learning or the number of Conservative Actions Samples, have a strong impact
on the performance and stability of the planner, including them as parameters in the training loop to
dynamically adapt the conservatism or data augmentation level of the model during training, might be
a future improvement.</p>
      <p>Data Quantity and Diversity The quantity and diversity of the training data in the training dataset
 had a significant impact on the performance of the tested algorithms. Our results support the intuition
that more and diverse data improves the approximation of the true Q-function and true transition
dynamics. The planning success rate of our GCM-DQN algorithm continuously improved as the number
of plans in increased. All methods struggled in scenarios where only few samples in the training
dataset were available. We deliberately focused on scarce data scenarios in our evaluation, as they
reflect the real-world application of planners, where collecting more data and an interaction with the
environment is prohivitively expensive. In this context, including Conservative Q-Learning an and
Hindsight Experience Replay as mitigations for scarce data was important. Even though Hindsight
Experience replay did not raise the mean planning success rates, it reduced the outcome variability
and thus improved the reliability of GCM-DQN on small data. This implies that when working with
scarce data and the performance is insuficient, adding additional datamatyo be more efective than
tweaking the algorithms in isolation.</p>
      <sec id="sec-6-1">
        <title>Distributional Shift in Ofline Reinforcement Learning One of the core challenges in Ofline</title>
        <p>
          Reinforcement Learning is the distributional shift between the training data and application scenarios
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Especially in the context of planning, the planner is likely to encounter state, action, parameter
combinations that lie outside the support of the training data, which can lead to extrapolation errors. We
mitigated this risk, using three mechanisms from the Ofline Reinforcement Learning literature: Using
Hindsight Experience Replay16[], Conservative Action Samplin2g3][, and Conservative Q-Learning
[17]. Our results indicate that all measures improved training stability and planning performance.
Aleatoric Uncertainty from Latent Factors in the Planning Domain Real-world application
scenarios for planners, e.g., industrial processes often show hidden factors and randomness that ofline
training cannot fully predict. I.e., in a manufacturing domain, tool wear out can alter a system’s
dynamics, introducing aleatoric uncertainty. Though our GCM-DQN approach attempts to accommodate
stochasticity in its state transition models, systematic latent factor shifts over time will lead to
mispredictions of future transitions as the underlying transition dynamics changed. This limitation,
however, is not unique to our approach but shared by all ofline learning methods. Mitigating it could
involve a periodic re-training with ”fresh” data or designing the model to model these factors explicitly
or in latent variables.
        </p>
        <p>
          Evaluation Fairness of Ofline Baselines Finally, we discuss the evaluation fairness of the employed
baselines. The employed baselines P-DDPG4][and P-DQN [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] were originally designed for online
Reinforcement learning, where extensive interactions with the environment shapes the policy and
DQNs. Conversely, we evaluated them in an ofline setting. However, to ensure a fair evaluation with
our GCM-DQN algorithm, we adapted both baselines to the ofline setup, by incorporating the same
techniques that we used in GCM-DQN to improve the training performance of the models. Namely, we
used the same state transition models, Conservative Q-Learning, Hindsight Experience Replay, and
Conservative Action sampling, creating a common and fair ground for evaluation.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion &amp; Outlook</title>
      <p>In this paper, we introduced the Goal-conditioned Model-augmented DQN algorithm (GCM-DQN), a
model-augmented Ofline Reinforcement Learning algorithm for planning in parametrized action spaces,
where no model of the planning domain and only a limited dataset of recorded plans are available.
GCM-DQN tackles three central challenges of planning with Reinforcement Learning in parametrized
action spaces(:i) infinite branching of action-parameter tup(liie)sg,oal-dependent reward functions,
and(iii) substituting domain interactions with a model during planning time. To address the challenges,
we introduce , a novel gradient-based optimization algorithm over the DQN for finding the
optimal parameters for an action in a state, a goal-conditioning of the DQN that allows for planning
with changing and sparse reward functions, and a novel state transition model that allows to capture the
inherent uncertainty in stochastic of probabilistic planning domains. We evaluate GCM-DQN against
ofline versions of two closely related algorithms. GCM-DQN shows significantly higher performance
than the baselines, especially in data scarce scenarios. Future work will include the refinement of
GCM-DQNs architecture and its application on real-world industrial planning scenarios.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>This research as part of the project LaiLa and EKI is funded by dtec.bw – Digitalization and Technology
Research Center of the Bundeswehr, which we gratefully acknowledge. dtec.bw is funded by the
European Union – NextGenerationEU.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>Any use of generative AI in this manuscript adheres to ethical guidelines of IEEE for use and
acknowledgement of generative AI. Each author has made a substantial contribution to the work, using LLMs
exclusively for language refinement, formatting purposes, and for non-substantial coding, e.g., for
creating plots.
[17] A. Kumar, A. Zhou, G. Tucker, S. Levine, Conservative q-learning for ofline reinforcement learning,
in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information
Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1179–1191.
[18] Z. Fan, R. Su, W. Zhang, Y. Yu, Hybrid actor-critic reinforcement learning in parameterized
action space, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019,
pp. 2279–2285. doi:10.24963/ijcai.2019/316.
[19] B. Li, H. Tang, Y. Zheng, J. Hao, P. Li, Z. Wang, Z. Meng, L. Wang, Hyar: Addressing
discretecontinuous action reinforcement learning via hybrid action representation1, 02.042815.d5o0/i:
ARXIV.2109.05490.
[20] A. Tavakoli, F. Pardo, P. Kormushev, Action branching architectures for deep reinforcement
learning, Proceedings of the AAAI Conference on Artificial Intelligence 32 (201180)..1d6o0i:9/
aaai.v32i1.11798.
[21] G. Wu, B. Say, S. Sanner, Scalable planning with tensorflow for hybrid nonlinear domains, in:
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.),
Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017.
[22] A. Y. Ng, D. Harada, S. J. Russell, Policy invariance under reward transformations: Theory and
application to reward shaping, in: Proceedings of the Sixteenth International Conference on
Machine Learning, ICML ’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999, p.
278–287.
[23] Y. Chebotar, K. Hausman, Y. Lu, T. Xiao, D. Kalashnikov, J. Varley, A. Irpan, B. Eysenbach, R. C.</p>
      <p>Julian, C. Finn, S. Levine, Actionable models: Unsupervised ofline reinforcement learning of
robotic skills, in: Proceedings of the 38th International Conference on Machine Learning, volume
139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 1518–1528.
[24] D. P. Kingma, S. Mohamed, D. J. Rezende, M. Welling, Semi-supervised learning with deep
generative models, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (Eds.),
Advances in Neural Information Processing Systems, volume 27, Curran Associates, Inc., 2014.
[25] P. H. Calamai, J. J. Moré, Projected gradient methods for linearly constrained problems,
Mathematical Programming 39 (1987) 93–116. do1i0:.1007/bf02592073.
[26] K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional
generative models, in: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances
in Neural Information Processing Systems, volume 28, Curran Associates, Inc., 2015.
[27] A. Taitler, R. Alford, J. Espasa, G. Behnke, D. Fišer, M. Gimelfarb, F. Pommerening, S. Sanner,
E. Scala, D. Schreiber, J. Segovia‐Aguas, J. Seipp, The 2023 international planning competition, AI
Magazine 45 (2024) 280–296. doi1:0.1002/aaai.12169.
[28] D. Vranješ, J. Ehrhardt, R. Heesch, L. Moddemann, H. S. Steude, O. Niggemann, Design Principles
for Falsifiable, Replicable and Reproducible Empirical Machine Learning Research, in: 35th
International Conference on Principles of Diagnosis and Resilient Systems (DX 2024), volume
125, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 202140..d4o2i3: 0/
OASIcs.DX.2024.7.
[29] M. Gimelfarb, A. Taitler, S. Sanner, Jaxplan and gurobiplan: Optimization baselines for replanning
in discrete and mixed discrete-continuous probabilistic domains, Proceedings of the International
Conference on Automated Planning and Scheduling 34 (2024) 230–2381.0d.o1i6:09/icaps.v34i1.
31480.
[30] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter
optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghallab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nau</surname>
          </string-name>
          , P. Traverso,
          <source>Automated Planning and Acting</source>
          , Cambridge University Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Masson</surname>
          </string-name>
          , P. Ranchod, G. Konidaris,
          <article-title>Reinforcement learning with parameterized actions</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>30</volume>
          (
          <year>2016</year>
          )
          <article-title>1</article-title>
          .
          <fpage>0d</fpage>
          .
          <source>o1i:609/aaai.v30i1.10226.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausknecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning in parameterized action space</article-title>
          ,
          <volume>21001</volume>
          .6. doi: 48550/ARXIV.1511.04143.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Heesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ehrhardt</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Niggemann,</surname>
          </string-name>
          <article-title>Integrating machine learning into an smt-based planning approach for production planning in cyber-physical production systems</article-title>
          ,
          <source>in: Artificial Intelligence. ECAI 2023 International Workshops</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>318</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          , L. Han,
          <string-name>
            <surname>Y</surname>
          </string-name>
          . Zheng,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Liu, H. Liu,
          <article-title>Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space</article-title>
          ,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1810</year>
          .
          <volume>06394</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ehrhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Niggemann</surname>
          </string-name>
          ,
          <article-title>Learning process steps as dynamical systems for a subsymbolic approach of process planning in cyber-physical production systems</article-title>
          ,
          <source>in: Artificial Intelligence. ECAI 2023 International Workshops</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>332</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Heesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ehrhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Diedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Niggemann</surname>
          </string-name>
          ,
          <article-title>A lazy approach to neural numerical planning with control parameters</article-title>
          ,
          <source>in: European Conference on Artificial Intelligence (ECAI)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Say</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanner</surname>
          </string-name>
          ,
          <article-title>Scalable planning with deep neural network learned transition models</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>68</volume>
          (
          <year>2020</year>
          )
          <fpage>571</fpage>
          -
          <lpage>6061</lpage>
          .
          <year>0d</year>
          .
          <source>o1i:613/jair.1</source>
          .11829.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Hunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <article-title>Continuous control with deep reinforcement learning</article-title>
          ,
          <volume>20161</volume>
          .
          <year>0d</year>
          .
          <source>o4i8:550/ARXIV</source>
          .1509.02971.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <source>Evolutionary Action Selection for Gradient-Based Policy Learning</source>
          , Springer International Publishing,
          <year>2023</year>
          , p.
          <fpage>579</fpage>
          -
          <lpage>5901</lpage>
          .
          <year>0d</year>
          .
          <year>o1i0</year>
          :
          <volume>07</volume>
          /
          <fpage>978</fpage>
          -3-
          <fpage>031</fpage>
          -30111-7_
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , G. Tucker,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Ofline reinforcement learning: Tutorial, review</article-title>
          , and perspectives on open problems,
          <year>2020</year>
          . do1i0:.48550/ARXIV.
          <year>2005</year>
          .
          <volume>01643</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pellier</surname>
          </string-name>
          , H. Fiorino,
          <string-name>
            <surname>TempAMLSI:</surname>
          </string-name>
          <article-title>Temporal action model learning based on STRIPS translation</article-title>
          ,
          <source>Proceedings of the International Conference on Automated Planning and Scheduling</source>
          <volume>32</volume>
          (
          <year>2022</year>
          )
          <fpage>597</fpage>
          -
          <lpage>605</lpage>
          . doi:
          <volume>10</volume>
          .1609/icaps.v32i1.
          <fpage>19847</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          , G. Ostrovski,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sadik</surname>
          </string-name>
          , I. Antonoglou,
          <string-name>
            <given-names>H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
          <year>do1i</year>
          :
          <fpage>0</fpage>
          .1038/nature14236.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Horgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gregor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <article-title>Universal value function approximators</article-title>
          ,
          <source>in: Proceedings of the 32nd International Conference on Machine Learning, volumPero3c7eoedfings of Machine Learning Research</source>
          , PMLR, Lille, France,
          <year>2015</year>
          , pp.
          <fpage>1312</fpage>
          -
          <lpage>1320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Andrychowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McGrew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , W. Zaremba,
          <article-title>Hindsight experience replay</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>