<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hybrid Rule-based QoS-Optimized Modification of Deep Q- Learning Network: Properties and Specifics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Irina Grishanova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Rogushina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pylyp An</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Digitalisation of Education, National Academy of Educational Sciences of Ukraine</institution>
          ,
          <addr-line>9 M. Berlynskoho Str., Kyiv, 04060</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Software Systems, National Academy of Sciences of Ukraine</institution>
          ,
          <addr-line>44 Glushkov Pr., Kyiv, 03680</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Learning</institution>
          ,
          <addr-line>Deep Q-learning Network, Personal Learning Pathway, multigoal planning</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The study analyzes the capabilities of reinforcement learning methods and their components. We propose the hybrid rule-based QoS-optimized modification of Deep Q-learning network method developed to facilitate automatic determination of key learning parameters and incorporate mechanisms for continuos learning. This method enables the agent to adapt to the environment dynamics, such as the availability of possible actions and changes of their parameters, without requiring complete relearning and ensure flexibility, universality and enhanced efficiency of Machine Learning method across various domains. We test the proposed algorithm and its characteristics on the practical task of the personalized learning pathway constructing used in educational process.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) is a machine learning approach where an agent interacts with an environment
and learns on base of rewards and penalties [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. RL is used to solve optimization and decision-making
tasks in complex dynamic environments. The fundamental components of RL are: agents, the environment,
and interactions.
      </p>
      <p>An agent is an autonomous entity that makes learning-based decisions and adapts its behavior
according to the feedback between its actions and changes in the environment state. The agent state
represents its internal belief about environment, and this environment can be partially or fully observable.
The agent's actions policy π is the strategy that agent uses to select next actions based on its current state.
An agent has the task that is defined by initial state and goal state.</p>
      <p>The environment E =&lt; S, A,P, R, sinit , Starget &gt; is the space of the agent operations, where: S is a space of
states that the agent can observe; A is a space of actions defined as a set of all possible actions the agent can
take; a transition function P determines the new state of the agent after performing action a; a reward
function R evaluates the agent's actions; an initial state sinit marks the starting point; target states
Starget define the conditions for completing the learning process. If the agent reaches a target state
sgoal ∈ St arg et , the episode ends, and the agent receives a reward.</p>
      <p>The state space of the environment S = {sq =&lt; x1,...,xn &gt;}, q = 1,2n is the set of all possible state
configurations that the agent can recognize. A state s =&lt; x1,..., xn &gt;,s ∈ S is defined as an n-dimensional
vector.</p>
      <p>It is important to distinguish between the state of the environment and the state of the agent. Every
state of the environment s ∈ S is a combination of available input and output parameters that vary
depending on the actions performed by the agent. The state of the agent sa ∈ S is an internal set of
parameters of the learning model, that can include knowledge of prior experiences (e.g., an action-value
table Q(s, a)or neural network parameters), the current level of environment exploration (epsilon-greedy
value), knowledge about executed transitions, etc. In further explanations, we consider s as an
environment state, because it serves as the input data for the learning algorithm.</p>
      <p>In this work we introduce additional terms to clarify the proposed modification of learning algorithm.</p>
      <p>State signals xi ∈ {0,1}, i = 1, n are state attributes that characterize the state of the environment: xi = 1
if the agent has information about this environmental parameter; xi = 0 if the agent has no information
about this parameter of the environment.</p>
      <p>Terminal states Sterm ⊂ S are a subset of states that conclude training episodes. They are divided into
several non-overlapping subsets St arg et ∪ Sdeadlock ∪ Slimit = Sterm .</p>
      <p>Target states St arg et ⊆ Stermare reperesented by a set of states where all signal values defined as the
agent learning goal are equal to 1 (other signal values can be 1 or 0). The goal state sgoal ∈St arg et needs to
meet
the
condition:
for
all
elements
st arg et =&lt; xt arg et1 ,...,xt arg etn &gt;∈ St arg et
and
sgoal =&lt; xgoal1 ,..., xgoal n &gt; , if xgoalk = 1, then xt arg etk = 1, k = 1, n</p>
      <p>Deadlock states Sdeadlock ∈ Sterm are a set of states where the agent cannot perform any action to
expand its knowledge. These states can arise if the agent has no available services to execute or if all
possible actions don’t change the state or repeat actions that have already been performed.</p>
      <p>Limit states Slimit ⊆ Sterm are a set of states that the agent reaches within a limited number of steps
but fails to complete the assigned task.</p>
      <p>
        Action space A is the set of all possible actions that agent can perform within the environment. The
agent actions can be represented as services invoked by the agent, where each service is characterized by
input parameters, output parameters, a description of their transformation, and quality parameters (QoS)
that define this transformation. Service w ∈ W is a specific entity that, in the presence of the appropriate
input parameters, changes the state of the environment and determines the values of the output
parameters (either modifies or computes them) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Flow is a sequence of actions, where each action is permissible in a given current state, starting from
the initial state. These actions change the environment from one state to another and complete when the
environment reaches a target state.</p>
      <p>The agent finds a set of flows M = {mj }, j = 1, p that ensure the transition from the initial state sinit to
the target state sgoal and selects the optimal one based on the QoS parameters of the services within the
flow: f (mopt ,QoS(mopt )) ≥ f (mj,QoS(mj )), j = 1, p .</p>
      <p>Composite service wcompoz =&lt; w1,..., wr &gt; is an ordered sequence of services wk ∈ W, k = 1, r such that
after execution of each service the set of environmental signals becomes suitable for executing the next
service. After completing all services, the state of the environment is a subset of the terminal state
St arg et ⊆ Sterm .</p>
      <p>wcompoz : sinit → sgoal</p>
      <p>To find the optimal composite service that satisfies the given initial and target states, it is necessary to
define the optimal conditions for QoS parameters, i.e., their minimization or maximization.</p>
      <p>
        The mechanism for updating the agent knowledge about the environment depends on the method used
for ML. Q-learning method [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] represents this mechanism by updating the Q-table according to the
Bellman equation. In DQN (Deep Q-learning) [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], it involves updating the parameters of a neural
network that approximates the Q-function Q(s, a) through gradient descent, utilizing a Replay Buffer and
a Target Network (this approach enables the agent to handle large or continuous state spaces).
      </p>
      <p>Transition Function T(s,a,s`) determines the agent state s` after performing action a from state s .
Depending on the transition function, environments are categorized as either deterministic or stochastic.</p>
      <p>Reward Function R(s, a, s`) evaluates the agent's actions.</p>
      <p>Interaction is the process used by the agent for the environment discovering: by choosing an action
(based on its ε -greedy policy), the agent receives feedback (reward R — positive or negative) and updates
its knowledge according to the ML method (e.g., Q-learning or DQN) or passes to a new state. The ε
greedy policy is used to maintain a balance between exploration and exploitation.</p>
      <p>The RL components form a closed loop of interactions between the agent and the environment: the
agent selects an action → the environment responds by changing its state and providing a reward → the
agent updates its strategy based on the experience gained using an optimization algorithm, such as
Qvalue table updates or gradient descent. This cycle repeats until the agent learns for optimal performing
the task or achieves its goal in the environment. The agent works with the environment states, perceiving
them as input data for decision-making. At the same time, the ML process itself is guided by the agent
internal parameters that are change with each interaction.</p>
      <p>Additionally, most RL methods (except for classical Q-learning) use optimization algorithms that
enhance learning stability and accelerate convergence. DQN uses a Replay Buffer to reduce correlations
between consecutive steps, while parameter updates are stabilized through the use of a Target Network.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Deep Q-Learning network method</title>
      <p>
        DRL is a group of ML methods that combine Deep Learning and Reinforcement Learning [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ]. These
methods avoid the need to store a large Q-table (as in traditional Q-learning) by utilizing neural networks
to approximate the Q-value function. Instead of explicit table for every state and action (that becomes
computationally inefficient for large state spaces), DQN uses a neural network that takes a state as input
and predicts Q-values for all possible actions. This approach enables the modeling and processing of
continuous or very large state spaces where classical Q-tables are impractical. DQN is a modification of
Qlearning that employs a deep neural network for approximating the Q-function. Examples of DRL methods
are DDPG (Deep Deterministic Policy Gradient), A2C/A3C (Advantage Actor-Critic), PPO (Proximal Policy
Optimization), and DQN (Deep Q-Network).
      </p>
      <p>The fundamental components of DQN are the Policy Q-Network, the Target Q-Network, the Replay
Buffer, and the action selection mechanism ( ε -greedy strategy).</p>
      <p>Q-value update function Q(s, a) (it is an analogue of the Bellman equation for Q-learning:
Q(s,a) = α[r + γ max Q(s`,a`) − Q(s,a)], where α is the learning rate, γ is the discount factor, r is the
a`
reward, and s`, a` are the next state and action) evaluates the expected total reward for selecting action a
in state s, considering future steps.</p>
      <p>Unlike Q-learning, where the function values are stored in a table, in DQN the function is approximated
by a neural network Q(s, a; θ) that takes the current state s (as a vector of attributes) as input and
transforms it into Q-values for each possible action a based on the network parameters θ .</p>
      <p>Replay Buffer is a specialized memory space where past interactions between the agent and the
environment are stored (as opposed to classical Q-learning that updates Q-values on base of current
experience).</p>
      <p>Target Network Q`is a separate copy of the policy network that is updated more slowly (using older
parameters θ − ) to avoid instability during training. The target values Q`(s`,a`;θ) are computed as the
sum of the reward r for action a in state s, and the highest Q-value for the future state s`, discounted by
the factor γ . The target value Y for the loss function is calculated as follows: Y = r + γ max Q(s`,a`;θ− ) .
a`
This approach helps to reduce training instability, smoothes Q-value updates, and prevents abrupt changes
in the agent policy.</p>
      <p>ε -greedy policy defines how the agent selects actions (similar to classical Q-learning): it selects a
random action (exploration) with a probability of ε and it selects the best action based on Q-values
(exploitation) with a probability of 1 − ε . The ε -decay gradually decreases, that allow the agent to focus
more on exploration at the beginning and optimize its behavior later.</p>
      <p>Loss function in DQN learning is based on gradient descent, where the loss function (typically the Mean
Squared Error – MSE between the predicted Q-values and the target values from the Bellman equation) is
minimized.</p>
      <p>Neural network update uses an optimizer (e.g., Adam, RMSprop) for updating weights of the model. It
utilizes a deep learning process where the neural network gradually improves its estimates of the Q-values.</p>
      <p>DQN hyperparameters that influence its efficiency include:
•
•
•
•
•
•
•
•
•
•</p>
      <sec id="sec-2-1">
        <title>Discount factor γ determines the importance of future rewards;</title>
        <p>Learning rate α defines the speed at that the neural network updates its parameters;</p>
      </sec>
      <sec id="sec-2-2">
        <title>Exploration rate ε indicates how often the agent explores random actions;</title>
        <p>Size of Replay Buffer specifies how many interactions the agent stores in memory;</p>
        <sec id="sec-2-2-1">
          <title>Batch size is the number of selected samples for training;</title>
          <p>Target Network update frequency determines how often the target network is synchronized.</p>
          <p>Let us consider the Replay Buffer in more detail. It is a key mechanism in DQN that addresses the
problem of correlated data and stabilizes learning. The Replay Buffer enables the agent to effectively utilize
accumulated experiences, reducing randomness in Q-network updates and accelerating algorithm
convergence. It is a data structure that stores the agent's experiences during interactions with the
environment as a set of tuples B = {(s, a, r, s`)} , where:
s is the input state of the environment before the agent performs action a;
a is the action chosen by the agent in state s;
r is the reward received by the agent for performing the action a in the state s;
s` is the state of the environment after the agent performs action a in the state s.</p>
          <p>Experience in the form of transactions (s,a, r,s`) is stored in memory and selectively used to update the
model, that improves training stability. At each step, DQN randomly selects a mini-batch from the memory
B for training the policy neural network. The main advantages of the Replay Buffer include the ability to
use past data for learning (to reduce correlation between consecutive states), that enables efficient use of
experience by selecting the best samples.</p>
          <p>The most common types of Replay Buffer are Fixed-Size Replay Buffer (FIFO – First In, First Out) and
Prioritized Experience Replay Buffer (PER). The classical implementation of DQN involves using the FIFO
type, that is recommended for the slowly changed environment and if the preserving recent experiences is
important. FIFO is advised for initial usage due to its simplicity of implementation. PER is recommended
for rapid learning, and if the agent needs to focus on key states. We have implemented the capability to
use these two types, allowing for the determination of the optimal one for a specific case depending on the
task environment.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The DQN neural network architecture</title>
      <p>We consider a single-layer neural network for DQN, implemented in the DQNAgent class and built on the
basis of a Multilayer Perceptron (MLP). This neural network takes an encoded state vector as input and
outputs Q-values for all possible actions. Its architecture consists of input, hidden and output layers.</p>
      <p>Input layer accepts the state, that is represented as a vector of attributes with a length corresponding to
the number of possible state signals in the environment. The layer transforms this input into a
64dimensional feature vector. This state is encoded using the one-hot method, where each bit indicates
whether the corresponding input signal is present in the current state.</p>
      <p>The hidden layer consists of 64 neurons and uses the ReLU activation function to introduce
nonlinearity.</p>
      <p>The output layer provides the Q-values for each possible action. It transforms the 64 neurons into action
size, that corresponds to the number of possible actions (available services). The output layer does not use
an activation function, as the output values are interpreted as Q-values for each action.</p>
      <p>The Adam-optimizer is used for optimization, serving to update the weights. The standard Mean
Squared Error (MSE) is employed as the loss function to minimize the squared differences between the
predicted and target Q-values.</p>
      <p>For updating the weights, we use the ε -greedy strategy to balance between exploration and
exploitation. The formula for ε calculation is not predetermined but is automatically determined using an
optimizer, that evaluates the effectiveness of various approaches: linear, exponential, or adaptive decay,
and selects the most efficient one for the given environment.</p>
      <p>Target Network is implemented as a separate copy of the main neural network, that is periodically
updated to stabilize learning. The update frequency is automatically determined on base of the
environment parameters and the task specifics.</p>
      <p>This simple neural network is sufficiently effective for solving the task of service composition. For
more complex tasks, the number of hidden layers can be increased to improve Q-value approximation.
This architecture enables the agent to learn optimal actions by predicting long-term rewards for each
possible choice, effectively approximating the Q-function values for RL tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>4. RL use for construction the optimal personalized learning pathway</title>
      <p>
        The application of RL methods and the choice of their modifications significantly depend on the specifics
of the tasks they are used for and the rules of the corresponding domains. Let us consider the task of
construction the Personal Learning Pathway (PLP) that defines a personalized sequence for Learning Objects
(LO) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] studying (this problem is defined in details in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), that a student needs to acquire a set of
competencies for some Learning Course (LC). Each LO is considered as a service [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where inputs are the
competencies required for the student to perceive the information from this LO, outputs are the
competencies the student acquires as a result of studying this LO, and QoS defines the quality
characteristics of the LO, such as study time, cost and rating. PLP is constructed as a composite service
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The state of the agent s =&lt; x1,..., xn &gt;, s ∈ S indicates whether the student has acquired specific LC
competencies, i.e., xi = 1 if the student has the corresponding competency. The initial state sinit ∈ S
represents the set of competencies that a particular student has before starting the LC study. Achieving the
target state sgoal ∈ S means acquiring the whole set of LC competencies.</p>
      <p>The task is to construct a set of composite services that propose pathways to transition the agent from
the initial state to the target state from the subset St arg et . Subsequently, the optimal pathway should be
selected according to the task specifics and the needs of the particular student (e.g., minimizing studying
time or maximizing the ratings of the studied LOs).</p>
      <p>The features of this task are:
•
•
•
•
•</p>
      <p>A very large number of LOs with outputs contained elements of sgoal is available;
A student who acquired a specific competency do not need to study this competency again. Thus,
pathway constructing has to prioritize LO services that do not duplicate each other’s outputs;
Each student begins LC study with an individual set of competencies, that also do not require
repeated study;
An availability of LO services, their inputs, outputs, and QoS characteristics can be changed during
the study process;
A set of competencies used as inputs and outputs for LO services is fixed before the start of the
learning process, and this set is relatively small.</p>
      <p>According to ML terminology, we work with a dynamic environment (the properties of LO services can
be changed), and this environment has a large number of states (all possible configurations of available
characteristics of LO services).</p>
      <p>
        Previously, we examined the solution to the PLP construction task based on the Q-learning method and
identified the need for significant modifications of this algorithm according to the task requirements [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Overall, the obtained solutions are satisfactory, but we identify certain issues that are not inherent to
DQN.
      </p>
      <p>Classic Q-learning uses Q(s,a) table, that stores values for each state s and action a. However, for PLP
construction task in real-world conditions, such table becomes extremely large, making its use impractical.
DQN addresses this issue by replacing the Q-value table with a neural network that approximates the
Q(s,a) function. Thus, we decide to explore the feasibility of applying DQN to this practical task and
determine specific modifications to this algorithm that are necessary for the efficient PLP construction.</p>
      <p>Additionally, during the study of the DQN method operation, we encountered the need to set certain
rules (as in the previous Q-learning modification), that significantly improve the results (speed and
quality). Moreover, the task requires the inclusion of QoS in the reward calculation — taking into account
both their values and their semantics. These features led to the necessity of DQN modifying to meet the
task requirements. Proposed modificated DQN can be applied for other tasks from various domains with
simular requirements.</p>
    </sec>
    <sec id="sec-5">
      <title>5. DQN modification</title>
      <p>In previous studies, we proposed a modification of the Q-learning method aimed on the PLP constructing
as a composite service. However, despite these improvements, Q-learning has several fundamental
limitations related to scalability and the handling of large dynamic structures that constrain its
effectiveness in complex environments with a high number of states and actions, where policy
generalization is required.</p>
      <p>
        Thus, we propose to use DQN method, that provides better generalizability, enhanced efficiency in
dynamic environments, and stabilization of the learning process through advanced mechanisms such as
the experience Replay buffer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and the Target neural network based on the neural network approximated
the Q-function.
      </p>
      <p>DQN remove the following limitations of the classical Q-learning:
•
•
•
•
•
•</p>
      <p>Eliminates dependency on the Q-table size, as the neural network approximates the Q-value
function, reducing memory requirements and enabling work with a large space of states;
Generalizes knowledge for similar states, ensuring adaptation to new conditions;
Works more effectively in complex dynamic environments due to the ability of the model to learn
regularities. However, DQN still operates with a discrete set of actions, so it requires additional
modifications or alternative methods for continuous action spaces;
Provides more efficient utilization of experience through Replay Buffer, that stores and reuses
transitions for stabilizing the learning process;
Enhances the stability of learning through the use of Target Network;
Optimizes learning process by using random sequences from Replay Buffer, improving efficiency
and the uniformity of model updates.</p>
      <p>DQN method proposes important advantages compared to classical Q-learning, but it has certain
drawbacks, including issues with stability, scalability, adaptability and efficiency in complex and specific
environments. To address these, we propose the following modifications:
•
incorporating action constraints and implementing mechanisms to detect deadlock states and
loops. The implementation of such rules helps to avoid undesirable situations and improves
decision-making efficiency;
•
•
•
•
•
•
•
•
more accurate state space representation, contributing to better alignment with the modeled
environment because the lack of structured state space modeling in DQN can lead to suboptimal
solutions;
automated selection of hyperparameters instead of manually tuning parameters such as learning
rate or discount factor to optimize model productivity;
an adaptive reward function, enabled through the normalization of QoS metrics, the use of global
weights to account for the importance of each parameter and the ability to customize the
optimality criteria (maximization or minimization) improves model flexibility and efficiency,
stabilizes the process and enhances learning outcomes;
use of alternative state transition modes adds an accumulative mode, that allows for retaining and
expanding knowledge to traditional DQN mode where the agent fully updates its knowledge
during each transition. Providing the ability to switch between these modes ensures the algorithm
adaptability to tasks with varying requirements and improves training stability;
automated determination of key parameters such as buffer type, memory size, number of episodes,
number steps without progress, etc., and ε -function ensuring better adaptation to diverse
environments;
supporting different learning modes by adding modes with fixed and random initial states makes
the learning process more universal and effective across a wide range of environments;
implementing Multi-Goal Learning enables efficient operation in environments with complex
performance metrics (base DQN does not support learning with multiple objectives);
providing retraining during exploitation allows adaptation to changes in the environment.</p>
      <p>When the agent selects an action, we propose to apply task-specific optimization rules that improve the
efficiency and stability of learning:
•
•
•
•
•</p>
      <p>Limitation steps without progress. If selected available action does not lead to progress, a counter for
non-progressive steps increases. In the accumulating mode, progress is defined as adding new
signals to the agent's state. In the non-accumulating mode, progress is defined as the agent
transitioning to a new state different from the previous one. If the counter exceeds a set threshold,
the agent terminates the episode or changes its strategy;
Check the action benefit. If the available action does not provide new information, it is discarded,
and an other action is selected. In the accumulating mode, an action is considered useful if it adds
new signals (expands the agent's knowledge). In the non-accumulating mode, an action is
considered useful if it changes the agent's state;
Avoid repeated actions. The agent avoids repeating actions already performed within the current
episode. Unlike standard DQN, where the agent can repeat the same actions without restrictions, in
this case, such actions are deemed unnecessary. They are discarded, and an alternative action is
selected. This optimization reduces the number of operations and speeds up the learning process;
Handling deadlock states. If agent has not any available action, it performs forced exploration by
selecting a random action. If this situation repeats and the non-progressive step counter exceeds
the threshold, the episode ends with a penalty;
Check the action feasibility. The agent analyzes the availability of each action before executing. If
the environment is changed, then certain services become unavailable or their input parameters are
changed. In such cases, the agent searches for an alternative action.</p>
      <p>We use two learning modes: one with a fixed initial state and another with a random initial state. In the
fixed initial state mode, the agent starts learning with the predefined set of initial and target states. The
main goal of this approach is to achieve a stable solution for a specific scenario. A fixed state allows for
quicker evaluation of the model efficiency and provides the ability to verify policy convergence. However,
this mode has limitations, as the model can become overly specific to one scenario and lack generalized for
other conditions.</p>
      <p>The random initial state mode enhances the adaptability of the model, enabling the agent to work with
various sets of initial states. This approach ensures policy generalization across a wide range of
environment variations. However, due to the variability of rewards and states, the learning curve can be
unstable, complicating the assessment of the agent’s progress. Nevertheless, this mode allows the agent to
operate under changing environment parameters.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Tuning parameters and the learning strategy function of the Modified DQN</title>
      <p>The efficiency of the modified DQN algorithm relies heavily on learning parameters fine-tuning. Manual
selection of these parameters is often labor-intensive and can cause non-optimal results, especially in
complex environments with a large number of states and actions. Automating this process significantly
simplifies the task and ensures more accurate value selection. Dynamic parameter adjustment during
learning process allows the algorithm to adapt to changes in the environment, improving its effectiveness.</p>
      <p>We propose to automate the determining the required number of episodes based on the number of
available actions in the environment. In addition to optimizing the main DQN hyperparameters, such as
learning rate, discount factor, and selecting the strategy for calculating the ε -function (linear, exponential,
or adaptive) with predefined parameters (initial value, minimum value, decay rate), our modification
requires fine-tuning the following parameters:
•
•
•
•
•</p>
      <p>Type of Replay Buffer (FIFO or Prioritized) for optimal memory and experience management;
Batch size – the number of data samples processed by the agent simultaneously during a single
learning step;
Replay Buffer Size – the capacity of the Replay Buffer to store experiences, i.e., the number of
records the agent can hold to accumulate transitions;
Maximum number of steps without progress to set a threshold to detect looping or deadlock
situations;
Target network update frequency to determine intervals to stabilize the learning process and reduce
correlation between updates.</p>
      <p>Automated tuning of these parameters enhances the model efficiency and adaptability to complex
environments. To automate parameter selection, we use the Optuna library, that enables efficient
parameter search in the space of parameters based on Bayesian optimization methods. The stage of
searching for optimal parameters is carried out in the mode with random initial states.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Agent Learning and Exploitation Algorithm</title>
      <p>Algorithm of the agent learning and exploitation consists of three main stages.</p>
      <p>Stage 1. Optimization of parameters using Optuna, where Bayesian optimization is employed for
efficient parameter searching. Based on the input data (spaces of the environment states and actions, a set
of hyperparameters for searching with respective ranges or value sets, and ε -function strategies), the
search function returns the agent average reward over the last episodes and determines the best parameter
set for the agent. Multiple workers can be used for parallel acceleration. Here we use random initial state
mode.</p>
      <p>Stage 2. Learning the agent with a random initial state using the optimal parameters obtained in Stage
1. Upon completion of the learning process, the agent model is saved in a format that includes:
hyperparameters and settings for exploitation, model and target network weights and memory buffer that
contains transition experiences.</p>
      <p>Stage 3. Exploitation with Fine-Tuning mode. After loading the saved agent model, the following steps
are performed:
•</p>
      <sec id="sec-7-1">
        <title>Initialization of input data set the initial and target states;</title>
        <p>Learning with Fine-Tuning mode where the agent adapts its policy to new environmental
conditions through weight optimization;
Optimal flow search that is performed using parameter tuning to improve the flows found by the
agent.</p>
        <p>This approach allows for effective agent learning in a large environment, ensuring model adaptability
and policy generalization. The use of automated parameter search enables optimization not only of
hyperparameters but also of strategies for balancing exploration and exploitation. As a result, the agent
achieves high accuracy during exploitation mode.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Hybrid Rule-based QoS-Optimized DQN properties</title>
      <p>DQN modified according to the elements described above differs from the classical approach and expands
its capabilities. The proposed Hybrid Rule-based QoS-Optimized DQN (HR-QoS-DQN) method combines a
rule-based action selection approach with Quality of Service (QoS) optimization. The algorithm not only
selects actions with the highest Q-values but also integrates additional QoS parameters and a set of rules to
construct a PLP, where the final state is represented as a flexible set of knowledge. This enables the agent
to effectively adapt its policy to a dynamic environment, ensuring high decision-making stability.</p>
      <p>1. The universality of the knowledge representation format. The vector of binary signals, that describes
the environment state, can easily be customized for various types of tasks and environments, including
environments with high semantic complexity. This format ensures scalability, as it allows the addition of
new signal features without altering the core structure of the model, facilitating its application in
largescale systems. Unlike classical methods focused on coordinates or gradients, the proposed in HR-QoS-DQN
format minimizes reliance on complex specific data such as images or state vectors, ensuring universality
for diverse types of data. For example, in the task of PLP construction, the state is represented as a binary
value vector, where each feature indicates the presence (value is equal to 1) or absence (value is equal to 0)
of some competency.</p>
      <p>2. The flexible reward function (unlike classical DQN, that typically uses fixed rewards – e.g., +1 for a
correct action, -1 for a mistake), HR-QoS-DQN utilizes also QoS parameters, their global importance
weights, and optimization functions: cost and time are minimized, while reliability is maximized. Thereby,
QoS parameters influence learning by altering the agent's strategic decisions, that is a significant departure
from classical DQN.</p>
      <p>Such reward function can be customized to meet various requirements. Additionally, we incorporate
the ability to set aggregate functions for calculating QoS parameters for the resulting flow. For example, in
the task of PLP construction, cost and time are summed, while rating is determined as the arithmetic mean.</p>
      <p>3. The alternative modes of the state transition. We support the possibility of using two modes of state
update: mode without accumulation imitated the traditional DQN learning system, and mode with
accumulation, that enable extended functionality. The accumulating mode implementation can expand the
applicability of the proposed method.</p>
      <p>In classical DQN, during each action the state is updated by replacing the current state with the one
returned by the environment after executing the action, and this approach does not allow for accumulation
or retention of parts of the previous state. We implemented an accumulating mode in HR-QoS-DQN,
where the agent retains and accumulates all obtained signals for future use. While this approach
complicates the model, it provides a more realistic representation of the knowledge acquisition process and
in some cases simplifies the format of the environment presentation and makes it more understandable for
humans.</p>
      <p>4. The rules prevented loops and deadlock states. We propose to add in HR-QoS-DQNthe following
checks: limit on steps without progress to prevent the agent from continuously taking ineffective actions,
benefit check for actions to ensure they contribute new information or progress, avoidance of repeated
actions to eliminate redundant steps, handling of deadlock states by implementing emergency exit
mechanisms when no progress can be made. The proposed rules accelerate learning by avoiding
unnecessary actions, because classical DQN does not verify such situations.</p>
      <p>5. Different types of Replay Buffer. We use in HR-QoS-DQN two types: FIFO and PER with prior
determination of the optimal type for a specific environment during the parameter optimization stage
(classical DQN typically uses FIFO, where all experiences have equal importance). We set the buffer type
as a parameter for the agent's learning function, adding flexibility for applying the method to different
tasks. This modification also enables effective expansion of experience buffer types without additional
complexities.</p>
      <p>6. The adaptive selection of the agent actions. As opposed to classical DQN, where actions are selected
solely based on Q-values, in HR-QoS-DQN the agent additionally verifies the feasibility of the chosen
action according to predefined rules. These rules eliminate repeated actions and prevent unnecessary
transitions, improving efficiency and decision-making.</p>
      <p>7. The modified learning process that differs from classical DQN, where the agent is trained on
predefined initial states. In HR-QoS-DQN, two learning modes are provided: fixed initial state and random
initial state, with the additional option for explicit specification.</p>
      <p>In the first stage, the Optuna library is used for automated selection the optimal key parameters with
random initial states. The agent is then trained in an environment with random initial state that promotes
model generalization. However, this approach requires an increased number of episodes to achieve stable
policies due to the agent's interaction with a wider range of initial conditions.</p>
      <p>After learning, the model is saved into the file for further use. During the exploitation phase, the model
is loaded from the file, and fine-tuning is carried out with the fixed initial state mode. Fine-tuning
stabilizes the policy and searches for optimal parameters through additional adjustments. This approach
ensures model adaptability, improves generalization, and enhances performance under operational
conditions.</p>
      <p>8. The dynamic set of target states (Multi-Goal Learning) is used in HR-QoS-DQN for target state
formation, as opposed to a single fixed target state in classical DQN.</p>
      <p>9. The automated selection of optimal parameters and strategies for subsequent use in the learning and
exploitation processes of HR-QoS-DQN.</p>
      <p>10. The integration of learning and exploitation with fine-tuning support differs from classical DQN,
where the learning and exploitation processes are separated. In HR-QoS-DQN, a flexible three-stage
algorithm is implemented. The first stage involves selecting optimal parameters and functions. At the
second stage the agent is trained in an environment with various initial states, promoting policy
generalization. The third stage is to operate the model with the possibility of fine-tuning, enabling
adaptation to gradual changes in the environment, such as changes of service availability , changes in QoS
parameter values and graph modifications (altering the input and output parameters of services).</p>
      <p>11. The continuous learning of the agent is provided by HR-QoS-DQN after completing learning by
saving not only the weights of the Q-network (as in classical DQN) but also additional elements such as
the memory buffer and other configuration parameters. This approach simplifies the relearning process
and supports the algorithm adaptability. It significantly enhances the agent's flexibility and stability,
allowing it to function effectively under gradual changes in environmental parameters and data structures.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Software Implementation of HR-QoS-DQN</title>
      <p>We developed an implementation of the modified DQN algorithm HR-QoS-DQN within an environment
that simulates the execution of service sequences to achieve a target state, with the environment
dynamically changing in our modification. The project is implemented using Python, one of the most
popular programming languages in the fields of AI and ML. Python offers an extensive set of libraries and
frameworks that significantly simplify the development and deployment of deep learning algorithms. The
use of Python in this project ensured the efficient implementation of the DQN algorithm along with the
modifications, allowing for the PLP construction, facilitates the development, optimization, and scaling of
the solution, as well as its adaptation to environmental changes.</p>
      <p>In this work, we use several powerful Python libraries to enable the effective DQN implementation,
along with automated hyperparameter tuning and the formula for strategy calculation: NumPy (Numerical
Python) – a library for working with multidimensional arrays and performing numerical operations,
PyTorch – one of the most popular libraries for implementing deep learning, Optuna – a modern and
powerful library for automated parameter tuning, and Collections (deque) – a part of Python's standard
library, providing data structures for efficient processing (double-ended queues) used to implement the
Replay Buffer, organizing a FIFO queue to store the recent experiences of the agent and supporting neural
network learning on buffer samples. These libraries enable the implementation of the agent learning in a
dynamic environment of services. NumPy ensures fast numerical computations, PyTorch implements the
neural network and gradient descent, deque optimizes memory management for storing experiences, and
Matplotlib is used for analyzing learning results. This combination of libraries makes the system efficient,
productive and scalable.</p>
      <p>We test the HR-QoS-DQN algorithm in an environment with 1029 services and 126 signals that
generate 526 states. Fragment of printout on Figure 1 describes the execution of algorithm in this
environment (number of states and services) with selected initial and goal states, displays one of resulting
optimal paths and their final QoS values, calculation time (optimal path search time) and automatically
determined optimal parameters of learning.</p>
      <p>The learning stage of HR-QoS-DQN with a random initial state is characterized by the graphs (Figure 2)
of the learning curve defined by the QoS-based reward (Figure 2-left) and TD Error (Figure 2-right). The
learning curve graph shows the cumulative reward an agent receives for completing tasks. The learning
curve is unstable due to the variability of rewards and states, that complicates to assess the agent's
progress, but graph demonstrates the goal arrival after 9999 episodes.</p>
      <p>The TD-error graph displays the average value of Temporal Difference error (TD-error) - the difference
between the predicted Q-value and the adjusted value from the Bellman equation, which is a key indicator
of the accuracy of the agent's policy during training. The TD error curve graph shows big values for
20004000 episodes but fixes the learning progress after 6200 episodes. The decrease in TD-error over time
demonstrates that the model successfully adapts to changes in the environment and becomes increasingly
efficient in decision-making. Error spikes are expected and even beneficial, as they may indicate that the
agent encounters new scenarios and refines its policy.</p>
      <p>Learning in the random initial state mode demonstrates effectiveness in creating a universal model that
can adapt to dynamic changes in the environment. Fluctuations at the beginning are natural and indicate
an active exploration process. The gradual increase in reward and achievement of the target indicator,
together with the gradual decrease and stabilization of TD error, confirm that the agent is successfully
mastering the policy and is capable of generalization. The graphs confirm that this learning strategy
increases the flexibility and adaptability of the algorithm that is critically important for complex
environments.
10. Conclusion
The study analyzes the capabilities of reinforcement learning methods and their components. We consider
the Deep Q-Network (DQN) method and its distinctions from Q-learning. Based on this analysis, the
challenges that necessitate modifications to these methods are identified. These challenges involve
practical applications such as adaptation to complex and dynamic environments with changing service
parameters, consideration of multiple goals and diverse success metrics, adaptation of transition function
and automation of optimal learning parameter selection.</p>
      <p>Proposed solutions ensures flexibility, universality and enhanced efficiency method across various
domains.</p>
      <p>We propose the Hybrid Rule-based QoS-Optimized modification of Deep Q-learning network method
HR-QoS-DQN developed to facilitate automatic determination of key learning parameters and incorporates
mechanisms for continious learning. This method enables the agent to adapt to the environment changes,
such as the removal of services or modification of their parameters, without requiring complete relearning.
The algorithm is implemented in Python and tested on the practical task of constructing personalized
learning pathways used in the educational process management. Experimental results indicate that
HRQoS-DQN delivers better adaptability to environmental changes compared to Q-learning results obtained
from the same datasets. A significant advantage of HR-QoS-DQN is its ease of adaptation to diverse
application domain – it requires adjustments to learning parameters, policies, and domain-specific rules,
but does not necessitate reprogramming the entire algorithm.</p>
      <p>These modifications are aimed to adapt the DQN algorithm for solving tasks related to constructing of
the personalized learning pathways used in educational process.
During the preparation of this work, the authors used AI programs Chat GPT 4o and Copilot for
grammar and spelling check. After using this services, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <source>Reinforcement Learning: An Introduction</source>
          , The MIT Press: Cambridge, Massachusetts,
          <year>2018</year>
          . https://web.stanford.edu/class/psych209/ Readings/SuttonBartoIPRLBook2ndEd.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Uc-Cetina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moo-Mena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hernandez-Ucan</surname>
          </string-name>
          ,
          <article-title>Composition of web services using Markov decision processes and dynamic programming</article-title>
          .
          <source>In: The Scientific World Journal, (1)</source>
          , , (
          <year>2015</year>
          ):
          <fpage>545308</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dayan</surname>
          </string-name>
          ,
          <article-title>Q-learning</article-title>
          .
          <source>In: Machine learning</source>
          ,
          <volume>8</volume>
          (
          <issue>3-4</issue>
          ), pp.
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          , (
          <year>1992</year>
          ):
          <fpage>10</fpage>
          .1007/BF00992698, https://www.researchgate.net/publication/220344150_Technical_Note_QLearning
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          , G. Harerimana,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Q-learning algorithms: A comprehensive classification and applications</article-title>
          . In: IEEE access, (
          <year>2019</year>
          ) :
          <fpage>10</fpage>
          .1109/ACCESS.
          <year>2019</year>
          .
          <volume>2941229</volume>
          https://www.researchgate.net/publication/335805245_QLearning_Algorithms_A_Comprehensive_Classification_and_Applications.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Playing Atari with Deep Reinforcement Learning</article-title>
          ,
          <source>Deep Mind Technologies</source>
          , (
          <year>2013</year>
          )
          <article-title>ArXiv preprint</article-title>
          arXiv:
          <volume>1312</volume>
          .5602, https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          , et al.,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          ,
          <volume>518</volume>
          (
          <issue>7540</issue>
          ), pp.
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          , (
          <year>2015</year>
          ):
          <fpage>10</fpage>
          .1038/nature14236.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Theoretical</surname>
          </string-name>
          <article-title>Analysis of Deep Q-Learning</article-title>
          ,
          <source>Proceedings of Machine Learning Research</source>
          vol
          <volume>120</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          ,
          <source>2nd Annual Conference on Learning for Dynamics and Control</source>
          , (
          <year>2020</year>
          ):
          <fpage>10</fpage>
          .48550/arXiv.
          <year>1901</year>
          .
          <volume>00137</volume>
          , https://arxiv.org/pdf/
          <year>1901</year>
          .00137.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rogushina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gladun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Anishchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pryima</surname>
          </string-name>
          ,
          <article-title>Semantic Support of Personal Learning Trajectory Development</article-title>
          , In: UkrPROG 2024
          <source>International Scientific and Practical Programming Conference, CEUR Workshop Proceedings</source>
          (
          <year>2024</year>
          ), CEUR Vol-
          <volume>3806</volume>
          , (
          <year>2024</year>
          ):
          <fpage>487</fpage>
          -
          <lpage>505</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gryshanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rogushina</surname>
          </string-name>
          ,
          <article-title>Planning Sequences of Learning Object Services Based on Reinforcement Machine Learning</article-title>
          . In: // Materials of the XXIV
          <source>International Scientific and Practical Conference ITB2024</source>
          . (
          <year>2024</year>
          ):
          <fpage>74</fpage>
          -
          <lpage>77</lpage>
          https://drive.google.com/file/d/1wsLVnueNRd62g-k179IHihuJNOYZqR9G/view
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Xuan Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Xiang Zhou</surname>
            ,
            <given-names>W.</given-names>
            Liu, W.
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouguettaya</surname>
          </string-name>
          ,
          <article-title>Adaptive Service Composition Based on Reinforcement Learning</article-title>
          , In: Maglio,
          <string-name>
            <given-names>P.P.</given-names>
            ,
            <surname>Weske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Fantinato</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>(eds) ServiceOriented Computing</article-title>
          .
          <source>ICSOC 2010. Lecture Notes in Computer Science</source>
          , vol
          <volume>6470</volume>
          , pp
          <fpage>92</fpage>
          -
          <lpage>107</lpage>
          , (
          <year>2010</year>
          ). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -17358-
          <issue>5</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Uc-Cetina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moo-Mena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hernandez-Ucan</surname>
          </string-name>
          ,
          <article-title>Composition of web services using Markov decision processes and dynamic programming</article-title>
          .
          <source>In: The Scientific World Journal (1)</source>
          , (
          <year>2015</year>
          ):
          <fpage>545308</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Self-improving reactive agents based on reinforcement learning, planning and teaching</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>8</volume>
          , (
          <year>1992</year>
          ):
          <fpage>293</fpage>
          -
          <lpage>321</lpage>
          . https://doi.org/10.1007/bf00992699.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>