<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Seventh International Workshop on Computer Modeling and Intelligent Systems, May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-Agent Reinforcement Learning Methods with Dynamic Parameters for Logistic Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eugene Fedorov</string-name>
          <email>fedorovee75@ukr.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Nechyporenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaroslav Korpan</string-name>
          <email>y.korpan@chdtu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetiana Neskorodieva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cherkasy State Technological University</institution>
          ,
          <addr-line>Shevchenko blvd., 460, Cherkasy, 18006</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Uman National University of Horticulture, Uman</institution>
          ,
          <addr-line>Instytutska str., 1, Uman, 20305</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>3</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Part of Industry 4.0 is building computer systems by combining artificial intelligence with robotics. Such computer systems play an important role in the planning of cargo transportation in supply chain management. One of the approaches to building such computer systems is the use of multi-agent systems. The aim of the work is to create a methodology for constructing proactive agents based on reinforcement learning to solve the problem of optimal planning of cargo transportation. To solve the problem of insufficient efficiency of computer agents, the existing methods of statistical and machine learning were investigated. To date, the most efficient approaches to creating proactive agents are reinforcement learning approaches. The formalization of the functioning of proactive agents is performed. As a part of creating a model for the functioning of proactive agents based on reinforcement learning, a procedure for generating a quasi-optimal action plan is proposed that models the planning function of a proactive agent, which speeds up the decision-making process. Multi-agent reinforcement learning methods are proposed, which are close to random search at the initial iterations, and close to directed search at the final iterations. This is ensured by the use of dynamic parameters and allows the increase in the learning rate by approximately 10 times while maintaining the mean squared error of the method. supply chain management, multi-agent system, proactive agent, reinforcement learning, dynamic programming, Monte Carlo, Temporal-Difference Learning</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The fourth industrial revolution or Industry 4.0 has brought about rapid changes in technology,
manufacturing and social processes in the 21st century due to increasing interconnection and
intelligent automation. Part of this phase of industrial change is the integration of artificial intelligence
with robotics, which blurs the boundaries between the physical, digital and biological worlds and is
based on parallel and distributed computing [1].</p>
      <p>Such computer systems play an important role in the planning of cargo transportation in supply
chain management (CSM) [2] and audit [3-4]. One of the approaches to building such computer
systems is the use of multi-agent systems.</p>
      <p>Despite a large number of studies on the problem of improving the efficiency of supply chains and
reducing logistics costs, some questions remain open. The complexity of supply chains is constantly
increasing due to globalization and beyond. If earlier goods were purchased in centralized
hypermarkets, now online trading is developing with its unique SCM stages.</p>
      <p>The aim of the work is to create a methodology for constructing proactive agents based on
reinforcement learning to solve the problem of optimal planning of cargo transportation. To achieve
the goal, the following tasks were set and solved:
• formalization of the functioning of proactive agents;
• propose models for the functioning of proactive agents with a utility function based on
reinforcement learning;
• propose a multi-agent reinforcement learning method with time difference and dynamic
parameters;
• propose a multi-agent reinforcement learning method based on Monte Carlo and dynamic
parameters;
• propose a multi-agent reinforcement learning method based on adaptive dynamic
programming and dynamic parameters.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formulation of the research problem</title>
      <p>The problem of increasing the efficiency of optimal cargo transportation planning comes down to the

problem of finding such a set of plans { }, that delivers a minimum of the mean square error (the
difference between the cost of the resulting plan and the cost of the optimal plan),
1 P  
F =  (f( )− f( ))2 → m{i n}, where P – power of multiple plans,  .– th received plan, </p>
      <p>P =1
.– th optimal plan, f() – cost function of the plan (for example, the length of the route in the case of
the traveling salesman problem).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Literature review</title>
      <p>decision about choosing a goal from a set of possible goals and how to achieve it by forming an action
plan based on logical inference. A proactive agent may also be based on a utility function.</p>
      <p>The advantages and disadvantages of proactive agents and reactive agents with an internal state are
practically the same [5-6].</p>
      <sec id="sec-3-1">
        <title>Thus, the current problem is the low efficiency of the considered software agents.</title>
        <p>At present, instead of expert systems with logical inference used in decision-making agents,
reinforcement learning is actively used [7-8]. The main areas of single-agent reinforcement learning
are:
• dynamic programming [9-10];
• adaptive dynamic programming [11-12];
• Monte Carlo [13-14];
• temporal-difference learning [15-16];
• policy-based methods [17-18];
• actor-critic methods [19-20].</p>
      </sec>
      <sec id="sec-3-2">
        <title>Today, multi-agent methods are actively developed [21–22].</title>
      </sec>
      <sec id="sec-3-3">
        <title>The advantages of reinforcement learning over inference are:</title>
        <p>• no labeled data sets are required, this is especially relevant for large amounts of data [23-24];
• there is no imitation of a teacher, but a new solution can be proposed that people have not
even thought about [25-26];
• the quality criterion / utility function is used [27-28].</p>
      </sec>
      <sec id="sec-3-4">
        <title>Disadvantages of reinforcement learning based on dynamic programming [9-10]:</title>
        <p>• a priori knowledge about the probabilities of transitions between states is required;
• action is not selected (for a fixed policy).</p>
      </sec>
      <sec id="sec-3-5">
        <title>Disadvantages of reinforcement learning based on adaptive dynamic programming [11-12]:</title>
        <p>• action is not selected (for a fixed policy);
• cannot directly optimize the policy;
• a large number of interactions between the agent and the environment;
• converges to the global optimum only in the case of a finite number of actions and states;
• susceptible to retraining.</p>
        <p>Disadvantages of Monte Carlo based reinforcement learning [13-14]:
• action is not selected (for a fixed policy);
• cannot directly optimize the policy;
• a large number of long trajectories is required;
• updating the value of the cost function only after receiving the entire trajectory;
• does not always converge to the global optimum;
• susceptible to retraining.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Disadvantages of reinforcement learning based on temporal-difference learning [15-16]:</title>
        <p>• the policy is fixed, so no action is selected (if TD-learning);
• cannot directly optimize the policy;
• a large number of interactions between the agent and the environment;
• converges to the global optimum only in the case of a finite number of actions and states;
• susceptible to undertraining (if one-step TD learning).</p>
      </sec>
      <sec id="sec-3-7">
        <title>Disadvantages of policy-based reinforcement learning [17-18]:</title>
        <p>• requires a large number of long trajectories;
• does not always converge to the global optimum;
• subject to retraining.</p>
        <p>Disadvantages of actor-critic reinforcement learning [19-20]:
• a large number of long trajectories (if MC learning) or a large number of interactions between
the agent and the environment (if TD learning);
• does not always converge to the global optimum (if MC learning) or converges to the global
optimum only in the case of a finite number of actions and states (if TD learning);
• subject to retraining (if MC training);
• susceptible to undertraining (if one-step TD training).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Formalization of models of proactive agents functioning</title>
      <p>For such agents, the internal state is called belief, the possible goal is called desire, the best goal is
called intention.</p>
      <sec id="sec-4-1">
        <title>Formalization of the functioning of a proactive agent.</title>
      </sec>
      <sec id="sec-4-2">
        <title>Perception function (1)</title>
        <p>
          see:E → Per (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
maps the current state of the environment E into a new perception Per.
        </p>
        <sec id="sec-4-2-1">
          <title>The state change function nextis called the belief change function brf (2)</title>
          <p>
            brf:Bel Per→ Bel (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
and maps belief (internal state) Bel (belief) and perception Per into belief (internal state) Bel.
          </p>
          <p>Changing the intention (the best goal) is the sequential execution of the function for selecting the
set of wishes (possible goals) options and the filtering function filter, which ensures the choice of the
intention (the best goal) from the set of desires (possible goals).</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Function to generate possible variants options (3)</title>
        <p>
          option:sBel Int→ Des
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
maps belief (internal state) Bel and intention (best goal) Int into a set of desires (possible goals) Des.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Filter function filter (4)</title>
        <p>
          filt:eBrel Des  Int→ Int (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
maps belief (internal state) Bel, a subset of desires (possible goals) Des and intention (best goal) Int to
intention (best goal) Int.
        </p>
        <p>Plan  is a sequence of actions</p>
        <p> = {1,...n },,
where each i is an element of the set Ac .</p>
        <p>Plan= {0,1,..–.s}et of all plans.</p>
        <sec id="sec-4-4-1">
          <title>Instead of an action selection function actioannew planning function plan is used (5)</title>
          <p>
            plan:Bel Int Ac → Plan
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
which maps a belief (internal state) Bel, an intention (best goal) Int, and a subset of actions Ac into
Plan.
5. Modelling the functioning of proactive agents with a utility function
through reinforcement learning
          </p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Let a utility function u assign a utility to a state and be represented as (6)</title>
        <p>u(s(n))= max Q (s(n),a),</p>
        <p>aA (s(n))
where Q (s(n),a) – state-action cost function (profit in case of state s(n) and action a),</p>
        <sec id="sec-4-5-1">
          <title>A (s(n))– set of actions available in state s(n).</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>Let there be a memory of reproducing experiments (7)</title>
        <p>
          M = {(s,a,R(s,a,s),s)},
where R(s,a,s) – reward (reward for the transition from state s to state s as a result of action a).
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
        </p>
        <p>Then, for a proactive agent with a utility function, the procedure for generating an action plan 
for the transition from the internal state (belief) s0 to the target state (intention) s* models the
planning function plan and is presented in the following form.</p>
        <p>1. Initialization</p>
        <p>s(0)= s0, iteration number n = 1.</p>
      </sec>
      <sec id="sec-4-7">
        <title>2. Choice of action (8) and observation of the internal state (9)</title>
        <p>y(n)= arg max Q (s(n − 1),a),</p>
        <p>aA (s(n−1))
(s,a,R(s,a,s),s) M :s(n − 1)= s y(n)= a → s(n)= s .</p>
      </sec>
      <sec id="sec-4-8">
        <title>3. Termination condition</title>
        <p>
          If n  N , then n = n + 1, go to step 2, otherwise π = (y(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ),.y.(.n,)).
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
        </p>
        <p>The paper proposes reinforcement learning methods based on temporal-difference learning, based
on Monte Carlo and based adaptive dynamic programming.
6. Multi-agent reinforcement learning with temporal-difference and
dynamic parameters</p>
      </sec>
      <sec id="sec-4-9">
        <title>The method consists of the following steps.</title>
      </sec>
      <sec id="sec-4-10">
        <title>1. Initialization.</title>
        <p>1.1. The maximum number of iterations N , the number of agents K , the maximum length of the
states’ sequence T, the discrete set of states S , the discrete set of actions A (s), sS, the reward
R(s,a), a A (s), sS, the parameters min,1max , 2min,2max (control the learning rate),
1
0 min  max  1, 0 min  max  1, the parameters min,max (control the ε-greedy policy),
1 1 2 2
0 min  max  1, the parameters  min, max (control discounting), 0 min  max  1 are set.
1.2. Reward tables are initialized for the kth agent</p>
        <p>Q k = [Q k (s,a) ],</p>
        <p>Q k (s,a)= 0, a A (s), s S, k 1,K .</p>
      </sec>
      <sec id="sec-4-11">
        <title>1.3. The reward table is initialized for a swarm of agents</title>
        <p>2. Iteration number n=1.</p>
      </sec>
      <sec id="sec-4-12">
        <title>3. The parameters are calculated (10)-(13)</title>
        <p>Q swarm = [Q swarm(s,a)]</p>
        <p>,</p>
        <p>Q swarm(s,a)= 0, a A (s), sS.
1(n)= max − (1max − min)</p>
        <p>1 1</p>
        <p>
          N − 1,
(n)=  min + ( max −  min) n − 1
n − 1
N − 1,
n − 1
6. For each kth agent, action akt is chosen using the ε-greedy policy π. If U (
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ) (n), then
choose action akt randomly from the set of allowed actions A (skt), otherwise choose action akt
in the form (
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
i.e.
        </p>
        <p>
          , b A (skt)),
akt = (skt), k  1,K .
7. For each kth agent, a reward R(skt,akt), k 1,K is calculated.
8. For each kth agent a new state skt = akt, k 1,K is observed.
9. For each kth agent, the value of the combinations of the state-action cost functions of the swarm
and the kth agent is calculated, i.e.
in the form (
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
~
Q k (skt,akt)= (1− 2(n))Q swarm(skt,akt)+ 2(n)Q k (skt,akt), k 1,K .
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
10. For each kth agent, the value of the cost function of the state-action Q k (skt,akt) is calculated
as (
          <xref ref-type="bibr" rid="ref16">16</xref>
          )
        </p>
        <p>
          ~
(1− 1(n))Q k (skt,akt)+
Q k (skt,akt)= + 1(n) R(~skt,akt)+ (n)bmAa(sxkt)Q~k (skt,b), t T , k 1,K . (
          <xref ref-type="bibr" rid="ref16">16</xref>
          )
(1− 1(n))Q k (skt,akt)+ 1(n)R(skt,akt), t= T
11. Calculate the value of the cost function of the state-action of the swarm of agents
Q swarm(skt,akt) for each kth agent in the form (17)

        </p>
        <p>maxQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)
Q swarm(skt,akt)=  z1,K z1,K z1,K
minQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)
z1,K z1,K z1,K
, k 1,K .</p>
        <p>(17)
12. For each kth agent, the current state skt = skt, k 1,K is set.</p>
        <p>13. If the current time is not the last, i.e. t T , then increase the iteration number, i.e. t= t+ 1,
go to step 6.</p>
        <p>14. If the current iteration is not the last one, i.e. n  N , then increase the iteration number, i.e.
n = n + 1, go to step 3.</p>
        <p>Note. Upon completion of the method, plan k = (yk1,..y.kt,,..y.kT,) is formed for each kth agent
(18)
ykt = argamAa(sxkt)Q k (skt,a)</p>
        <p>, sS, k 1,K .</p>
      </sec>
      <sec id="sec-4-13">
        <title>The plan of the agent that satisfies the quality criterion better than others is selected. (18)</title>
        <p>7. Multi-agent Monte Carlo reinforcement learning with dynamic
parameters
This method is presented in the following form.</p>
        <p>1. Initialization.
1.1. The maximum number of iterations N , the number of agents K , the discrete set of states S ,
the discrete set of actions A (s), sS, the reward R(s,a), a A (s), sS, the parameters
min,1max , 2min,2max (control the learning rate), 0 min  max  1, 0 min  max  1, the
1 1 1 2 2
parameters min,max (control the ε-greedy policy), 0 min  max  1, the parameters
 min, max (control discounting), 0 min  max  1 are set.
1.2. Reward tables are initialized for the kth agent</p>
        <p>Q k = [Q k (s,a)],</p>
        <p>Q k (s,a)= 0, a A (s), s S , k  1,K .</p>
      </sec>
      <sec id="sec-4-14">
        <title>1.3. The reward table is initialized for a swarm of agents</title>
        <p>Q swarm = [Q swarm(s,a)]</p>
        <p>,</p>
        <p>Q swarm(s,a)= 0, a A (s), s S .
1.4. Tables of the number of transitions for the kth agent are initialized</p>
        <p>D k = [D k (s,a)],</p>
        <p>D k (s,a)= 0, a A (s), s S , k 1,K .
2. Iteration number n = 1.
3. The parameters are calculated (19)-(22)
1(n)= max − (1max − min) n − 1
1 1 N − 1,
(19)</p>
        <p>(20)
(21)
(22)
(24)
(25)
(26)
4. Trajectory k = (sk0,ak0,rk0,..sk.T,,akT ,rkT ) is generated for each kth agent, and akt = (skt),
rkt = R(skt,akt), as a result of action akt a new state sk,t+1 and reward rkt, are observed, state
sk0 can change at each iteration, the policy of choosing action π is ε-greedy, k 1,K .
5. Number of the moment in time t= T .
6. Calculate for each kth agent the profit in the form of a discounted amount of reward from time t
to time T (23)</p>
        <p>T
Rkt(k )=  t−t(n)rkt , k 1,K . (23)</p>
        <p>t=t
7. For each kth agent, the value of the combinations of the state-action cost functions of the swarm
and the kth agent is calculated, i.e.
in the form (24)
~</p>
        <p>Q k (skt,akt)= (1− 2(n))Q swarm(skt,akt)+ 2(n)Q k (skt,akt), k 1,K .
8. For each kth agent, the transition counter D k (skt,akt) is increased, i.e.</p>
        <p>D k (skt,akt)= D k (skt,akt)+ 1, k 1,K .
9. For each kth agent, the value of the cost function of the state-action Q k (skt,akt) is calculated as
(25)</p>
        <p>
Q k (skt,akt)= 1−


1(n)  ~</p>
        <p>Q k (skt,akt)+
D k (skt,akt)</p>
        <p>1(n)
</p>
        <p>maxQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)
Q swarm(skt,akt)=  z1,K z1,K z1,K
minQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)
 z1,K z1,K z1,K
, k 1,K .
11. If t 0, then t= t− 1, go to step 6.
12. If the current iteration is not the last one, i.e. n  N , then increase the iteration number, i.e.
n = n + 1, go to step 3, otherwise stop.</p>
        <p>Note. Upon completion of the method, plan k = (yk1,..y.kt,,..y.kT,) is formed for each kth agent
(27)
ykt = argamAa(sxkt)Q k (skt,a)
, s S , k 1,K .</p>
        <p>(27)</p>
      </sec>
      <sec id="sec-4-15">
        <title>The plan of the agent that satisfies the quality criterion better than others is selected.</title>
        <p>8. Multi-agent reinforcement learning method based on adaptive
dynamic programming and dynamic parameters
The method consists of the following steps.</p>
        <p>1. Initialization.
1.1. The maximum number of iterations N , the number of agents K , the maximum length of the
states’ sequence T, the discrete set of states S , the discrete set of actions A (s), sS, the
parameters (control the learning rate),</p>
        <p>1min,1max , 2min,2max 0 1min  1max  1,
0 min  max  1, the parameters min,max (control the ε-greedy policy), 0 min  max  1,
2 2
the parameters  min, max (control discounting), 0 min  max  1 are set.
1.2. Reward tables are initialized for the kth agent</p>
        <p>Q k (s,a), Q k (s,a)= 0, a A (s), sS, k 1,K .</p>
      </sec>
      <sec id="sec-4-16">
        <title>1.3. The reward table is initialized for a swarm of agents</title>
        <p>Q swarm(s,a), Q swarm(s,a)= 0, a A (s), sS.
1.4. The tables of the number of transitions for the kth agent are initialized</p>
        <p>D k (s,a), D k (s,a)= 0, a A (s), sS, k 1,K .
1.5. The state observation quantity tables for the kth agent are initialized</p>
        <p>D k (s), D k (s)= 0, sS, k 1,K .
2. Iteration number n=1.</p>
      </sec>
      <sec id="sec-4-17">
        <title>3. The parameters are calculated (28)-(31)</title>
        <p>
          4. The number of the moment in time is set t= 1.
5. The initial state skt, k 1,K is observed for each kth agent.
6. For each kth agent, action akt is chosen using the ε-greedy policy π. If U (
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ) (n), then
choose action akt randomly from the set of allowed actions A (skt), otherwise choose action akt
in the form (32)
i.e.
        </p>
        <p>, b A (skt)),
akt = (skt), k 1,K .
7. For each kth agent, a reward R(skt,akt), k 1,K is calculated.
8. For each kth agent a new state skt = akt, k 1,K is observed.</p>
      </sec>
      <sec id="sec-4-18">
        <title>9. The transition function is calculated (33)</title>
        <p>Pk (skt |skt,akt)=
in the form (34)
~
Qk (skt , akt ) = (1 − 2 (n))Qswarm (skt , akt ) + 2 (n)Qk (skt , akt ) , k 1, K .
(28)
(29)
(30)
(31)
(32)
(33)
11. For each kth agent, the value of the cost function of the state-action Q k (s,a) is calculated as
(35)</p>
        <p>~
(1− 1(n))Q k (skt,akt)+

  ~
 ~ bA (skt) k (skt,b),
Q k (skt,akt)= + 1(n)Pk (skt |skt,akt) R(skt,akt)+ (n) max Q
(1− 1(n))Q k (skt,akt)+ 1(n)Pk (skt |skt,akt)R(skt,akt),
t T
t= T</p>
        <p>,
k 1,K
12. Calculate the value of the cost function of the state-action of the swarm of agents
Q swarm(skt,akt) for each kth agent in the form (36)
(35)
(36)
</p>
        <p>maxQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)
Q swarm(skt,akt)=  z1,K z1,K z1,K
minQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)
 z1,K z1,K z1,K
13. For each kth agent, the current state skt = skt, k 1,K is set.
14. If the current time is not the last, i.e. t T , then increase the iteration number, i.e. t= t+ 1,
go to step 6.
15. If the current iteration is not the last one, i.e. n  N , then increase the iteration number, i.e.
n = n + 1, go to step 3.</p>
        <p>Note. Upon completion of the method, plan k = (yk1,..y.kt,,..y.kT,) is formed for each kth agent
(37)
ykt = argamAa(sxkt)Q k (skt,a), sS, k 1,K .
(37)</p>
      </sec>
      <sec id="sec-4-19">
        <title>The plan of the agent that satisfies the quality criterion better than others is selected.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>9. Experiments and results</title>
      <sec id="sec-5-1">
        <title>The numerical study of the proposed methods was carried out using the Python package.</title>
        <p>For multi-agent reinforcement learning methods, the value of parameters min = 0.1,1max = 0.9,
1
min = 0.1,2max = 0.9 (control the learning rate), parameters min = 0.1,max = 0.9 (control the
ε2
greedy policy), parameters min = 0.1,max = 0.9 (control discounting), the number of agents is
K = 20.</p>
        <sec id="sec-5-1-1">
          <title>The dependence of parameter (n) is defined as</title>
          <p>(n)=  min + ( max −  min) n − 1 .</p>
          <p>N − 1</p>
          <p>The dependence of parameter (n) on the iteration number n is linear and shows that its share
increases with the iteration number.</p>
          <p>The dependence of parameter 1(n), 2(n) and (n) is defined as
1(n)= max − (1max − min) n − 1
1 1 N − 1,</p>
          <p>n − 1</p>
          <p>N − 1.</p>
          <p>The dependence of parameter 1(n), 2(n) and (n) on the iteration number n is linear; it shows
that their share decreases with increasing iteration number.</p>
          <p>The results of comparing the proposed temporal-difference reinforcement learning method with
dynamic parameters and the traditional Q-learning method based on the mean squared error criterion
and the number of iterations for solving the travelling salesman problem (Berlin52 standard dataset),
which is used for planning cargo transportation are presented in Table 1.</p>
          <p>The results of comparing the proposed Monte Carlo based reinforcement learning method with
dynamic parameters and with the traditional every-visit method based on the mean squared error
criterion and the number of iterations for solving the travelling salesman problem (Berlin52 standard
dataset), which is used for planning cargo transportation are presented in Table 2.</p>
          <p>The results of a comparison of the proposed reinforcement learning method based on adaptive
dynamic programming with dynamic parameters and the traditional passive adaptive dynamic
programming method based on the mean square error criterion and the number of iterations for
solving the traveling salesman problem (Berlin52 standard dataset), which is used for freight
planning, are presented in Table 3.
Advantages of the proposed methods:
1. Modification of reinforcement learning methods due to dynamic parameters allows for an
increase in the learning rate while maintaining the mean squared error of the method (Tables 1-3).
2. The use of a multi-agent approach makes distributed computing possible and increases the
learning speed while maintaining the root-mean-square error of the method (Tables 1-3).
3. Reinforcement learning methods with dynamic parameters use the ε-greedy approach, which
is close to random search at initial iterations, and close to directed search at final iterations. This is
ensured by the use of dynamic parameters and allows for an increase in the learning rate while
maintaining the mean squared error of the method (Tables 1-3).
11.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>To solve the problem of insufficient efficiency of computer agents, the existing methods of
statistical and machine learning were investigated. These studies have shown that, to date, the
most effective approaches to creating proactive agents are reinforcement learning approaches.</p>
      <sec id="sec-6-1">
        <title>The formalization of the functioning of proactive agents has been conducted.</title>
        <p>As part of creating a model for the functioning of proactive agents based on reinforcement
learning, a procedure for generating a quasi-optimal action plan is proposed that models the
planning function of a proactive agent, which speeds up the decision-making process.
Reinforcement learning methods are proposed, which at the initial iterations are close to
random search, and at the final iterations are close to the directed search. This is ensured by
the use of dynamic parameters and multi-agent approach and allows for an increase in the
learning rate while maintaining the mean squared error of the method.</p>
      </sec>
      <sec id="sec-6-2">
        <title>The proposed multi-agent methods will be used for freight planning in supply chain management and auditing, and were investigated on a standard data set.</title>
        <p>
          12.
[17] A. S. Stebenkov, N. O. Nikitin, Automated Generation of Ensemble Pipelines using
PolicyBased Reinforcement Learning method, Procedia Computer Science, 229 (2023) 70-79.
doi:10.1016/j.procs.2023.12.009.
[18] F. Huang, X. Deng, Y. He, W. Jiang, A novel policy based on action confidence limit to improve
exploration efficiency in reinforcement learning, Information Sciences, 640 (2023) 119011.
doi:10.1016/j.ins.2023.119011.
[19] J. Zhang, Sh. Han, X. Xiong, Sh. Zhu, Sh. Lu, Explorer-Actor-Critic: Better actors for deep
reinforcement learning, Information Sciences, 662 (2024) 120255.
doi:10.1016/j.ins.2024.120255.
[20] Zh. Zhang, X. Liang, C. Chen, D. Liu, Ch. Yu, W. Li, Defense penetration strategy for
unmanned surface vehicle based on modified soft actor–critic, Ocean Engineering, 304 (2024)
117840. doi: 10.1016/j.oceaneng.2024.117840.
[21] T. Li, K. Zhu, N. C. Luong, D. Niyato, Q. Wu, Y. Zhang, B. Chen, Applications of multi-agent
reinforcement learning in future Internet: A comprehensive survey, IEEE Communications
Surveys &amp; Tutorials, 24 (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) (2022) 1240–1279. doi:10.1109/COMST.2022.3160697.
[22] L. M. Schmidt, J. Brosig, A. Plinge, B. M. Eskofier, C. Mutschler, An introduction to multi-agent
reinforcement learning and review of its application to autonomous mobility, in: IEEE 25th
        </p>
        <sec id="sec-6-2-1">
          <title>International Conference on Intelligent Transportation Systems, 2022, pp. 1342–1349.</title>
          <p>
            [23] P. Yadav, A. Mishra, S. Kim, A comprehensive survey on multi-agent reinforcement learning for
connected and automated vehicles, Sensors, 23 (
            <xref ref-type="bibr" rid="ref10">10</xref>
            ) (2023) 4710. doi:10.3390/s23104710.
[24] J. Orr, A. Dutta, Multi-agent deep reinforcement learning for multi-robot applications: A survey,
          </p>
          <p>
            Sensors, 23 (
            <xref ref-type="bibr" rid="ref7">7</xref>
            ) (2023) 3625. doi.org/10.3390/s23073625.
[25] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Re, S. Spanò,
Multiagent reinforcement learning: A review of challenges and applications, Applied Sciences, 11 (
            <xref ref-type="bibr" rid="ref11">11</xref>
            )
(2021) 4948. doi: 10.3390/app11114948.
[26] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh, D. Silver, Meta-gradient reinforcement
learning with an objective discovered, arXiv:2007.08433, 2020. doi:10.48550/arXiv.2007.08433.
[27] H. Wang , E. Miahi, M. White, M. C. Machado, Z. Abbas, R. Kumaraswamy, V. Liu, A. White,
Investigating the properties of neural network representations in reinforcement learning,
Artificial Intelligence, 330 (2024) 1-24. doi: 10.1016/j.artint.2024.104100.
[28] F. Robertazzi, M. Vissani, G. Schillaci, E. Falotico, Brain-inspired meta-reinforcement learning
cognitive control in conflictual inhibition decision-making task for artificial agents, Neural
Networks, 154 (2022) 283–302. doi: 10.1016/j.neunet.2022.06.020.
          </p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Shvachych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Ivaschenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Busygin</surname>
          </string-name>
          , Ye. Ye.
          <article-title>Fedorov, Parallel computational algorithms in thermal processes in metallurgy and mining</article-title>
          ,
          <source>Naukovyi Visnyk Natsionalnoho Hirnychoho Universytetu</source>
          ,
          <volume>4</volume>
          (
          <year>2018</year>
          )
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          . doi:
          <volume>10</volume>
          .29202/nvngu/2018-4/19.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Grygor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fedorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nechyporenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grygorian</surname>
          </string-name>
          ,
          <article-title>Neural network forecasting method for inventory management in the supply chain</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          ,
          <year>2022</year>
          , volume
          <volume>3137</volume>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Neskorodieva</surname>
          </string-name>
          , E. Fedorov,
          <article-title>Method for automatic analysis of compliance of expenses data and the enterprise income by neural network model of forecast</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          , volume
          <volume>2631</volume>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>158</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Neskorodieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fedorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Izonin</surname>
          </string-name>
          ,
          <article-title>Forecast method for audit data analysis by modified liquid state machine</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          , volume
          <volume>2623</volume>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jezic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen-Burger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kusek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sperka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Howlett</surname>
          </string-name>
          , L. C. Jain (Eds.),
          <article-title>Agents and multi-agent systems: technologies and applications</article-title>
          , volume
          <volume>186</volume>
          of Smart innovation,
          <source>systems and technologies</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-15-5764-4.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig</surname>
          </string-name>
          ,
          <source>Artificial Intelligence: a Modem Approach</source>
          , Englewood Cliffs, NJ: Prentice
          <string-name>
            <surname>Hall</surname>
            <given-names>PTR</given-names>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Ottoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Nepomuceno</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. S. de Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. C. R. de Oliveira</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for the traveling salesman problem with refueling</article-title>
          ,
          <source>Complex &amp; Intelligent Systems</source>
          ,
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>2001</fpage>
          -
          <lpage>2015</lpage>
          . doi:
          <volume>10</volume>
          .1007/s40747-021-00444-4.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oroojlooy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hajinezhad</surname>
          </string-name>
          ,
          <article-title>A review of cooperative multi-agent deep reinforcement learning</article-title>
          ,
          <source>Applied Intelligence</source>
          ,
          <volume>53</volume>
          (
          <year>2023</year>
          )
          <fpage>13677</fpage>
          -
          <lpage>13722</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10489-022-04105-y.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <article-title>Supplementary heuristic dynamic programming for wastewater treatment process control</article-title>
          ,
          <source>Expert Systems with Applications</source>
          ,
          <volume>247</volume>
          (
          <year>2024</year>
          )
          <article-title>123280</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2024</year>
          .
          <volume>123280</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>U.</given-names>
            <surname>Satic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jacko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kirkbride</surname>
          </string-name>
          ,
          <article-title>A simulation-based approximate dynamic programming approach to dynamic and stochastic resource-constrained multi-project scheduling problem</article-title>
          ,
          <source>European Journal of Operational Research</source>
          ,
          <volume>315</volume>
          (
          <year>2024</year>
          )
          <fpage>454</fpage>
          -
          <lpage>469</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.ejor.
          <year>2023</year>
          .
          <volume>10</volume>
          .046.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Nonzero-sum games using actor-critic neural networks: A dynamic event-triggered adaptive dynamic programming</article-title>
          ,
          <source>Information Sciences</source>
          ,
          <volume>662</volume>
          (
          <year>2024</year>
          )
          <article-title>120236</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.ins.
          <year>2024</year>
          .
          <volume>120236</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Optimal dynamic output feedback control of unknown linear continuous-time systems by adaptive dynamic programming</article-title>
          ,
          <source>Automatica</source>
          ,
          <volume>163</volume>
          (
          <year>2024</year>
          )
          <article-title>111601</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.automatica.
          <year>2024</year>
          .
          <volume>111601</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pascal</surname>
          </string-name>
          ,
          <article-title>Artificial neural networks to solve dynamic programming problems: A bias-corrected Monte Carlo operator</article-title>
          ,
          <source>Journal of Economic Dynamics and Control</source>
          ,
          <volume>162</volume>
          (
          <year>2024</year>
          )
          <article-title>104853</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.jedc.
          <year>2024</year>
          .
          <volume>104853</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Christianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          , Multi-Agent
          <source>Reinforcement Learning: Foundations and Modern Approaches</source>
          , MIT Press, Cambridge, MA, USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Sh.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sh. Dong</surname>
          </string-name>
          , Ya. Gao,
          <article-title>Online attentive kernel-based temporal difference learning</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          ,
          <volume>278</volume>
          (
          <year>2023</year>
          )
          <article-title>110902</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.knosys.
          <year>2023</year>
          .
          <volume>110902</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Stanković</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Stanković</surname>
          </string-name>
          ,
          <article-title>Distributed consensus-based multi-agent temporaldifference learning</article-title>
          ,
          <source>Automatica</source>
          ,
          <volume>151</volume>
          (
          <year>2023</year>
          )
          <article-title>110922</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.automatica.
          <year>2023</year>
          .
          <volume>110922</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>