1. Introduction

Seventh International Workshop on Computer Modeling and Intelligent Systems, May

Multi-Agent Reinforcement Learning Methods with Dynamic Parameters for Logistic Tasks

Eugene Fedorov

fedorovee75@ukr.net 0

Olga Nechyporenko

Yaroslav Korpan

y.korpan@chdtu.edu.ua 0

Tetiana Neskorodieva

1 0 Cherkasy State Technological University , Shevchenko blvd., 460, Cherkasy, 18006 , Ukraine 1 Uman National University of Horticulture, Uman , Instytutska str., 1, Uman, 20305 , Ukraine

2024

3 2024 0000 0003

Part of Industry 4.0 is building computer systems by combining artificial intelligence with robotics. Such computer systems play an important role in the planning of cargo transportation in supply chain management. One of the approaches to building such computer systems is the use of multi-agent systems. The aim of the work is to create a methodology for constructing proactive agents based on reinforcement learning to solve the problem of optimal planning of cargo transportation. To solve the problem of insufficient efficiency of computer agents, the existing methods of statistical and machine learning were investigated. To date, the most efficient approaches to creating proactive agents are reinforcement learning approaches. The formalization of the functioning of proactive agents is performed. As a part of creating a model for the functioning of proactive agents based on reinforcement learning, a procedure for generating a quasi-optimal action plan is proposed that models the planning function of a proactive agent, which speeds up the decision-making process. Multi-agent reinforcement learning methods are proposed, which are close to random search at the initial iterations, and close to directed search at the final iterations. This is ensured by the use of dynamic parameters and allows the increase in the learning rate by approximately 10 times while maintaining the mean squared error of the method. supply chain management, multi-agent system, proactive agent, reinforcement learning, dynamic programming, Monte Carlo, Temporal-Difference Learning

1. Introduction

The fourth industrial revolution or Industry 4.0 has brought about rapid changes in technology, manufacturing and social processes in the 21st century due to increasing interconnection and intelligent automation. Part of this phase of industrial change is the integration of artificial intelligence with robotics, which blurs the boundaries between the physical, digital and biological worlds and is based on parallel and distributed computing [1].

Such computer systems play an important role in the planning of cargo transportation in supply chain management (CSM) [2] and audit [3-4]. One of the approaches to building such computer systems is the use of multi-agent systems.

Despite a large number of studies on the problem of improving the efficiency of supply chains and reducing logistics costs, some questions remain open. The complexity of supply chains is constantly increasing due to globalization and beyond. If earlier goods were purchased in centralized hypermarkets, now online trading is developing with its unique SCM stages.

The aim of the work is to create a methodology for constructing proactive agents based on reinforcement learning to solve the problem of optimal planning of cargo transportation. To achieve the goal, the following tasks were set and solved: • formalization of the functioning of proactive agents; • propose models for the functioning of proactive agents with a utility function based on reinforcement learning; • propose a multi-agent reinforcement learning method with time difference and dynamic parameters; • propose a multi-agent reinforcement learning method based on Monte Carlo and dynamic parameters; • propose a multi-agent reinforcement learning method based on adaptive dynamic programming and dynamic parameters.

2. Formulation of the research problem

The problem of increasing the efficiency of optimal cargo transportation planning comes down to the  problem of finding such a set of plans { }, that delivers a minimum of the mean square error (the difference between the cost of the resulting plan and the cost of the optimal plan), 1 P   F =  (f( )− f( ))2 → m{i n}, where P – power of multiple plans,  .– th received plan, 

P =1 .– th optimal plan, f() – cost function of the plan (for example, the length of the route in the case of the traveling salesman problem).

3. Literature review

decision about choosing a goal from a set of possible goals and how to achieve it by forming an action plan based on logical inference. A proactive agent may also be based on a utility function.

The advantages and disadvantages of proactive agents and reactive agents with an internal state are practically the same [5-6].

Thus, the current problem is the low efficiency of the considered software agents.

At present, instead of expert systems with logical inference used in decision-making agents, reinforcement learning is actively used [7-8]. The main areas of single-agent reinforcement learning are: • dynamic programming [9-10]; • adaptive dynamic programming [11-12]; • Monte Carlo [13-14]; • temporal-difference learning [15-16]; • policy-based methods [17-18]; • actor-critic methods [19-20].

Today, multi-agent methods are actively developed [21–22]. The advantages of reinforcement learning over inference are:

• no labeled data sets are required, this is especially relevant for large amounts of data [23-24]; • there is no imitation of a teacher, but a new solution can be proposed that people have not even thought about [25-26]; • the quality criterion / utility function is used [27-28].

Disadvantages of reinforcement learning based on dynamic programming [9-10]:

• a priori knowledge about the probabilities of transitions between states is required; • action is not selected (for a fixed policy).

Disadvantages of reinforcement learning based on adaptive dynamic programming [11-12]:

• action is not selected (for a fixed policy); • cannot directly optimize the policy; • a large number of interactions between the agent and the environment; • converges to the global optimum only in the case of a finite number of actions and states; • susceptible to retraining.

Disadvantages of Monte Carlo based reinforcement learning [13-14]: • action is not selected (for a fixed policy); • cannot directly optimize the policy; • a large number of long trajectories is required; • updating the value of the cost function only after receiving the entire trajectory; • does not always converge to the global optimum; • susceptible to retraining.

Disadvantages of reinforcement learning based on temporal-difference learning [15-16]:

• the policy is fixed, so no action is selected (if TD-learning); • cannot directly optimize the policy; • a large number of interactions between the agent and the environment; • converges to the global optimum only in the case of a finite number of actions and states; • susceptible to undertraining (if one-step TD learning).

Disadvantages of policy-based reinforcement learning [17-18]:

• requires a large number of long trajectories; • does not always converge to the global optimum; • subject to retraining.

Disadvantages of actor-critic reinforcement learning [19-20]: • a large number of long trajectories (if MC learning) or a large number of interactions between the agent and the environment (if TD learning); • does not always converge to the global optimum (if MC learning) or converges to the global optimum only in the case of a finite number of actions and states (if TD learning); • subject to retraining (if MC training); • susceptible to undertraining (if one-step TD training).

4. Formalization of models of proactive agents functioning

For such agents, the internal state is called belief, the possible goal is called desire, the best goal is called intention.

Formalization of the functioning of a proactive agent. Perception function (1)

see:E → Per ( 1 ) maps the current state of the environment E into a new perception Per.

The state change function nextis called the belief change function brf (2)

brf:Bel Per→ Bel ( 2 ) and maps belief (internal state) Bel (belief) and perception Per into belief (internal state) Bel.

Changing the intention (the best goal) is the sequential execution of the function for selecting the set of wishes (possible goals) options and the filtering function filter, which ensures the choice of the intention (the best goal) from the set of desires (possible goals).

Function to generate possible variants options (3)

option:sBel Int→ Des ( 3 ) maps belief (internal state) Bel and intention (best goal) Int into a set of desires (possible goals) Des.

Filter function filter (4)

filt:eBrel Des  Int→ Int ( 4 ) maps belief (internal state) Bel, a subset of desires (possible goals) Des and intention (best goal) Int to intention (best goal) Int.

Plan  is a sequence of actions

 = {1,...n },, where each i is an element of the set Ac .

Plan= {0,1,..–.s}et of all plans.

Instead of an action selection function actioannew planning function plan is used (5)

plan:Bel Int Ac → Plan ( 5 ) which maps a belief (internal state) Bel, an intention (best goal) Int, and a subset of actions Ac into Plan. 5. Modelling the functioning of proactive agents with a utility function through reinforcement learning

Let a utility function u assign a utility to a state and be represented as (6)

u(s(n))= max Q (s(n),a),

aA (s(n)) where Q (s(n),a) – state-action cost function (profit in case of state s(n) and action a),

A (s(n))– set of actions available in state s(n). Let there be a memory of reproducing experiments (7)

M = {(s,a,R(s,a,s),s)}, where R(s,a,s) – reward (reward for the transition from state s to state s as a result of action a). ( 7 )

Then, for a proactive agent with a utility function, the procedure for generating an action plan  for the transition from the internal state (belief) s0 to the target state (intention) s* models the planning function plan and is presented in the following form.

1. Initialization

s(0)= s0, iteration number n = 1.

2. Choice of action (8) and observation of the internal state (9)

y(n)= arg max Q (s(n − 1),a),

aA (s(n−1)) (s,a,R(s,a,s),s) M :s(n − 1)= s y(n)= a → s(n)= s .

3. Termination condition

If n  N , then n = n + 1, go to step 2, otherwise π = (y( 1 ),.y.(.n,)). ( 6 ) ( 8 ) ( 9 )

The paper proposes reinforcement learning methods based on temporal-difference learning, based on Monte Carlo and based adaptive dynamic programming. 6. Multi-agent reinforcement learning with temporal-difference and dynamic parameters

The method consists of the following steps. 1. Initialization.

1.1. The maximum number of iterations N , the number of agents K , the maximum length of the states’ sequence T, the discrete set of states S , the discrete set of actions A (s), sS, the reward R(s,a), a A (s), sS, the parameters min,1max , 2min,2max (control the learning rate), 1 0 min  max  1, 0 min  max  1, the parameters min,max (control the ε-greedy policy), 1 1 2 2 0 min  max  1, the parameters  min, max (control discounting), 0 min  max  1 are set. 1.2. Reward tables are initialized for the kth agent

Q k = [Q k (s,a) ],

Q k (s,a)= 0, a A (s), s S, k 1,K .

1.3. The reward table is initialized for a swarm of agents

2. Iteration number n=1.

3. The parameters are calculated (10)-(13)

Q swarm = [Q swarm(s,a)]

Q swarm(s,a)= 0, a A (s), sS. 1(n)= max − (1max − min)

1 1

N − 1, (n)=  min + ( max −  min) n − 1 n − 1 N − 1, n − 1 6. For each kth agent, action akt is chosen using the ε-greedy policy π. If U ( 0,1 ) (n), then choose action akt randomly from the set of allowed actions A (skt), otherwise choose action akt in the form ( 14 ) i.e.

, b A (skt)), akt = (skt), k  1,K . 7. For each kth agent, a reward R(skt,akt), k 1,K is calculated. 8. For each kth agent a new state skt = akt, k 1,K is observed. 9. For each kth agent, the value of the combinations of the state-action cost functions of the swarm and the kth agent is calculated, i.e. in the form ( 15 ) ~ Q k (skt,akt)= (1− 2(n))Q swarm(skt,akt)+ 2(n)Q k (skt,akt), k 1,K . ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) 10. For each kth agent, the value of the cost function of the state-action Q k (skt,akt) is calculated as ( 16 )

~ (1− 1(n))Q k (skt,akt)+ Q k (skt,akt)= + 1(n) R(~skt,akt)+ (n)bmAa(sxkt)Q~k (skt,b), t T , k 1,K . ( 16 ) (1− 1(n))Q k (skt,akt)+ 1(n)R(skt,akt), t= T 11. Calculate the value of the cost function of the state-action of the swarm of agents Q swarm(skt,akt) for each kth agent in the form (17) 

maxQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt) Q swarm(skt,akt)=  z1,K z1,K z1,K minQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt) z1,K z1,K z1,K , k 1,K .

(17) 12. For each kth agent, the current state skt = skt, k 1,K is set.

13. If the current time is not the last, i.e. t T , then increase the iteration number, i.e. t= t+ 1, go to step 6.

14. If the current iteration is not the last one, i.e. n  N , then increase the iteration number, i.e. n = n + 1, go to step 3.

Note. Upon completion of the method, plan k = (yk1,..y.kt,,..y.kT,) is formed for each kth agent (18) ykt = argamAa(sxkt)Q k (skt,a)

, sS, k 1,K .

The plan of the agent that satisfies the quality criterion better than others is selected. (18)

7. Multi-agent Monte Carlo reinforcement learning with dynamic parameters This method is presented in the following form.

1. Initialization. 1.1. The maximum number of iterations N , the number of agents K , the discrete set of states S , the discrete set of actions A (s), sS, the reward R(s,a), a A (s), sS, the parameters min,1max , 2min,2max (control the learning rate), 0 min  max  1, 0 min  max  1, the 1 1 1 2 2 parameters min,max (control the ε-greedy policy), 0 min  max  1, the parameters  min, max (control discounting), 0 min  max  1 are set. 1.2. Reward tables are initialized for the kth agent

Q k = [Q k (s,a)],

Q k (s,a)= 0, a A (s), s S , k  1,K .

1.3. The reward table is initialized for a swarm of agents

Q swarm = [Q swarm(s,a)]

Q swarm(s,a)= 0, a A (s), s S . 1.4. Tables of the number of transitions for the kth agent are initialized

D k = [D k (s,a)],

D k (s,a)= 0, a A (s), s S , k 1,K . 2. Iteration number n = 1. 3. The parameters are calculated (19)-(22) 1(n)= max − (1max − min) n − 1 1 1 N − 1, (19)

(20) (21) (22) (24) (25) (26) 4. Trajectory k = (sk0,ak0,rk0,..sk.T,,akT ,rkT ) is generated for each kth agent, and akt = (skt), rkt = R(skt,akt), as a result of action akt a new state sk,t+1 and reward rkt, are observed, state sk0 can change at each iteration, the policy of choosing action π is ε-greedy, k 1,K . 5. Number of the moment in time t= T . 6. Calculate for each kth agent the profit in the form of a discounted amount of reward from time t to time T (23)

T Rkt(k )=  t−t(n)rkt , k 1,K . (23)

t=t 7. For each kth agent, the value of the combinations of the state-action cost functions of the swarm and the kth agent is calculated, i.e. in the form (24) ~

Q k (skt,akt)= (1− 2(n))Q swarm(skt,akt)+ 2(n)Q k (skt,akt), k 1,K . 8. For each kth agent, the transition counter D k (skt,akt) is increased, i.e.

D k (skt,akt)= D k (skt,akt)+ 1, k 1,K . 9. For each kth agent, the value of the cost function of the state-action Q k (skt,akt) is calculated as (25)

 Q k (skt,akt)= 1−   1(n)  ~

Q k (skt,akt)+ D k (skt,akt)

1(n) 

maxQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt) Q swarm(skt,akt)=  z1,K z1,K z1,K minQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)  z1,K z1,K z1,K , k 1,K . 11. If t 0, then t= t− 1, go to step 6. 12. If the current iteration is not the last one, i.e. n  N , then increase the iteration number, i.e. n = n + 1, go to step 3, otherwise stop.

Note. Upon completion of the method, plan k = (yk1,..y.kt,,..y.kT,) is formed for each kth agent (27) ykt = argamAa(sxkt)Q k (skt,a) , s S , k 1,K .

(27)

The plan of the agent that satisfies the quality criterion better than others is selected.

8. Multi-agent reinforcement learning method based on adaptive dynamic programming and dynamic parameters The method consists of the following steps.

1. Initialization. 1.1. The maximum number of iterations N , the number of agents K , the maximum length of the states’ sequence T, the discrete set of states S , the discrete set of actions A (s), sS, the parameters (control the learning rate),

1min,1max , 2min,2max 0 1min  1max  1, 0 min  max  1, the parameters min,max (control the ε-greedy policy), 0 min  max  1, 2 2 the parameters  min, max (control discounting), 0 min  max  1 are set. 1.2. Reward tables are initialized for the kth agent

Q k (s,a), Q k (s,a)= 0, a A (s), sS, k 1,K .

1.3. The reward table is initialized for a swarm of agents

Q swarm(s,a), Q swarm(s,a)= 0, a A (s), sS. 1.4. The tables of the number of transitions for the kth agent are initialized

D k (s,a), D k (s,a)= 0, a A (s), sS, k 1,K . 1.5. The state observation quantity tables for the kth agent are initialized

D k (s), D k (s)= 0, sS, k 1,K . 2. Iteration number n=1.

3. The parameters are calculated (28)-(31)

4. The number of the moment in time is set t= 1. 5. The initial state skt, k 1,K is observed for each kth agent. 6. For each kth agent, action akt is chosen using the ε-greedy policy π. If U ( 0,1 ) (n), then choose action akt randomly from the set of allowed actions A (skt), otherwise choose action akt in the form (32) i.e.

, b A (skt)), akt = (skt), k 1,K . 7. For each kth agent, a reward R(skt,akt), k 1,K is calculated. 8. For each kth agent a new state skt = akt, k 1,K is observed.

9. The transition function is calculated (33)

Pk (skt |skt,akt)= in the form (34) ~ Qk (skt , akt ) = (1 − 2 (n))Qswarm (skt , akt ) + 2 (n)Qk (skt , akt ) , k 1, K . (28) (29) (30) (31) (32) (33) 11. For each kth agent, the value of the cost function of the state-action Q k (s,a) is calculated as (35)

~ (1− 1(n))Q k (skt,akt)+    ~  ~ bA (skt) k (skt,b), Q k (skt,akt)= + 1(n)Pk (skt |skt,akt) R(skt,akt)+ (n) max Q (1− 1(n))Q k (skt,akt)+ 1(n)Pk (skt |skt,akt)R(skt,akt), t T t= T

, k 1,K 12. Calculate the value of the cost function of the state-action of the swarm of agents Q swarm(skt,akt) for each kth agent in the form (36) (35) (36) 

maxQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt) Q swarm(skt,akt)=  z1,K z1,K z1,K minQ z(skt,akt), maxQ z(skt,akt)  minQ z(skt,akt)  z1,K z1,K z1,K 13. For each kth agent, the current state skt = skt, k 1,K is set. 14. If the current time is not the last, i.e. t T , then increase the iteration number, i.e. t= t+ 1, go to step 6. 15. If the current iteration is not the last one, i.e. n  N , then increase the iteration number, i.e. n = n + 1, go to step 3.

Note. Upon completion of the method, plan k = (yk1,..y.kt,,..y.kT,) is formed for each kth agent (37) ykt = argamAa(sxkt)Q k (skt,a), sS, k 1,K . (37)

The plan of the agent that satisfies the quality criterion better than others is selected. 9. Experiments and results The numerical study of the proposed methods was carried out using the Python package.

For multi-agent reinforcement learning methods, the value of parameters min = 0.1,1max = 0.9, 1 min = 0.1,2max = 0.9 (control the learning rate), parameters min = 0.1,max = 0.9 (control the ε2 greedy policy), parameters min = 0.1,max = 0.9 (control discounting), the number of agents is K = 20.

The dependence of parameter (n) is defined as

(n)=  min + ( max −  min) n − 1 .

N − 1

The dependence of parameter (n) on the iteration number n is linear and shows that its share increases with the iteration number.

The dependence of parameter 1(n), 2(n) and (n) is defined as 1(n)= max − (1max − min) n − 1 1 1 N − 1,

n − 1

N − 1.

The dependence of parameter 1(n), 2(n) and (n) on the iteration number n is linear; it shows that their share decreases with increasing iteration number.

The results of comparing the proposed temporal-difference reinforcement learning method with dynamic parameters and the traditional Q-learning method based on the mean squared error criterion and the number of iterations for solving the travelling salesman problem (Berlin52 standard dataset), which is used for planning cargo transportation are presented in Table 1.

The results of comparing the proposed Monte Carlo based reinforcement learning method with dynamic parameters and with the traditional every-visit method based on the mean squared error criterion and the number of iterations for solving the travelling salesman problem (Berlin52 standard dataset), which is used for planning cargo transportation are presented in Table 2.

The results of a comparison of the proposed reinforcement learning method based on adaptive dynamic programming with dynamic parameters and the traditional passive adaptive dynamic programming method based on the mean square error criterion and the number of iterations for solving the traveling salesman problem (Berlin52 standard dataset), which is used for freight planning, are presented in Table 3. Advantages of the proposed methods: 1. Modification of reinforcement learning methods due to dynamic parameters allows for an increase in the learning rate while maintaining the mean squared error of the method (Tables 1-3). 2. The use of a multi-agent approach makes distributed computing possible and increases the learning speed while maintaining the root-mean-square error of the method (Tables 1-3). 3. Reinforcement learning methods with dynamic parameters use the ε-greedy approach, which is close to random search at initial iterations, and close to directed search at final iterations. This is ensured by the use of dynamic parameters and allows for an increase in the learning rate while maintaining the mean squared error of the method (Tables 1-3). 11.

Conclusions

To solve the problem of insufficient efficiency of computer agents, the existing methods of statistical and machine learning were investigated. These studies have shown that, to date, the most effective approaches to creating proactive agents are reinforcement learning approaches.

The formalization of the functioning of proactive agents has been conducted.

As part of creating a model for the functioning of proactive agents based on reinforcement learning, a procedure for generating a quasi-optimal action plan is proposed that models the planning function of a proactive agent, which speeds up the decision-making process. Reinforcement learning methods are proposed, which at the initial iterations are close to random search, and at the final iterations are close to the directed search. This is ensured by the use of dynamic parameters and multi-agent approach and allows for an increase in the learning rate while maintaining the mean squared error of the method.

The proposed multi-agent methods will be used for freight planning in supply chain management and auditing, and were investigated on a standard data set.

12. [17] A. S. Stebenkov, N. O. Nikitin, Automated Generation of Ensemble Pipelines using PolicyBased Reinforcement Learning method, Procedia Computer Science, 229 (2023) 70-79. doi:10.1016/j.procs.2023.12.009. [18] F. Huang, X. Deng, Y. He, W. Jiang, A novel policy based on action confidence limit to improve exploration efficiency in reinforcement learning, Information Sciences, 640 (2023) 119011. doi:10.1016/j.ins.2023.119011. [19] J. Zhang, Sh. Han, X. Xiong, Sh. Zhu, Sh. Lu, Explorer-Actor-Critic: Better actors for deep reinforcement learning, Information Sciences, 662 (2024) 120255. doi:10.1016/j.ins.2024.120255. [20] Zh. Zhang, X. Liang, C. Chen, D. Liu, Ch. Yu, W. Li, Defense penetration strategy for unmanned surface vehicle based on modified soft actor–critic, Ocean Engineering, 304 (2024) 117840. doi: 10.1016/j.oceaneng.2024.117840. [21] T. Li, K. Zhu, N. C. Luong, D. Niyato, Q. Wu, Y. Zhang, B. Chen, Applications of multi-agent reinforcement learning in future Internet: A comprehensive survey, IEEE Communications Surveys & Tutorials, 24 ( 2 ) (2022) 1240–1279. doi:10.1109/COMST.2022.3160697. [22] L. M. Schmidt, J. Brosig, A. Plinge, B. M. Eskofier, C. Mutschler, An introduction to multi-agent reinforcement learning and review of its application to autonomous mobility, in: IEEE 25th

International Conference on Intelligent Transportation Systems, 2022, pp. 1342–1349.

[23] P. Yadav, A. Mishra, S. Kim, A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles, Sensors, 23 ( 10 ) (2023) 4710. doi:10.3390/s23104710. [24] J. Orr, A. Dutta, Multi-agent deep reinforcement learning for multi-robot applications: A survey,

Sensors, 23 ( 7 ) (2023) 3625. doi.org/10.3390/s23073625. [25] L. Canese, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Re, S. Spanò, Multiagent reinforcement learning: A review of challenges and applications, Applied Sciences, 11 ( 11 ) (2021) 4948. doi: 10.3390/app11114948. [26] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh, D. Silver, Meta-gradient reinforcement learning with an objective discovered, arXiv:2007.08433, 2020. doi:10.48550/arXiv.2007.08433. [27] H. Wang , E. Miahi, M. White, M. C. Machado, Z. Abbas, R. Kumaraswamy, V. Liu, A. White, Investigating the properties of neural network representations in reinforcement learning, Artificial Intelligence, 330 (2024) 1-24. doi: 10.1016/j.artint.2024.104100. [28] F. Robertazzi, M. Vissani, G. Schillaci, E. Falotico, Brain-inspired meta-reinforcement learning cognitive control in conflictual inhibition decision-making task for artificial agents, Neural Networks, 154 (2022) 283–302. doi: 10.1016/j.neunet.2022.06.020.

[1]

G. G.

Shvachych ,

O. V.

Ivaschenko ,

V. V.

Busygin , Ye. Ye. Fedorov, Parallel computational algorithms in thermal processes in metallurgy and mining , Naukovyi Visnyk Natsionalnoho Hirnychoho Universytetu , 4 ( 2018 ) 129 - 137 . doi: 10 .29202/nvngu/2018-4/19.

[2]

Grygor ,

Fedorov ,

Nechyporenko ,

Grygorian , Neural network forecasting method for inventory management in the supply chain , in: CEUR Workshop Proceedings , 2022 , volume 3137 , pp. 14 - 27 .

[3]

Neskorodieva , E. Fedorov, Method for automatic analysis of compliance of expenses data and the enterprise income by neural network model of forecast , in: CEUR Workshop Proceedings , 2020 , volume 2631 , pp. 145 - 158

[4]

Neskorodieva ,

Fedorov , I. Izonin , Forecast method for audit data analysis by modified liquid state machine , in: CEUR Workshop Proceedings , 2020 , volume 2623 , pp. 25 - 35 .

[5]

Jezic ,

Chen-Burger ,

Kusek ,

Sperka ,

R. J.

Howlett , L. C. Jain (Eds.), Agents and multi-agent systems: technologies and applications , volume 186 of Smart innovation, systems and technologies , 2020 . doi: 10 .1007/ 978 -981-15-5764-4.

[6]

Russell ,

Norvig , Artificial Intelligence: a Modem Approach , Englewood Cliffs, NJ: Prentice Hall

PTR

, 2020 .

[7]

A. L. C.

Ottoni ,

E. G.

Nepomuceno , M. S. de Oliveira , D. C. R. de Oliveira , Reinforcement learning for the traveling salesman problem with refueling , Complex & Intelligent Systems , 8 ( 2021 ) 2001 - 2015 . doi: 10 .1007/s40747-021-00444-4.

[8]

Oroojlooy ,

Hajinezhad , A review of cooperative multi-agent deep reinforcement learning , Applied Intelligence , 53 ( 2023 ) 13677 - 13722 . doi: 10 .1007/s10489-022-04105-y.

[9]

Wang ,

Li ,

Xin ,

Liu ,

Qiao , Supplementary heuristic dynamic programming for wastewater treatment process control , Expert Systems with Applications , 247 ( 2024 ) 123280 . doi: 10 .1016/j.eswa. 2024 . 123280 .

[10]

Satic ,

Jacko ,

Kirkbride , A simulation-based approximate dynamic programming approach to dynamic and stochastic resource-constrained multi-project scheduling problem , European Journal of Operational Research , 315 ( 2024 ) 454 - 469 . doi: 10 .1016/j.ejor. 2023 . 10 .046.

[11]

Shen ,

Li ,

Wang ,

Cao , Nonzero-sum games using actor-critic neural networks: A dynamic event-triggered adaptive dynamic programming , Information Sciences , 662 ( 2024 ) 120236 . doi: 10 .1016/j.ins. 2024 . 120236 .

[12]

Xie ,

Zheng ,

Jiang ,

Lan ,

Yu , Optimal dynamic output feedback control of unknown linear continuous-time systems by adaptive dynamic programming , Automatica , 163 ( 2024 ) 111601 . doi: 10 .1016/j.automatica. 2024 . 111601 .

[13]

Pascal , Artificial neural networks to solve dynamic programming problems: A bias-corrected Monte Carlo operator , Journal of Economic Dynamics and Control , 162 ( 2024 ) 104853 . doi: 10 .1016/j.jedc. 2024 . 104853 .

[14]

S. V.

Albrecht ,

Christianos ,

Schäfer , Multi-Agent Reinforcement Learning: Foundations and Modern Approaches , MIT Press, Cambridge, MA, USA, 2023 .

[15]

Chen ,

Yang , Sh. Yang , H. Wang , Sh. Dong , Ya. Gao, Online attentive kernel-based temporal difference learning , Knowledge-Based Systems , 278 ( 2023 ) 110902 . doi: 10 .1016/j.knosys. 2023 . 110902 .

[16]

M. S.

Stanković ,

Beko ,

S. S.

Stanković , Distributed consensus-based multi-agent temporaldifference learning , Automatica , 151 ( 2023 ) 110922 . doi: 10 .1016/j.automatica. 2023 . 110922 .