Promoting Training of Multi-Agent Systems Petro Kravets1[0000-0001-8569-423X], Vasyl Lytvyn2[0000-0002-9676-0180], Victoria Vysotska3[0000-0001-6417-3689], Yevhen Burov4[0000-0001-8653-1520] Lviv Polytechnic National University, Lviv, Ukraine Petro.O.Kravets@lpnu.ua1,Vasyl.V.Lytvyn@lpnu.ua2, Victoria.A.Vysotska@lpnu.ua3, Yevhen.V.Burov@lpnu.ua4 Abstract. The problem of incentive training of multi-agent systems in the game formulation for collective decision making under uncertainty is considered. Methods of incentive training do not require a mathematical model of the envi- ronment and enable decision making directly in the training process. Markov model of stochastic game is constructed and the criteria for its solution are for- mulated. An iterative Q-method for solving a stochastic game based on the nu- merical identification of a characteristic function of a dynamic system in space of state-action is described. Players’ current gains are determined by the method of randomization of payment Q-matrix elements. Mixed player strategies are calculated using the Boltzmann method. Pure strategies are determined on the basis of discrete random distributions given by mixed player strategies. The al- gorithm for stochastic game solving is developed and results of computer im- plementation of game Q-method are analyzed. Keywords – Multi-Agent System, Stochastic Game, Promotional Training, Q- method 1 Introduction The functioning of most modern information systems (IS) is based on rigidly pro- grammed algorithms. Unforeseen environmental influences in such systems may im- pair the stability of operating modes, which can lead to various types of emergency situations. To prevent critical states, distributed IS software must consist of interoper- able standalone modules, be intelligent, flexible, and capable of independently moni- toring environmental changes and making timely and appropriate decisions. Other- wise, such systems should be built on the principles of an agent-oriented methodology [1 – 10]. An IS agent is a standalone software module with elements of artificial intel- ligence, capable of making decisions on its own, interacting with the environment, other agents, and people as they accomplish the task. IS agents interact within the computer network. A population of computer network agents who solve a common problem is called a multi-agent system (MAS). The operation of the MAS is usually carried out in the context of a priori uncer- tainty about the state of the decision-making environment and the actions of other Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). agents. In this regard, agent’s behavior strategies must be adaptive at the expense of agents' ability to learn [11]. Among the methods of learning under uncertainty, incen- tive-based methods have gained practical appeal [12, 13] because they do not require a mathematical model of the environment and provide decision-making power di- rectly in the learning process. The mechanisms of reflexive behavior of living organ- isms with developed nervous system are the basis of the stimulating training. An ef- fective method of incentive learning is Markov Q-learning [14], which performs nu- merical identification of the characteristic function of a dynamic system in state- space. As a characteristic function is usually the function of the total expected reward agent. Compared to single agent systems, the structure, operation, and research of multi- agent Q-learning methods are much more complicated [15]. Due to the collective interaction of agents, the stationary environment is transformed into a non-stationary class. The change in the state of the environment and the value of the benefits of each agent depend on the actions of the other agents. Generally, in an MAS, an agent can- not achieve a maximum gain equal to that of a single agent system. The optimal pay- offs of agents must be balanced and meet the criteria of benefit, fairness, balance. Thus, instead of the criterion of scalar maximization of the benefits of a single-agent system, the criteria of vector maximization of MAS winnings are introduced, for ex- ample, Nash equilibrium, Pareto optimality or other [16]. Provided the use of methods of Q-learning of MAS, an iterative construction of a system of characteristic Q-functions in the state-action space takes place, and the increment of the elements of these functions is carried out in the direction of achiev- ing their collective equilibrium. To build the MAS, it is necessary to carry out pre- liminary studies on the basis of adequate mathematical models that will allow to study the dynamics of the system under uncertainty, to build strategies for the behavior of agents that provide optimal technical and economic parameters of the system func- tioning. Given the peculiarities of the subject area, namely, multi-agency, uncertainty of decision-making environment, antagonism or competitiveness of goals, communi- cativeness, coordination of actions, adaptability of agent behavior strategies, we use the mathematical apparatus of stochastic game theory [17 – 29]. Solving a stochastic game is to find the strategies of agents that maximize their winnings so as to ensure a certain collective balance of interests for all players. The search for optimal strategies for players in uncertainty will be performed on the basis of the promotional training method. The purpose of the work is to construct an iterative method of incentive learning to solve the stochastic MAS game in uncertainty. To achieve this goal, it is necessary to develop a model of multi-agent stochastic game, to determine the criteria of collective equilibrium, the method and algorithm for solving the game problem. 2 The Mathematical Model of Stochastic Game The stochastic game is determined by the tuple: ( S , p, Ai , r i | i  I ) , I  {1, 2, ..., L} , where S  {s1 ,..., s M } is set of all states of the environment, p : S  A   ( S ) is sys- tem state change function defined in the space of probability distributions (S ) on the plural S , Ai  ai (1),..., ai ( N i )  is multiple actions or pure strategies i -agent, A   Ai is set of combined agent actions, r i : S  A  R is reward function i -agent, iI I is multiple agents, L is number of agents, M is number of states, N i is number of strategies i -agent. In the general case of multiple actions Ai  Ai ( s ) i  I and combined actions A  A(s ) may depend on the state of the environment s  S . We adopt a Markov model [30, 31] of the dynamics of states of a system in which the probability of change of states p depends only on the current state of the envi- ronment and the current actions of the agents: p st 1  s | ( s , b ),  0,1, 2, ..., t   p st 1  s | st , bt  , where bt  A is combined action at time t . At each point in time, the environment is in one of the states s  S and agents choose actions independently ai  Ai . After the implementation of the combined option a  (a1 ,..., a L )  A agents get random winnings r i (otherwise is incentives or reinforcements), and the environment changes its state according to the probability distribution p ( s, a ) with values per segment [0, 1] :  p ( s  | s, a )  1 . sS The agent implements actions based on a mixed strategy  i : S  Ai , which deter- mines the likelihood of action ai  Ai in every state of the environment s  S . Distribution  i   i takes the value on a unit simplex [11]      i     aiAi   ( s , a i )  1 ,  ( s , ai )  0  .  If  i ( s, ai )  {0, 1} , then the agent determines the choices of solutions. Let the total payoff of each agent be determined by the function of discounted total payoffs:  i   r , t 0 t i t (1) where   (0, 1] is discount option. The goal i -agent is to maximize function (1) by formulating an effective strategy i : Vi ( s )  E  i | s0  s   max, i  I , (2) i where   ( 1 ,...,  L ) ; E is symbol of mathematical expectation. Stochastic game resolution is about defining agent behavior strategies  i* ( i  I ), which ensure fulfilment of one of the conditions of collective optimality, for example: 1) Nash equilibrium: V i ( s,  1* ,  2* ,...,  L* )  V i ( s,  1* ,  2* ,...,  i*1 ,  i ,  i*1 ,...,  L* ) ; 2) Pareto optimality: V i ( s,  * )  V i ( s,  ) . 3 Learning of Stochastic Game Calculation Vi (s ) can be performed in a recursive form known in the literature as the Bellman equation [12 – 14]. Given (1), we obtain after simple transformations:  V ( s | s t  s )  E ( rt )    E(r k t  k 1 ) E ( rt )  V ( s t 1 )  k 0 (3)  r ( s,  ( s ))    p(s  | s, (s))V (s ) sS where s is probable future states of the system. The agent's goal is to find a strategy  * that maximizes function (2) for all states of the environment:  s  S V  * ( s )  V  ( s) . Since the choice of options is made by chance, it is for the purpose of comparing the effectiveness of actions when the system is in a state s  S , current gains are useful to obtain from (2). For this purpose is specially built Q is average payoff feature that determines the cost of the action – the total payoff of the agent in the state s chose action a :  Q ( s, a )  E R s0  s, a0  a .  (4) Expression (4) defines a tabular function of the values of the action options a in the states s . Similarly to (3) we obtain: Q ( s, a )  r ( s, a )    p( s | s, a )V ( s) . s S (5) Adherence to the Bellman principle of optimality (5) ensures the optimal gain of the agent from the current state achieved s  S at all future times. Applying this principle to all states ensures a global optimal solution. For optimal function selection strategies  * for each state s  S we will get:   V * ( s )  max  r ( s, a )   aA   p( s  | s, a )V * ( s ) . (6)  sS  From (6) one can obtain the optimal function of choosing strategies  * ( s )  arg max Q * ( s, a ) . (7) aA Optimization (7) can be performed by dynamic programming methods [30]. By analogy to single agent training, we define the payoff matrix i -players with current and future winnings in the direction of movement to the optimal collective state in space S  A : Q*i ( s, a1 ,..., a L )  r i ( s, a1 ,..., a L )    p(s | s, a ,..., a ) V (s, ,..., ) , sS 1 L i * 1 * L where Q*i ( s, a1 ,..., a L ) is total discounted gain i -player provided the players select the action (a1 ,..., a L ) in the state s according to the optimal strategy of the game  *  ( 1* ,...,  L* ) . In the conditions of a priori uncertainty of the transition probabilities between the states of the system p( s, a1 ,..., a L ) and winnings features r i ( s, a1 ,..., a L ) an iterative method is used to calculate the elements of the payoff matrix Q -learning [31]: Qti1 ( s, a1 ,..., a L )  (1   t )Qti ( s, a1 ,..., a L )   t [ rti  Vti ( st 1 )] (8) where  t  (0, 1) is training option; Vti ( st 1 ) is the operator of the cost of the system state in the direction of the optimal collective solution. The type of the operator Vti ( st 1 ) is determined by the condition of collective equi- librium, for example:   Vti ( st 1 )  MM Qt ( st 1 ) is maximin equilibrium;  Vt ( st 1 )  NE Qti ( st 1 ) i  is Nash equilibrium;  Vti ( st 1 )  BR Qti ( st 1 )  is best agent response;  Vti ( st 1 )  CE Qti ( st 1 )  is correlated equilibrium;  Vti ( st 1 )  PE Qti ( st 1 )  is Pareto optimality. The above list may be supplemented by other already known and new equilibrium states that will determine the target aspect of the functioning of a distributed dynamic system. Method (8) can be applied to decipher a single agent game with nature as a partial case N-agent stochastic game if I  {i}, | I | 1, A  Ai , when Vti ( st 1 )  max Qti ( s , b) , where s   st 1 . bAi The Maximin Equilibrium (MM) takes place in the game of two agents with zero sum of their payoff functions: V 1 ( st 1 )  max min  11 a2A2  (s, a )Q ( s, a , a )  V ( s ) . a1A1 1 1 t 1 2 2 t 1 Nash Equilibrium (NE) is determined by the independent distribution of strategies by players who choose their own strategies independently of the choice of other agents. In a Nash equilibrium situation in mixed strategies  NE ( s )   1NE ( s ),...,  LNE ( s ) it is   not profitable for each agent to deviate from its own optimal strategy  iNE (s ) , if other agents stick to the equilibrium point [31]: L L  Q ( s, a) aA i t NE i ( ai )  j i j ( s , a j )  NE  Q ( s, a)~ (a )  aA i t i i j i j ( s , a j ) , NE (9) where a  (a1 ,..., a L ) ;  iNE , ~ i   i . Method (8) ensures that condition (9) is satisfied when the current value of the sys- tem state value operator is determined at point  NE Nash equilibrium:    Q ( s, a)  ( s, a ) . L NE Qti ( st 1 )  i t NE j j aA j 1 The set of NE equilibrium points in mixed strategies is a convex compact and can be calculated by linear programming methods (for bi-matrix games) or by solving a sys- tem of polylinear equations that determine the complementary rigidity condition: L L  Q ( s, a , a )  ( s, a )  Q (s, a)  (s, a ) , i  I , a  A , s  S , a  i A i i t i i j i j j aA i t j 1 j j i i  i ( s , ai )  0 ,   ( s, a )  1 . ai Ai i i Unlike the Nash equilibrium, the Best Response (BR) method generates an optimal agent strategy in response to the actions of all other agents. The corresponding system state value operator in method (8) has the form [32]:     L BR Qti ( st 1 )  max  Qti ( s , a ) i    j ( s, a j )  .    aA j 1  Correlated Equilibrium (CE) generalizes the Nash equilibrium by allowing players' strategies to depend. For this purpose, there is an arbitrator in the collective decision- making system, which according to the generalized distribution    ( A)  (a)  1 , A   A ) recommends that players choose to take actions that form a L ( i i 1 aA combined option a  (a1 ,..., a L ) . Player with a number i only receives component information combined option a  A . This signal is perceived i -player as an optional offer to take action ai . Each player secretly and independently chooses at a moment's time t action option ai , possibly different from the proposed variant, and receives a current payoff r i ( st , at ) , which is a function of the current state of the system st and the combined option at  A . The environment then moves to a new state st 1 ac- cording to the probability distribution p( st 1 | st , at ) and the process is repeated at time t  1 . Correlated equilibrium is determined by the united distribution of player strategies    ( A) , when each agent does not have the motivation to deviate unilaterally [33]:  (a | a )Q [s, (a , a )]   (a | a )Q [s, (a , a~ )], a  i A i CE i i i i i a  i A i CE i i i i i L where Ai   Aj , A  Ai  Ai , ai  Ai , a  (a i , ai )  A , ai , a~i  Ai , j 1, j i  CE ( ai )    (a , a ) ,  ( a | a )   (a , a )  ( a ) ,  ( a )  0 . a  i A i CE i i CE i i CE i i CE i CE i All Nash equilibrium points are correlated equilibrium points. If i  I  CE (a i | ai )   CE (a i | a~i ) , ai , a~i  Ai , ai  Ai ,  CE (ai ),  CE (a~i )  0 , then correlated equilibrium is also Nash equilibrium. To solve the game by method (8), the cost operator Vti ( st 1 ) the state of the system is determined by the point  CE corre- lated equilibrium: CE (Qti ( st 1 ))   (a)Q ( s, a) . The set of CE equilibrium aA CE i t points is non-empty, convex and compact and can be effectively calculated using linear programming methods. In the case of maximizing total player winnings, the task of linear programming is to find  CE can be formulated as follows: ,   (a , a )Q [ s , (a , a )]  Q [ s , ( a , a~ )   0 , L  aA  (a ) Q (s, a)  max i 1 i  a  i A i i i i i i i i i i  I , ai  Ai , a~i  Ai , s  S ,  ( a )  0 a  A ,  (a )  1 . aA Based on the distribution  (a ) a  A players' own strategies are determined  i i  I . There are various options for switching from  to  i depending on the type of strategies and players' level of awareness. For example, pure strategies deter- mine the maximum value of the operator CE (Q i ( s )) : ai  arg max CE (Q i ( s )) .  Taking into account that  (a )  1 , it can be assumed that mixed strategies ai Ai i  i ( ai )   ( ai )   (a , a ) , or: a  i A i i i  i ( ai )   (a )Q (s, a , a ) CE (Q (s)) . a  i A i i i i i i Pareto Equilibrium (PE) optimality occurs in the Common-Interest Markov Game, where the payoff matrices are the same for all players Qti ( s, a )  Qt j ( s, a ) i , j  I , s  S , a  A [33]. A game with different payoff matrices can be turned into a game of shared interests by a convolution      Q ( s, a)  ( s, a ) , L L PE Qt ( st 1 )  k k t PE j j k 1 aA j 1 where  j  0 ( j  1..L) . The search for the game's PE solution is done independently by agent strategies, similar to the search for a NE solution. A multi-agent game is optimal for Pareto if there is no common player strategy that improves the winnings of all players: Qti ( s,  PE )  Qti ( s,  ) . Pareto-optimal mixed strategies  PE ( s )   1PE ( s ),...,  LPE ( s )   can be obtained by maximizing the convolution of concave (up) payoff functions: L L   k 1 k aA Qtk ( s , a )   ( s, a )  max j 1 j  . j To calculate optimal collective strategies  * ( s )   1* ( s ),...,  L* ( s ) (NE, CE, PE)   agent with number i need to know Q -functions of all agents:   Q ( s )  Qt1 ( s ),..., QtL ( s ) . In the absence of such information, each agent should evaluate the value Q - functions in the learning process. For this i -agent monitors the current gains of other agents and modifies their estimates Q -functions according to (8). To ensure that method (8) converges to one of the points of collective equilib- rium, it is necessary to impose a limit on the rate of change of its adjustable parame- ters. The general limitations are as follows [11, 14, 34]:    t 0  t  ,    , t 0 2 t (10) where  t  t  (  0) is monotonically decreasing positive sequences of real values. 4 Stochastic Game Solving Algorithm Step 1. Set the start time t  0 ; the initial values of the payoff matrices Qti ( s, a1 ,..., a L )   s  S , ai  Ai , i  I , where 0    1 is small posi- tive value; the value of the gain discount parameter   (0, 1] ; the initial state of the system s0 . Step 2. Perform a random selection of agent actions a  (a1 ,..., a L ) based on strate- gies   ( 1 ,...,  L ) . The value of the strategies can be calculated from the current estimates of the payoff matrices i  I :  i ( s , ai )   Q ( s, a , a )  Q ( s, a) , a  A . a  i A i i t i i aA i t i i Step 3. Get current agent payouts rt  ( rt1 ,..., rtL ) . Step 4. Determine the new state of the system st 1  st (a1 ,..., a L ) . Step 5. Calculate function Vti ( st 1 ) according to the specific condition of collective equilibrium (NE, CE, PE).   Step 6. Modify the payoff matrix Qt 1  Qti1 ( st , at ) | i  1..L according to (8). Step 7. If Qti1  Qti   i  1..L , then ask t : t  1 and go to step 2.  Step 8. Print the calculated values of the payoff matrices Q ( s )  Q 1 ( s ),..., Q L ( s ) and strategies  ( s )  ( 1 ( s ),..., L ( s )) s  S . End of algorithm. 5 The Results of Computer Simulation Let's solve the stochastic game of two agents with two pure strategies in a two-state environment. The matrices of the average payoffs of such a game are given in Table 1. Table 1. Player gains matrix States Strategies The first player The second player   2 ( s1 , a2 [1])  2 ( s1 , a2 [2])  2 ( s1 , a2 [1])  2 ( s1 , a2 [2]) s1  1 ( s1 , a1[1]) 0.4 0.1 0.9 0.2  1 ( s1 , a1[2]) 0.1 0.9 0.2 0.9   2 ( s2 , a2 [1])  2 ( s2 , a2 [2])  2 ( s2 , a2 [1])  2 ( s2 , a2 [2]) s2  1 ( s2 , a1[1]) 0.4 0.6 0.5 0.2  1 ( s2 , a1[2]) 0.6 0.8 0.6 0.7 Under uncertainty, the elements of the average payoff matrix v i ( s, a )  sS a priori aA unknown and available for observation in the form of random current values r i ( s, a )  Normal (v i ( s, a ), d i ( s, a )) , distributed by normal law with mathematical expectation v i ( s, a ) and dispersion d i ( s, a ) . Normally distributed random variables are obtained by summing twelve evenly distributed random numbers   [0, 1] :  12    r i ( s, a )  v i ( s, a )  d i ( s, a )   j  6  ,  (11)  j 1  where d i ( s, a )  d  0 s  S , a  A . If at time t the system was in a state of disre- pair s  S , then after implementing pure strategies a  (a1 ,..., a L ) , where L a  A   Ai , agents receive current winnings rti ( s, a ) , calculated according to (11). i 1 After receiving current winnings, each agent lists the corresponding item Q - matrix according to the algorithm modified for uncertainty conditions BR : Qti1 ( s, a1 ,..., a L )  (1   t )Qti ( s, a1 ,..., a L )   t ( rti   max Qti ( s, a1 ,..., a L )) . (12) ai Based on Q -matrices the current values of mixed strategies are calculated using the Boltzmann method: Ni e * Qi* ( s ,ai ( j )) / T  i ( ai ( k ) | s )  eQi ( s ,ai ( k )) / T , k  1.. N i , (13) j 1 L where Qi* ( s, ai ( k ))  max r i ( ai , ai ( k )) , ai  Ai , Ai   A j , T  0 is tempera- a i j 1 j i ture coefficient. Vector elements of mixed strategies  i define a discrete distribution by which the values of random pure strategies are determined i -agent at the next time:   k       ai ( s )   Ai ( s, k ) k  arg  min  i ( s, ai ( j ))   , k  1.. N i  , s  S , i  I , (14)  k    j 1  where   [0, 1] is random variable with uniform distribution. The change of states of the dynamic system is determined by the discrete distribu- tion p( s  | s, a )  p s  S , a  A :   k     s   S (k ) k  arg  min   k  p( j )   , k  1.. M  .   (15)  j 1  Let the states change s  S system is implemented with equal probabilities p( s | s, a )  (| S |k1 , k  1..M ) s  S , a  A , that is p( s  | s, a )  (0.5; 0.5) for | S | 2 . Agents' average payoffs are calculated taking into account the transition probabilities of the medium from one state to another: V i   p( s)V ( s) , i  1..L , sS i L where V i ( s )   v ( s, a)  (s, a) is average agent gain in the state s  S . aA i j 1 j The trajectories of changing agent strategies within a unit simplex and the appear- ance of average payoff functions V i (s ) , which correspond to the data of the table 1, is shown in Fig. 1 and Fig. 2. Fig. 1. First agent average payoff functions Fig. 2. Second agent average payoff functions The BR method (12 – 15) ensures that stochastic play is solved at the vertices of a unit simplex with the maximum value of the mean gain function. The percentage of options for achieving optimal game resolution depends on the absolute difference between the two largest consecutive values of the average payoff features. The convergence of the method is estimated by the error of fulfillment of the com- plementary slackness condition [35], weighted by mixed strategies:    ~ , where ~  diag ( ) V i / V i ; diag ( ) is diagonal square 2   L1 i i i i i iI   matrix of order N i , formed from vector elements  i ; V i  V i [ j ] j  1.. N i is vec- Ni tor median payoff function for fixed net strategies i -player; V i  V [ j ] [ j] is j 1 i i average payoff function i -player;  is Euclidean vector norm. The complementary slackness condition characterizes the gameplay in Nash mixed strategies. The weighted condition additionally takes into account the game's solutions in pure strate- gies. Average payoff function graphs  and norms for deviating mixed strategies from their target values  filed in Fig. 3. Fig. 3. Characteristics of the convergence of the game Q -method L The gains in Fig. 3 are averaged over the number of players:   L1   , where i 1 i  i  0 is discounted current gains i -player. Deviation schedule drop  mixed strategies from their target indicates the convergence of the game Q -method. The value of the temperature coefficient T has a significant impact on the conver- gence of the game method. The rate of convergence is determined by the rapid decline of the function graph  , which can be estimated by the value of the acute angle of linear approximation of the function graph  with the time axis. With the growth T rate of convergence of game Q -method decreases. 6 Conclusions The promotional training method (8) considered in the deterministic version requires the knowledge of each agent Q -functions of all other agents. These functions are used by agents to identify strategies that provide method dynamics toward the points of collective equilibrium. Value Q -functions can be obtained by exchanging informa- tion between agents. If integrated information about Q -functions are not available to the agent, then he must determine their value independently in the learning process, observing the current benefits of other agents and performing evaluations Q - functions according to (8). If such observations are not possible, the agents may per- form reflective assessments Q -functions of other agents. Another method of constructing incentive training algorithms for agents under un- certainty is to apply the stochastic approximation method to the corresponding collec- tive equilibrium condition. The practical use of game-based promotional training methods requires their prior analysis to determine the conditions for convergence to a state of collective equilib- rium. Such studies are based on the evaluation of sequences of random variables, which characterize the current deviations of players' strategies from their optimal values. The rate of convergence of the game method of Q-learning is determined by the parameters  t and T. Parameter  t must satisfy the general conditions of stochastic approximation (10). The value of the parameter T depends on the absolute values of the elements of Q-matrices. It is experimentally established that for the given matrices of average payoffs the convergence of the game Q-method is provided at T  (0, 0.2] in the parameter value range  t  t  ,   (0, 1] . The highest convergence rate of the game method of promotional learning is achieved at T  10 2 . References 1. Weiss, G.: Multiagent Systems. Second Edition. The MIT Press. (2013) 2. Lee, Y., Chong, Q.: Multi-agent Systems Support for Community-Based Learning. Inter- acting with Computers, 15(1), 33 – 55. (2003) 3. Kravets, P.: The Methodology of Multi-Agent Systems: a Modern State and Future Trends. In: Proceedings of the International Conference on computer science and information technologies – CSIT’2006, 125 – 127. (2006) 4. Dignum, F., Bradshaw, J., Silverman, B.G., Doesburg, W.: Agent for Games and Simula- tions: Trends in Techniques, Concepts and Design. Springer. (2009) 5. Kravets, P.: The Control Agent with Fuzzy Logic. In: Perspective Technologies and Meth- ods in MEMS Design, MEMSTECH, 40 – 41. (2010) 6. Scerri, P., Vincent, R., Mailler, R.T.: Coordination of Large-Scale Multiagent Systems. Springer. (2010) 7. Byrski, A, Kisiel-Dorohinicki, M.: Evolutionary Multi-Agent Systems: From Inspirations to Applications. Springer. (2017) 8. Radley, N.: Multi-Agent Systems – Modeling, Control, Programming, Simulations and Applications. Scitus Academics LLC. (2017) 9. Yang, S., Xu, J.-X., Li, X., Shen, D. : Iterative Learning Control for Multi-Agent Systems Coordination. Wiley-IEEE Press. (2017) 10. Sun, Z.: Cooperative Coordination and Formation Control for Multi-Agent Systems. Springer. (2018) 11. Nazin, A. V., Poznyak, A. S.: Adaptive Choice of Variants: Recurrence Algorithms (in russian). Moscow: Science. (1986) 12. Kaelbling, L., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. In: Journal of Artificial Intelligence Research, 4, 237 – 285. (1996) 13. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press. (1998) 14. Watkins, C.J.C.H., Dayan, P.: Q-Learning. In: Machine Learning, Kluwer Academic Pub- lishers, Boston, 8, 279 – 292. (1992) 15. Ummels, M.: Stochastic Multiplayer Games: Theory and Algorithms. Amsterdam Univer- sity Press. (2014) 16. Ungureanu, V.: Pareto-Nash-Stackelberg Game and Control Theory: Intelligent Paradigms and Applications. Springer. (2018) 17. Zheng, J., Cai, Y., Wu, Y., Shen, X.: Dynamic Computation Offloading for Mobile Cloud Computing: A Stochastic Game-Theoretic Approach. In: IEEE Transaction on Mobile Computing, 18(4), 771 – 786. (2018) 18. Chen, B.-S.: Stochastic Game Strategies and their Applications. CRC Press. (2019) 19. Kravets, P., Pasichnyk, V., Kunanets, N., Veretennikova, N. Game Method of Event Syn- chronization in Multiagent Systems. In: Advances in Intelligent Systems and Computing (AISC), ICCSEEA 2019 – Proceedings , 938, 378 – 387. (2019) 20. Kravets, P., Burov, Y., Lytvyn, V, Vysotska V.: Gaming Method of Ontology Clusteriza- tion. In: Webology, 16(1), 55 – 76. (2019) 21. Gao, J., You, F.: A Stochastic Game Theoretic Framework for Decentralized Optimization of Multi-Stakeholder Supply Chain Under Uncertainty. In: Compute & Chemical Engi- neering, 122, 31 – 46. (2019) 22. Lalropuia, K. C., Gupta, V.: Modeling Cyber-Physical Attacks Based on Stochastic Game and Markov Processes. In: Reliability Engineering & System Safety, 181, 28 – 37. (2019) 23. Lozovanu, D. Pure and Mixed Stationary Nash Equilibria for Average Stochastic Posi- tional Games. In: Frontiers of Dynamic Games, 131 – 155. (2019) 24. Garrec, T. Communicating Zero-Sum Product Stochastic Games. In: Journal of Mathe- matical Analysis and Applications, 477(1), 60 – 84. (2019) 25. Kloosterman, A. Cooperation in Stochastic Games: a Prisoner’s Dilemma Experiment. In: Experimental Economics, 1 – 21. (2019) 26. Lin, X., Adams, S. C., Beling, P. A. Multi-Agent Inverse Reinforcement Learning for Cer- tain General-Sum Stochastic Games. In: Journal of Artificial Intelligence Research, 66, 473 – 502. (2019) 27. Saldi, N., Basar, T., Raginsky, M. Approximate Nash Equilibria in Partially Observed Sto- chastic Games with Mean-Field Interactions. In: Mathematics of Operations Research, 44(3), 1006 – 1033. (2019) 28. Yamamoto, Y. Stochastic Games with Hidden States. In: Theoretical Economics, 14, 1115 – 1167. (2019) 29. Fudenberg, D., Levine, D.K.: The Theory of Learning in Games. Cambridge. (1998) 30. Puterman, M. L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York. (2005) 31. Hu, J., Wellman, M. P., Nash Q-learning for general-sum stochastic games. In: Machine Learning Research, 4, 1039 – 1069. (2003) 32. Weinberg, M., Rosenschein, J.S.: Best-Response Multiagent Learning in Non-Stationary Environments. In: AAMAS'04, New York, USA. (2004) 33. Greenwald, A., Hall, K.: Correlated Q-learning. In: Proceedings of the Twentieth Interna- tional Conference on Machine Learning, 242 – 249. (2003) 34. Kushner, H., Yin, G. G.: Stochastic Approximation and Recursive Algorithms and Appli- cations. Springer Science & Business Media. (2013) 35. Neogy, S. K., Bapat, R. B., Dubey, D.: Mathematical Programming and Game Theory. Springer. (2018)