Promoting Training of Multi-Agent Systems

       Petro Kravets1[0000-0001-8569-423X], Vasyl Lytvyn2[0000-0002-9676-0180], Victoria
             Vysotska3[0000-0001-6417-3689], Yevhen Burov4[0000-0001-8653-1520]

                     Lviv Polytechnic National University, Lviv, Ukraine

             Petro.O.Kravets@lpnu.ua1,Vasyl.V.Lytvyn@lpnu.ua2,
          Victoria.A.Vysotska@lpnu.ua3, Yevhen.V.Burov@lpnu.ua4


       Abstract. The problem of incentive training of multi-agent systems in the game
       formulation for collective decision making under uncertainty is considered.
       Methods of incentive training do not require a mathematical model of the envi-
       ronment and enable decision making directly in the training process. Markov
       model of stochastic game is constructed and the criteria for its solution are for-
       mulated. An iterative Q-method for solving a stochastic game based on the nu-
       merical identification of a characteristic function of a dynamic system in space
       of state-action is described. Players’ current gains are determined by the method
       of randomization of payment Q-matrix elements. Mixed player strategies are
       calculated using the Boltzmann method. Pure strategies are determined on the
       basis of discrete random distributions given by mixed player strategies. The al-
       gorithm for stochastic game solving is developed and results of computer im-
       plementation of game Q-method are analyzed.

       Keywords – Multi-Agent System, Stochastic Game, Promotional Training, Q-
       method


1      Introduction

The functioning of most modern information systems (IS) is based on rigidly pro-
grammed algorithms. Unforeseen environmental influences in such systems may im-
pair the stability of operating modes, which can lead to various types of emergency
situations. To prevent critical states, distributed IS software must consist of interoper-
able standalone modules, be intelligent, flexible, and capable of independently moni-
toring environmental changes and making timely and appropriate decisions. Other-
wise, such systems should be built on the principles of an agent-oriented methodology
[1 – 10]. An IS agent is a standalone software module with elements of artificial intel-
ligence, capable of making decisions on its own, interacting with the environment,
other agents, and people as they accomplish the task. IS agents interact within the
computer network. A population of computer network agents who solve a common
problem is called a multi-agent system (MAS).
   The operation of the MAS is usually carried out in the context of a priori uncer-
tainty about the state of the decision-making environment and the actions of other
   Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
agents. In this regard, agent’s behavior strategies must be adaptive at the expense of
agents' ability to learn [11]. Among the methods of learning under uncertainty, incen-
tive-based methods have gained practical appeal [12, 13] because they do not require
a mathematical model of the environment and provide decision-making power di-
rectly in the learning process. The mechanisms of reflexive behavior of living organ-
isms with developed nervous system are the basis of the stimulating training. An ef-
fective method of incentive learning is Markov Q-learning [14], which performs nu-
merical identification of the characteristic function of a dynamic system in state-
space. As a characteristic function is usually the function of the total expected reward
agent.
   Compared to single agent systems, the structure, operation, and research of multi-
agent Q-learning methods are much more complicated [15]. Due to the collective
interaction of agents, the stationary environment is transformed into a non-stationary
class. The change in the state of the environment and the value of the benefits of each
agent depend on the actions of the other agents. Generally, in an MAS, an agent can-
not achieve a maximum gain equal to that of a single agent system. The optimal pay-
offs of agents must be balanced and meet the criteria of benefit, fairness, balance.
Thus, instead of the criterion of scalar maximization of the benefits of a single-agent
system, the criteria of vector maximization of MAS winnings are introduced, for ex-
ample, Nash equilibrium, Pareto optimality or other [16].
   Provided the use of methods of Q-learning of MAS, an iterative construction of a
system of characteristic Q-functions in the state-action space takes place, and the
increment of the elements of these functions is carried out in the direction of achiev-
ing their collective equilibrium. To build the MAS, it is necessary to carry out pre-
liminary studies on the basis of adequate mathematical models that will allow to study
the dynamics of the system under uncertainty, to build strategies for the behavior of
agents that provide optimal technical and economic parameters of the system func-
tioning. Given the peculiarities of the subject area, namely, multi-agency, uncertainty
of decision-making environment, antagonism or competitiveness of goals, communi-
cativeness, coordination of actions, adaptability of agent behavior strategies, we use
the mathematical apparatus of stochastic game theory [17 – 29]. Solving a stochastic
game is to find the strategies of agents that maximize their winnings so as to ensure a
certain collective balance of interests for all players. The search for optimal strategies
for players in uncertainty will be performed on the basis of the promotional training
method.
   The purpose of the work is to construct an iterative method of incentive learning to
solve the stochastic MAS game in uncertainty. To achieve this goal, it is necessary to
develop a model of multi-agent stochastic game, to determine the criteria of collective
equilibrium, the method and algorithm for solving the game problem.


2      The Mathematical Model of Stochastic Game

The stochastic game is determined by the tuple:
                               ( S , p, Ai , r i | i  I ) , I  {1, 2, ..., L} ,

where S  {s1 ,..., s M } is set of all states of the environment, p : S  A   ( S ) is sys-
tem state change function defined in the space of probability distributions (S ) on
the plural S , Ai  ai (1),..., ai ( N i )  is multiple actions or pure strategies i -agent,
A   Ai is set of combined agent actions, r i : S  A  R is reward function i -agent,
    iI
 I is multiple agents, L is number of agents, M is number of states, N i is number
of strategies i -agent.
   In the general case of multiple actions Ai  Ai ( s ) i  I and combined actions
 A  A(s ) may depend on the state of the environment s  S .
   We adopt a Markov model [30, 31] of the dynamics of states of a system in which
the probability of change of states p depends only on the current state of the envi-
ronment and the current actions of the agents:

                  p st 1  s | ( s , b ),  0,1, 2, ..., t   p st 1  s | st , bt  ,

where bt  A is combined action at time t .
  At each point in time, the environment is in one of the states s  S and agents
choose actions independently ai  Ai . After the implementation of the combined
option a  (a1 ,..., a L )  A agents get random winnings r i (otherwise is incentives or
reinforcements), and the environment changes its state according to the probability
distribution p ( s, a ) with values per segment [0, 1] :

                                            p ( s  | s, a )  1 .
                                           sS


The agent implements actions based on a mixed strategy  i : S  Ai , which deter-
mines the likelihood of action ai  Ai in every state of the environment s  S .
   Distribution  i   i takes the value on a unit simplex [11]

                                                                              
                                                                              
                           i   
                                   aiAi
                                            
                                           ( s , a i )  1 ,  ( s , ai )  0  .
                                                                               

If  i ( s, ai )  {0, 1} , then the agent determines the choices of solutions. Let the total
payoff of each agent be determined by the function of discounted total payoffs:
                                                   
                                           i     r ,
                                                   t 0
                                                          t i
                                                            t                                      (1)


where   (0, 1] is discount option.
    The goal i -agent is to maximize function (1) by formulating an effective strategy
i :

                              Vi ( s )  E  i | s0  s   max, i  I ,                                       (2)
                                                                              i

where   ( 1 ,...,  L ) ; E is symbol of mathematical expectation.
   Stochastic game resolution is about defining agent behavior strategies  i*
( i  I ), which ensure fulfilment of one of the conditions of collective optimality, for
example:
1) Nash equilibrium: V i ( s,  1* ,  2* ,...,  L* )  V i ( s,  1* ,  2* ,...,  i*1 ,  i ,  i*1 ,...,  L* ) ;
2) Pareto optimality: V i ( s,  * )  V i ( s,  ) .


3        Learning of Stochastic Game

Calculation Vi (s ) can be performed in a recursive form known in the literature as
the Bellman equation [12 – 14]. Given (1), we obtain after simple transformations:
                                                   
              V ( s | s t  s )  E ( rt )      E(r     k
                                                                         t  k 1 ) E ( rt )  V ( s t 1 ) 
                                                  k 0                                                             (3)
               r ( s,  ( s ))      p(s  | s, (s))V (s )
                                      sS

where s is probable future states of the system.
   The agent's goal is to find a strategy  * that maximizes function (2) for all states
of the environment:
                              s  S     V  * ( s )  V  ( s) .
Since the choice of options is made by chance, it is for the purpose of comparing the
effectiveness of actions when the system is in a state s  S , current gains are useful to
obtain from (2). For this purpose is specially built Q is average payoff feature that
determines the cost of the action – the total payoff of the agent in the state s chose
action a :

                                                         
                                      Q ( s, a )  E R s0  s, a0  a .                                         (4)

Expression (4) defines a tabular function of the values of the action options a in the
states s .
   Similarly to (3) we obtain:

                            Q ( s, a )  r ( s, a )            p( s | s, a )V ( s) .
                                                                 s S
                                                                                                                   (5)
Adherence to the Bellman principle of optimality (5) ensures the optimal gain of the
agent from the current state achieved s  S at all future times. Applying this principle
to all states ensures a global optimal solution.
   For optimal function selection strategies  * for each state s  S we will get:

                                                                                             
                             V * ( s )  max  r ( s, a )  
                                          aA 
                                                                   p( s  | s, a )V * ( s ) .                 (6)
                                                              sS                           

From (6) one can obtain the optimal function of choosing strategies

                                              * ( s )  arg max Q * ( s, a ) .                                  (7)
                                                             aA

Optimization (7) can be performed by dynamic programming methods [30].
   By analogy to single agent training, we define the payoff matrix i -players with
current and future winnings in the direction of movement to the optimal collective
state in space S  A :

          Q*i ( s, a1 ,..., a L )  r i ( s, a1 ,..., a L )      p(s | s, a ,..., a ) V (s, ,..., ) ,
                                                                  sS
                                                                                 1       L
                                                                                                    i    *
                                                                                                         1
                                                                                                              *
                                                                                                              L


where Q*i ( s, a1 ,..., a L ) is total discounted gain i -player provided the players select
the action (a1 ,..., a L ) in the state s according to the optimal strategy of the game
 *  ( 1* ,...,  L* ) .
   In the conditions of a priori uncertainty of the transition probabilities between the
states of the system p( s, a1 ,..., a L ) and winnings features r i ( s, a1 ,..., a L ) an iterative
method is used to calculate the elements of the payoff matrix Q -learning [31]:

                Qti1 ( s, a1 ,..., a L )  (1   t )Qti ( s, a1 ,..., a L )   t [ rti  Vti ( st 1 )]       (8)

where  t  (0, 1) is training option; Vti ( st 1 ) is the operator of the cost of the system
state in the direction of the optimal collective solution.
   The type of the operator Vti ( st 1 ) is determined by the condition of collective equi-
librium, for example:

                                        
   Vti ( st 1 )  MM Qt ( st 1 ) is maximin equilibrium;
                        
   Vt ( st 1 )  NE Qti ( st 1 )
      i
                                        is Nash equilibrium;
                        
   Vti ( st 1 )  BR Qti ( st 1 )     is best agent response;
                        
   Vti ( st 1 )  CE Qti ( st 1 )     is correlated equilibrium;
                        
   Vti ( st 1 )  PE Qti ( st 1 )     is Pareto optimality.
The above list may be supplemented by other already known and new equilibrium
states that will determine the target aspect of the functioning of a distributed dynamic
system. Method (8) can be applied to decipher a single agent game with nature as a
partial case N-agent stochastic game if                    I  {i}, | I | 1, A  Ai , when
Vti ( st 1 )  max Qti ( s , b) , where s   st 1 .
                      bAi
  The Maximin Equilibrium (MM) takes place in the game of two agents with zero
sum of their payoff functions:

                        V 1 ( st 1 )  max min
                                           11 a2A2
                                                                   (s, a )Q ( s, a , a )  V ( s ) .
                                                                 a1A1
                                                                             1         1           t              1       2
                                                                                                                                         2
                                                                                                                                               t 1


Nash Equilibrium (NE) is determined by the independent distribution of strategies by
players who choose their own strategies independently of the choice of other agents.
In a Nash equilibrium situation in mixed strategies  NE ( s )   1NE ( s ),...,  LNE ( s ) it is                                                            
not profitable for each agent to deviate from its own optimal strategy  iNE (s ) , if other
agents stick to the equilibrium point [31]:
                                                  L                                                                               L

          Q ( s, a)
         aA
                  i
                  t
                                  NE
                                 i ( ai )       j i
                                                             j ( s , a j ) 
                                                             NE
                                                                                      Q ( s, a)~ (a ) 
                                                                                     aA
                                                                                                   i
                                                                                                   t                  i       i
                                                                                                                                  j i
                                                                                                                                             j ( s , a j ) ,
                                                                                                                                             NE
                                                                                                                                                                    (9)


where a  (a1 ,..., a L ) ;  iNE , ~ i   i .
   Method (8) ensures that condition (9) is satisfied when the current value of the sys-
tem state value operator is determined at point  NE Nash equilibrium:


                                                                  Q ( s, a)  ( s, a ) .
                                                                                                       L
                                        NE Qti ( st 1 )                        i
                                                                                 t
                                                                                                                  NE
                                                                                                                  j                j
                                                                      aA                              j 1


The set of NE equilibrium points in mixed strategies is a convex compact and can be
calculated by linear programming methods (for bi-matrix games) or by solving a sys-
tem of polylinear equations that determine the complementary rigidity condition:
                                  L                                                        L

   Q ( s, a , a )  ( s, a )  Q (s, a)  (s, a ) , i  I , a  A , s  S ,
a  i A i
              i
              t        i    i
                                 j i
                                          j              j
                                                                aA
                                                                         i
                                                                         t
                                                                                       j 1
                                                                                                        j             j                               i    i


                                                       i ( s , ai )  0 ,         ( s, a )  1 .
                                                                                 ai Ai
                                                                                               i              i


Unlike the Nash equilibrium, the Best Response (BR) method generates an optimal
agent strategy in response to the actions of all other agents. The corresponding system
state value operator in method (8) has the form [32]:
                                                                                                   
                                                                   
                                                                                L
                                       BR Qti ( st 1 )  max  Qti ( s , a )
                                                           i 
                                                                                           
                                                                                     j ( s, a j )  .
                                                                                                                  
                                                               aA            j 1                 

Correlated Equilibrium (CE) generalizes the Nash equilibrium by allowing players'
strategies to depend. For this purpose, there is an arbitrator in the collective decision-
making system, which according to the generalized distribution    ( A)

     (a)  1 , A   A ) recommends that players choose to take actions that form a
                                        L
(                                                  i
                                       i 1
    aA
combined option a  (a1 ,..., a L ) . Player with a number i only receives component
information combined option a  A . This signal is perceived i -player as an optional
offer to take action ai . Each player secretly and independently chooses at a moment's
time t action option ai , possibly different from the proposed variant, and receives a
current payoff r i ( st , at ) , which is a function of the current state of the system st and
the combined option at  A . The environment then moves to a new state st 1 ac-
cording to the probability distribution p( st 1 | st , at ) and the process is repeated at
time t  1 .
   Correlated equilibrium is determined by the united distribution of player strategies
   ( A) , when each agent does not have the motivation to deviate unilaterally [33]:

                  (a | a )Q [s, (a , a )]   (a | a )Q [s, (a , a~ )],
               a  i A i
                              CE
                                          i           i
                                                                i
                                                                             i        i
                                                                                                a  i A i
                                                                                                                  CE
                                                                                                                        i    i
                                                                                                                                       i
                                                                                                                                                   i    i


                                   L
where         Ai                         Aj ,               A  Ai  Ai ,                   ai  Ai ,             a  (a i , ai )  A ,                ai , a~i  Ai ,
                             j 1, j i

 CE ( ai )            (a , a ) ,  ( a | a )   (a , a )  ( a ) ,  ( a )  0 .
                    a  i A i
                                   CE
                                               i          i
                                                                        CE
                                                                                  i       i
                                                                                                       CE
                                                                                                                   i    i
                                                                                                                                  CE
                                                                                                                                           i
                                                                                                                                                        CE
                                                                                                                                                              i


All Nash equilibrium points are correlated equilibrium points. If i  I
 CE (a i | ai )   CE (a i | a~i ) , ai , a~i  Ai , ai  Ai ,  CE (ai ),  CE (a~i )  0 , then
correlated equilibrium is also Nash equilibrium. To solve the game by method (8), the
cost operator Vti ( st 1 ) the state of the system is determined by the point  CE corre-
lated equilibrium: CE (Qti ( st 1 ))                                        (a)Q ( s, a) . The set of CE equilibrium
                                                                             aA
                                                                                           CE            i
                                                                                                         t


points is non-empty, convex and compact and can be effectively calculated using
linear programming methods. In the case of maximizing total player winnings, the
task of linear programming is to find  CE can be formulated as follows:


                                    ,   (a , a )Q [ s , (a , a )]  Q [ s , ( a , a~ )   0 ,
                      L

     
     aA
            (a )   Q (s, a)  max
                    i 1
                               i
                                 
                                                                        a  i A i
                                                                                           i      i
                                                                                                              i
                                                                                                                         i   i
                                                                                                                                               i
                                                                                                                                                         i       i
              i  I , ai  Ai , a~i  Ai , s  S ,  ( a )  0 a  A ,                                           (a )  1 .
                                                                                                                     aA

Based on the distribution  (a ) a  A players' own strategies are determined
 i i  I . There are various options for switching from  to  i depending on the
type of strategies and players' level of awareness. For example, pure strategies deter-
mine the maximum value of the operator CE (Q i ( s )) :
                                                          ai  arg max CE (Q i ( s )) .
                                                                           

Taking into account that                             (a )  1 , it can be assumed that mixed strategies
                                                  ai Ai
                                                                  i


 i ( ai )   ( ai )       (a , a ) , or:
                          a  i A i
                                             i       i


                                i ( ai )             (a )Q (s, a , a ) CE (Q (s)) .
                                                    a  i A i
                                                                      i
                                                                               i
                                                                                           i   i
                                                                                                                 i


Pareto Equilibrium (PE) optimality occurs in the Common-Interest Markov Game,
where the payoff matrices are the same for all players Qti ( s, a )  Qt j ( s, a ) i , j  I ,
 s  S , a  A [33]. A game with different payoff matrices can be turned into a
game of shared interests by a convolution


                                                             Q ( s, a)  ( s, a ) ,
                                                                  L                                 L
                            PE Qt ( st 1 )                           k
                                                                                       k
                                                                                       t
                                                                                                            PE
                                                                                                            j         j
                                                              k 1         aA                      j 1


where  j  0 ( j  1..L) .
   The search for the game's PE solution is done independently by agent strategies,
similar to the search for a NE solution. A multi-agent game is optimal for Pareto if
there is no common player strategy that improves the winnings of all players:
Qti ( s,  PE )  Qti ( s,  ) . Pareto-optimal mixed strategies  PE ( s )   1PE ( s ),...,  LPE ( s )                          
can be obtained by maximizing the convolution of concave (up) payoff functions:
                                         L                                         L

                                         
                                        k 1
                                               k
                                                    aA
                                                           Qtk ( s , a )     ( s, a )  max
                                                                               j 1
                                                                                           j
                                                                                             
                                                                                                .   j


To calculate optimal collective strategies  * ( s )   1* ( s ),...,  L* ( s ) (NE, CE, PE)                                
agent with number i need to know Q -functions of all agents:
                                       
Q ( s )  Qt1 ( s ),..., QtL ( s ) . In the absence of such information, each agent should
evaluate the value Q - functions in the learning process. For this i -agent monitors the
current gains of other agents and modifies their estimates Q -functions according to
(8). To ensure that method (8) converges to one of the points of collective equilib-
rium, it is necessary to impose a limit on the rate of change of its adjustable parame-
ters. The general limitations are as follows [11, 14, 34]:
                                                                 

                                        
                                        t 0
                                                t  ,              ,
                                                                  t 0
                                                                         2
                                                                         t                           (10)


where  t  t  (  0) is monotonically decreasing positive sequences of real values.


4       Stochastic Game Solving Algorithm
Step 1. Set the start time t  0 ; the initial values of the payoff matrices
        Qti ( s, a1 ,..., a L )   s  S , ai  Ai , i  I , where 0    1 is small posi-
        tive value; the value of the gain discount parameter   (0, 1] ; the initial state
        of the system s0 .
Step 2. Perform a random selection of agent actions a  (a1 ,..., a L ) based on strate-
        gies   ( 1 ,...,  L ) . The value of the strategies can be calculated from the
        current estimates of the payoff matrices i  I :

                    i ( s , ai )       Q ( s, a , a )  Q ( s, a) , a  A .
                                      a  i A i
                                                    i
                                                    t   i    i
                                                                      aA
                                                                             i
                                                                             t       i       i


Step 3. Get current agent payouts rt  ( rt1 ,..., rtL ) .
Step 4. Determine the new state of the system st 1  st (a1 ,..., a L ) .
Step 5. Calculate function Vti ( st 1 ) according to the specific condition of collective
        equilibrium (NE, CE, PE).
                                                                                
Step 6. Modify the payoff matrix Qt 1  Qti1 ( st , at ) | i  1..L according to (8).
Step 7. If Qti1  Qti   i  1..L , then ask t : t  1 and go to step 2.
                                                                                         
Step 8. Print the calculated values of the payoff matrices Q ( s )  Q 1 ( s ),..., Q L ( s ) and
        strategies  ( s )  ( 1 ( s ),..., L ( s )) s  S . End of algorithm.


5       The Results of Computer Simulation

Let's solve the stochastic game of two agents with two pure strategies in a two-state
environment. The matrices of the average payoffs of such a game are given in Table
1.
                                                   Table 1. Player gains matrix
  States             Strategies                          The first player                                     The second player
                                             2 ( s1 , a2 [1])      2 ( s1 , a2 [2])                 2 ( s1 , a2 [1])  2 ( s1 , a2 [2])
       s1            1 ( s1 , a1[1])              0.4                       0.1                            0.9                  0.2
                     1 ( s1 , a1[2])              0.1                       0.9                            0.2                  0.9
                                             2 ( s2 , a2 [1])        2 ( s2 , a2 [2])               2 ( s2 , a2 [1])    2 ( s2 , a2 [2])
       s2            1 ( s2 , a1[1])              0.4                       0.6                            0.5                  0.2
                     1 ( s2 , a1[2])              0.6                       0.8                            0.6                  0.7


Under uncertainty, the elements of the average payoff matrix v i ( s, a )  sS a priori
                                                                                                                            aA

unknown and available for observation in the form of random current values

                                          r i ( s, a )  Normal (v i ( s, a ), d i ( s, a )) ,

distributed by normal law with mathematical expectation v i ( s, a ) and dispersion
d i ( s, a ) . Normally distributed random variables are obtained by summing twelve
evenly distributed random numbers   [0, 1] :

                                                                              12      
                                                                                    
                                  r i ( s, a )  v i ( s, a )  d i ( s, a )   j  6  ,
                                                                                       
                                                                                                                                          (11)
                                                                              j 1    

where d i ( s, a )  d  0 s  S , a  A . If at time t the system was in a state of disre-
pair        s  S , then after implementing pure strategies                                                   a  (a1 ,..., a L ) , where
              L
a  A   Ai , agents receive current winnings rti ( s, a ) , calculated according to (11).
             i 1
  After receiving current winnings, each agent lists the corresponding item Q -
matrix according to the algorithm modified for uncertainty conditions BR :

       Qti1 ( s, a1 ,..., a L )  (1   t )Qti ( s, a1 ,..., a L )   t ( rti   max Qti ( s, a1 ,..., a L )) .                       (12)
                                                                                                      ai


Based on Q -matrices the current values of mixed strategies are calculated using the
Boltzmann method:
                                                                      Ni

                                                                      e
                                                   *
                                                                             Qi* ( s ,ai ( j )) / T
                       i ( ai ( k ) | s )  eQi ( s ,ai ( k )) / T                                   , k  1.. N i ,                     (13)
                                                                      j 1
                                                                         L
where Qi* ( s, ai ( k ))  max r i ( ai , ai ( k )) , ai  Ai , Ai   A j , T  0 is tempera-
                              a i                                      j 1
                                                                        j i
ture coefficient.
   Vector elements of mixed strategies  i define a discrete distribution by which the
values of random pure strategies are determined i -agent at the next time:

                                      k                                    
                                                                             
            
                                            
 ai ( s )   Ai ( s, k ) k  arg  min  i ( s, ai ( j ))   , k  1.. N i  , s  S , i  I , (14)
                                   k                          
                                                                              
                                      j 1                    

where   [0, 1] is random variable with uniform distribution.
   The change of states of the dynamic system is determined by the discrete distribu-
tion p( s  | s, a )  p s  S , a  A :

                                               k                          
                                                                            
                     s   S (k ) k  arg  min
                         
                                           k       p( j )   , k  1.. M  .
                                                                
                                                                             
                                                                                                    (15)
                                               j 1            

Let the states change s  S system is implemented with equal probabilities
 p( s | s, a )  (| S |k1 , k  1..M ) s  S , a  A , that is p( s  | s, a )  (0.5; 0.5) for
| S | 2 . Agents' average payoffs are calculated taking into account the transition
probabilities of the medium from one state to another: V i                     p( s)V ( s) , i  1..L ,
                                                                               sS
                                                                                         i


                                     L
where V i ( s )     v ( s, a)  (s, a) is average agent gain in the state s  S .
                    aA
                          i

                                     j 1
                                            j


    The trajectories of changing agent strategies within a unit simplex and the appear-
ance of average payoff functions V i (s ) , which correspond to the data of the table 1,
is shown in Fig. 1 and Fig. 2.


                               Fig. 1. First agent average payoff functions
                       Fig. 2. Second agent average payoff functions

The BR method (12 – 15) ensures that stochastic play is solved at the vertices of a
unit simplex with the maximum value of the mean gain function. The percentage of
options for achieving optimal game resolution depends on the absolute difference
between the two largest consecutive values of the average payoff features.
   The convergence of the method is estimated by the error of fulfillment of the com-
plementary slackness condition [35], weighted by mixed strategies:
            ~ , where ~  diag ( ) V i / V i ; diag ( ) is diagonal square
                   2
  L1       i    i              i          i                     i
        iI

                                                                                    
matrix of order N i , formed from vector elements  i ; V i  V i [ j ] j  1.. N i is vec-
                                                                           Ni
tor median payoff function for fixed net strategies i -player; V i       V [ j ] [ j] is
                                                                          j 1
                                                                                 i
                                                                                         i


average payoff function i -player;  is Euclidean vector norm. The complementary
slackness condition characterizes the gameplay in Nash mixed strategies. The
weighted condition additionally takes into account the game's solutions in pure strate-
gies. Average payoff function graphs  and norms for deviating mixed strategies
from their target values  filed in Fig. 3.


              Fig. 3. Characteristics of the convergence of the game Q -method
                                                                            L
The gains in Fig. 3 are averaged over the number of players:   L1         , where
                                                                           i 1
                                                                                  i


  i  0 is discounted current gains i -player. Deviation schedule drop  mixed
strategies from their target indicates the convergence of the game Q -method.
    The value of the temperature coefficient T has a significant impact on the conver-
gence of the game method. The rate of convergence is determined by the rapid decline
of the function graph  , which can be estimated by the value of the acute angle of
linear approximation of the function graph  with the time axis. With the growth T
rate of convergence of game Q -method decreases.


6      Conclusions

The promotional training method (8) considered in the deterministic version requires
the knowledge of each agent Q -functions of all other agents. These functions are
used by agents to identify strategies that provide method dynamics toward the points
of collective equilibrium. Value Q -functions can be obtained by exchanging informa-
tion between agents. If integrated information about Q -functions are not available to
the agent, then he must determine their value independently in the learning process,
observing the current benefits of other agents and performing evaluations Q -
functions according to (8). If such observations are not possible, the agents may per-
form reflective assessments Q -functions of other agents.
   Another method of constructing incentive training algorithms for agents under un-
certainty is to apply the stochastic approximation method to the corresponding collec-
tive equilibrium condition.
   The practical use of game-based promotional training methods requires their prior
analysis to determine the conditions for convergence to a state of collective equilib-
rium. Such studies are based on the evaluation of sequences of random variables,
which characterize the current deviations of players' strategies from their optimal
values.
   The rate of convergence of the game method of Q-learning is determined by the
parameters  t and T. Parameter  t must satisfy the general conditions of stochastic
approximation (10). The value of the parameter T depends on the absolute values of
the elements of Q-matrices. It is experimentally established that for the given matrices
of average payoffs the convergence of the game Q-method is provided at T  (0, 0.2]
in the parameter value range  t  t  ,   (0, 1] . The highest convergence rate of the
game method of promotional learning is achieved at T  10 2 .


References
 1. Weiss, G.: Multiagent Systems. Second Edition. The MIT Press. (2013)
 2. Lee, Y., Chong, Q.: Multi-agent Systems Support for Community-Based Learning. Inter-
    acting with Computers, 15(1), 33 – 55. (2003)
 3. Kravets, P.: The Methodology of Multi-Agent Systems: a Modern State and Future Trends.
    In: Proceedings of the International Conference on computer science and information
    technologies – CSIT’2006, 125 – 127. (2006)
 4. Dignum, F., Bradshaw, J., Silverman, B.G., Doesburg, W.: Agent for Games and Simula-
    tions: Trends in Techniques, Concepts and Design. Springer. (2009)
 5. Kravets, P.: The Control Agent with Fuzzy Logic. In: Perspective Technologies and Meth-
    ods in MEMS Design, MEMSTECH, 40 – 41. (2010)
 6. Scerri, P., Vincent, R., Mailler, R.T.: Coordination of Large-Scale Multiagent Systems.
    Springer. (2010)
 7. Byrski, A, Kisiel-Dorohinicki, M.: Evolutionary Multi-Agent Systems: From Inspirations
    to Applications. Springer. (2017)
 8. Radley, N.: Multi-Agent Systems – Modeling, Control, Programming, Simulations and
    Applications. Scitus Academics LLC. (2017)
 9. Yang, S., Xu, J.-X., Li, X., Shen, D. : Iterative Learning Control for Multi-Agent Systems
    Coordination. Wiley-IEEE Press. (2017)
10. Sun, Z.: Cooperative Coordination and Formation Control for Multi-Agent Systems.
    Springer. (2018)
11. Nazin, A. V., Poznyak, A. S.: Adaptive Choice of Variants: Recurrence Algorithms (in
    russian). Moscow: Science. (1986)
12. Kaelbling, L., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. In: Journal
    of Artificial Intelligence Research, 4, 237 – 285. (1996)
13. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press. (1998)
14. Watkins, C.J.C.H., Dayan, P.: Q-Learning. In: Machine Learning, Kluwer Academic Pub-
    lishers, Boston, 8, 279 – 292. (1992)
15. Ummels, M.: Stochastic Multiplayer Games: Theory and Algorithms. Amsterdam Univer-
    sity Press. (2014)
16. Ungureanu, V.: Pareto-Nash-Stackelberg Game and Control Theory: Intelligent Paradigms
    and Applications. Springer. (2018)
17. Zheng, J., Cai, Y., Wu, Y., Shen, X.: Dynamic Computation Offloading for Mobile Cloud
    Computing: A Stochastic Game-Theoretic Approach. In: IEEE Transaction on Mobile
    Computing, 18(4), 771 – 786. (2018)
18. Chen, B.-S.: Stochastic Game Strategies and their Applications. CRC Press. (2019)
19. Kravets, P., Pasichnyk, V., Kunanets, N., Veretennikova, N. Game Method of Event Syn-
    chronization in Multiagent Systems. In: Advances in Intelligent Systems and Computing
    (AISC), ICCSEEA 2019 – Proceedings , 938, 378 – 387. (2019)
20. Kravets, P., Burov, Y., Lytvyn, V, Vysotska V.: Gaming Method of Ontology Clusteriza-
    tion. In: Webology, 16(1), 55 – 76. (2019)
21. Gao, J., You, F.: A Stochastic Game Theoretic Framework for Decentralized Optimization
    of Multi-Stakeholder Supply Chain Under Uncertainty. In: Compute & Chemical Engi-
    neering, 122, 31 – 46. (2019)
22. Lalropuia, K. C., Gupta, V.: Modeling Cyber-Physical Attacks Based on Stochastic Game
    and Markov Processes. In: Reliability Engineering & System Safety, 181, 28 – 37. (2019)
23. Lozovanu, D. Pure and Mixed Stationary Nash Equilibria for Average Stochastic Posi-
    tional Games. In: Frontiers of Dynamic Games, 131 – 155. (2019)
24. Garrec, T. Communicating Zero-Sum Product Stochastic Games. In: Journal of Mathe-
    matical Analysis and Applications, 477(1), 60 – 84. (2019)
25. Kloosterman, A. Cooperation in Stochastic Games: a Prisoner’s Dilemma Experiment. In:
    Experimental Economics, 1 – 21. (2019)
26. Lin, X., Adams, S. C., Beling, P. A. Multi-Agent Inverse Reinforcement Learning for Cer-
    tain General-Sum Stochastic Games. In: Journal of Artificial Intelligence Research, 66,
    473 – 502. (2019)
27. Saldi, N., Basar, T., Raginsky, M. Approximate Nash Equilibria in Partially Observed Sto-
    chastic Games with Mean-Field Interactions. In: Mathematics of Operations Research,
    44(3), 1006 – 1033. (2019)
28. Yamamoto, Y. Stochastic Games with Hidden States. In: Theoretical Economics, 14, 1115
    – 1167. (2019)
29. Fudenberg, D., Levine, D.K.: The Theory of Learning in Games. Cambridge. (1998)
30. Puterman, M. L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
    John Wiley & Sons, New York. (2005)
31. Hu, J., Wellman, M. P., Nash Q-learning for general-sum stochastic games. In: Machine
    Learning Research, 4, 1039 – 1069. (2003)
32. Weinberg, M., Rosenschein, J.S.: Best-Response Multiagent Learning in Non-Stationary
    Environments. In: AAMAS'04, New York, USA. (2004)
33. Greenwald, A., Hall, K.: Correlated Q-learning. In: Proceedings of the Twentieth Interna-
    tional Conference on Machine Learning, 242 – 249. (2003)
34. Kushner, H., Yin, G. G.: Stochastic Approximation and Recursive Algorithms and Appli-
    cations. Springer Science & Business Media. (2013)
35. Neogy, S. K., Bapat, R. B., Dubey, D.: Mathematical Programming and Game Theory.
    Springer. (2018)