=Paper= {{Paper |id=Vol-2819/session2paper2 |storemode=property |title=Parallel Algorithm for Approximating Nash Equilibrium in Multiplayer Stochastic Games with Application to Naval Strategic Planning |pdfUrl=https://ceur-ws.org/Vol-2819/session2paper2.pdf |volume=Vol-2819 |authors=Sam Ganzfried,Conner Laughlin,Charles Morefield }} ==Parallel Algorithm for Approximating Nash Equilibrium in Multiplayer Stochastic Games with Application to Naval Strategic Planning== https://ceur-ws.org/Vol-2819/session2paper2.pdf
         Parallel Algorithm for Approximating Nash Equilibrium in Multiplayer
            Stochastic Games with Application to Naval Strategic Planning*

                               Sam Ganzfried1 , Conner Laughlin2 , Charles Morefield2
                                                    1
                                                        Ganzfried Research 2 Arctan, Inc.




                            Abstract                                          While extensive-form game trees can be used to model
                                                                           sequential actions of a known duration (e.g., repeating a
  Many real-world domains contain multiple agents behaving
  strategically with probabilistic transitions and uncertain (po-          simultaneous-move game for a specified number of iter-
  tentially infinite) duration. Such settings can be modeled as            ations), they cannot model games of unknown duration,
  stochastic games. While algorithms have been developed for               which can potentially contain infinite cycles between states.
  solving (i.e., computing a game-theoretic solution concept               Such games must be modeled as stochastic games.
  such as Nash equilibrium) two-player zero-sum stochastic
                                                                           Definition 1. A stochastic game is a tuple (Q, N, A, P, r):
  games, research on algorithms for non-zero-sum and multi-
  player stochastic games is limited. We present a new algo-               • Q is a finite set of (stage) games (aka game states)
  rithm for these settings, which constitutes the first parallel
                                                                           • N is a finite set of n players
  algorithm for multiplayer stochastic games. We present ex-
  perimental results on a 4-player stochastic game motivated               • A = A1 × . . . × An , where Ai is a finite set of actions
  by a naval strategic planning scenario, showing that our algo-             available to player i
  rithm is able to quickly compute strategies constituting Nash            • P : Q × A × Q → [0, 1] is the transition probabil-
  equilibrium up to a very small degree of approximation error.              ity function; P (q, a, q̂) is the probability of transitioning
                                                                             from state q to state q̂ after action profile a
                        Introduction                                       • R = r1 , . . . , rn , where ri : Q × A → R is a real-valued
Nash equilibrium has emerged as the most compelling solu-                    payoff function for player i
tion concept in multiagent strategic interactions. For two-                   There are two commonly used methods for aggregat-
player zero-sum (adversarial) games, a Nash equilibrium                    ing the stage game payoffs into an overall payoff: average
can be computed in polynomial time (e.g., by linear pro-                   (undiscounted) reward and future discount reward using a
gramming). This result holds both for simultaneous-move                    discount factor δ < 1. Stochastic games generalize several
games (often represented as a matrix), and for sequential                  commonly-studied settings, including games with finite in-
games of both perfect and imperfect information (often rep-                teractions, strategic-form games, repeated games, stopping
resented as an extensive-form game tree). However, for non-                games, and Markov decision problems.
zero-sum and games with 3 or more agents it is PPAD-hard                      The main solution concept for stochastic games, as for
to compute a Nash equilibrium (even for the simultaneous-                  other game classes, is Nash equilibrium (i.e., a strategy pro-
move case) and widely believed that no efficient algorithms                file for all players such that no player can profit by unilat-
exist (Chen and Deng 2005; 2006; Daskalakis, Goldberg,                     erally deviating), though some works have considered alter-
and Papadimitriou 2009). For simultaneous (strategic-form)                 native solution concepts such as correlated equilibrium and
games several approaches have been developed with vary-                    Stackelberg equilibrium. Before discussing algorithms, we
ing degrees of success (Berg and Sandholm 2017; Porter,                    point out that, unlike other classes of games such as strate-
Nudelman, and Shoham 2008; Govindan and Wilson 2003;                       gic and extensive-form, it is not guaranteed that Nash equi-
Lemke and Howson 1964).                                                    librium exists in general in stochastic games.
    * This research was developed with funding from the Defense               One theorem states that if there is a finite number of play-
Advanced Research Projects Agency (DARPA). The views, opin-                ers and the action sets and the set of states are finite, then
ions and/or findings expressed are those of the author and should          a stochastic game with a finite number of stages always has
not be interpreted as representing the official views or policies of       a Nash equilibrium (using both average and discounted re-
the Department of Defense or the U.S. Government.                          ward). Another result shows that this is true for a game with
Copyright 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY              infinitely many stages if total payoff is the discounted sum.
4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop                   Often a subset of the full set of strategies is singled out
on Deep Models and Artificial Intelligence for Defense Applica-            called stationary strategies. A strategy is stationary if it de-
tions: Potentials, Theories, Practices, Tools, and Risks, November         pends only on the current state (and not on the time step).
11-12, 2020, Virtual, published at http://ceur-ws.org.                     Note that in general a strategy could play different strategies


                                      “A” (Approved for Public Release, Distribution Unlimited)
at the same game state at different time steps, and a restric-    player zero-sum and certain non-zero-sum games). For mul-
tion to stationary strategies results in a massive reduction in   tiplayer and non-zero-sum games it does not guarantee con-
the size of the strategy spaces to consider. It has been shown    vergence to equilibrium, and all that can be proven is that if it
that in two-player discounted stochastic games there exists       does happen to converge then the sequence of strategies de-
an equilibrium in stationary policies. For the undiscounted       termined by the iterations constitutes an equilibrium. It did
(average-reward) setting, it has recently been proven that        happen to converge consistently in the 3-player application
each player has a strategy that is -optimal in the limit as      despite the fact that it is not guaranteed to do so, suggesting
 → 0, technically called a uniform equilibrium, first for        that it likely performs better in practice than the worst-case
two-player zero-sum games (Mertens and Neyman 1981)               theory would dictate. For the value updating step, variants
and more recently for general-sum games (Vieille 2000).           of value iteration and policy iteration (which are approaches
   Thus, overall, the prior results show that for two-player      for solving Markov decision processes) were used.
(zero-sum and non-zero-sum) games there exists an equi-              Note that there has been significant recent attention on an
librium in stationary strategies for the discounted reward        alternative iterative self-play algorithm known as counter-
model, and a uniform equilibrium for the average reward           factual regret minimization (CFR). Like FP, CFR is proven
model. However, for more than two players, only the first of      to converge to a Nash equilibrium in the limit for two-
these is guaranteed, and it remains an open problem whether       player zero-sum games. For multiplayer and non-zero-sum
a (uniform) equilibrium exists in the undiscounted average-       games the algorithm can also be run, though the strate-
reward model. Perhaps this partially explains the scarcity of     gies computed are not guaranteed to form a Nash equi-
research on algorithms for multiplayer stochastic games.          librium. It was demonstrated that it does in fact converge
   Several stochastic game models have been proposed for          to an -Nash equilibrium (a strategy profile in which no
national security settings. For example, two-player dis-          agent can gain more than  by deviating) in the small game
counted models of adversarial patrolling have been con-           of three-player Kuhn poker, while it does not converge to
sidered, for which mixed-integer program formulations are         equilibrium in Leduc hold ’em (Abou Risk and Szafron
solved to compute a Markov stationary Stackelberg equi-           2010). It was subsequently proven that it guarantees con-
librium (Vorobeychik and Singh 2012; Vorobeychik et al.           verging to a strategy that is not dominated and does not put
2014). One work has applied an approach to approximate            any weight on iteratively weakly-dominated actions (Gibson
a correlated equilibrium in a three-player threat prediction      2014). While for some small games this guarantee can be
game model (Chen et al. 2006). However we are not aware           very useful (e.g., for two-player Kuhn poker a high fraction
of other prior research on settings with more than two play-      of the actions are iteratively-weakly-dominated), in many
ers with guarantees on solution quality (or for computing         large games (such as full Texas hold ’em) only a very small
Nash as opposed to Stackelberg or correlated equilibrium).        fraction of actions are dominated, and the guarantee is not
   The only prior research we are aware of for computing          useful (Ganzfried 2019). Very recently an agent based on
Nash equilibria in multiplayer stochastic games has been ap-      CFR has defeated strong human players in a multiplayer
proaches developed for poker tournaments (Ganzfried and           poker cash game1 (Brown and Sandholm 2019). However,
Sandholm 2008; 2009). Our algorithms are based on the ap-         much of the strength of the agent came from real-time solv-
proaches developed in that work. The model was a 3-player         ing of smaller portions of the game which typically con-
poker tournament, where each game state corresponded to a         tained just two players using “endgame”/“subgame” solv-
vector of stack sizes. The game had potentially infinite du-      ing (Ganzfried and Sandholm 2015) and more recently depth
ration (e.g., if all players continue to fold the game could      limited “midgame” solving (Hu and Ganzfried 2017; Brown,
continue arbitrarily long), and was modeled assuming no           Sandholm, and Amos 2018). Recently it has been shown that
discount factor. Several algorithms were provided, with the       when integrated with deep learning a version of CFR outper-
best-performer based on integrating fictitious play (FP) with     forms FP in two-player zero-sum poker variants (Brown et
a variant of policy iteration. While the algorithm is not guar-   al. 2019), though the core version of FP outperforms CFR in
anteed to converge, a technique was developed that com-           multiplayer and non-zero-sum settings (Ganzfried 2020).
putes the maximum a player could gain by deviating from              In this work we build on the prior algorithms for multi-
the computed strategies, and it was verified that this value      player stochastic games to solve a 4-player model of naval
was low, demonstrating that the algorithm successfully com-       strategic planning that we refer to as a Hostility Game. This
puted a close approximation of Nash equilibrium. In addi-         is a novel model of national security that has been devised
tion to being multiplayer, this model also differed from pre-     by a domain expert. The game is motivated by the Freedom
vious models in that stage games had imperfect information.       of Navigation Scenario in the South China Sea, though we
   The main approaches from prior work on multiplayer             think it is likely also applicable to other situations, and in
stochastic game solving integrate algorithms for solving          general that multiplayer stochastic games are fundamental
stage games (of imperfect information) assuming specified         for modeling national security scenarios.
values for the payoffs of all players at transitions into other       1
                                                                        Note that a poker cash game is modeled as a standard
stage games, and techniques for updating the values for all       extensive-form game, while the poker tournament described above
players at all states in light of these newly computed strate-    is modeled as a stochastic game. In a cash game chips represent
gies. For the stage game equilibrium computation these al-        actual money, while in a tournament chips have no monetary value
gorithms used fictitious play, which has been proven to con-      and are only a proxy, as players receive money only after they run
verge to Nash equilibrium in certain classes of games (two-       out of chips (depending on their position of elimination).


                                   “A” (Approved for Public Release, Distribution Unlimited)
                      Hostility game
In the South China Sea a set of blue players attempts to nav-
igate freely, while a set of red players attempt to obstruct
this from occurring (Figure 1). In our model there is a sin-
gle blue player and several red players of different “types”
which may have different capabilities (we will specifically
focus on the setting where there are three different types of
red players). If a blue player and a subset of the red players
happen to navigate to the same location, then a confrontation
will ensue, which we refer to as a Hostility Game.


                                                                      Figure 2: List of blue moves that counter each red move.


                                                                     • For each blue player move and red player, a probability
                                                                       of blue success/red failure given that the move is defended
                                                                       against (i.e., countered), denoted as bD
                                                                     • Probability that a move is a blue success/red failure given
                                                                       the move is Undefended against, denoted as bU
                                                                     • Probability for a red success/blue failure given the move
                                                                       is defended against, rD
  Figure 1: General figure for South China Sea scenario.
                                                                     • Probability for a red success/blue failure given the move
   In a Hostility Game, each player can initially select from a        is undefended against, rU
number of available actions (which is between 7 and 10 for           • Real valued payoff for success for each player, πi
each player). Certain actions for the blue player are coun-          • Real-valued hostility level for each move h(mi )
tered by certain actions of each of the red players, while           • Positive real-valued kinetic hostility threshold K
others are not (Figure 2). Depending on whether the selected
                                                                     • Real-valued payoffs for each player when game goes into
actions constitute a counter, there is some probability that
                                                                       Kinetic mode, πiK
the blue player wins the confrontation, some probability that
the red players win, and some probability that the game re-             We model hostility game G as a (4-player) stochastic
peats. Furthermore, each action of each player has an associ-        game with a collection of stage games {Gn }, where n cor-
ated hostility level. Initially the game starts in a state of zero   responds to the cumulative sum of hostility levels of actions
hostility, and if it is repeated then the overall hostility level    played so far. The game has K + 3 states: G0 , . . . , GK , with
increases by the sum of the hostilities of the selected actions.     two additional terminal states B and R for blue and red vic-
If the overall hostility level reaches a certain threshold (300),    tories. Depending on whether the blue move is countered,
then the game goes into kinetic mode and all players achieve         there is a probabilistic outcome for whether the blue player
a very low payoff (negative 200). If the game ends in a win          or red player (or neither) will outright win. The game will
for the blue player, then the blue player receives a payoff          then transition into terminal states B or R with these proba-
of 100 and the red players receive negative 100 (and vice            bilities, and then will be over with final payoffs. Otherwise,
versa for a red win). Note that the game repeats until either        the game transitions into Gn0 where n0 is the new sum of the
the blue/red players win or the game enters kinetic mode. A          hostility levels. If the game reaches GK , the players obtain
subset of the game’s actions and parameters are given in Fig-        the kinetic payoff πiK . Thus, the game starts at initial state
ure 3. Note that in our model we assume that all red players         G0 and after a finite number of time steps will eventually
act independently and do not coordinate their actions. Our           reach one of the three terminal states B, R, or GK .
game model and parameters were constructed from discus-                 Note that in our formulation there is a finite number of
sions with a domain expert.                                          players (4) as well as a finite number of states (K + 3). Fur-
Definition 2. A hostility game (HG) is a tuple                       thermore, with the assumption that hostility levels for all ac-
G = (N, M, c, bD , bU , rD , rU , π, h, K, π K ), where              tions are positive, the game must complete within a finite
                                                                     number of stages (because the combined hostility level will
 • N is the set of players. For our initial model we will as-        ultimately reach K if one of the terminal states B or R is
   sume player 1 is a blue player and players 2–4 are red            not reached before then). So a Nash equilibrium is guaran-
   players (P2 is a Warship, P3 is a Security ship, and P4 is        teed to exist in stationary strategies, for both the average and
   an Auxiliary vessel).                                             discounted reward models. Note that the payoffs are only ob-
 • M = {Mi } is the set of actions, or moves, where Mi is            tained in the final stage when a terminal state is reached, and
   the set of moves available to player i                            so the difference between using average and discounted re-
 • For mi ∈ Mi , c(Mi ) gives a set of blue moves that are           ward is likely less significant than for games where rewards
   counter moves of mi                                               are frequently accumulated within different time steps.


                                     “A” (Approved for Public Release, Distribution Unlimited)
                              Figure 3: Sample of typical actions and parameters for Hostility Game.


                         Algorithm                                  two-player general-sum games or games with more than two
While research on algorithms for stochastic games with              players. All that is known is that if it does converge, then the
more than two players is limited, several prior algorithms          strategies constitute a Nash equilibrium (Theorem 1).
have been devised and applied in the context of a poker tour-       Theorem 1. (Fudenberg and Levine 1998) Under ficti-
nament (Ganzfried and Sandholm 2008; 2009). At a high               tious play, if the empirical distributions over each player’s
level these algorithms consist of two different components:         choices converge, the strategy profile corresponding to the
first is a game-solving algorithm that computes an (approxi-        product of these distributions is a Nash equilibrium.
mate) Nash equilibrium at each stage game assuming given
values for all players at the other states, and the second is a        A meta-algorithm that integrates these two components—
value update procedure that updates values for all players at       stage game solving and value updating—is depicted in Al-
all states in light of the newly-computed stage-game strate-        gorithm 1. We initialize the values at all states according to
gies. For the poker application the stage games were them-          V0 , and alternate between the phase of solving each nonter-
selves games of imperfect information (the players must se-         minal stage game using algorithm A (note that for certain
lect a strategy for every possible set of private cards that they   applications it may even make sense to use a different stage
could hold at the given vector of chip stack sizes). The fic-       game algorithm Ai for different states), and the value update
titious play algorithm was used for the game-solving step,          phase using algorithm V . Following prior work we will be
which applies both to games of perfect and imperfect in-            using fictitious play for A and variants of value and policy
formation. Fictitious play is an iterative self-play algorithm      iteration for V , though the meta-algorithm is general enough
that has been proven to converge to Nash equilibrium in cer-        to allow for alternative choices depending on the setting.
tain classes of games (two-player zero-sum and certain non-
zero-sum). For multiplayer and non-zero-sum games it does           Algorithm 1 Meta-algorithm for multiplayer stochastic
not guarantee convergence to equilibrium, and all that can          game equilibrium computation
be proven is that if it does happen to converge, then the se-       Inputs: Stochastic game G with set of terminal states {Tn }
quence of strategies determined by the iterations constitutes       and set of U nonterminal states {Un }, algorithm for stage
an equilibrium (Theorem 1). It did happen to converge con-          game equilibrium computation A, algorithm for updating
sistently in the 3-player application despite the fact that it is   values of all nonterminal states for all players V , number
not guaranteed to do so, suggesting that it likely performs         of iterations N , initial assignment of state values V0 .
better in practice than the worst-case theory would dictate.
                                                                       Initialize values for all players for all nonterminal states
    In (smoothed) fictitious play each player i plays a best
                                                                       according to V0 .
response to the average opponents’ strategies thus far, using
                                                                       for n = 1 to N do
the following rule at time t to obtain the current strategy,
                                                                           for i = 1 to U do
                                                                                Solve stage game defined at Ui using algorithm A
                               
                              1 t−1 1 0t
                  sti = 1 −       si + si ,                            assuming values given by Vn−1 .
                              t            t
                                                                                Let Si,n denote the equilibrium for state i.
                                                          t−1
where s0ti is a best response of player i to the profile s−i               Update the values for all nonterminal states Ui accord-
of the other players played at time t − 1 (strategies can              ing to algorithm V assuming that strategies Si,n are used
be initialized arbitrarily at t = 0, and for our experiments           at game state Ui .
we will initialize them to be uniformly random). This algo-            Output strategies {Si,N }
rithm was originally developed as a simple learning model
for repeated games, and was proven to converge to a Nash
equilibrium in two-player zero-sum games (Fudenberg and                The first algorithm previously considered, called VI-FP,
Levine 1998). However, it is not guaranteed to converge in          instantiates Algorithm 1 using fictitious play for solving


                                    “A” (Approved for Public Release, Distribution Unlimited)
stage games and a multiplayer analogue of value itera-                gies that produced lowest approximation error  out of all
tion for updating values (Ganzfried and Sandholm 2008;                iterations (not just the final strategies after the last iteration).
2009). As originally implemented (Algorithm 2), the algo-                The next approach considered by prior work also used
rithm takes two inputs, which determine the stopping crite-           fictitious play for the stage-game solving phase but substi-
rion for the two phases. The fictitious play phase halts on           tuted in a variant of the policy-iteration algorithm (Algo-
a given state when no player can gain more than γ by de-              rithm 4) for value iteration in the value update phase. This
viating from the strategies (i.e., the strategies constitute a        algorithm called PI-FP is depicted in Algorithm 3. The new
γ-equilibrium), and the value iteration phase halts when all          values are computed by solving a system of equations de-
game state values for all players change by less than δ.              fined by a transition matrix. In effect this corresponds to up-
                                                                      dating all game state values globally to be consistent with the
Algorithm 2 VI-FP (Ganzfried and Sandholm 2009)                       recently-computed stage game strategies, while the value it-
                                                                      eration procedure updates the values locally given the prior
Inputs: Degree of desired stage game solution approxima-
                                                                      values of the adjacent states. Thus, at least intuitively we
tion γ, desired max difference between value updates δ
                                                                      would likely expect PI-FP to outperform VI-FP for this rea-
   V 0 = initializeValues()                                           son. Unlike VI-FP, for PI-FP it can be proven (Proposition 1)
   diff = ∞                                                           that if the algorithm converges then the resulting strategies
   i=0                                                                constitute a Nash equilibrium (regardless of the initializa-
   while diff > δ do                                                  tion). The experimental results of prior work agreed with
       i=i+1                                                          this intuition, as PI-FP converged to near-equilibrium faster
       regret = ∞                                                     than VI-FP (Ganzfried and Sandholm 2009). This was de-
       S = initializeStrategies()                                     termined by an ex-post checking procedure to compute the
       while regret > γ do                                            degree of approximation  given by Algorithm 5, with cor-
           S = fictPlay()                                             rectness following from Theorem 3 for Algorithm 4. The
           regret = maxRegret(S)                                                  π ∗ ,s∗
                                                                      quantity vi i −i (G0 ) denotes the value to player i at the ini-
       V i = getNewValues(V i−1 ,S)                                   tial game state when player i plays πi∗ and his opponents
       diff = maxDev(V i , V i−1 )                                                      s∗ ,s∗
   return S                                                           play s∗−i , and vi i   −i
                                                                                                  (G0 ) is analogous.

   Prior work used a domain-specific initialization for the           Algorithm 3 PI-FP (Ganzfried and Sandholm 2009)
values V 0 called the Independent Chip Model for poker                Inputs: Degree of desired stage game solution approxima-
tournaments (Ganzfried and Sandholm 2008). A counterex-               tion γ, desired max difference between value updates δ
ample was provided showing that VI-FP may actually con-                  V 0 = initializeValues()
verge to non-equilibrium strategies if a poor initialization is          diff = ∞
used (Ganzfried and Sandholm 2009), and it was suggested                 i=0
based on a prior theorem for value iteration in single-agent             while diff > δ do
Markov decision processes (MDPs) that this phenomenon                        i=i+1
can only occur if not all values are initialized pessimistically             regret = ∞
(Theorem 2). We note that there is not a well-defined notion                 S 0 = initializeStrategies()
of v ∗ in our setting, as multiplayer games can contain multi-               while regret > γ do
ple Nash equilibria yielding different payoffs to the players.                    S i = fictPlay()
Theorem 2. (Puterman 2005) In our setting, if v 0 is initial-                     regret = maxRegret(S i )
ized pessimistically (i.e., ∀s, v 0 (s) ≤ v ∗ (s)), value iteration          M = createTransitionMatrix(S i )
                                                                                i

converges (pointwise and monotonically) to v ∗ .                             V i = evaluatePolicy(M i )
                                                                             diff = maxDev(V i , V i−1 )
   We also note that the prior work proposed just one option             return S i
for a set of halting criteria for fictitious play and value it-
eration. Since fictitious play is not guaranteed to converge
in multiplayer games there is no guarantee that the approxi-          Proposition 1. If the sequence of strategies {sn } deter-
mation threshold of γ will be reached for sufficiently small          mined by iterations of the outer loop of Algorithm 3 con-
values (and similarly there is no guarantee that a value dif-         verges, then the final strategy profile s∗ is an equilibrium.
ference threshold of δ will be obtained for the outer loop).
There are several other sensible choices of halting criteria,         Theorem 3. (Puterman 2005) Let S be the set of states
for example running the algorithms for a specified number             in M . Suppose S and A(s) are finite. Let {v n } denote the
of iterations as we have done in our meta-algorithm, Algo-            sequence of iterates of Algorithm 4. Then, for some finite N ,
rithm 1. As we will see when we describe our parallel algo-           v N = v ∗ and π N = π ∗ .
rithm, this approach would also allow for more consistency
between the runtimes of computations on different cores.              Proposition 2. Algorithm 5 correctly computes the largest
Another halting criterion for fictitious play is to run it for a      amount any agent can improve its expected utility by deviat-
specified number of iterations but output the average strate-         ing from s∗ .


                                     “A” (Approved for Public Release, Distribution Unlimited)
Algorithm 4 Policy iteration for positive bounded models                   Algorithm 6 Parallel PI-FP
with expected total-reward criterion                                       Inputs: Stopping condition CS for stage game solving, stop-
1. Set n = 0 and initialize the policy π 0 so it has nonnegative           ping condition CV for value updating, number of cores d
   expected reward.                                                          V 0 = initializeValues()
                                                                             i=0
2. Let v n be the solution to the system of equations
                                                                             while CV not met do
                                                                                 i=i+1
                                   X n
                   v(i) = r(i) +       pπij v(j)
                                          j
                                                                                 while CS not met for each stage game do
                                                                                     Run fictitious play on each stage game on d cores
              n
   where pπij is the probability of moving from state i to                   (solving d stage games simultaneously) to obtain S i
   state j under policy π n . If there are multiple solutions, let               M i = createTransitionMatrix(S i )
   v n be the minimal nonnegative solution.                                      V i = evaluatePolicy(M i )
3. For each state s with action space A(s), set                              return S i
                                   X
               π n+1 (s) ∈ argmax      paij v n (j),
                                   a∈A(s)                                  expected reward is clearly bounded (both in the positive and
                                              j
                                                                           negative directions). So we can still apply these versions of
   breaking ties so π n+1 (s) = π n (s) whenever possible.                 value and policy iteration to (hopefully) obtain optimal so-
4. If π n+1 (s) = π n (s) for all s, stop and set π ∗ = π n .              lutions. Note also that for the case where all hostility levels
   Otherwise increment n by 1 and return to Step 2.                        are positive we can guarantee the game will complete within
                                                                           a finite duration and can apply backwards induction; but this
                                                                           will not work in general for the case of zero or negative hos-
Algorithm 5 Ex post check procedure                                        tilities where the game has potentially infinite duration, and
   Create MDP M from the strategy profile s∗                               the stochastic game-solving algorithms will be needed.
   Run Algorithm 4hon M (using initial policy π 0i= s∗ ) to get π ∗
   return maxi∈N vi i
                      π ∗ ,s∗
                            −i              s∗ ,s∗
                                 (G0 ) − vi i    −i
                                                      (G0 )                                       Experiments
                                                                           Results for the first 25 iterations of several algorithm vari-
                                                                           ations are given in Figure 4. All experiments ran the par-
                                                                           allel versions of the algorithms with 6 cores on a laptop.
   The implementations of VI-FP and PI-FP in prior work                    The variations include VI-FP and PI-FP with varying num-
both used a single core, and involved running fictitious play              bers of iterations of fictitious play, as well as PI-FP using
sequentially at every game state within the stage game up-                 the version of fictitious play where the strategy with lowest
date phase. We observe that both of these approaches can be                exploitability over all iterations was output (as opposed to
parallelized. Assuming there are |S| states and d cores (and               the final strategy). We first observe that VI-FP did not con-
for presentation simplicity assuming that |S| is a multiple of             verge to equilibrium while all versions of PI-FP did, making
d), we can assign |S| d of the stage games to each core and                PI-FP the clearly preferable choice. We also observe that us-
run fictitious play independently on d states simultaneously.              ing minimum exploitability FP led to nearly identical perfor-
This will compute equilibrium strategies at all stage games,               mance as the standard version; since this version also takes
which can be integrated with the value update phase of both                longer due to the overhead of having to compute the value
VI-FP and PI-FP. Since the stage game solving phase is the                 of  at every iteration instead of just at the end, we conclude
bottleneck step of both algorithms, this parallel algorithm                that the standard version of fictitious play is preferable to the
will achieve an approximately linear improvement in run-                   version that selects the iteration with minimal exploitability.
time by a factor of d. In addition to incorporating paralleliza-              For Parallel PI-FP using standard fictitious play, we com-
tion, our Algorithm 6 differs from the prior approach by al-               pared results using 1,000, 5,000, 10,000, 20,000, 25,000,
lowing for custom stopping conditions for the two phases.                  and 50,000 iterations of fictitious play for solving each game
   We note that neither VI-FP or PI-FP is guaranteed to con-               state within the inner loop of the algorithm. Each of these
verge in this setting (though it has been proven that if PI-               versions eventually converged to strategies with relatively
FP converges then the resulting strategies constitute a Nash               low exploitability, with the convergence value of  smaller
equilibrium (Ganzfried and Sandholm 2009)). Note that our                  as more iterations of FP are used. Note that initially we set
Hostility Game does not technically fall into the positive                 values for all players at all non-terminal states to be zero,
bounded model (Puterman 2005), as certain actions can ob-                  and that the terminal payoffs for a victory/loss are 100/-100,
tain negative payoff. However, the main difference between                 and for kinetic payoffs are -200 (with K=300); so conver-
policy iteration for this model (Algorithm 4) as opposed to                gence to  = 0.01 is quite good (this represents 0.01% of
the discounted reward model is using the minimal nonneg-                   the minimum possible payoff of the game). Even just using
ative solution for Step 2 (Puterman 2005); however, for all                1,000 iterations of FP converged to  of around 0.25, which
our experiments the transition matrix had full rank and there              is still relatively small. Note that while the final convergence
was a unique solution. Furthermore, in a Hostility Game the                values were quite low, there was quite a bit of variance in
rewards are only obtained at a terminal state, and the total                for the first several iterations, even for the versions with


                                            “A” (Approved for Public Release, Distribution Unlimited)
                                      Figure 4: Performance of several algorithm variants.


large number of FP iterations (e.g., using 10,000–50,000 it-       the core version of FP has been shown to outperform CFR
erations spiked up to  exceeding 20 at iteration 6, and using     in multiplayer games (Ganzfried 2020), for larger domains
20,000 and 25,000 spiked up again to  exceeding 25 again          with complex information structures CFR may outperform
at iteration 13). So it is very important to ensure that the       fictitious play by better capitalizing on integration with
algorithm can be run long enough to obtain convergence.            forms of Monte Carlo sampling and deep learning.
                                                                      While we considered a single value for the main game
                       Conclusion                                  parameters (set of actions, payoffs, hostility levels, etc.) that
We have presented a new parallel algorithm for solving mul-        were selected by a domain expert, in practice we may not be
tiplayer stochastic games, and presented experimental re-          sure of such values, and we would like to compute strategies
sults showing that it is able to successfully compute an -        that are robust in case our game model is inaccurate. One
equilibrium for very small  for a naval strategic planning        approach to achieve this would be to use a Bayesian setting,
scenario that has been devised by a domain expert.                 where the game parameters are selected according to a spec-
   There are several immediate avenues for future study.           ified probability distribution (typically over a small number
First, we note that while for the game model we have ex-           of possible options). This would require us to extend our
perimented on the stage games have perfect information, our        algorithm to solve multiplayer stochastic games where the
algorithm also applies to games where the stage games have         stage games are themselves Bayesian games.
imperfect information (related prior work has shown suc-              While our model has assumed that the red players act in-
cessful convergence in the imperfect-information setting for       dependently and do not coordinate amongst themselves, this
poker tournaments). There are several different natural ways       may not be the case in all realistic situations. In the extreme
in which imperfect information can be integrated into the          case when the red players are all controlled by one single
model. Currently we are exploring a model in which there is        meta-player, the game could simply be modeled as a two-
an unknown number of red “sub-players” of each of the three        player game (which would be zero sum for the parameters
types; this value is known to a single “meta-player” of that       we have been using), which would be significantly easier to
type, but the other players only know a publicly-available         solve as two-player zero-sum games can be solved in poly-
distribution from which this value is drawn (much like in          nomial time while solving multiplayer games is PPAD-hard.
poker how players receive private cards known only to them         We see no reason that our algorithm cannot be applied to
and a distribution for the cards of the opponents).                solve alternative modifications of the model that integrate
   We would also like to explore alternative approaches for        more subtle forms of coordination between players.
the stage game equilibrium-computation portion of our algo-           Our game model assumed that all hostility levels are pos-
rithm. Currently we have used fictitious play, which has been      itive, from which we are able to conclude the existence of a
demonstrated to obtain high performance previously. How-           Nash equilibrium in stationary strategies (because the game
ever, it may be outperformed by more recently-devised ap-          would be guaranteed to have a finite number of stages); how-
proaches such as counterfactual regret minimization. While         ever, we could not make the same deduction if some hostility


                                   “A” (Approved for Public Release, Distribution Unlimited)
levels are non-positive for the undiscounted setting (though      tion and situation awareness. Journal of Advances in Infor-
we still could if we were using discounted reward). In the fu-    mation Fusion 2(1):1–14.
ture we would like to explore convergence of our algorithm        Daskalakis, C.; Goldberg, P.; and Papadimitriou, C. 2009.
for different selections of the hostility levels including zero   The complexity of computing a Nash equilibrium. SIAM
and negative values, as well as consider potential differences    Journal on Computing 1(39):195–259.
between the average and discounted reward settings.
                                                                  Fudenberg, D., and Levine, D. 1998. The Theory of Learn-
   By now we have observed fictitious play to converge con-       ing in Games. MIT Press.
sistently for stage games in several domains (previously for
poker tournaments and now for naval planning), as well            Ganzfried, S., and Sandholm, T. 2008. Computing an ap-
as the general PI-FP algorithm converge for multiplayer           proximate jam/fold equilibrium for 3-player no-limit Texas
stochastic games. Theoretically we have seen that these ap-       hold ’em tournaments. In International Conference on Au-
proaches are not guaranteed to converge in general for these      tonomous Agents and Multi-Agent Systems.
game classes, and all that has been proven is that if they        Ganzfried, S., and Sandholm, T. 2009. Computing equilibria
do converge then the computed strategies constitute a Nash        in multiplayer stochastic games of imperfect information. In
equilibrium (though for VI-FP this is not the case and a          Proceedings of the 21st International Joint Conference on
counterexample was shown where it can converge to non-            Artificial Intelligence (IJCAI).
equilibrium strategies (Ganzfried and Sandholm 2009)). It         Ganzfried, S., and Sandholm, T. 2015. Endgame solving in
would be interesting from a theoretical perspective to prove      large imperfect-information games. In International Con-
more general conditions for which these algorithms are guar-      ference on Autonomous Agents and Multi-Agent Systems.
anteed to converge in multiplayer settings that can include       Ganzfried, S. 2019. Mistakes in games. In Proceedings of
generalizations of these settings that have been studied.         the International Conference on Distributed Artificial Intel-
   Many important real-world settings contain multiple play-      ligence (DAI).
ers interacting over an unknown duration with probabilistic
transitions, and we feel that the multiplayer stochastic game     Ganzfried, S. 2020. Fictitious Play Outperforms Counter-
model is fundamental for many national security domains,          factual Regret Minimization. arXiv:2001.11165.
particularly with the ability of approaches to be integrated      Gibson, R. 2014. Regret Minimization in Games and the
with imperfect information and Bayesian parameter uncer-          Development of Champion Multiplayer Computer Poker-
tainty. We plan to explore the application of our algorithm       Playing Agents. Ph.D. Dissertation, University of Alberta.
to other similarly complicated domains in the near future.        Govindan, S., and Wilson, R. 2003. A global Newton
                                                                  method to compute Nash equilibria. Journal of Economic
                        References                                Theory 110:65–86.
Abou Risk, N., and Szafron, D. 2010. Using counterfactual         Hu, K., and Ganzfried, S. 2017. Midgame solving: A new
regret minimization to create competitive multiplayer poker       weapon for efficient large-scale equilibrium approximation.
agents. In International Conference on Autonomous Agents          In IEEE International Conference on Tools with Artificial
and Multi-Agent Systems, 159–166.                                 Intelligence.
                                                                  Lemke, C., and Howson, J. 1964. Equilibrium points of
Berg, K., and Sandholm, T. 2017. Exclusion method for
                                                                  bimatrix games. Journal of the Society of Industrial and
finding Nash equilibrium in multiplayer games. In Pro-
                                                                  Applied Mathematics 12:413–423.
ceedings of the AAAI Conference on Artificial Intelligence
(AAAI), 383–389.                                                  Mertens, J.-F., and Neyman, A. 1981. Stochastic games.
                                                                  International Journal of Game Theory 10(2):53–66.
Brown, N., and Sandholm, T. 2019. Superhuman AI for
multiplayer poker. Science 365:885–890.                           Porter, R.; Nudelman, E.; and Shoham, Y. 2008. Simple
                                                                  search methods for finding a Nash equilibrium. Games and
Brown, N.; Lerer, A.; Gross, S.; and Sandholm, T. 2019.           Economic Behavior 63(2):642–662.
Deep counterfactual regret minimization. In Proceedings of
the International Conference on Machine Learning (ICML).          Puterman, M. L. 2005. Markov Decision Processes: Dis-
                                                                  crete Stochastic Dynamic Programming. John Wiley &
Brown, N.; Sandholm, T.; and Amos, B. 2018. Depth-                Sons.
limited solving for imperfect-information games. In Pro-
ceedings of the Annual Conference on Neural Information           Vieille, N. 2000. Two-player stochastic games I: A reduc-
Processing Systems (NIPS).                                        tion. Israel Journal of Mathematics 119(1):55–91.
                                                                  Vorobeychik, Y., and Singh, S. 2012. Computing Stack-
Chen, X., and Deng, X. 2005. 3-Nash is PPAD-complete.
                                                                  elberg equilibria in discounted stochastic games. In Pro-
Electronic Colloquium on Computational Complexity Re-
                                                                  ceedings of the AAAI Conference on Artificial Intelligence
port No. 134:1–12.
                                                                  (AAAI).
Chen, X., and Deng, X. 2006. Settling the complexity of           Vorobeychik, Y.; An, B.; Tambe, M.; and Singh, S. 2014.
2-player Nash equilibrium. In Proceedings of the Annual           Computing solutions in infinite-horizon discounted adver-
Symposium on Foundations of Computer Science (FOCS).              sarial patrolling games. In International Conference on Au-
Chen, G.; Shen, D.; Kwan, C.; Cruz, J.; Kruger, M.; and           tomated Planning and Scheduling (ICAPS).
Blasch, E. 2006. Game theoretic approach to threat predic-


                                   “A” (Approved for Public Release, Distribution Unlimited)