<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Parallel Algorithm for Approximating Nash Equilibrium in Multiplayer Stochastic Games with Application to Naval Strategic Planning*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sam Ganzfried</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Conner Laughlin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles Morefield</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ganzfried Research</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arctan</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Many real-world domains contain multiple agents behaving strategically with probabilistic transitions and uncertain (potentially infinite) duration. Such settings can be modeled as stochastic games. While algorithms have been developed for solving (i.e., computing a game-theoretic solution concept such as Nash equilibrium) two-player zero-sum stochastic games, research on algorithms for non-zero-sum and multiplayer stochastic games is limited. We present a new algorithm for these settings, which constitutes the first parallel algorithm for multiplayer stochastic games. We present experimental results on a 4-player stochastic game motivated by a naval strategic planning scenario, showing that our algorithm is able to quickly compute strategies constituting Nash equilibrium up to a very small degree of approximation error.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Nash equilibrium has emerged as the most compelling
solution concept in multiagent strategic interactions. For
twoplayer zero-sum (adversarial) games, a Nash equilibrium
can be computed in polynomial time (e.g., by linear
programming). This result holds both for simultaneous-move
games (often represented as a matrix), and for sequential
games of both perfect and imperfect information (often
represented as an extensive-form game tree). However, for
nonzero-sum and games with 3 or more agents it is PPAD-hard
to compute a Nash equilibrium (even for the
simultaneousmove case) and widely believed that no efficient algorithms
exist
        <xref ref-type="bibr" rid="ref10 ref14 ref6">(Chen and Deng 2005; 2006; Daskalakis, Goldberg,
and Papadimitriou 2009)</xref>
        . For simultaneous (strategic-form)
games several approaches have been developed with
varying degrees of success
        <xref ref-type="bibr" rid="ref13 ref19 ref2 ref20 ref21 ref24">(Berg and Sandholm 2017; Porter,
Nudelman, and Shoham 2008; Govindan and Wilson 2003;
Lemke and Howson 1964)</xref>
        .
      </p>
      <p>*This research was developed with funding from the Defense
Advanced Research Projects Agency (DARPA). The views,
opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of
the Department of Defense or the U.S. Government.</p>
      <p>Copyright 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY
4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop
on Deep Models and Artificial Intelligence for Defense
Applications: Potentials, Theories, Practices, Tools, and Risks, November
11-12, 2020, Virtual, published at http://ceur-ws.org.</p>
      <p>While extensive-form game trees can be used to model
sequential actions of a known duration (e.g., repeating a
simultaneous-move game for a specified number of
iterations), they cannot model games of unknown duration,
which can potentially contain infinite cycles between states.
Such games must be modeled as stochastic games.
Definition 1. A stochastic game is a tuple (Q; N; A; P; r):
• Q is a finite set of (stage) games (aka game states)
• N is a finite set of n players
• A = A1 : : : An, where Ai is a finite set of actions
available to player i
• P : Q A Q ! [0; 1] is the transition
probability function; P (q; a; q^) is the probability of transitioning
from state q to state q^ after action profile a
• R = r1; : : : ; rn, where ri : Q
payoff function for player i
A ! R is a real-valued</p>
      <p>There are two commonly used methods for
aggregating the stage game payoffs into an overall payoff: average
(undiscounted) reward and future discount reward using a
discount factor &lt; 1. Stochastic games generalize several
commonly-studied settings, including games with finite
interactions, strategic-form games, repeated games, stopping
games, and Markov decision problems.</p>
      <p>The main solution concept for stochastic games, as for
other game classes, is Nash equilibrium (i.e., a strategy
profile for all players such that no player can profit by
unilaterally deviating), though some works have considered
alternative solution concepts such as correlated equilibrium and
Stackelberg equilibrium. Before discussing algorithms, we
point out that, unlike other classes of games such as
strategic and extensive-form, it is not guaranteed that Nash
equilibrium exists in general in stochastic games.</p>
      <p>One theorem states that if there is a finite number of
players and the action sets and the set of states are finite, then
a stochastic game with a finite number of stages always has
a Nash equilibrium (using both average and discounted
reward). Another result shows that this is true for a game with
infinitely many stages if total payoff is the discounted sum.</p>
      <p>
        Often a subset of the full set of strategies is singled out
called stationary strategies. A strategy is stationary if it
depends only on the current state (and not on the time step).
Note that in general a strategy could play different strategies
at the same game state at different time steps, and a
restriction to stationary strategies results in a massive reduction in
the size of the strategy spaces to consider. It has been shown
that in two-player discounted stochastic games there exists
an equilibrium in stationary policies. For the undiscounted
(average-reward) setting, it has recently been proven that
each player has a strategy that is -optimal in the limit as
! 0, technically called a uniform equilibrium, first for
two-player zero-sum games
        <xref ref-type="bibr" rid="ref22">(Mertens and Neyman 1981)</xref>
        and more recently for general-sum games
        <xref ref-type="bibr" rid="ref26">(Vieille 2000)</xref>
        .
      </p>
      <p>Thus, overall, the prior results show that for two-player
(zero-sum and non-zero-sum) games there exists an
equilibrium in stationary strategies for the discounted reward
model, and a uniform equilibrium for the average reward
model. However, for more than two players, only the first of
these is guaranteed, and it remains an open problem whether
a (uniform) equilibrium exists in the undiscounted
averagereward model. Perhaps this partially explains the scarcity of
research on algorithms for multiplayer stochastic games.</p>
      <p>
        Several stochastic game models have been proposed for
national security settings. For example, two-player
discounted models of adversarial patrolling have been
considered, for which mixed-integer program formulations are
solved to compute a Markov stationary Stackelberg
equilibrium
        <xref ref-type="bibr" rid="ref27 ref28">(Vorobeychik and Singh 2012; Vorobeychik et al.
2014)</xref>
        . One work has applied an approach to approximate
a correlated equilibrium in a three-player threat prediction
game model
        <xref ref-type="bibr" rid="ref8 ref9">(Chen et al. 2006)</xref>
        . However we are not aware
of other prior research on settings with more than two
players with guarantees on solution quality (or for computing
Nash as opposed to Stackelberg or correlated equilibrium).
      </p>
      <p>
        The only prior research we are aware of for computing
Nash equilibria in multiplayer stochastic games has been
approaches developed for poker tournaments
        <xref ref-type="bibr" rid="ref13">(Ganzfried and
Sandholm 2008; 2009)</xref>
        . Our algorithms are based on the
approaches developed in that work. The model was a 3-player
poker tournament, where each game state corresponded to a
vector of stack sizes. The game had potentially infinite
duration (e.g., if all players continue to fold the game could
continue arbitrarily long), and was modeled assuming no
discount factor. Several algorithms were provided, with the
best-performer based on integrating fictitious play (FP) with
a variant of policy iteration. While the algorithm is not
guaranteed to converge, a technique was developed that
computes the maximum a player could gain by deviating from
the computed strategies, and it was verified that this value
was low, demonstrating that the algorithm successfully
computed a close approximation of Nash equilibrium. In
addition to being multiplayer, this model also differed from
previous models in that stage games had imperfect information.
      </p>
      <p>The main approaches from prior work on multiplayer
stochastic game solving integrate algorithms for solving
stage games (of imperfect information) assuming specified
values for the payoffs of all players at transitions into other
stage games, and techniques for updating the values for all
players at all states in light of these newly computed
strategies. For the stage game equilibrium computation these
algorithms used fictitious play, which has been proven to
converge to Nash equilibrium in certain classes of games
(twoplayer zero-sum and certain non-zero-sum games). For
multiplayer and non-zero-sum games it does not guarantee
convergence to equilibrium, and all that can be proven is that if it
does happen to converge then the sequence of strategies
determined by the iterations constitutes an equilibrium. It did
happen to converge consistently in the 3-player application
despite the fact that it is not guaranteed to do so, suggesting
that it likely performs better in practice than the worst-case
theory would dictate. For the value updating step, variants
of value iteration and policy iteration (which are approaches
for solving Markov decision processes) were used.</p>
      <p>
        Note that there has been significant recent attention on an
alternative iterative self-play algorithm known as
counterfactual regret minimization (CFR). Like FP, CFR is proven
to converge to a Nash equilibrium in the limit for
twoplayer zero-sum games. For multiplayer and non-zero-sum
games the algorithm can also be run, though the
strategies computed are not guaranteed to form a Nash
equilibrium. It was demonstrated that it does in fact converge
to an -Nash equilibrium (a strategy profile in which no
agent can gain more than by deviating) in the small game
of three-player Kuhn poker, while it does not converge to
equilibrium in Leduc hold ’em
        <xref ref-type="bibr" rid="ref1">(Abou Risk and Szafron
2010)</xref>
        . It was subsequently proven that it guarantees
converging to a strategy that is not dominated and does not put
any weight on iteratively weakly-dominated actions
        <xref ref-type="bibr" rid="ref18">(Gibson
2014)</xref>
        . While for some small games this guarantee can be
very useful (e.g., for two-player Kuhn poker a high fraction
of the actions are iteratively-weakly-dominated), in many
large games (such as full Texas hold ’em) only a very small
fraction of actions are dominated, and the guarantee is not
useful
        <xref ref-type="bibr" rid="ref16">(Ganzfried 2019)</xref>
        . Very recently an agent based on
CFR has defeated strong human players in a multiplayer
poker cash game1
        <xref ref-type="bibr" rid="ref3 ref4">(Brown and Sandholm 2019)</xref>
        . However,
much of the strength of the agent came from real-time
solving of smaller portions of the game which typically
contained just two players using “endgame”/“subgame”
solving
        <xref ref-type="bibr" rid="ref15">(Ganzfried and Sandholm 2015)</xref>
        and more recently depth
limited “midgame” solving
        <xref ref-type="bibr" rid="ref2 ref20 ref5">(Hu and Ganzfried 2017; Brown,
Sandholm, and Amos 2018)</xref>
        . Recently it has been shown that
when integrated with deep learning a version of CFR
outperforms FP in two-player zero-sum poker variants
        <xref ref-type="bibr" rid="ref3 ref4">(Brown et
al. 2019)</xref>
        , though the core version of FP outperforms CFR in
multiplayer and non-zero-sum settings
        <xref ref-type="bibr" rid="ref17">(Ganzfried 2020)</xref>
        .
      </p>
      <p>In this work we build on the prior algorithms for
multiplayer stochastic games to solve a 4-player model of naval
strategic planning that we refer to as a Hostility Game. This
is a novel model of national security that has been devised
by a domain expert. The game is motivated by the Freedom
of Navigation Scenario in the South China Sea, though we
think it is likely also applicable to other situations, and in
general that multiplayer stochastic games are fundamental
for modeling national security scenarios.</p>
      <p>1Note that a poker cash game is modeled as a standard
extensive-form game, while the poker tournament described above
is modeled as a stochastic game. In a cash game chips represent
actual money, while in a tournament chips have no monetary value
and are only a proxy, as players receive money only after they run
out of chips (depending on their position of elimination).</p>
    </sec>
    <sec id="sec-2">
      <title>Hostility game</title>
      <p>In the South China Sea a set of blue players attempts to
navigate freely, while a set of red players attempt to obstruct
this from occurring (Figure 1). In our model there is a
single blue player and several red players of different “types”
which may have different capabilities (we will specifically
focus on the setting where there are three different types of
red players). If a blue player and a subset of the red players
happen to navigate to the same location, then a confrontation
will ensue, which we refer to as a Hostility Game.</p>
      <p>In a Hostility Game, each player can initially select from a
number of available actions (which is between 7 and 10 for
each player). Certain actions for the blue player are
countered by certain actions of each of the red players, while
others are not (Figure 2). Depending on whether the selected
actions constitute a counter, there is some probability that
the blue player wins the confrontation, some probability that
the red players win, and some probability that the game
repeats. Furthermore, each action of each player has an
associated hostility level. Initially the game starts in a state of zero
hostility, and if it is repeated then the overall hostility level
increases by the sum of the hostilities of the selected actions.
If the overall hostility level reaches a certain threshold (300),
then the game goes into kinetic mode and all players achieve
a very low payoff (negative 200). If the game ends in a win
for the blue player, then the blue player receives a payoff
of 100 and the red players receive negative 100 (and vice
versa for a red win). Note that the game repeats until either
the blue/red players win or the game enters kinetic mode. A
subset of the game’s actions and parameters are given in
Figure 3. Note that in our model we assume that all red players
act independently and do not coordinate their actions. Our
game model and parameters were constructed from
discussions with a domain expert.</p>
      <p>Definition 2. A hostility game (HG) is a tuple
G = (N; M; c; bD; bU ; rD; rU ; ; h; K; K ), where
• N is the set of players. For our initial model we will
assume player 1 is a blue player and players 2–4 are red
players (P2 is a Warship, P3 is a Security ship, and P4 is
an Auxiliary vessel).
• M = fMig is the set of actions, or moves, where Mi is
the set of moves available to player i
• For mi 2 Mi, c(Mi) gives a set of blue moves that are
counter moves of mi
• For each blue player move and red player, a probability
of blue success/red failure given that the move is defended
against (i.e., countered), denoted as bD
• Probability that a move is a blue success/red failure given
the move is Undefended against, denoted as bU
• Probability for a red success/blue failure given the move
is defended against, rD
• Probability for a red success/blue failure given the move
is undefended against, rU
• Real valued payoff for success for each player, i
• Real-valued hostility level for each move h(mi)
• Positive real-valued kinetic hostility threshold K
• Real-valued payoffs for each player when game goes into
Kinetic mode, iK</p>
      <p>We model hostility game G as a (4-player) stochastic
game with a collection of stage games fGng; where n
corresponds to the cumulative sum of hostility levels of actions
played so far. The game has K + 3 states: G0; : : : ; GK , with
two additional terminal states B and R for blue and red
victories. Depending on whether the blue move is countered,
there is a probabilistic outcome for whether the blue player
or red player (or neither) will outright win. The game will
then transition into terminal states B or R with these
probabilities, and then will be over with final payoffs. Otherwise,
the game transitions into Gn0 where n0 is the new sum of the
hostility levels. If the game reaches GK , the players obtain
the kinetic payoff iK . Thus, the game starts at initial state
G0 and after a finite number of time steps will eventually
reach one of the three terminal states B; R; or GK .</p>
      <p>
        Note that in our formulation there is a finite number of
players (4) as well as a finite number of states (K + 3).
Furthermore, with the assumption that hostility levels for all
actions are positive, the game must complete within a finite
number of stages (because the combined hostility level will
ultimately reach K if one of the terminal states B or R is
not reached before then). So a Nash equilibrium is
guaranteed to exist in stationary strategies, for both the average and
discounted reward models. Note that the payoffs are only
obtained in the final stage when a terminal state is reached, and
so the difference between using average and discounted
reward is likely less significant than for games where rewards
are frequently accumulated within different time steps.
While research on algorithms for stochastic games with
more than two players is limited, several prior algorithms
have been devised and applied in the context of a poker
tournament
        <xref ref-type="bibr" rid="ref13">(Ganzfried and Sandholm 2008; 2009)</xref>
        . At a high
level these algorithms consist of two different components:
first is a game-solving algorithm that computes an
(approximate) Nash equilibrium at each stage game assuming given
values for all players at the other states, and the second is a
value update procedure that updates values for all players at
all states in light of the newly-computed stage-game
strategies. For the poker application the stage games were
themselves games of imperfect information (the players must
select a strategy for every possible set of private cards that they
could hold at the given vector of chip stack sizes). The
fictitious play algorithm was used for the game-solving step,
which applies both to games of perfect and imperfect
information. Fictitious play is an iterative self-play algorithm
that has been proven to converge to Nash equilibrium in
certain classes of games (two-player zero-sum and certain
nonzero-sum). For multiplayer and non-zero-sum games it does
not guarantee convergence to equilibrium, and all that can
be proven is that if it does happen to converge, then the
sequence of strategies determined by the iterations constitutes
an equilibrium (Theorem 1). It did happen to converge
consistently in the 3-player application despite the fact that it is
not guaranteed to do so, suggesting that it likely performs
better in practice than the worst-case theory would dictate.
      </p>
      <p>
        In (smoothed) fictitious play each player i plays a best
response to the average opponents’ strategies thus far, using
the following rule at time t to obtain the current strategy,
sit =
1
1
t
sit 1 + 1t s0it;
where s0it is a best response of player i to the profile st i1
of the other players played at time t 1 (strategies can
be initialized arbitrarily at t = 0, and for our experiments
we will initialize them to be uniformly random). This
algorithm was originally developed as a simple learning model
for repeated games, and was proven to converge to a Nash
equilibrium in two-player zero-sum games
        <xref ref-type="bibr" rid="ref12">(Fudenberg and
Levine 1998)</xref>
        . However, it is not guaranteed to converge in
two-player general-sum games or games with more than two
players. All that is known is that if it does converge, then the
strategies constitute a Nash equilibrium (Theorem 1).
Theorem 1.
        <xref ref-type="bibr" rid="ref12">(Fudenberg and Levine 1998)</xref>
        Under
fictitious play, if the empirical distributions over each player’s
choices converge, the strategy profile corresponding to the
product of these distributions is a Nash equilibrium.
      </p>
      <p>A meta-algorithm that integrates these two components—
stage game solving and value updating—is depicted in
Algorithm 1. We initialize the values at all states according to
V0, and alternate between the phase of solving each
nonterminal stage game using algorithm A (note that for certain
applications it may even make sense to use a different stage
game algorithm Ai for different states), and the value update
phase using algorithm V . Following prior work we will be
using fictitious play for A and variants of value and policy
iteration for V , though the meta-algorithm is general enough
to allow for alternative choices depending on the setting.
Algorithm 1 Meta-algorithm for multiplayer stochastic
game equilibrium computation
Inputs: Stochastic game G with set of terminal states fTng
and set of U nonterminal states fUng, algorithm for stage
game equilibrium computation A, algorithm for updating
values of all nonterminal states for all players V , number
of iterations N , initial assignment of state values V0.</p>
      <p>Initialize values for all players for all nonterminal states
according to V0.
for n = 1 to N do
for i = 1 to U do</p>
      <p>Solve stage game defined at Ui using algorithm A
assuming values given by Vn 1.</p>
      <p>Let Si;n denote the equilibrium for state i.</p>
      <p>Update the values for all nonterminal states Ui
according to algorithm V assuming that strategies Si;n are used
at game state Ui.</p>
      <p>Output strategies fSi;N g</p>
      <p>
        The first algorithm previously considered, called VI-FP,
instantiates Algorithm 1 using fictitious play for solving
stage games and a multiplayer analogue of value
iteration for updating values
        <xref ref-type="bibr" rid="ref13">(Ganzfried and Sandholm 2008;
2009)</xref>
        . As originally implemented (Algorithm 2), the
algorithm takes two inputs, which determine the stopping
criterion for the two phases. The fictitious play phase halts on
a given state when no player can gain more than by
deviating from the strategies (i.e., the strategies constitute a
-equilibrium), and the value iteration phase halts when all
game state values for all players change by less than .
Algorithm 2 VI-FP
        <xref ref-type="bibr" rid="ref14">(Ganzfried and Sandholm 2009)</xref>
        Inputs: Degree of desired stage game solution
approximation , desired max difference between value updates
V 0 = initializeValues()
diff = 1
i = 0
while diff &gt; do
i = i + 1
regret = 1
S = initializeStrategies()
while regret &gt; do
      </p>
      <p>S = fictPlay()
regret = maxRegret(S)
V i = getNewValues(V i 1,S)
diff = maxDev(V i; V i 1)
return S</p>
      <p>
        Prior work used a domain-specific initialization for the
values V 0 called the Independent Chip Model for poker
tournaments
        <xref ref-type="bibr" rid="ref13">(Ganzfried and Sandholm 2008)</xref>
        . A
counterexample was provided showing that VI-FP may actually
converge to non-equilibrium strategies if a poor initialization is
used
        <xref ref-type="bibr" rid="ref14">(Ganzfried and Sandholm 2009)</xref>
        , and it was suggested
based on a prior theorem for value iteration in single-agent
Markov decision processes (MDPs) that this phenomenon
can only occur if not all values are initialized pessimistically
(Theorem 2). We note that there is not a well-defined notion
of v in our setting, as multiplayer games can contain
multiple Nash equilibria yielding different payoffs to the players.
Theorem 2.
        <xref ref-type="bibr" rid="ref25">(Puterman 2005)</xref>
        In our setting, if v0 is
initialized pessimistically (i.e., 8s; v0(s) v (s)), value iteration
converges (pointwise and monotonically) to v :
      </p>
      <p>We also note that the prior work proposed just one option
for a set of halting criteria for fictitious play and value
iteration. Since fictitious play is not guaranteed to converge
in multiplayer games there is no guarantee that the
approximation threshold of will be reached for sufficiently small
values (and similarly there is no guarantee that a value
difference threshold of will be obtained for the outer loop).
There are several other sensible choices of halting criteria,
for example running the algorithms for a specified number
of iterations as we have done in our meta-algorithm,
Algorithm 1. As we will see when we describe our parallel
algorithm, this approach would also allow for more consistency
between the runtimes of computations on different cores.
Another halting criterion for fictitious play is to run it for a
specified number of iterations but output the average
strategies that produced lowest approximation error out of all
iterations (not just the final strategies after the last iteration).</p>
      <p>
        The next approach considered by prior work also used
fictitious play for the stage-game solving phase but
substituted in a variant of the policy-iteration algorithm
(Algorithm 4) for value iteration in the value update phase. This
algorithm called PI-FP is depicted in Algorithm 3. The new
values are computed by solving a system of equations
defined by a transition matrix. In effect this corresponds to
updating all game state values globally to be consistent with the
recently-computed stage game strategies, while the value
iteration procedure updates the values locally given the prior
values of the adjacent states. Thus, at least intuitively we
would likely expect PI-FP to outperform VI-FP for this
reason. Unlike VI-FP, for PI-FP it can be proven (Proposition 1)
that if the algorithm converges then the resulting strategies
constitute a Nash equilibrium (regardless of the
initialization). The experimental results of prior work agreed with
this intuition, as PI-FP converged to near-equilibrium faster
than VI-FP
        <xref ref-type="bibr" rid="ref14">(Ganzfried and Sandholm 2009)</xref>
        . This was
determined by an ex-post checking procedure to compute the
degree of approximation given by Algorithm 5, with
correctness following from Theorem 3 for Algorithm 4. The
quantity vi i ;s i (G0) denotes the value to player i at the
initial game state when player i plays i and his opponents
play s i, and visi ;s i (G0) is analogous.
      </p>
      <sec id="sec-2-1">
        <title>Algorithm 3 PI-FP (Ganzfried and Sandholm 2009)</title>
        <p>Inputs: Degree of desired stage game solution
approximation , desired max difference between value updates
V 0 = initializeValues()
diff = 1
i = 0
while diff &gt; do
i = i + 1
regret = 1
S0 = initializeStrategies()
while regret &gt; do</p>
        <p>
          Si = fictPlay()
regret = maxRegret(Si)
M i = createTransitionMatrix(Si)
V i = evaluatePolicy(M i)
diff = maxDev(V i; V i 1)
return Si
Proposition 1. If the sequence of strategies fsng
determined by iterations of the outer loop of Algorithm 3
converges, then the final strategy profile s is an equilibrium.
Theorem 3.
          <xref ref-type="bibr" rid="ref25">(Puterman 2005)</xref>
          Let S be the set of states
in M . Suppose S and A(s) are finite. Let fvng denote the
sequence of iterates of Algorithm 4. Then, for some finite N ,
vN = v and N = .
        </p>
        <p>Proposition 2. Algorithm 5 correctly computes the largest
amount any agent can improve its expected utility by
deviating from s .
Algorithm 4 Policy iteration for positive bounded models
with expected total-reward criterion
1. Set n = 0 and initialize the policy 0 so it has nonnegative
expected reward.
2. Let vn be the solution to the system of equations
v(i) = r(i) +</p>
        <p>X pij v(j)</p>
        <p>n
j
where pijn is the probability of moving from state i to
state j under policy n. If there are multiple solutions, let
vn be the minimal nonnegative solution.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3. For each state s with action space A(s), set</title>
        <p>n+1(s) 2 argmax X piaj vn(j);
a2A(s) j
breaking ties so n+1(s) =
n(s) whenever possible.
4. If n+1(s) = n(s) for all s, stop and set
Otherwise increment n by 1 and return to Step 2.
=
n:</p>
      </sec>
      <sec id="sec-2-3">
        <title>Algorithm 5 Ex post check procedure</title>
        <p>Create MDP M from the strategy profile s
Run Algorithm 4 on M (using initial policy 0 = s ) to get
return maxi2N hvi i ;s i (G0) visi ;s i (G0)i</p>
        <p>The implementations of VI-FP and PI-FP in prior work
both used a single core, and involved running fictitious play
sequentially at every game state within the stage game
update phase. We observe that both of these approaches can be
parallelized. Assuming there are jSj states and d cores (and
for presentation simplicity assuming that jSj is a multiple of
d), we can assign jSj of the stage games to each core and
d
run fictitious play independently on d states simultaneously.
This will compute equilibrium strategies at all stage games,
which can be integrated with the value update phase of both
VI-FP and PI-FP. Since the stage game solving phase is the
bottleneck step of both algorithms, this parallel algorithm
will achieve an approximately linear improvement in
runtime by a factor of d. In addition to incorporating
parallelization, our Algorithm 6 differs from the prior approach by
allowing for custom stopping conditions for the two phases.</p>
        <p>
          We note that neither VI-FP or PI-FP is guaranteed to
converge in this setting (though it has been proven that if
PIFP converges then the resulting strategies constitute a Nash
equilibrium
          <xref ref-type="bibr" rid="ref14">(Ganzfried and Sandholm 2009)</xref>
          ). Note that our
Hostility Game does not technically fall into the positive
bounded model
          <xref ref-type="bibr" rid="ref25">(Puterman 2005)</xref>
          , as certain actions can
obtain negative payoff. However, the main difference between
policy iteration for this model (Algorithm 4) as opposed to
the discounted reward model is using the minimal
nonnegative solution for Step 2
          <xref ref-type="bibr" rid="ref25">(Puterman 2005)</xref>
          ; however, for all
our experiments the transition matrix had full rank and there
was a unique solution. Furthermore, in a Hostility Game the
rewards are only obtained at a terminal state, and the total
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Algorithm 6 Parallel PI-FP</title>
        <p>Inputs: Stopping condition CS for stage game solving,
stopping condition CV for value updating, number of cores d
V 0 = initializeValues()
i = 0
while CV not met do
i = i + 1
while CS not met for each stage game do</p>
        <p>Run fictitious play on each stage game on d cores
(solving d stage games simultaneously) to obtain Si
M i = createTransitionMatrix(Si)</p>
        <p>V i = evaluatePolicy(M i)
return Si
expected reward is clearly bounded (both in the positive and
negative directions). So we can still apply these versions of
value and policy iteration to (hopefully) obtain optimal
solutions. Note also that for the case where all hostility levels
are positive we can guarantee the game will complete within
a finite duration and can apply backwards induction; but this
will not work in general for the case of zero or negative
hostilities where the game has potentially infinite duration, and
the stochastic game-solving algorithms will be needed.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Results for the first 25 iterations of several algorithm
variations are given in Figure 4. All experiments ran the
parallel versions of the algorithms with 6 cores on a laptop.
The variations include VI-FP and PI-FP with varying
numbers of iterations of fictitious play, as well as PI-FP using
the version of fictitious play where the strategy with lowest
exploitability over all iterations was output (as opposed to
the final strategy). We first observe that VI-FP did not
converge to equilibrium while all versions of PI-FP did, making
PI-FP the clearly preferable choice. We also observe that
using minimum exploitability FP led to nearly identical
performance as the standard version; since this version also takes
longer due to the overhead of having to compute the value
of at every iteration instead of just at the end, we conclude
that the standard version of fictitious play is preferable to the
version that selects the iteration with minimal exploitability.</p>
      <p>For Parallel PI-FP using standard fictitious play, we
compared results using 1,000, 5,000, 10,000, 20,000, 25,000,
and 50,000 iterations of fictitious play for solving each game
state within the inner loop of the algorithm. Each of these
versions eventually converged to strategies with relatively
low exploitability, with the convergence value of smaller
as more iterations of FP are used. Note that initially we set
values for all players at all non-terminal states to be zero,
and that the terminal payoffs for a victory/loss are 100/-100,
and for kinetic payoffs are -200 (with K=300); so
convergence to = 0:01 is quite good (this represents 0.01% of
the minimum possible payoff of the game). Even just using
1,000 iterations of FP converged to of around 0.25, which
is still relatively small. Note that while the final convergence
values were quite low, there was quite a bit of variance in
for the first several iterations, even for the versions with
large number of FP iterations (e.g., using 10,000–50,000
iterations spiked up to exceeding 20 at iteration 6, and using
20,000 and 25,000 spiked up again to exceeding 25 again
at iteration 13). So it is very important to ensure that the
algorithm can be run long enough to obtain convergence.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have presented a new parallel algorithm for solving
multiplayer stochastic games, and presented experimental
results showing that it is able to successfully compute an
equilibrium for very small for a naval strategic planning
scenario that has been devised by a domain expert.</p>
      <p>There are several immediate avenues for future study.
First, we note that while for the game model we have
experimented on the stage games have perfect information, our
algorithm also applies to games where the stage games have
imperfect information (related prior work has shown
successful convergence in the imperfect-information setting for
poker tournaments). There are several different natural ways
in which imperfect information can be integrated into the
model. Currently we are exploring a model in which there is
an unknown number of red “sub-players” of each of the three
types; this value is known to a single “meta-player” of that
type, but the other players only know a publicly-available
distribution from which this value is drawn (much like in
poker how players receive private cards known only to them
and a distribution for the cards of the opponents).</p>
      <p>
        We would also like to explore alternative approaches for
the stage game equilibrium-computation portion of our
algorithm. Currently we have used fictitious play, which has been
demonstrated to obtain high performance previously.
However, it may be outperformed by more recently-devised
approaches such as counterfactual regret minimization. While
the core version of FP has been shown to outperform CFR
in multiplayer games
        <xref ref-type="bibr" rid="ref17">(Ganzfried 2020)</xref>
        , for larger domains
with complex information structures CFR may outperform
fictitious play by better capitalizing on integration with
forms of Monte Carlo sampling and deep learning.
      </p>
      <p>While we considered a single value for the main game
parameters (set of actions, payoffs, hostility levels, etc.) that
were selected by a domain expert, in practice we may not be
sure of such values, and we would like to compute strategies
that are robust in case our game model is inaccurate. One
approach to achieve this would be to use a Bayesian setting,
where the game parameters are selected according to a
specified probability distribution (typically over a small number
of possible options). This would require us to extend our
algorithm to solve multiplayer stochastic games where the
stage games are themselves Bayesian games.</p>
      <p>While our model has assumed that the red players act
independently and do not coordinate amongst themselves, this
may not be the case in all realistic situations. In the extreme
case when the red players are all controlled by one single
meta-player, the game could simply be modeled as a
twoplayer game (which would be zero sum for the parameters
we have been using), which would be significantly easier to
solve as two-player zero-sum games can be solved in
polynomial time while solving multiplayer games is PPAD-hard.
We see no reason that our algorithm cannot be applied to
solve alternative modifications of the model that integrate
more subtle forms of coordination between players.</p>
      <p>Our game model assumed that all hostility levels are
positive, from which we are able to conclude the existence of a
Nash equilibrium in stationary strategies (because the game
would be guaranteed to have a finite number of stages);
however, we could not make the same deduction if some hostility
levels are non-positive for the undiscounted setting (though
we still could if we were using discounted reward). In the
future we would like to explore convergence of our algorithm
for different selections of the hostility levels including zero
and negative values, as well as consider potential differences
between the average and discounted reward settings.</p>
      <p>
        By now we have observed fictitious play to converge
consistently for stage games in several domains (previously for
poker tournaments and now for naval planning), as well
as the general PI-FP algorithm converge for multiplayer
stochastic games. Theoretically we have seen that these
approaches are not guaranteed to converge in general for these
game classes, and all that has been proven is that if they
do converge then the computed strategies constitute a Nash
equilibrium (though for VI-FP this is not the case and a
counterexample was shown where it can converge to
nonequilibrium strategies
        <xref ref-type="bibr" rid="ref14">(Ganzfried and Sandholm 2009)</xref>
        ). It
would be interesting from a theoretical perspective to prove
more general conditions for which these algorithms are
guaranteed to converge in multiplayer settings that can include
generalizations of these settings that have been studied.
      </p>
      <p>Many important real-world settings contain multiple
players interacting over an unknown duration with probabilistic
transitions, and we feel that the multiplayer stochastic game
model is fundamental for many national security domains,
particularly with the ability of approaches to be integrated
with imperfect information and Bayesian parameter
uncertainty. We plan to explore the application of our algorithm
to other similarly complicated domains in the near future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Abou</given-names>
            <surname>Risk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , and
            <surname>Szafron</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Using counterfactual regret minimization to create competitive multiplayer poker agents</article-title>
          .
          <source>In International Conference on Autonomous Agents and Multi-Agent Systems</source>
          ,
          <volume>159</volume>
          -
          <fpage>166</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Exclusion method for finding Nash equilibrium in multiplayer games</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)</source>
          ,
          <fpage>383</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , N., and
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Superhuman AI for multiplayer poker</article-title>
          .
          <source>Science</source>
          <volume>365</volume>
          :
          <fpage>885</fpage>
          -
          <lpage>890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , N.;
          <string-name>
            <surname>Lerer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Sandholm,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , N.;
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Amos</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Depthlimited solving for imperfect-information games</article-title>
          .
          <source>In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>3-Nash is PPAD-complete.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Electronic Colloquium on Computational Complexity Report No. 134</source>
          :
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Settling the complexity of 2-player Nash equilibrium</article-title>
          .
          <source>In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kwan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cruz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Kruger,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Blasch</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Game theoretic approach to threat prediction and situation awareness</article-title>
          .
          <source>Journal of Advances in Information Fusion</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Daskalakis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Papadimitriou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>The complexity of computing a Nash equilibrium</article-title>
          .
          <source>SIAM Journal on Computing</source>
          <volume>1</volume>
          (
          <issue>39</issue>
          ):
          <fpage>195</fpage>
          -
          <lpage>259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Fudenberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>The Theory of Learning in Games</article-title>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Ganzfried</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Computing an approximate jam/fold equilibrium for 3-player no-limit Texas hold 'em tournaments</article-title>
          . In International Conference on Autonomous Agents and
          <string-name>
            <surname>Multi-Agent Systems</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Ganzfried</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Computing equilibria in multiplayer stochastic games of imperfect information</article-title>
          .
          <source>In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI).</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Ganzfried</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Endgame solving in large imperfect-information games</article-title>
          . In International Conference on Autonomous Agents and
          <string-name>
            <surname>Multi-Agent Systems</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Ganzfried</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Mistakes in games</article-title>
          .
          <source>In Proceedings of the International Conference on Distributed Artificial Intelligence (DAI).</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Ganzfried</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Fictitious Play Outperforms Counterfactual Regret Minimization</article-title>
          . arXiv:
          <year>2001</year>
          .11165.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Regret Minimization in Games and the Development of Champion Multiplayer Computer PokerPlaying Agents</article-title>
          .
          <source>Ph.D. Dissertation</source>
          , University of Alberta.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Govindan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and Wilson,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2003</year>
          .
          <article-title>A global Newton method to compute Nash equilibria</article-title>
          .
          <source>Journal of Economic Theory</source>
          <volume>110</volume>
          :
          <fpage>65</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ganzfried</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Midgame solving: A new weapon for efficient large-scale equilibrium approximation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Lemke</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Howson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1964</year>
          .
          <article-title>Equilibrium points of bimatrix games</article-title>
          .
          <source>Journal of the Society of Industrial and Applied Mathematics</source>
          <volume>12</volume>
          :
          <fpage>413</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Mertens</surname>
            ,
            <given-names>J.-F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Neyman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1981</year>
          .
          <article-title>Stochastic games</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>International Journal of Game Theory</source>
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>53</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nudelman</surname>
            , E.; and Shoham,
            <given-names>Y.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Simple search methods for finding a Nash equilibrium</article-title>
          .
          <source>Games and Economic Behavior</source>
          <volume>63</volume>
          (
          <issue>2</issue>
          ):
          <fpage>642</fpage>
          -
          <lpage>662</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Puterman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Markov Decision Processes: Discrete Stochastic Dynamic Programming</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Vieille</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Two-player stochastic games I: A reduction</article-title>
          .
          <source>Israel Journal of Mathematics</source>
          <volume>119</volume>
          (
          <issue>1</issue>
          ):
          <fpage>55</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Vorobeychik</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Computing Stackelberg equilibria in discounted stochastic games</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Vorobeychik</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>An</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tambe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>