=Paper= {{Paper |id=Vol-2819/session3paper2 |storemode=property |title=Evaluating Reinforcement Learning Algorithms for Evolving Military Games (https://youtu.be/VWIJifA5EXY) |pdfUrl=https://ceur-ws.org/Vol-2819/session3paper2.pdf |volume=Vol-2819 |authors=James Chao,Jonathan Sato,Crisrael Lucero,Doug Lange }} ==Evaluating Reinforcement Learning Algorithms for Evolving Military Games (https://youtu.be/VWIJifA5EXY)== https://ceur-ws.org/Vol-2819/session3paper2.pdf
    Evaluating Reinforcement Learning Algorithms For Evolving Military Games

                           James Chao*, Jonathan Sato*, Crisrael Lucero, Doug S. Lange
                                                  Naval Information Warfare Center Pacific
                                                            *Equal Contribution
                                                            {first.last}@navy.mil




                             Abstract                                     games in 2013 (Mnih et al. 2013), Google DeepMind devel-
                                                                          oped AlphaGo (Silver et al. 2016) that defeated world cham-
  In this paper, we evaluate reinforcement learning algorithms            pion Lee Sedol in the game of Go using supervised learning
  for military board games. Currently, machine learning ap-               and reinforcement learning. One year later, AlphaGo Zero
  proaches to most games assume certain aspects of the game               (Silver et al. 2017b) was able to defeat AlphaGo with no
  remain static. This methodology results in a lack of algorithm
  robustness and a drastic drop in performance upon chang-
                                                                          human knowledge and pure reinforcement learning. Soon
  ing in-game mechanics. To this end, we will evaluate general            after, AlphaZero (Silver et al. 2017a) generalized AlphaGo
  game playing (Diego Perez-Liebana 2018) AI algorithms on                Zero to be able to play more games including Chess, Shogi,
  evolving military games.                                                and Go, creating a more generalized AI to apply to differ-
                                                                          ent problems. In 2018, OpenAI Five used five Long Short-
                                                                          term Memory (Hochreiter and Schmidhuber 1997) neural
                         Introduction                                     networks and a Proximal Policy Optimization (Schulman et
                                                                          al. 2017) method to defeat a professional DotA team, each
AlphaZero (Silver et al. 2017a) described an approach that                LSTM acting as a player in a team to collaborate and achieve
trained an AI agent through self-play to achieve super-                   a common goal. AlphaStar used a transformer (Vaswani et
human performance. While the results are impressive, we                   al. 2017), LSTM (Hochreiter and Schmidhuber 1997), auto-
want to test if the same algorithms used in games are ro-                 regressive policy head (Vinyals et al. 2017) with a pointer
bust enough to translate into more complex environments                   (Vinyals, Fortunato, and Jaitly 2015), and a centralized value
that closer resemble the real world. To our knowledge, pa-                baseline (Foerster et al. 2017) to beat top professional Star-
pers such as (Hsueh et al. 2018) examine AlphaZero on                     craft2 players. Pluribus (Brown and Sandholm 2019) used
non-deterministic games, but not much research has been                   Monte Carlo counterfactual regret minimization to beat pro-
performed on progressively complicating and evolving the                  fessional poker players.
game environment, mechanics, and goals. Therefore, we                        AlphaZero is chosen due to its proven ability to be play at
tested these different aspects of robustness on AlphaZero                 super-human levels without doubt of merely winning due to
models. We intend to continue future work evaluating dif-                 fast machine reaction and domain knowledge; however, we
ferent algorithms.                                                        are not limited to AlphaZero as an algorithm. Since the orig-
                                                                          inal AlphaZero is generally applied to well known games
           Background and Related Work                                    with well defined rules, we built our base case game and
                                                                          applied a general AlphaZero algorithm (Nair, Thakoor, and
Recent breakthroughs in game AI has generated a large
                                                                          Jhunjhunwala 2017) in order to have the ability to mod-
amount of excitement in the AI community. Game AI not
                                                                          ify the game code as well as the algorithm code in order
only can provide advancement in the gaming industry, but
                                                                          to experiment with evolving game environments such as
also can be applied to help solve many real world problems.
                                                                          Surprise-based learning (Ranasinghe and Shen 2008).
After Deep-Q Networks (DQNs) were used to beat Atari

This will certify that all author(s) of the above article/paper are em-   Game Description: Checkers Modern Warfare
ployees of the U.S. Government and performed this work as part of         The basic game that has been developed to test our approach
their employment, and that the article/paper is therefore not subject
                                                                          consists of two players with a fixed size, symmetrical square
to U.S. copyright protection. No copyright. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY             board. Each player has the same number of pieces placed
4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop               symmetrically on the board. Players take turns according to
on Deep Models and Artificial Intelligence for Defense Applica-           the following rules: the turn player chooses a single piece
tions: Potentials, Theories, Practices, Tools, and Risks, November        and either moves the piece one space or attacks an adjacent
11-12, 2020, Virtual, published at http://ceur-ws.org                     piece in the up, down, right, or left directions. The turn is
then passed to the next player. This continues until pieces of                   Case Study and Results
only one team remain or the stalemate turn count is reached.      The base case game consists of a five by five size board and
A simple two turns are shown in Figure 1. The game state          six game pieces with three for each side. The three pieces of
is fully observable, symmetrical, zero sum, turn-based, dis-      each team are set in opposing corners of the board as seen in
crete, deterministic, static, and sequential.                     Figure 2. The top right box of the board is a dedicated piece
                                                                  of land that pieces are not allowed to move to. During each
                                                                  player’s turn, the player has the option of moving a single
                                                                  piece or attacking another game piece with one of their game
                                                                  pieces. This continues until only no pieces from one team is
                                                                  left or until 50 turns have elapsed signaling a stalemate. This
                                                                  base case can be incrementally changed according to one or
                                                                  multiple aspects described in the methodology section.
Figure 1: Sample Two Turns: The first of the three boards
shows the state of the board before the turn starts. The player
of the dark star piece chooses to move one space down re-
sulting in the second board. The third board is a result of the
player of the light star piece attacking the dark star piece.


                      Methodology
The methodology we propose starts from a base case and
incrementally builds to more complicated versions of the
game. This involves training on less complicated variations
of the base case and testing on never-before-seen aspects
from the list below. These never-before-seen mechanics can
come into play at the beginning of a new game or part-way
through the new game. The way we measure the success-             Figure 2: Base case board setup used for initial training and
ful adaptation of the agent is based off of comparing the         testing.
win/loss/draw ratios before the increase in difficulty and af-
ter. The different variations to increase game complexity are        We trained an AlphaZero-based agent using a Nvidia
described in the sections below.                                  DGX with 4 Tesla GPUs for 200 iterations, 100 episodes per
                                                                  iteration, 20 Monte Carlo Tree Search (MCTS) simulations
Disrupting Board Symmetry                                         per episode, 40 games to determine model improvement, 1
                                                                  for Cpuct, 0.001 learning rate, 0.3 dropout rate, 10 epochs,
We propose two methods for disrupting board symmetry. In-         16 batch size, 128 number of channels. The below table and
troducing off-limits spaces that pieces cannot move to, caus-     graphs show our results after pitting the model at certain it-
ing the board to not rotate along a symmetrical axis and stay     erations against a random player.
the same board. The second is by disrupting piece symmetry
by having non-symmetrical starting positions.                                 Iteration   Wins     Losses    Draws
                                                                              0           18       22        60
Changing the Game Objective                                                   10          41       8         51
Changing the game winning mechanisms for the players,                         20          45       1         54
suddenly shifting the way the agent would need to play                        30          40       3         57
the game. For example, instead of capturing enemy pieces,                     70          23       4         73
now the objective is capture the flag. Another example of                     140         41       3         56
changing objectives is by having the different players try to                 200         44       1         55
achieve a different goal, such as one player having a focus
on survival whereas the other would focus on wiping the op-          Convergence occurs around 10 iterations, this is earlier
ponent’s pieces out as fast as possible.                          then initially expected possibly due to the lack of game com-
                                                                  plexity in the base case game. More studies will be con-
Mid-Game Changes                                                  ducted once game complexity is increased. We dialed up the
Many of the above changes can be made part-way through            Cpuct hyper-parameter to 4 to encourage exploration, the
game to incorporate timing of changes as part of the diffi-       model simply converges at a slower rate to the same win-
culty. In addition to the existing changes, other mid-game        ning model as the Cpuct equals 1.
changes can include a sudden ”catastrophe” where the the
enemy gains a number of units or you lose a number of units       Observations on AlphaZero
and introducing a new player either as an ally, enemy ally or     Game design is important, since AlphaZero is a Monte Carlo
neutral third party.                                              method, we need to make sure the game ends in a timely
Figure 3: The trained agent starts winning more consistently
after 10 iterations.
                                                                    Figure 5: Non-symmetrical board setup used for incremental
                                                                    case testing.


                                                                    Non Symmetrical Board Change Mid Game
                                                                    The board starts as the non symmetrical board shown in fig-
                                                                    ure 5, then turns into the board shown in figure 2 without
                                                                    obstacles after 25 turns. 50 turns without a winner results in
                                                                    a draw. The trained agent won 68 games, lost 4 games, and
                                                                    drew 28 games out of 100 games. Proving the agent can per-
                                                                    form relatively well with board symmetry change half way
                                                                    through the game.
Figure 4: Draws are constant throughout the training process

                                                                    Non Symmetrical Board with Random Added New
matter in order to complete an episode before the training          Pieces Mid Game
becomes unstable due to an agent running from the enemy
forever.                                                            Starting with the non symmetrical board shown in figure 5,
                                                                    at turn 25 in a 100 turn game, add 3 reinforcement pieces for
   Furthermore, we cannot punish draws but instead give
                                                                    each team at a random location if the space is empty during
a near 0 reward since AlphaZero generally uses the same
                                                                    the turn. The trained agent won 80 games, lost 6, and drew
agent to play both players and simply flips the board to play
                                                                    14, performing relatively well with the new randomness in-
against itself. This could potentially cause issues down the
                                                                    troduced.
road if we were to change the two player’s goal to be differ-
ent from one another, for example, player one wants to de-
stroy a building, while player two wants to defend the build-       Non Symmetrical Board with Random Deleted
ing at all cost.                                                    Pieces Mid Game
   The board symmetry does not affect agent learning in Al-
phaZero.                                                            Starting with the non symmetrical board shown in figure
                                                                    5, at turn 25 in a 100 turn game, blow up 3 random spots
                                                                    with bombs, where any pieces at those locations are now de-
Non Symmetrical Board                                               stroyed. The trained agent won 84 games, lost 11, and drew
The trained agent will be used to play different Check-             5, performing relatively well with the new randomness in-
ers Modern Warfare variants. Starting with one de-                  troduced.
gree variant such as making the board non-symmetrical
with random land fills where players cannot move                    Non Symmetrical Board with Statically Added
their pieces to. To do this, we disabled locations                  New Pieces Mid Game
[0,0],[1,0],[2,0],[3,0],[4,0],[0,1],[1,1],[0,3],[2,3], put player
1 pieces at [2,1],[3,1], and player 2 pieces at [2,4] and [3,4]     Starting with the non symmetrical board shown in figure 5,
as shown in figure 5, at the beginning of the game.                 at turn 25 in a 100 turn game, add 3 reinforcement pieces for
   The agent trained with 200 iterations from the above sec-        each team at specific locations if the space is empty during
tion was pitted against a random player. Winning 70 games,          the turn. Team 1 is reinforced at locations [2,1] [3,1] [4,1],
losing 1 games, and drawing 29 games. This proves the               team 2 is reinforced at locations [2,4][3,4][4,4]. The trained
trained agent can deal with disrupted board symmetry and            agent won 77 games, lost 2, and drew 21, performing rela-
a game board with different terrain setup.                          tively well with mid game state space changes.
Non Symmetrical Board with Statically Deleted                      Diego Perez-Liebana, Jialin Liu, A. K. R. D. G. J. T. S. M. L.
Pieces Mid Game                                                    2018. General video game ai: a multi-track framework for
                                                                   evaluating agents, games and content generation algorithms.
Starting with the non symmetrical board shown in figure 5,
                                                                   arXiv:1802.10363.
and at turn 25 in a 100 turn game, blow up every piece at
locations [2,1] [3,1] [4,1] [2,4] [3,4] [4,4]. The trained agent   Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and
won 83 games, lost 8, and drew 9, performing relatively well       Whiteson, S. 2017. Counterfactual multi-agent policy gra-
with mid game state space changes.                                 dients. arXiv:1705.08926.
                                                                   Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
Non Symmetrical Board with Non Deterministic                       memory. Neural computation 9:1735–80.
Moves                                                              Hsueh, C.-H.; Wu, I.-C.; Chen, J.-C.; and Hsu, T.-S. 2018.
Movements and attacks are now non-deterministic, where             Alphazero for a non-deterministic game. 2018 Conference
20% of the moves or attacks are nullified and resulting in         on Technologies and Applications of Artificial Intelligence.
a no-op. Testing on a 50 turn game. The trained agent won          Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
55 games, lost 10, and draw 35. We then tested the same            Antonoglou, I.; Wierstra, D.; and Riedmiller, M.
rules with 50% of the movements and attacks nullified, The         2013. Playing atari with deep reinforcement learning.
trained agent won 34 games, lost 10, and drew 56. Finally          arXiv:1312.5602.
we changed it to 80% of the movements and attacks nulli-
fied, the trained agent won 8 games, lost 3, and drew on 89        Nair, S.; Thakoor, S.; and Jhunjhunwala, M. 2017. Learning
games. The results indicate the agent preformed relatively         to play othello without human knowledge.
well, with the observation that more randomly assigned no-         Ranasinghe, N., and Shen, W.-M. 2008. Surprise-based
ops will result in more draw games.                                learning for developmental robotics. 2008 ECSIS Sympo-
                                                                   sium on Learning and Adaptive Behaviors for Robotic Sys-
Changing Game Objective                                            tems (LAB-RS).
Changed the game objective to capture the flag, and used           Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
the agent trained on eliminating the enemy team. the agent         Klimov, O. 2017. Proximal policy optimization algorithms.
won 10 games, lost 4 games, and drew 6 games over 20               arXiv:1707.06347.
games. We then changed the game objective after 25 turns           Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre,
in a 50 turn game, the agent won 9 games, lost 5 games, and        L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
drew 6 games over 20 games. The agent performed relatively         Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.;
well with changing game objectives even though it was not          Nham, J.; Kalchbrenner, N.; Sutskever, I.; and Timothy Lil-
trained on this objective. We suspect this is due the trained      licrap, Madeleine Leach, K. K. T. G. D. H. 2016. Mastering
agent having learning generic game playing techniques such         the game of go with deep neural networks and tree search.
as movement patterns on a square type board.                       Nature 529(7587):484.
Non Symmetrical Game Objective                                     Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai,
                                                                   M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Grae-
Finally, we changed the game objective to be non symmetri-         pel, T.; Lillicrap, T.; Simonyan, K.; and Hassabis, D. 2017a.
cal, meaning the 2 players have different game winning con-        Mastering chess and shogi by self-play with a general rein-
ditions. player 1 has the goal to protect a flag, while player     forcement learning algorithm. arXiv:1712.01815.
2 has the goal to destroy the flag. AlphaZero could not train
this agent with good results since it uses one neural network      Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
to train both player. Therefore, future work will be to change     Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
the AlphaZero algorithm to a multi-agent learning system           A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driess-
where there are 2 agents trained on 2 different objectives.        che, G.; Graepel, T.; and Hassabis, D. 2017b. Master-
                                                                   ing the game of go without human knowledge. Nature
                                                                   550(7676):354.
                        Conclusion
                                                                   Vaswani; Ashish; Shazeer; Noam; Parmar; Niki; Uszkoreit;
As we incrementally increase the complexity of the game,           Jakob; Jones; Llion; Gomez; N, A.; Kaiser; Lukasz; Polo-
we discover the robustness of the algorithms to more com-          sukhin; and Illia. 2017. Attention is all you need. In Ad-
plex environments and then apply different strategies to im-       vances in Neural Information Processing Systems 30. Cur-
prove the AI flexibility to accommodate to more complex            ran Associates, Inc. 5998–6008.
and stochastic environments. We learned that AlphaZero is
robust to board change, but less flexible dealing with other       Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezh-
aspects of game change.                                            nevets, A. S.; Yeo, M.; Makhzani, A.; Kttler, H.; Agapiou,
                                                                   J.; Schrittwieser, J.; Quan, J.; Gaffney, S.; Petersen, S.; Si-
                                                                   monyan, K.; Schaul, T.; van Hasselt, H.; Silver, D.; Lillicrap,
                        References                                 T.; Calderone, K.; Keet, P.; Brunasso, A.; Lawrence, D.; Ek-
Brown, N., and Sandholm, T. 2019. Superhuman ai for                ermo, A.; Repp, J.; and Tsing, R. 2017. Starcraft ii: A new
multiplayer poker. Science 11 July 2019.                           challenge for reinforcement learning.
Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer
networks. arXiv:1506.03134.