=Paper=
{{Paper
|id=Vol-2819/session3paper2
|storemode=property
|title=Evaluating Reinforcement Learning Algorithms for Evolving Military Games (https://youtu.be/VWIJifA5EXY)
|pdfUrl=https://ceur-ws.org/Vol-2819/session3paper2.pdf
|volume=Vol-2819
|authors=James Chao,Jonathan Sato,Crisrael Lucero,Doug Lange
}}
==Evaluating Reinforcement Learning Algorithms for Evolving Military Games (https://youtu.be/VWIJifA5EXY)==
Evaluating Reinforcement Learning Algorithms For Evolving Military Games
James Chao*, Jonathan Sato*, Crisrael Lucero, Doug S. Lange
Naval Information Warfare Center Pacific
*Equal Contribution
{first.last}@navy.mil
Abstract games in 2013 (Mnih et al. 2013), Google DeepMind devel-
oped AlphaGo (Silver et al. 2016) that defeated world cham-
In this paper, we evaluate reinforcement learning algorithms pion Lee Sedol in the game of Go using supervised learning
for military board games. Currently, machine learning ap- and reinforcement learning. One year later, AlphaGo Zero
proaches to most games assume certain aspects of the game (Silver et al. 2017b) was able to defeat AlphaGo with no
remain static. This methodology results in a lack of algorithm
robustness and a drastic drop in performance upon chang-
human knowledge and pure reinforcement learning. Soon
ing in-game mechanics. To this end, we will evaluate general after, AlphaZero (Silver et al. 2017a) generalized AlphaGo
game playing (Diego Perez-Liebana 2018) AI algorithms on Zero to be able to play more games including Chess, Shogi,
evolving military games. and Go, creating a more generalized AI to apply to differ-
ent problems. In 2018, OpenAI Five used five Long Short-
term Memory (Hochreiter and Schmidhuber 1997) neural
Introduction networks and a Proximal Policy Optimization (Schulman et
al. 2017) method to defeat a professional DotA team, each
AlphaZero (Silver et al. 2017a) described an approach that LSTM acting as a player in a team to collaborate and achieve
trained an AI agent through self-play to achieve super- a common goal. AlphaStar used a transformer (Vaswani et
human performance. While the results are impressive, we al. 2017), LSTM (Hochreiter and Schmidhuber 1997), auto-
want to test if the same algorithms used in games are ro- regressive policy head (Vinyals et al. 2017) with a pointer
bust enough to translate into more complex environments (Vinyals, Fortunato, and Jaitly 2015), and a centralized value
that closer resemble the real world. To our knowledge, pa- baseline (Foerster et al. 2017) to beat top professional Star-
pers such as (Hsueh et al. 2018) examine AlphaZero on craft2 players. Pluribus (Brown and Sandholm 2019) used
non-deterministic games, but not much research has been Monte Carlo counterfactual regret minimization to beat pro-
performed on progressively complicating and evolving the fessional poker players.
game environment, mechanics, and goals. Therefore, we AlphaZero is chosen due to its proven ability to be play at
tested these different aspects of robustness on AlphaZero super-human levels without doubt of merely winning due to
models. We intend to continue future work evaluating dif- fast machine reaction and domain knowledge; however, we
ferent algorithms. are not limited to AlphaZero as an algorithm. Since the orig-
inal AlphaZero is generally applied to well known games
Background and Related Work with well defined rules, we built our base case game and
applied a general AlphaZero algorithm (Nair, Thakoor, and
Recent breakthroughs in game AI has generated a large
Jhunjhunwala 2017) in order to have the ability to mod-
amount of excitement in the AI community. Game AI not
ify the game code as well as the algorithm code in order
only can provide advancement in the gaming industry, but
to experiment with evolving game environments such as
also can be applied to help solve many real world problems.
Surprise-based learning (Ranasinghe and Shen 2008).
After Deep-Q Networks (DQNs) were used to beat Atari
This will certify that all author(s) of the above article/paper are em- Game Description: Checkers Modern Warfare
ployees of the U.S. Government and performed this work as part of The basic game that has been developed to test our approach
their employment, and that the article/paper is therefore not subject
consists of two players with a fixed size, symmetrical square
to U.S. copyright protection. No copyright. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY board. Each player has the same number of pieces placed
4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop symmetrically on the board. Players take turns according to
on Deep Models and Artificial Intelligence for Defense Applica- the following rules: the turn player chooses a single piece
tions: Potentials, Theories, Practices, Tools, and Risks, November and either moves the piece one space or attacks an adjacent
11-12, 2020, Virtual, published at http://ceur-ws.org piece in the up, down, right, or left directions. The turn is
then passed to the next player. This continues until pieces of Case Study and Results
only one team remain or the stalemate turn count is reached. The base case game consists of a five by five size board and
A simple two turns are shown in Figure 1. The game state six game pieces with three for each side. The three pieces of
is fully observable, symmetrical, zero sum, turn-based, dis- each team are set in opposing corners of the board as seen in
crete, deterministic, static, and sequential. Figure 2. The top right box of the board is a dedicated piece
of land that pieces are not allowed to move to. During each
player’s turn, the player has the option of moving a single
piece or attacking another game piece with one of their game
pieces. This continues until only no pieces from one team is
left or until 50 turns have elapsed signaling a stalemate. This
base case can be incrementally changed according to one or
multiple aspects described in the methodology section.
Figure 1: Sample Two Turns: The first of the three boards
shows the state of the board before the turn starts. The player
of the dark star piece chooses to move one space down re-
sulting in the second board. The third board is a result of the
player of the light star piece attacking the dark star piece.
Methodology
The methodology we propose starts from a base case and
incrementally builds to more complicated versions of the
game. This involves training on less complicated variations
of the base case and testing on never-before-seen aspects
from the list below. These never-before-seen mechanics can
come into play at the beginning of a new game or part-way
through the new game. The way we measure the success- Figure 2: Base case board setup used for initial training and
ful adaptation of the agent is based off of comparing the testing.
win/loss/draw ratios before the increase in difficulty and af-
ter. The different variations to increase game complexity are We trained an AlphaZero-based agent using a Nvidia
described in the sections below. DGX with 4 Tesla GPUs for 200 iterations, 100 episodes per
iteration, 20 Monte Carlo Tree Search (MCTS) simulations
Disrupting Board Symmetry per episode, 40 games to determine model improvement, 1
for Cpuct, 0.001 learning rate, 0.3 dropout rate, 10 epochs,
We propose two methods for disrupting board symmetry. In- 16 batch size, 128 number of channels. The below table and
troducing off-limits spaces that pieces cannot move to, caus- graphs show our results after pitting the model at certain it-
ing the board to not rotate along a symmetrical axis and stay erations against a random player.
the same board. The second is by disrupting piece symmetry
by having non-symmetrical starting positions. Iteration Wins Losses Draws
0 18 22 60
Changing the Game Objective 10 41 8 51
Changing the game winning mechanisms for the players, 20 45 1 54
suddenly shifting the way the agent would need to play 30 40 3 57
the game. For example, instead of capturing enemy pieces, 70 23 4 73
now the objective is capture the flag. Another example of 140 41 3 56
changing objectives is by having the different players try to 200 44 1 55
achieve a different goal, such as one player having a focus
on survival whereas the other would focus on wiping the op- Convergence occurs around 10 iterations, this is earlier
ponent’s pieces out as fast as possible. then initially expected possibly due to the lack of game com-
plexity in the base case game. More studies will be con-
Mid-Game Changes ducted once game complexity is increased. We dialed up the
Many of the above changes can be made part-way through Cpuct hyper-parameter to 4 to encourage exploration, the
game to incorporate timing of changes as part of the diffi- model simply converges at a slower rate to the same win-
culty. In addition to the existing changes, other mid-game ning model as the Cpuct equals 1.
changes can include a sudden ”catastrophe” where the the
enemy gains a number of units or you lose a number of units Observations on AlphaZero
and introducing a new player either as an ally, enemy ally or Game design is important, since AlphaZero is a Monte Carlo
neutral third party. method, we need to make sure the game ends in a timely
Figure 3: The trained agent starts winning more consistently
after 10 iterations.
Figure 5: Non-symmetrical board setup used for incremental
case testing.
Non Symmetrical Board Change Mid Game
The board starts as the non symmetrical board shown in fig-
ure 5, then turns into the board shown in figure 2 without
obstacles after 25 turns. 50 turns without a winner results in
a draw. The trained agent won 68 games, lost 4 games, and
drew 28 games out of 100 games. Proving the agent can per-
form relatively well with board symmetry change half way
through the game.
Figure 4: Draws are constant throughout the training process
Non Symmetrical Board with Random Added New
matter in order to complete an episode before the training Pieces Mid Game
becomes unstable due to an agent running from the enemy
forever. Starting with the non symmetrical board shown in figure 5,
at turn 25 in a 100 turn game, add 3 reinforcement pieces for
Furthermore, we cannot punish draws but instead give
each team at a random location if the space is empty during
a near 0 reward since AlphaZero generally uses the same
the turn. The trained agent won 80 games, lost 6, and drew
agent to play both players and simply flips the board to play
14, performing relatively well with the new randomness in-
against itself. This could potentially cause issues down the
troduced.
road if we were to change the two player’s goal to be differ-
ent from one another, for example, player one wants to de-
stroy a building, while player two wants to defend the build- Non Symmetrical Board with Random Deleted
ing at all cost. Pieces Mid Game
The board symmetry does not affect agent learning in Al-
phaZero. Starting with the non symmetrical board shown in figure
5, at turn 25 in a 100 turn game, blow up 3 random spots
with bombs, where any pieces at those locations are now de-
Non Symmetrical Board stroyed. The trained agent won 84 games, lost 11, and drew
The trained agent will be used to play different Check- 5, performing relatively well with the new randomness in-
ers Modern Warfare variants. Starting with one de- troduced.
gree variant such as making the board non-symmetrical
with random land fills where players cannot move Non Symmetrical Board with Statically Added
their pieces to. To do this, we disabled locations New Pieces Mid Game
[0,0],[1,0],[2,0],[3,0],[4,0],[0,1],[1,1],[0,3],[2,3], put player
1 pieces at [2,1],[3,1], and player 2 pieces at [2,4] and [3,4] Starting with the non symmetrical board shown in figure 5,
as shown in figure 5, at the beginning of the game. at turn 25 in a 100 turn game, add 3 reinforcement pieces for
The agent trained with 200 iterations from the above sec- each team at specific locations if the space is empty during
tion was pitted against a random player. Winning 70 games, the turn. Team 1 is reinforced at locations [2,1] [3,1] [4,1],
losing 1 games, and drawing 29 games. This proves the team 2 is reinforced at locations [2,4][3,4][4,4]. The trained
trained agent can deal with disrupted board symmetry and agent won 77 games, lost 2, and drew 21, performing rela-
a game board with different terrain setup. tively well with mid game state space changes.
Non Symmetrical Board with Statically Deleted Diego Perez-Liebana, Jialin Liu, A. K. R. D. G. J. T. S. M. L.
Pieces Mid Game 2018. General video game ai: a multi-track framework for
evaluating agents, games and content generation algorithms.
Starting with the non symmetrical board shown in figure 5,
arXiv:1802.10363.
and at turn 25 in a 100 turn game, blow up every piece at
locations [2,1] [3,1] [4,1] [2,4] [3,4] [4,4]. The trained agent Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and
won 83 games, lost 8, and drew 9, performing relatively well Whiteson, S. 2017. Counterfactual multi-agent policy gra-
with mid game state space changes. dients. arXiv:1705.08926.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
Non Symmetrical Board with Non Deterministic memory. Neural computation 9:1735–80.
Moves Hsueh, C.-H.; Wu, I.-C.; Chen, J.-C.; and Hsu, T.-S. 2018.
Movements and attacks are now non-deterministic, where Alphazero for a non-deterministic game. 2018 Conference
20% of the moves or attacks are nullified and resulting in on Technologies and Applications of Artificial Intelligence.
a no-op. Testing on a 50 turn game. The trained agent won Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
55 games, lost 10, and draw 35. We then tested the same Antonoglou, I.; Wierstra, D.; and Riedmiller, M.
rules with 50% of the movements and attacks nullified, The 2013. Playing atari with deep reinforcement learning.
trained agent won 34 games, lost 10, and drew 56. Finally arXiv:1312.5602.
we changed it to 80% of the movements and attacks nulli-
fied, the trained agent won 8 games, lost 3, and drew on 89 Nair, S.; Thakoor, S.; and Jhunjhunwala, M. 2017. Learning
games. The results indicate the agent preformed relatively to play othello without human knowledge.
well, with the observation that more randomly assigned no- Ranasinghe, N., and Shen, W.-M. 2008. Surprise-based
ops will result in more draw games. learning for developmental robotics. 2008 ECSIS Sympo-
sium on Learning and Adaptive Behaviors for Robotic Sys-
Changing Game Objective tems (LAB-RS).
Changed the game objective to capture the flag, and used Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
the agent trained on eliminating the enemy team. the agent Klimov, O. 2017. Proximal policy optimization algorithms.
won 10 games, lost 4 games, and drew 6 games over 20 arXiv:1707.06347.
games. We then changed the game objective after 25 turns Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre,
in a 50 turn game, the agent won 9 games, lost 5 games, and L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
drew 6 games over 20 games. The agent performed relatively Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.;
well with changing game objectives even though it was not Nham, J.; Kalchbrenner, N.; Sutskever, I.; and Timothy Lil-
trained on this objective. We suspect this is due the trained licrap, Madeleine Leach, K. K. T. G. D. H. 2016. Mastering
agent having learning generic game playing techniques such the game of go with deep neural networks and tree search.
as movement patterns on a square type board. Nature 529(7587):484.
Non Symmetrical Game Objective Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai,
M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Grae-
Finally, we changed the game objective to be non symmetri- pel, T.; Lillicrap, T.; Simonyan, K.; and Hassabis, D. 2017a.
cal, meaning the 2 players have different game winning con- Mastering chess and shogi by self-play with a general rein-
ditions. player 1 has the goal to protect a flag, while player forcement learning algorithm. arXiv:1712.01815.
2 has the goal to destroy the flag. AlphaZero could not train
this agent with good results since it uses one neural network Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
to train both player. Therefore, future work will be to change Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
the AlphaZero algorithm to a multi-agent learning system A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driess-
where there are 2 agents trained on 2 different objectives. che, G.; Graepel, T.; and Hassabis, D. 2017b. Master-
ing the game of go without human knowledge. Nature
550(7676):354.
Conclusion
Vaswani; Ashish; Shazeer; Noam; Parmar; Niki; Uszkoreit;
As we incrementally increase the complexity of the game, Jakob; Jones; Llion; Gomez; N, A.; Kaiser; Lukasz; Polo-
we discover the robustness of the algorithms to more com- sukhin; and Illia. 2017. Attention is all you need. In Ad-
plex environments and then apply different strategies to im- vances in Neural Information Processing Systems 30. Cur-
prove the AI flexibility to accommodate to more complex ran Associates, Inc. 5998–6008.
and stochastic environments. We learned that AlphaZero is
robust to board change, but less flexible dealing with other Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezh-
aspects of game change. nevets, A. S.; Yeo, M.; Makhzani, A.; Kttler, H.; Agapiou,
J.; Schrittwieser, J.; Quan, J.; Gaffney, S.; Petersen, S.; Si-
monyan, K.; Schaul, T.; van Hasselt, H.; Silver, D.; Lillicrap,
References T.; Calderone, K.; Keet, P.; Brunasso, A.; Lawrence, D.; Ek-
Brown, N., and Sandholm, T. 2019. Superhuman ai for ermo, A.; Repp, J.; and Tsing, R. 2017. Starcraft ii: A new
multiplayer poker. Science 11 July 2019. challenge for reinforcement learning.
Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer
networks. arXiv:1506.03134.