<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Reinforcement Learning Algorithms For Evolving Military Games</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>James Chao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonathan Sato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Crisrael Lucero</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Doug S. Lange</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Equal Contribution</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Naval Information Warfare Center Pacific</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we evaluate reinforcement learning algorithms for military board games. Currently, machine learning approaches to most games assume certain aspects of the game remain static. This methodology results in a lack of algorithm robustness and a drastic drop in performance upon changing in-game mechanics. To this end, we will evaluate general game playing (Diego Perez-Liebana 2018) AI algorithms on evolving military games.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        AlphaZero
        <xref ref-type="bibr" rid="ref16 ref18">(Silver et al. 2017a)</xref>
        described an approach that
trained an AI agent through self-play to achieve
superhuman performance. While the results are impressive, we
want to test if the same algorithms used in games are
robust enough to translate into more complex environments
that closer resemble the real world. To our knowledge,
papers such as
        <xref ref-type="bibr" rid="ref6">(Hsueh et al. 2018)</xref>
        examine AlphaZero on
non-deterministic games, but not much research has been
performed on progressively complicating and evolving the
game environment, mechanics, and goals. Therefore, we
tested these different aspects of robustness on AlphaZero
models. We intend to continue future work evaluating
different algorithms.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>
        Recent breakthroughs in game AI has generated a large
amount of excitement in the AI community. Game AI not
only can provide advancement in the gaming industry, but
also can be applied to help solve many real world problems.
After Deep-Q Networks (DQNs) were used to beat Atari
games in
        <xref ref-type="bibr" rid="ref8">2013 (Mnih et al. 2013</xref>
        ), Google DeepMind
developed AlphaGo
        <xref ref-type="bibr" rid="ref14">(Silver et al. 2016)</xref>
        that defeated world
champion Lee Sedol in the game of Go using supervised learning
and reinforcement learning. One year later, AlphaGo Zero
        <xref ref-type="bibr" rid="ref16 ref18">(Silver et al. 2017b)</xref>
        was able to defeat AlphaGo with no
human knowledge and pure reinforcement learning. Soon
after, AlphaZero
        <xref ref-type="bibr" rid="ref16 ref18">(Silver et al. 2017a)</xref>
        generalized AlphaGo
Zero to be able to play more games including Chess, Shogi,
and Go, creating a more generalized AI to apply to
different problems. In 2018, OpenAI Five used five Long
Shortterm Memory
        <xref ref-type="bibr" rid="ref5">(Hochreiter and Schmidhuber 1997)</xref>
        neural
networks and a Proximal Policy Optimization
        <xref ref-type="bibr" rid="ref12">(Schulman et
al. 2017)</xref>
        method to defeat a professional DotA team, each
LSTM acting as a player in a team to collaborate and achieve
a common goal. AlphaStar used a transformer
        <xref ref-type="bibr" rid="ref19">(Vaswani et
al. 2017)</xref>
        , LSTM
        <xref ref-type="bibr" rid="ref5">(Hochreiter and Schmidhuber 1997)</xref>
        ,
autoregressive policy head
        <xref ref-type="bibr" rid="ref20">(Vinyals et al. 2017)</xref>
        with a pointer
(Vinyals, Fortunato, and Jaitly 2015), and a centralized value
baseline
        <xref ref-type="bibr" rid="ref4">(Foerster et al. 2017)</xref>
        to beat top professional
Starcraft2 players. Pluribus
        <xref ref-type="bibr" rid="ref1">(Brown and Sandholm 2019)</xref>
        used
Monte Carlo counterfactual regret minimization to beat
professional poker players.
      </p>
      <p>
        AlphaZero is chosen due to its proven ability to be play at
super-human levels without doubt of merely winning due to
fast machine reaction and domain knowledge; however, we
are not limited to AlphaZero as an algorithm. Since the
original AlphaZero is generally applied to well known games
with well defined rules, we built our base case game and
applied a general AlphaZero algorithm
        <xref ref-type="bibr" rid="ref10">(Nair, Thakoor, and
Jhunjhunwala 2017)</xref>
        in order to have the ability to
modify the game code as well as the algorithm code in order
to experiment with evolving game environments such as
Surprise-based learning
        <xref ref-type="bibr" rid="ref11">(Ranasinghe and Shen 2008)</xref>
        .
      </p>
      <sec id="sec-2-1">
        <title>Game Description: Checkers Modern Warfare</title>
        <p>The basic game that has been developed to test our approach
consists of two players with a fixed size, symmetrical square
board. Each player has the same number of pieces placed
symmetrically on the board. Players take turns according to
the following rules: the turn player chooses a single piece
and either moves the piece one space or attacks an adjacent
piece in the up, down, right, or left directions. The turn is
then passed to the next player. This continues until pieces of
only one team remain or the stalemate turn count is reached.
A simple two turns are shown in Figure 1. The game state
is fully observable, symmetrical, zero sum, turn-based,
discrete, deterministic, static, and sequential.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The methodology we propose starts from a base case and
incrementally builds to more complicated versions of the
game. This involves training on less complicated variations
of the base case and testing on never-before-seen aspects
from the list below. These never-before-seen mechanics can
come into play at the beginning of a new game or part-way
through the new game. The way we measure the
successful adaptation of the agent is based off of comparing the
win/loss/draw ratios before the increase in difficulty and
after. The different variations to increase game complexity are
described in the sections below.</p>
      <sec id="sec-3-1">
        <title>Disrupting Board Symmetry</title>
        <p>We propose two methods for disrupting board symmetry.
Introducing off-limits spaces that pieces cannot move to,
causing the board to not rotate along a symmetrical axis and stay
the same board. The second is by disrupting piece symmetry
by having non-symmetrical starting positions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Changing the Game Objective</title>
        <p>Changing the game winning mechanisms for the players,
suddenly shifting the way the agent would need to play
the game. For example, instead of capturing enemy pieces,
now the objective is capture the flag. Another example of
changing objectives is by having the different players try to
achieve a different goal, such as one player having a focus
on survival whereas the other would focus on wiping the
opponent’s pieces out as fast as possible.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Mid-Game Changes</title>
        <p>Many of the above changes can be made part-way through
game to incorporate timing of changes as part of the
difficulty. In addition to the existing changes, other mid-game
changes can include a sudden ”catastrophe” where the the
enemy gains a number of units or you lose a number of units
and introducing a new player either as an ally, enemy ally or
neutral third party.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Case Study and Results</title>
      <p>The base case game consists of a five by five size board and
six game pieces with three for each side. The three pieces of
each team are set in opposing corners of the board as seen in
Figure 2. The top right box of the board is a dedicated piece
of land that pieces are not allowed to move to. During each
player’s turn, the player has the option of moving a single
piece or attacking another game piece with one of their game
pieces. This continues until only no pieces from one team is
left or until 50 turns have elapsed signaling a stalemate. This
base case can be incrementally changed according to one or
multiple aspects described in the methodology section.</p>
      <p>We trained an AlphaZero-based agent using a Nvidia
DGX with 4 Tesla GPUs for 200 iterations, 100 episodes per
iteration, 20 Monte Carlo Tree Search (MCTS) simulations
per episode, 40 games to determine model improvement, 1
for Cpuct, 0.001 learning rate, 0.3 dropout rate, 10 epochs,
16 batch size, 128 number of channels. The below table and
graphs show our results after pitting the model at certain
iterations against a random player.</p>
      <p>Convergence occurs around 10 iterations, this is earlier
then initially expected possibly due to the lack of game
complexity in the base case game. More studies will be
conducted once game complexity is increased. We dialed up the
Cpuct hyper-parameter to 4 to encourage exploration, the
model simply converges at a slower rate to the same
winning model as the Cpuct equals 1.</p>
      <sec id="sec-4-1">
        <title>Observations on AlphaZero</title>
        <p>Game design is important, since AlphaZero is a Monte Carlo
method, we need to make sure the game ends in a timely
matter in order to complete an episode before the training
becomes unstable due to an agent running from the enemy
forever.</p>
        <p>Furthermore, we cannot punish draws but instead give
a near 0 reward since AlphaZero generally uses the same
agent to play both players and simply flips the board to play
against itself. This could potentially cause issues down the
road if we were to change the two player’s goal to be
different from one another, for example, player one wants to
destroy a building, while player two wants to defend the
building at all cost.</p>
        <p>The board symmetry does not affect agent learning in
AlphaZero.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Non Symmetrical Board</title>
        <p>The trained agent will be used to play different
Checkers Modern Warfare variants. Starting with one
degree variant such as making the board non-symmetrical
with random land fills where players cannot move
their pieces to. To do this, we disabled locations
[0,0],[1,0],[2,0],[3,0],[4,0],[0,1],[1,1],[0,3],[2,3], put player
1 pieces at [2,1],[3,1], and player 2 pieces at [2,4] and [3,4]
as shown in figure 5, at the beginning of the game.</p>
        <p>The agent trained with 200 iterations from the above
section was pitted against a random player. Winning 70 games,
losing 1 games, and drawing 29 games. This proves the
trained agent can deal with disrupted board symmetry and
a game board with different terrain setup.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Non Symmetrical Board Change Mid Game</title>
        <p>The board starts as the non symmetrical board shown in
figure 5, then turns into the board shown in figure 2 without
obstacles after 25 turns. 50 turns without a winner results in
a draw. The trained agent won 68 games, lost 4 games, and
drew 28 games out of 100 games. Proving the agent can
perform relatively well with board symmetry change half way
through the game.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Non Symmetrical Board with Random Added New</title>
      </sec>
      <sec id="sec-4-5">
        <title>Pieces Mid Game</title>
        <p>Starting with the non symmetrical board shown in figure 5,
at turn 25 in a 100 turn game, add 3 reinforcement pieces for
each team at a random location if the space is empty during
the turn. The trained agent won 80 games, lost 6, and drew
14, performing relatively well with the new randomness
introduced.</p>
      </sec>
      <sec id="sec-4-6">
        <title>Non Symmetrical Board with Random Deleted</title>
      </sec>
      <sec id="sec-4-7">
        <title>Pieces Mid Game</title>
        <p>Starting with the non symmetrical board shown in figure
5, at turn 25 in a 100 turn game, blow up 3 random spots
with bombs, where any pieces at those locations are now
destroyed. The trained agent won 84 games, lost 11, and drew
5, performing relatively well with the new randomness
introduced.</p>
      </sec>
      <sec id="sec-4-8">
        <title>Non Symmetrical Board with Statically Added</title>
      </sec>
      <sec id="sec-4-9">
        <title>New Pieces Mid Game</title>
        <p>Starting with the non symmetrical board shown in figure 5,
at turn 25 in a 100 turn game, add 3 reinforcement pieces for
each team at specific locations if the space is empty during
the turn. Team 1 is reinforced at locations [2,1] [3,1] [4,1],
team 2 is reinforced at locations [2,4][3,4][4,4]. The trained
agent won 77 games, lost 2, and drew 21, performing
relatively well with mid game state space changes.</p>
      </sec>
      <sec id="sec-4-10">
        <title>Non Symmetrical Board with Statically Deleted</title>
      </sec>
      <sec id="sec-4-11">
        <title>Pieces Mid Game</title>
        <p>Starting with the non symmetrical board shown in figure 5,
and at turn 25 in a 100 turn game, blow up every piece at
locations [2,1] [3,1] [4,1] [2,4] [3,4] [4,4]. The trained agent
won 83 games, lost 8, and drew 9, performing relatively well
with mid game state space changes.</p>
      </sec>
      <sec id="sec-4-12">
        <title>Non Symmetrical Board with Non Deterministic</title>
      </sec>
      <sec id="sec-4-13">
        <title>Moves</title>
        <p>Movements and attacks are now non-deterministic, where
20% of the moves or attacks are nullified and resulting in
a no-op. Testing on a 50 turn game. The trained agent won
55 games, lost 10, and draw 35. We then tested the same
rules with 50% of the movements and attacks nullified, The
trained agent won 34 games, lost 10, and drew 56. Finally
we changed it to 80% of the movements and attacks
nullified, the trained agent won 8 games, lost 3, and drew on 89
games. The results indicate the agent preformed relatively
well, with the observation that more randomly assigned
noops will result in more draw games.</p>
      </sec>
      <sec id="sec-4-14">
        <title>Changing Game Objective</title>
        <p>Changed the game objective to capture the flag, and used
the agent trained on eliminating the enemy team. the agent
won 10 games, lost 4 games, and drew 6 games over 20
games. We then changed the game objective after 25 turns
in a 50 turn game, the agent won 9 games, lost 5 games, and
drew 6 games over 20 games. The agent performed relatively
well with changing game objectives even though it was not
trained on this objective. We suspect this is due the trained
agent having learning generic game playing techniques such
as movement patterns on a square type board.</p>
      </sec>
      <sec id="sec-4-15">
        <title>Non Symmetrical Game Objective</title>
        <p>Finally, we changed the game objective to be non
symmetrical, meaning the 2 players have different game winning
conditions. player 1 has the goal to protect a flag, while player
2 has the goal to destroy the flag. AlphaZero could not train
this agent with good results since it uses one neural network
to train both player. Therefore, future work will be to change
the AlphaZero algorithm to a multi-agent learning system
where there are 2 agents trained on 2 different objectives.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>As we incrementally increase the complexity of the game,
we discover the robustness of the algorithms to more
complex environments and then apply different strategies to
improve the AI flexibility to accommodate to more complex
and stochastic environments. We learned that AlphaZero is
robust to board change, but less flexible dealing with other
aspects of game change.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , N., and
          <string-name>
            <surname>Sandholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Superhuman ai for multiplayer poker</article-title>
          .
          <source>Science 11 July</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2018.
          <article-title>General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          arXiv:
          <year>1802</year>
          .10363.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Foerster</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Farquhar,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Afouras,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Nardelli</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Whiteson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Counterfactual multi-agent policy gradients</article-title>
          .
          <source>arXiv:1705</source>
          .
          <fpage>08926</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          <volume>9</volume>
          :
          <fpage>1735</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Hsueh</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H.; Wu</surname>
          </string-name>
          , I.-C.;
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , J.-C.; and Hsu, T.-S.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Alphazero for a non-deterministic game</article-title>
          .
          <source>2018 Conference on Technologies and Applications of Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          2013.
          <article-title>Playing atari with deep reinforcement learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>arXiv:1312</source>
          .
          <fpage>5602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Thakoor</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Jhunjhunwala,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Learning to play othello without human knowledge.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Ranasinghe</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>W.-M.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Surprise-based learning for developmental robotics</article-title>
          .
          <source>2008 ECSIS Symposium on Learning and Adaptive Behaviors for Robotic Systems (LAB-RS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wolski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Klimov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Proximal policy optimization algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>arXiv:1707</source>
          .
          <fpage>06347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maddison</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sifre</surname>
          </string-name>
          , L.; van den Driessche, G.;
          <string-name>
            <surname>Schrittwieser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Panneershelvam</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dieleman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grewe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nham</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Kalchbrenner,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.;</surname>
          </string-name>
          and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          , Madeleine Leach,
          <string-name>
            <surname>K. K. T. G. D. H</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>Nature</source>
          <volume>529</volume>
          (
          <issue>7587</issue>
          ):
          <fpage>484</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Hubert,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Lai,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Lanctot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; Graepel,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2017a</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>Mastering chess and shogi by self-play with a general reinforcement learning algorithm</article-title>
          .
          <source>arXiv:1712</source>
          .
          <year>01815</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schrittwieser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hubert</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bolton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hui</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sifre</surname>
            , L.; van den Driessche, G.; Graepel,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Hassabis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017b</year>
          .
          <article-title>Mastering the game of go without human knowledge</article-title>
          .
          <source>Nature</source>
          <volume>550</volume>
          (
          <issue>7676</issue>
          ):
          <fpage>354</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Vaswani</surname>
            ; Ashish; Shazeer;
            <given-names>N</given-names>
            oam; Parmar; Niki; Uszkoreit; Jakob; Jones; Llion; Gomez; N
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ;
          <article-title>Kaiser; Lukasz; Polosukhin; and</article-title>
          <string-name>
            <surname>Illia.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          . Curran Associates, Inc.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ewalds</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bartunov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Georgiev,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Vezhnevets</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          ; Yeo,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Makhzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Kttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Agapiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Gaffney,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Schaul</surname>
          </string-name>
          , T.; van Hasselt,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; Lillicrap,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Calderone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Keet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Brunasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Ekermo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Repp</surname>
          </string-name>
          , J.; and Tsing,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Starcraft ii: A new challenge for reinforcement learning</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>