<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Evaluating Reinforcement Learning Algorithms For Evolving Military Games</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">James</forename><surname>Chao</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Equal Contribution</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jonathan</forename><surname>Sato</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Equal Contribution</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Crisrael</forename><surname>Lucero</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Doug</forename><forename type="middle">S</forename><surname>Lange</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Naval Information Warfare Center Pacific</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Evaluating Reinforcement Learning Algorithms For Evolving Military Games</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">13AF5728876136D333F48C733C65030A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T20:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we evaluate reinforcement learning algorithms for military board games. Currently, machine learning approaches to most games assume certain aspects of the game remain static. This methodology results in a lack of algorithm robustness and a drastic drop in performance upon changing in-game mechanics. To this end, we will evaluate general game playing (Diego Perez-Liebana 2018) AI algorithms on evolving military games.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>AlphaZero <ref type="bibr" target="#b7">(Silver et al. 2017a</ref>) described an approach that trained an AI agent through self-play to achieve superhuman performance. While the results are impressive, we want to test if the same algorithms used in games are robust enough to translate into more complex environments that closer resemble the real world. To our knowledge, papers such as <ref type="bibr" target="#b3">(Hsueh et al. 2018</ref>) examine AlphaZero on non-deterministic games, but not much research has been performed on progressively complicating and evolving the game environment, mechanics, and goals. Therefore, we tested these different aspects of robustness on AlphaZero models. We intend to continue future work evaluating different algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background and Related Work</head><p>Recent breakthroughs in game AI has generated a large amount of excitement in the AI community. Game AI not only can provide advancement in the gaming industry, but also can be applied to help solve many real world problems. After Deep-Q Networks (DQNs) were used to beat Atari This will certify that all author(s) of the above article/paper are employees of the U.S. Government and performed this work as part of their employment, and that the article/paper is therefore not subject to U.  <ref type="bibr" target="#b6">(Silver et al. 2016</ref>) that defeated world champion Lee Sedol in the game of Go using supervised learning and reinforcement learning. One year later, AlphaGo Zero <ref type="bibr" target="#b7">(Silver et al. 2017b</ref>) was able to defeat AlphaGo with no human knowledge and pure reinforcement learning. Soon after, AlphaZero <ref type="bibr" target="#b7">(Silver et al. 2017a)</ref> generalized AlphaGo Zero to be able to play more games including Chess, Shogi, and Go, creating a more generalized AI to apply to different problems. In 2018, OpenAI Five used five Long Shortterm Memory <ref type="bibr" target="#b2">(Hochreiter and Schmidhuber 1997)</ref> neural networks and a Proximal Policy Optimization <ref type="bibr" target="#b5">(Schulman et al. 2017</ref>) method to defeat a professional DotA team, each LSTM acting as a player in a team to collaborate and achieve a common goal. AlphaStar used a transformer <ref type="bibr" target="#b7">(Vaswani et al. 2017)</ref>, LSTM <ref type="bibr" target="#b2">(Hochreiter and Schmidhuber 1997)</ref>, autoregressive policy head <ref type="bibr" target="#b8">(Vinyals et al. 2017</ref>) with a pointer <ref type="bibr" target="#b9">(Vinyals, Fortunato, and Jaitly 2015)</ref>, and a centralized value baseline <ref type="bibr" target="#b1">(Foerster et al. 2017)</ref> to beat top professional Star-craft2 players. Pluribus <ref type="bibr" target="#b0">(Brown and Sandholm 2019)</ref> used Monte Carlo counterfactual regret minimization to beat professional poker players.</p><p>AlphaZero is chosen due to its proven ability to be play at super-human levels without doubt of merely winning due to fast machine reaction and domain knowledge; however, we are not limited to AlphaZero as an algorithm. Since the original AlphaZero is generally applied to well known games with well defined rules, we built our base case game and applied a general AlphaZero algorithm <ref type="bibr" target="#b4">(Nair, Thakoor, and Jhunjhunwala 2017)</ref> in order to have the ability to modify the game code as well as the algorithm code in order to experiment with evolving game environments such as Surprise-based learning (Ranasinghe and Shen 2008).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Game Description: Checkers Modern Warfare</head><p>The basic game that has been developed to test our approach consists of two players with a fixed size, symmetrical square board. Each player has the same number of pieces placed symmetrically on the board. Players take turns according to the following rules: the turn player chooses a single piece and either moves the piece one space or attacks an adjacent piece in the up, down, right, or left directions. The turn is then passed to the next player. This continues until pieces of only one team remain or the stalemate turn count is reached. A simple two turns are shown in Figure <ref type="figure" target="#fig_1">1</ref>. The game state is fully observable, symmetrical, zero sum, turn-based, discrete, deterministic, static, and sequential. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methodology</head><p>The methodology we propose starts from a base case and incrementally builds to more complicated versions of the game. This involves training on less complicated variations of the base case and testing on never-before-seen aspects from the list below. These never-before-seen mechanics can come into play at the beginning of a new game or part-way through the new game. The way we measure the successful adaptation of the agent is based off of comparing the win/loss/draw ratios before the increase in difficulty and after. The different variations to increase game complexity are described in the sections below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Disrupting Board Symmetry</head><p>We propose two methods for disrupting board symmetry. Introducing off-limits spaces that pieces cannot move to, causing the board to not rotate along a symmetrical axis and stay the same board. The second is by disrupting piece symmetry by having non-symmetrical starting positions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Changing the Game Objective</head><p>Changing the game winning mechanisms for the players, suddenly shifting the way the agent would need to play the game. For example, instead of capturing enemy pieces, now the objective is capture the flag. Another example of changing objectives is by having the different players try to achieve a different goal, such as one player having a focus on survival whereas the other would focus on wiping the opponent's pieces out as fast as possible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mid-Game Changes</head><p>Many of the above changes can be made part-way through game to incorporate timing of changes as part of the difficulty. In addition to the existing changes, other mid-game changes can include a sudden "catastrophe" where the the enemy gains a number of units or you lose a number of units and introducing a new player either as an ally, enemy ally or neutral third party.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Case Study and Results</head><p>The base case game consists of a five by five size board and six game pieces with three for each side. The three pieces of each team are set in opposing corners of the board as seen in Figure <ref type="figure" target="#fig_2">2</ref>. The top right box of the board is a dedicated piece of land that pieces are not allowed to move to. During each player's turn, the player has the option of moving a single piece or attacking another game piece with one of their game pieces. This continues until only no pieces from one team is left or until 50 turns have elapsed signaling a stalemate. This base case can be incrementally changed according to one or multiple aspects described in the methodology section. Convergence occurs around 10 iterations, this is earlier then initially expected possibly due to the lack of game complexity in the base case game. More studies will be conducted once game complexity is increased. We dialed up the Cpuct hyper-parameter to 4 to encourage exploration, the model simply converges at a slower rate to the same winning model as the Cpuct equals 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Observations on AlphaZero</head><p>Game design is important, since AlphaZero is a Monte Carlo method, we need to make sure the game ends in a timely  Furthermore, we cannot punish draws but instead give a near 0 reward since AlphaZero generally uses the same agent to play both players and simply flips the board to play against itself. This could potentially cause issues down the road if we were to change the two player's goal to be different from one another, for example, player one wants to destroy a building, while player two wants to defend the building at all cost.</p><p>The board symmetry does not affect agent learning in Al-phaZero.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Board</head><p>The trained agent will be used to play different Checkers Modern Warfare variants. Starting with one degree variant such as making the board non-symmetrical with random land fills where players cannot move their pieces to. To do this, we disabled locations</p><formula xml:id="formula_0">[0,0],[1,0],[2,0],[3,0],[4,0],[0,1],[1,1],[0,3],[2,3], put player 1 pieces at [2,1],[3,1],</formula><p>and player 2 pieces at [2,4] and <ref type="bibr">[3,</ref><ref type="bibr">4]</ref> as shown in figure <ref type="figure" target="#fig_5">5</ref>, at the beginning of the game.</p><p>The agent trained with 200 iterations from the above section was pitted against a random player. Winning 70 games, losing 1 games, and drawing 29 games. This proves the trained agent can deal with disrupted board symmetry and a game board with different terrain setup. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Board with Random Added New Pieces Mid Game</head><p>Starting with the non symmetrical board shown in figure <ref type="figure" target="#fig_5">5</ref>, at turn 25 in a 100 turn game, add 3 reinforcement pieces for each team at a random location if the space is empty during the turn. The trained agent won 80 games, lost 6, and drew 14, performing relatively well with the new randomness introduced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Board with Random Deleted Pieces Mid Game</head><p>Starting with the non symmetrical board shown in figure <ref type="figure" target="#fig_5">5</ref>, at turn 25 in a 100 turn game, blow up 3 random spots with bombs, where any pieces at those locations are now destroyed. The trained agent won 84 games, lost 11, and drew 5, performing relatively well with the new randomness introduced.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Board with Statically Added New Pieces Mid Game</head><p>Starting with the non symmetrical board shown in figure <ref type="figure" target="#fig_5">5</ref>, at turn 25 in a 100 turn game, add 3 reinforcement pieces for each team at specific locations if the space is empty during the turn. Team 1 is reinforced at locations [2,1] [3,1] [4,1], team 2 is reinforced at locations <ref type="bibr">[2,4][3,4][4,4]</ref>. The trained agent won 77 games, lost 2, and drew 21, performing relatively well with mid game state space changes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Board with Statically Deleted Pieces Mid Game</head><p>Starting with the non symmetrical board shown in figure <ref type="figure" target="#fig_5">5</ref>, and at turn 25 in a 100 turn game, blow up every piece at locations <ref type="bibr">[2,1] [3,1] [4,1] [2,4] [3,4] [4,4]</ref>. The trained agent won 83 games, lost 8, and drew 9, performing relatively well with mid game state space changes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Board with Non Deterministic Moves</head><p>Movements and attacks are now non-deterministic, where 20% of the moves or attacks are nullified and resulting in a no-op. Testing on a 50 turn game. The trained agent won 55 games, lost 10, and draw 35. We then tested the same rules with 50% of the movements and attacks nullified, The trained agent won 34 games, lost 10, and drew 56. Finally we changed it to 80% of the movements and attacks nullified, the trained agent won 8 games, lost 3, and drew on 89 games. The results indicate the agent preformed relatively well, with the observation that more randomly assigned noops will result in more draw games.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Changing Game Objective</head><p>Changed the game objective to capture the flag, and used the agent trained on eliminating the enemy team. the agent won 10 games, lost 4 games, and drew 6 games over 20 games. We then changed the game objective after 25 turns in a 50 turn game, the agent won 9 games, lost 5 games, and drew 6 games over 20 games. The agent performed relatively well with changing game objectives even though it was not trained on this objective. We suspect this is due the trained agent having learning generic game playing techniques such as movement patterns on a square type board.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non Symmetrical Game Objective</head><p>Finally, we changed the game objective to be non symmetrical, meaning the 2 players have different game winning conditions. player 1 has the goal to protect a flag, while player 2 has the goal to destroy the flag. AlphaZero could not train this agent with good results since it uses one neural network to train both player. Therefore, future work will be to change the AlphaZero algorithm to a multi-agent learning system where there are 2 agents trained on 2 different objectives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>As we incrementally increase the complexity of the game, we discover the robustness of the algorithms to more complex environments and then apply different strategies to improve the AI flexibility to accommodate to more complex and stochastic environments. We learned that AlphaZero is robust to board change, but less flexible dealing with other aspects of game change.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>S. copyright protection. No copyright. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop on Deep Models and Artificial Intelligence for Defense Applications: Potentials, Theories, Practices, Tools, and Risks, November 11-12, 2020, Virtual, published at http://ceur-ws.org games in 2013 (Mnih et al. 2013), Google DeepMind developed AlphaGo</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Sample Two Turns: The first of the three boards shows the state of the board before the turn starts. The player of the dark star piece chooses to move one space down resulting in the second board. The third board is a result of the player of the light star piece attacking the dark star piece.</figDesc><graphic coords="2,54.00,122.37,238.49,59.62" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Base case board setup used for initial training and testing. We trained an AlphaZero-based agent using a Nvidia DGX with 4 Tesla GPUs for 200 iterations, 100 episodes per iteration, 20 Monte Carlo Tree Search (MCTS) simulations per episode, 40 games to determine model improvement, 1 for Cpuct, 0.001 learning rate, 0.3 dropout rate, 10 epochs, 16 batch size, 128 number of channels. The below table and graphs show our results after pitting the model at certain iterations against a random player.Iteration Wins Losses Draws 0 18 22 60 10 41 8 51 20 45 1 54 30 40 3 57 70 23 4 73 140 41 3 56 200 44 1 55</figDesc><graphic coords="2,368.25,198.80,141.00,141.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The trained agent starts winning more consistently after 10 iterations.</figDesc><graphic coords="3,77.00,54.00,192.50,116.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Draws are constant throughout the training process</figDesc><graphic coords="3,77.00,217.52,192.50,116.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Non-symmetrical board setup used for incremental case testing.</figDesc><graphic coords="3,368.25,54.00,141.00,141.40" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Superhuman ai for multiplayer poker</title>
		<author>
			<persName><forename type="first">N</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sandholm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jialin</forename><surname>Liebana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K R D G J T S M L</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.10363</idno>
	</analytic>
	<monogr>
		<title level="m">General video game ai: a multi-track framework for evaluating agents, games and content generation algorithms</title>
				<meeting><address><addrLine>Diego Perez</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2019. July 2019. 2018</date>
			<biblScope unit="volume">11</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Counterfactual multi-agent policy gradients</title>
		<author>
			<persName><forename type="first">J</forename><surname>Foerster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Farquhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Afouras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nardelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Whiteson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1705.08926</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Alphazero for a non-deterministic game</title>
		<author>
			<persName><forename type="first">C.-H</forename><surname>Hsueh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I.-C</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-S</forename><surname>Hsu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1312.5602</idno>
	</analytic>
	<monogr>
		<title level="m">2018 Conference on Technologies and Applications of Artificial Intelligence</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Wierstra</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Riedmiller</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2013">2018. 2013</date>
		</imprint>
	</monogr>
	<note>Playing atari with deep reinforcement learning</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Learning to play othello without human knowledge</title>
		<author>
			<persName><forename type="first">S</forename><surname>Nair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Thakoor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jhunjhunwala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ranasinghe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-M</forename><surname>Shen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECSIS Symposium on Learning and Adaptive Behaviors for Robotic Systems</title>
				<meeting><address><addrLine>LAB-RS</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2017. 2008. 2008</date>
		</imprint>
	</monogr>
	<note>Surprise-based learning for developmental robotics</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Schulman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wolski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Klimov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1707.06347</idno>
		<title level="m">Proximal policy optimization algorithms</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Mastering the game of go with deep neural networks and tree search</title>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Maddison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sifre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Van Den Driessche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schrittwieser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Antonoglou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Panneershelvam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lanctot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dieleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Grewe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Nham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kalchbrenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Timothy</forename><surname>Lillicrap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Madeleine</forename><surname>Leach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K T G D H</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">529</biblScope>
			<biblScope unit="page">484</biblScope>
			<date type="published" when="2016">2016. 7587</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Mastering chess and shogi by self-play with a general reinforcement learning algorithm</title>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schrittwieser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Antonoglou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lanctot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sifre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kumaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Graepel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lillicrap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hassabis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schrittwieser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">; ;</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Baker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bolton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lillicrap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sifre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Van Den Driessche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Graepel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hassabis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Illia</forename><surname>Vaswani ; Sukhin;</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1712.01815</idno>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 30</title>
				<meeting><address><addrLine>Polo-</addrLine></address></meeting>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2017">2017a. 2017b. 7676. 2017</date>
			<biblScope unit="volume">550</biblScope>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
	<note>Attention is all you need</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Starcraft ii: A new challenge for reinforcement learning</title>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ewalds</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bartunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Georgiev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Vezhnevets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yeo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Makhzani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kttler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Agapiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schrittwieser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Quan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gaffney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Petersen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schaul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Van Hasselt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Silver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lillicrap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Calderone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Keet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Brunasso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ekermo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Repp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tsing</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Pointer networks</title>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fortunato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Jaitly</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1506.03134</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
