<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Reinforcement Learning with VizDoom First-Person Shooter?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>? Copyright ©</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>International (CC BY</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>and performed at National Research University Higher School of Economics</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we study deep reinforcement algorithms for partially observable Markov decision processes (POMDP) combined with Deep Q-Networks. To our knowledge, we are the rst to apply standard Markov decision process architectures to POMDP scenarios. We propose an extension of DQN with Dueling Networks and several other model-free policies to training agent using deep reinforcement learning in VizDoom environment, which is replication of Doom rst-person shooter. We develop several agents for the following scenarios in VizDoom rst-person shooter (FPS): Basic, Defend The Center, Health Gathering. We compare our agent with Recurrent DQN with Prioritized Experience Replay and Snapshot Ensembling agent and get approximately triple increase in per episode reward. It is important to say that POMDP scenario close the gap between human and computer player scenarios thus providing more meaningful justi cation for Deep RL agent performance.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Reinforcement Learning</kwd>
        <kwd>VizDoom</kwd>
        <kwd>First-Person Shooter</kwd>
        <kwd>DQN</kwd>
        <kwd>Double Q-learning</kwd>
        <kwd>Dueling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>First-person shooter is a type of video games, in which a computer player avatar
under human control competes with other agents and human players, while
having visual representation if a human sees from avatar's eyes perspective. The
simplest elements of playing FPS are passing the maze, collecting bonuses and
cooperating and ghting other players with a ranged weapon. FPS games are
demanding for players' skills and are hard to train for supervised AI models with
passive learning.</p>
      <p>Recently, models of supervised active learning, such as Reinforcement
Learning (RL), achieve state-of-art results in many computer vision related tasks, such
as robotics, path planning and playing computer games on human level. FPS in
RL framework is presented as complex state environment with nite control
actions and certain goal, for example, to maximize kill-death ratio during one game
session or episode, which then may be divided to subgoals, such as map
navigation, health pack collection, weapon use or enemy detection. Learning agent
policy in 3D FPSs using RL is computationally hard: rewards are sparse and
usually highly delayed (player not always one-shot killed). Moreover, an agent
do not know complete information on the environment in order to model human
decision making: enemies positions are unknown and angle of view is limited
by 90-110 degrees according to human perception. The only information that
Deep RL agent can use for action choice is a game screenshot (image captured
from rendered scene), so even navigation in 3D maze is challenging and involves
storage of known routes, planning, and proper feature extraction to learn map
without knowing its navigation mesh.</p>
      <p>
        We choose the VizDoom [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as simulation environment with three core
scenarios with di erent goals and action sets: Basic, to learn navigation and monster
detection; Defend The Center, to learn accurate aiming and so, improve enemy
detection and ammo preserving; and Health Gathering, to learn detecting and
collecting health packs under navigation in critical environment with acid on the
oor.
      </p>
      <p>We combine several existing MDP models for RL agents learning in POMDP
and present a new model-free Deep RL agent, which showed its e ciency
compared to other Deep RL models applied for VizDoom agents.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Here, we give an overview of existing models following our previous study on
one scenario presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Rainbow</title>
        <p>
          In `Rainbow' paper [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], authors combined together with DQN such models, as
Double Q-Learning, Dueling Architecture, Multi-step learning, Prioritized
Replay, C51 and Noisy Networks for exploration. They achieve 8 times faster
learning than just DQN alone in The Arcade Learning Environment [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] using MDP
setting, but performance in POMDP setting is unknown.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Arnold</title>
        <p>
          In `Arnold' paper [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], authors develop Deep RL agent to play Deathmatch
scenario in VizDoom environment augmenting agent with in-game features during
training, enemy-detection, reward shaping for subgoals, separate networks for
action and navigation, and dropout for reducing over tting of convolutional
feature extraction. Their agent substantially outperforms deterministic agent for
computer players and even humans.
        </p>
        <p>However, `Arnold' agent did not exploit Q-learning extensions that could
signi cantly improve agent's performance, and the way they training the agent
goes against overall framework of Deep RL agents' training in POMDP.</p>
      </sec>
      <sec id="sec-2-3">
        <title>DRQN with Prioritized Experience Replay, Double Q-learning and Snapshot Ensembling</title>
        <p>
          Prioritized Experience Replay (PER) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is a way to speed-up training. Samples
of Experience Replay are taken with non-uniform probabilities for each tuple
ho; a; r; o0i in accordance to higher loss values cause they carry more information
than those with low loss value, which was proved to work well in Atari games.
        </p>
        <p>
          Authors of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] combined PER with Double Q-learning and Snapshot
Ensembling and tested their agent in VizDoom Defend The Center scenario. The
authors train enemy detector and Q-function in a joint manner; this model was
chosen for baseline in our case. Deep Reinforcement Learning improvements
for MDP achieved super-human performance in many games [
          <xref ref-type="bibr" rid="ref10 ref9">10,9</xref>
          ]. However,
this improvements had not been considered in POMDP setting before [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We
combine such frameworks from MDP with POMDP to improve state-of-the art
methods considering several scenarios for FPS according to VizDoom DRL
competition.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Deep Reinforcement Learning Overview</title>
      <p>General reinforcement learning goal is to learn optimal policy for an agent, which
maximizes scalar reward by interacting with the environment.</p>
      <p>At each time step t an agent observes the state st of the environment, makes a
decision about the best action at according to its policy and gets the reward rt.
This process well known as Markov Decision Process (MDP) denoted as a tuple
(S; A; R; T ), where S indicates a nite state space, A indicates a nite action
space, R is a reward function, which maps pair (s; a) 2 (S; A) into stochastic
reward r, and last but not least T is a transition kernel: T (s; a; s0) = P (St+1 =
s0jSt = s; At = a).</p>
      <p>Discounted return from state st is expressed by the following formula:
= argmax E (Gt)
where 2 (0; 1) is a discount factor reducing the impact of reward on previous
steps. Choosing depends on game duration, providing advanced strategies for
long game sessions when trained with greater values of gamma, while small
gamma provides short-term rewards.</p>
      <p>In order to choose the actions, an agent uses a policy = P (ajs). We call a
policy optimal if it maximizes expected discounted reward :</p>
      <p>We consider Q-learning as core method for training RL agents.</p>
      <sec id="sec-3-1">
        <title>Q-learning</title>
        <p>To measure the quality of a given policy
de ned as:
one can use action-value function Q</p>
        <p>Q (s; a) = E [Gjs0 = s; a0 = a]</p>
        <p>Expectation of Q-value is computed over all possible action and states reward
with the policy started from state s and performed action a and then following
its policy. If the true Q-function (Q ) is given for us, we can derive optimal policy
by taking action a that maximizes Q for each state s: a = argmax0a Q(s; a0)
To learn Q for the optimal policy we use Bellman equation (2):
(1)
(2)
(3)
(4)
Q (s; a) = r(s; a) +
max Q (s0; a0)
a0</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], authors proved that sequential assignment converges from any Q to
optimal Q if there are only nite number of actions and states, and all possible
combinations are repeated in a sequence. The problem lies in the following
procedure: if we start learning from some initialization of Q-function, we will learn
nothing due to max operator, leaving no space for exploration. To overcome this
exp Q(s;ai)
issue, sampling actions with probabilities p(ai) = Pj exp Q(s;aj) = sof tmax(ai)
(Boltzmann approach) or epsilon-greedy sampling were proposed.
        </p>
        <p>In what follows, we consider the problem of approximating Q-function with
deep neural networks when we have in nite number of states.</p>
        <p>
          Deep Q-Network In [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], authors presented an agent achieving super-human
performance in Atari games, starting the era of Deep RL achievements. Authors
use two-layer Convolutional Neural Network (CNN) as feature extractor and add
two fully-connected layers to estimate Q-function. The input of this network is
screenshot from the game, and the output is Q-function, one per action. The
proposed model can not directly apply the rule (2) and instead compute Temporal
Di erence error (TD):
        </p>
        <p>T D = Q(si; ai)
(ri +
max Q(s0i; a0))
a0
and then minimize square of it. Authors used online network to estimate Q(si; ai)
term and target network to estimate maxa0 Q(s0i; a0) term. Online network was
trained via backpropagation, while value produced by the target network were
xed and updated in periodic manner. The training for online network
parameters and target network parameters ~ was made via the following loss:
L = X
i</p>
        <p>
          Q(si; ai; )
(ri +
max Q(s0i; a0; ~)
a0
2
;
where sum is computed over a batch of samples, via which the network propagate
the error. To form a batch authors in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] propose to use experience replay, which
contains tuples hs; a; r; s0i of state, performed action, reward and next state. The
main idea is to estimate Q(s; a) only for the performed actions, rather than to
all possible actions.
        </p>
        <p>
          Although DQN showed human-like (or even better) performance in Atari
playing, it has certain drawbacks on games requiring complicated strategies, and
also show slow and unstable training [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], while also overestimating of expected
reward [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Below, we consider extensions of DQN, which can stabilize and
improve training process overcoming issues mentioned above.
        </p>
        <p>
          Double Q-learning Conventional Q-learning su ers from max operator in (2)
and agent always overestimates obtained reward. There is simple solution called
Double Q-learning [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that is to replace maxa0 Q(s; a) with
        </p>
        <p>Q(s0; argmax Q(s0a); )
a0
(5)
leading to faster and more stable training process.</p>
        <p>
          Dueling Networks The Dueling network [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] decompose Q-function: Q(s; a) =
V (s) + A(s; a), where V (s) is value and A(s; a) is advantage. Estimating each of
them in its own stream signi cantly boost training performance; Q-function is
computed by following rule:
        </p>
        <p>Q(s; a) = V (s) + A(s; a)
(6)
1 Na</p>
        <p>X A(s; aj )
Na j
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Recurrent Q-learning</title>
        <p>Q-learning is trained under fully observable environment state for each step,
which is usually not the case for practical applications. In many cases, we
can observe or trust only part of the environment state, leading to Partially
Observable Markov Decision Process (POMDP) setting. POMDP is a tuple
(S; A; R; T; ; O), where rst four items comes from MDP, indicates
observation set and O indicates observation function: O(st+1; at) = p(ot+1jst+1; at),
8o 2 . An agent makes a decision by making actions in the environment and
receiving new observations; its belief distribution in optimal next state is based
on distribution of the current state.</p>
        <p>
          Deep Recurrent Q-Network Without complete information on state st in
POMDP, it is impossible to use DQN-like approach to estimate Q(st; at).
However, there is a simple solution to adapt previous solutions to POMDP: an agent
is assigned with a memory ht and approximates Q(st; at) by using observations
and memory from previous step Q(ot; ht 1; at). One of suitable solutions is
recurrent neural network eliminating observation ot by Long-Short Term Memory
block [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] ht = LST M (ot; ht 1). Such networks are called Deep Recurrent
QNetwork (DRQN) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Experience replay in POMDP setting contains tuples ho; a; r; o0i denoting
observation, performed action, reward and next observation. It is necessary to
sample consecutive observations from experience replay in order to use memory
information, which is usually based on last observations.</p>
        <p>
          Multi-step learning DQN is trained on a single time step, while DRQN should
be trained on a sequence, which immediately get us to idea how to extend
temporal di erence (3) in loss function to n-step temporal di erence [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]:
T D(n) = Q(si; ai)
(ri + ri+1 + : : : +
n 1ri+n 1 +
n max Q(si+n; a0)) (7)
a
This method also trains faster, especially for games with sparse and delayed
rewards, but hyper-parameter n of steps number should be tuned [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
Distributional RL Q-function as an expectation of discounted reward may
be estimated in statistical way by learning distribution of discounted reward [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
with probability masses adjusted with discrete support z with N atoms, zi =
Vmin + i z; i = 0; : : : ; N , where z = (Vmax Vmin)=(N 1) and (Vmin; Vmax)
represent admissible value interval. Authors denote the value distribution as Z,
then distributional version of (2) holds: Z(x; a) =D R(x; a)+ Z(X0; A0). Authors
proposed loss function and exact algorithm called Categorical 51 (C51), where
51 is the number of atoms zi. An agent trained using C51 is more expressive and
converges to a better policy.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Scenarios</title>
      <p>In this section we describe scenarios, on which we have tested our agents. For
each scenario we provide general information, such as reward design and agent's
goal and ways for agent to reach the goal. We also provide screen resolution and
frameskip settings and show a screenshot for each scenario.
4.1</p>
      <sec id="sec-4-1">
        <title>Basic</title>
        <p>Player spawns in a wide room with its back to one of the long walls. On the
opposite side monster spawns with 1 health point, in a random position along
the wall (see Figure 1). Only 3 buttons (and 23 = 8 actions) are available for
player: attack, move left, move right. The goal is to navigate and hit a monster.
Agent receives 100 reward for hitting, -5 reward for missing and -1 for living, this
forces player to kill the monster as fast as possible. Basic scenario is the simplest
in VizDoom environment, and mainly used to test correctness of algorithms, so
even agent with big frameskip can converge to a good policy.</p>
        <p>We decide to use frameskip 10 and simpli ed version of neural network as
well as smaller amount of training epochs and steps per epoch for this scenario.
We also resize game screenshot from base 640 480 resolution to 45 30 and
convert it to grayscale.
In the second scenario, an agent with pistol and limited amount of ammo respawns
in the center of a circular room and is able only to turn left or right and re
(see Figure 2). Also, there are ve melee enemies that spawn at random
location against the wall and make their way towards the agent in the center. When
agent kills one of them, another spawns at random location. Agent's goal is to
kill as many enemies as possible, it receives +1 reward for killing enemy and -1
for death. For the agent death is inevitable, because it has a limited number of
ammo. Although reward shaping may cause positive e ect on agent performance
we decide not to use it in this scenario.</p>
        <p>
          This scenario is a good example of POMDP: agent can observe only 110
degrees and can only guess about where enemies are located. For the agent, it
is important to make accurate aiming, so we change screen resolution 400 225
(16 : 9 scale), resize down to 108 60 and reduce frameskip to 4.
In the third scenario, agent spawns in square room with dangerous substance
on the oor making constant damage each time step. Agent has no access to
its health supporting POMDP setting. Among possible actions are turn left,
turn right and move forward. There are health kits placed on the oor, which
collected increase agent's health helping to survive (see Figure 3). The following
reward shaping was used [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: agent receives +1 for every step, -100 if it dies, and
+ amount of healing, which is equal to health(ot) health(ot 1) if this value is
bigger than zero, where health(o) represents the agent health in the game. An
episode ends if the agent dies or if 2100 tics done. Agent trained with frameskip
4 and screen resolution of 108 60.
We use two agents as a baseline: DQN and DRQN with LSTM unit. Next, we
consider two modi cations. The rst is DRQN with Dueling Network, Double
Q-Learning, and dropout with rate=0:5 (D4RQN). The second is combination
of C51 with Multi-step (C51M).
        </p>
        <p>Since Basic scenario is very simple for agent to learn, we decide to use the
same feature extractor for every scenario with its architecture presented in Table
1.</p>
        <p>After convolutional layers, the feature map is reshaped to vector and is feeded
into dense layers. DQN network has dense layer with 128 units in Basic scenario
and 512 in other, both, with Relu activation. Both DQN and DRQN has one
more dense layer at the end with number of actions units and linear activation.</p>
        <p>D4RQN and C51M has dropout layer after convolutions with rate = 0:5
followed by LSTM layer with 128 units for Basic and 512 units for other
scenarios. D4RQN splits computation into two streams: value stream and advantage
stream by dense layers with 1 and number of actions units with linear activation.
Outputs from streams are combined by formula (6) and targets during
optimization are picked according to (5) thus combining Double Q-learning and Dueling
Networks.</p>
        <p>There are atoms in C51M algorithm supporting discounted reward
distribution. Each atom has probability, calculated as softmax over atoms. We split
LSTM output into two streams: value stream with number of atoms units and
value stream with number of actions linear layers, each with number of atoms
units. For each atom, these streams are combined by formula (6) and then
softmax function applied. To compute Q-value from this distribution we use formula:
Q(s; a) = Pin atoms zipi(s; a; ), where zi and pi is the i-th atom support and
probability and represents network parameters. An illustration of the network
architecture for C51M algorithm is presented in Figure 4.</p>
        <p>
          We choose 21 atoms for C51M for Basic (so it could be called C21M instead)
and 51 atoms as in the original algorithm for other scenarios. Next, we set the
values for Vmin and Vmax in atom support. For Basic scenario, we set these
values to -5 and +15, because it is relatively simple and reward can take only
3 values: -15, -10 and 95. So, it is normal if we clip maximum reward to 15 to
balance it with negative reward. For Defend The Center, we set Vmin = 5 and
Vmax = +15. In Health Gathering scenario [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], we set these values to 5 and
195. The maximum possible reward equals to maximum episode length which is
2100 and can not be precisely calculated for shaped reward.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment design</title>
      <p>For Basic scenario, we organize training as follows: rst agent makes one step in
environment and then samples observation history from experience replay. We
set batch size to 32 and sequence length to three, with rst frame used for hidden
state estimation and last two frames to train on. Each agent trains 21 iterations
of 2000 gradient descent steps, we call such iteration epoch. Before epoch starts
we update target networks' weights. We de ne epsilon to 1.0 at the start of the
training and linearly decay it at each epoch end down to 0.01 in the nal epoch.
After each epoch we test our agent setting epsilon to 0.01 and plot mean episode
reward over 100 episodes.</p>
      <p>We set experience replay size to 104 for Basic scenario and to 105 for other.
For sampling sequences of consecutive observations from experience replay we
choose only those containing just the last one as a terminal. It was important to
reduce number of terminal states for Health Gathering scenario, because agent
receives large negative reward if it dies on the last observation. For all the other
scenarios we used di erent parameters: we increase batch size and sequence
length to 128 and 10, respectively. We use rst four observations to update
agent's memory and last six to train on. We increase number of gradient steps
per epoch to 8000 and number of steps before sampling from experience replay
to 15. We also reduce learning rate to 0.0002.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>For each scenario, we measure TD loss at each gradient descent step (stability),
measuring per-episode reward changes during training (learning speed) and
obtained reward after each training epoch (performance of agents). For each of
Figures 5,6 we visualize C51M (light-blue), DQN (dark-blue), DRQN (orange)
and D4RQN (red) agent training results. For Figure 7 the visualization slightly
changes due to shift of colors made by mistake: C51M (dark-blue), DQN (red),
DRQN (light-blue) and D4RQN (orange). For TD loss plots we visualize C51M
separately, so that we could see the di erence between this model and other
agent architectures.
7.1</p>
      <sec id="sec-6-1">
        <title>Basic</title>
        <p>TD loss of each agent, except C51M, is shown on Figure 5a. All values smoothed
by exponential moving average with parameter = 0:99. Normal behavior of
TD loss includes slow increasing or decreasing. TD loss may jump down when
target network updated. Closer to end of training, loss function should decrease,
which signals that agent learned something meaningful. From Figure 5a we can
see that DRQN and D4RQN trains stable and converges, but DQN has not
converged. So, even without looking on reward plots we may assume now that
DQN has poor performance.</p>
        <p>TD loss for C51M is presented in Figure 5b. This loss is cross-entropy between
agent's prediction of distribution and actual one. Loss drops down after rst
epoch and target network's weight updated. After this loss starts uctuate and
slowly increase. We can observe stable training process on this plot.</p>
        <p>Rewards obtained by agents during training can be seen at Figure 5c. We
expect these rewards slowly but surely increase during training. It also can be
seen that each agent has been playing di erent number of episodes, because in
our experiments we only set up number of epochs and training steps during one
epoch. It can be seen that each agent, except DQN, has similar reward growing
(a) TD loss for all agents except C51M
(b) TD loss for C51M agent
(c) Rewards per training episodes</p>
        <p>(d) Rewards per testing episodes
rate, and the best one in terms of growing speed is DRQN. It is also good
indicator of learning process for Basic scenario.</p>
        <p>We test all agents at the end of each epoch. In test setting, we turn dropout
o and set to its minimal value 0:01. We play 100 episodes with each agent and
calculate mean non-shaped reward per episode. Results can be seen in Figure
5d.</p>
        <p>From these gures, it is clear that C51M has much faster learning speed
and stability as well compared to other agents. But the best policy goes to
DRQN agent. The reason for this may be simplicity of scenario and dropout
in combined agents. D4RQN converged to approximately same policy as C51M,
but trained much slower even than DQN and is the slowest agent among these
all. It converged to pretty bad policy and not capable to play this scenario well
given this number of epochs.
7.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Defend The Center</title>
        <p>For Defend The Center scenario TD loss presented in Figures 6a and 6b. We
can see similar behaviour for DRQN and D4RQN agents: they both has peaks
when target network's weight updated, small variance and stable mean of TD
loss. But for DQN agent loss has high variance and behaves unstable. It rapidly
increases in the rst half of training and starts slowly decreasing in the second
half. We can assume that DQN performance is not as good as in other models.
Loss for C51M has huge variance, but is stable in decreasing. We may assume
that model has not converged yet but it should have good performance when
trained longer.</p>
        <p>Rewards obtained by the agents during training may be seen in Figure 6c.
We can see that DQN and DRQN models trained similarly, but DQN ended up
with slightly lower score. D4RQN and C51M trained much faster and converged
to signi cantly better policies, and C51M did it faster and scored more. Also it
(a) TD loss for all agents except C51M
(b) TD loss for C51M agent
(c) Rewards per training episodes</p>
        <p>(d) Rewards per testing episodes
is noticeable that both agents did not stop their score improvement at the end
of training and potentially may be trained further.</p>
        <p>Rewards obtained during testing after each epoch are showed in Figure 6d.
These plots con rm our hypothesis about agent performance in previous
paragraph. Reward on this plot is mean value of total reward without shaping from
100 episodes. For each episode reward is equal to (number of kills - number of
deaths), which is obviously just (number of kills - 1). So, mean number of kills
can be inferred by adding one to each plot point.</p>
        <p>
          C51M came o ahead D4RQN by 5 kills in average and scores roughly 16 kills.
D4RQN obtains 11 kills, which is greater by 2 than DRQN and by 4 than DQN
at the end of training. So, our worst models in average scores 6 kills, whereas
best model in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] scores 5, which seems quite strange and may be explained
by converging of their model to bad policy or alternate image descriptors for
convolutional layers representations.
        </p>
        <p>We showed that in Defend The Center episode C51M was far superior. It
trains faster and converges to 50% better policy in reward terms. We noticed
that agents (except C51M) are not hesitating to shoot (and wasting ammo)
rapidly. DQN agent learns just to turn one side and re simultaneously, DRQN
has better aiming skill but its behaviour pretty same. D4RQN aims accurately
but misses a lot anyway.</p>
        <p>Only C51M learns that saving ammunition is vital to survival and missing
is bad. It starts the game by turning around, looking for enemies and shooting
them down, but after some time it let enemies approach closer, because it is
easier to aim and shoot at near enemy rather then at enemy far away. When
ammo amount drops to small values, around 5, agent starts to wait being hit.
After agent being hit screen, ash red and agent starts to search the attacker,
and when it founds him then it eliminates attacker with one shoot.</p>
        <p>This behavior is remarkable because it is similar to human: player usually
start playing aggressively, but when ammo count became low, they start think
(a) TD loss for all agents except C51M
(b) TD loss for C51M agent
(c) Rewards per training episodes</p>
        <p>
          (d) Rewards per testing episodes
about safety and tries to survive as long as possible. Another reason for our
admiration is that we did not use reward shaping and only signal that agent
receives is +1 for kill and -1 for being killed. No additional information about
ammo or health is being used. It is clearly a challenging task for RL agent to
learn such behaviour without additional information.
The results for Health Gathering are used from our paper [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>TD loss for all the agents can be seen at Figures 7a and 7b. Again, all
agents, except DQN, show stable learning process. Rewards obtained during
training can be observed at Figure 7c (plot is constructed in relative time-scale).
D4RQN obtains max score more than in half of training episodes, but other
agents do not show stable performance. Test rewards presented at gure 7d.
C51M has lower performance while D4RQN has highest. Unexpectedly, D4RQN
obtained much higher reward while training than while testing with turned o
dropout. We further study agents' performance with and without dropout at
testing time. DQN does not detect health pack as well as D4RQN and may just
move into a wall and die. DRQN plays pretty well, but agent's behaviour was
not interpretable in terms of any reasonable humans' behaviour. C51M is good
at detecting health packs, but it scores lower than D4RQN and DRQN have,
because it tries to wait several frames before pick up a health pack.
7.4</p>
      </sec>
      <sec id="sec-6-3">
        <title>Summary results</title>
        <p>
          Since Health Gathering scenario forced to end after 2100 steps during training,
we modi ed the test for this scenario set episode length to 10000 and call it
Health Gathering Expanded. We add this version of scenario in Table 2 following
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] paper previously written by us.
        </p>
        <p>For Defend The Center scenario reward is equal to number of kills minus one
as a death penalty. In Table 2 we report pure number of kills obtained by each
agent.</p>
        <p>From Table 2 we can observe that dropout at testing time may increase
agent's performance (Health Gathering scenario for both D4RQN and C51M),
or decrease (Defend The Center for C51M, Basic for both D4RQN and C51M)
or cause no e ect (Defend The Center, D4RQN).
8</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>
        In this work, we extend results of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on two additional scenarios while
studying new model-free deep reinforcement learning agents in POMDP settings for
3D rst-person shooter. The presented agents drastically outperformed
baseline methods, such as DQN and DRQN. Our agent successfully learned how to
play several scenarios in VizDoom environment and show human-like behaviour.
This agent can be used like backbone architecture for more challenging task,
like Deathmatch scenario, which is exactly our plan for future work. Moreover,
our agent could be easily combined with Action-speci c DRQN [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], Boltzmann
exploration [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Prioritized Experience Replay [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and can be modi ed to use
in-game features as well as separate networks for action and navigation to
improve further.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>
        This work extends the results of our previous study presented on MMEDIA
international conference in 2019 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akimov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makarov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning in vizdoom rst-person shooter for health gathering scenario</article-title>
          .
          <source>In: MMEDIA</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dabney</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munos</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A distributional perspective on reinforcement learning</article-title>
          .
          <source>arXiv:1707.06887</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naddaf</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowling</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The arcade learning environment: An evaluation platform for general agents</article-title>
          .
          <source>In: Proceedings of the 24th International Conference on Arti cial Intelligence</source>
          . pp.
          <volume>4148</volume>
          {
          <fpage>4152</fpage>
          . IJCAI'15, AAAI Press (
          <year>2015</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>2832747</volume>
          .
          <fpage>2832830</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hausknecht</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Deep recurrent q-learning for partially observable mdps</article-title>
          .
          <source>CoRR, abs/1507</source>
          .06527 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hessel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Modayil</surname>
            , J., Van Hasselt,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrovski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dabney</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horgan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Rainbow:
          <article-title>Combining improvements in deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1710.02298</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          ). https://doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735, https: //doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kempka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wydmuch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Runc</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toczek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaskowski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Vizdoom: A doom-based ai research platform for visual reinforcement learning</article-title>
          .
          <source>In: CIG'16</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaplot</surname>
            ,
            <given-names>D.S.:</given-names>
          </string-name>
          <article-title>Playing fps games with deep reinforcement learning</article-title>
          .
          <source>In: AAAI</source>
          . pp.
          <volume>2140</volume>
          {
          <issue>2146</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Makarov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kashin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korinevskaya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning to play pong video game via deep reinforcement learning</article-title>
          .
          <source>CEUR WP</source>
          pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidjeland</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrovski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ),
          <volume>529</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Prioritized experience replay</article-title>
          .
          <source>arXiv preprint arXiv:1511.05952</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Schulze</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Vizdoom: Drqn with prioritized experience replay, doubleq learning, &amp; snapshot ensembling</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>01000</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          :
          <article-title>Learning to predict by the methods of temporal di erences</article-title>
          .
          <source>Machine learning 3(1)</source>
          ,
          <volume>9</volume>
          {
          <fpage>44</fpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          :
          <article-title>Reinforcement learning: An introduction</article-title>
          ,
          <source>vol. 1</source>
          . MIT press Cambridge (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Van Hasselt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning with double q-learning</article-title>
          .
          <source>In: AAAI</source>
          . vol.
          <volume>16</volume>
          , pp.
          <year>2094</year>
          {
          <volume>2100</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hessel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Hasselt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Freitas</surname>
          </string-name>
          , N.:
          <article-title>Dueling network architectures for deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1511.06581</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Watkins</surname>
            ,
            <given-names>C.J.C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dayan</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Q-learning</article-title>
          .
          <source>Machine Learning</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <volume>279</volume>
          {292 (May
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poupart</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miao</surname>
          </string-name>
          , G.:
          <article-title>On improving deep reinforcement learning for pomdps</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>06309</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>