<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reward Shaping for Deep Reinforcement Learning in VizDoom</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vitalii Sopov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Makarov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence Research Institute (AIRI)</institution>
          ,
          <addr-line>Moscow, Russia, Nizhny Susalny lane 5 p. 19, 105064 Moscow, Russian Federation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HSE University</institution>
          ,
          <addr-line>Moscow, Russia, Pokrovsky boulevard 11, 109028 Moscow, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <fpage>157</fpage>
      <lpage>169</lpage>
      <abstract>
        <p>Reward shaping helps reinforcement learning agents to succeed in challenging tasks when environmental rewards are either sparse or delayed. In this work we propose an approach which combines both information from the game screen and additional information about in-game events to produce an estimation of novelty of the visited states and used behaviors. We use this estimation to motivate the agent to seeking novel experiences and show that our method helps in accelerating learning and reaching better and secures more robust strategies in complex VizDoom scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep reinforcement learning</kwd>
        <kwd>VizDoom</kwd>
        <kwd>Reward shaping</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Reinforcement learning (RL) in environments with sparse or delayed rewards have traditionally
been considered a dificult task to solve. Traditional algorithms such as DQN, DDQN, A2C and
others can rarely learn fast enough or converge to satisfying strategies and results without
any reward-shaping, which is modification of the reward function to make it less sparse. Such
modifications are usually manually made by the creators of algorithms and require some
substantial prior knowledge about the given task. In this work we propose a deep algorithm
which learns a deep representation of the environment, its current state and the occurred
events to automatically shape the reward function based on the agent’s curiosity and current
performance.</p>
      <p>The method we introduce in this paper, besides the regular reward from the environment,
additionally rewards the RL agent by the novelty (or rarity) of the experienced events (or
transitions, which are defined by the occur ed in-game events and general observations about
the environment which are acquired via game screen). The idea is to observe the events,
behaviors and states the agent uses and visits, and incite it to explore new states and behaviors.
So, we make the agent balance between the rewards which it receives from the environment
and curiosity about yet unknown and risky, but potentially rewarding strategies.</p>
      <p>While the method is general enough to be used for any environment and applied to any
reward-based RL algorithm, in this paper we study the application of our method to A2C
algorithm in VizDoom environment.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Deep Reinforcement Learning</title>
        <p>
          Reinforcement learning, while studied and analyzed since 20th century [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] , recently due to the
rapid advancements in hardware and development of neural networks have gained a lot more
traction. Incorporation of some abilities of neural networks into existing RL algorithms such
as Q-learning [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] allowed new algorithms to estimate and model considerably more complex
environments such as Atari games [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], shooters such as Doom and even modern games such
as Dota 2 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and Starcraft 2 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The main goal for these algorithms is to estimate game state
and conduct agent strategies based only on the observable space which is the game screen seen
by human players. Convolutional neural networks are used by most agents to build powerful
embeddings of the screen space.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Actor-Critic methods</title>
        <p>
          One of the most numerous families of algorithms in reinforcement learning is actor-critic
algorithms [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Basically such agents have two parts, one of which predicts the action that
the agent must take and the other estimates the value of the action that the agent made. Such
methods combine simplicity of direct policy optimization such as Policy Gradient and robustness
of value-estimating methods such as Q-learning.
        </p>
        <p>
          A way to greatly enhance the performance of such methods is to train multiple agents
simultaneously and asynchronously, collecting agents’ transitions and periodically updating
the network weights with them. It leads to much faster training and considerably better final
performance. This method is called A3C (Asynchronous Advantage Actor-Critic) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          At the same time, some researchers consider that the asynchronous algorithm does not in
fact contribute to the performance of the algorithm. OpenAI implementation [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] gets rid of
it by applying synchronous network updates and making agents’ steps synchronously. Their
evaluation shows that such method is both more cost-efective and results in better agent
performance. The resulting algorithm is called A2C. It will be used as a base RL algorithm in
this work and also as a baseline during evaluation.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. VizDoom Environment</title>
        <p>
          It is important to have proper simulation for complex first-person shooters [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref8 ref9">8, 9, 10, 11, 12, 13</xref>
          ]
or 2D interaction games [
          <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
          ]. VizDoom [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] as an AI research platform based on an
opensource implementation of the original Doom game. It provides API for running games,
acquiring both visual frame information and internal game variables, controlling the ingame
agent in a flexible and functional way. It also defines the means to create custom scenarios and
also provides a set of default scenarios of big variety. They all difer in dificulty, some of them
featuring sparse and delayed rewards.
        </p>
        <p>
          The code of this work is based on an opensource implementation of OpenAI’s A2C algorithm
made by p-kar [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The algorithm was already modified for usage with VizDoom environment
and thus is used as the baseline and a starting point for our own work.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Curiosity, Intrinsic Motivation and Novelty Search</title>
        <p>Recent advancements in the field were made partially due to the usage of intrinsic motivation
by rewarding agent for implementing new behavior and reaching new states. Making agents
curious and more keen to explore new behaviors and states considerably improved performance
in many environments, in particular the ones with sparse and delayed rewards [19, 20, 21, 22].</p>
        <p>One of the approaches to modelling curiosity is to approximate the frequency of visiting
specific states. While for complex environments such counts are close to zero in most cases
and are also computationally hard to store and compute, attempts were made to make an
approximation of the state density function called pseudocount [23] and then to simplify and
approximate it by usage of neural networks [24].</p>
        <p>Another approach is to acquire some additional information from the game by the usage of
game variables such as player’s position, number of kills, number of shots, acquired weapons,
etc. The method manually defines a set of events based on these variables, stores history of
recent events and compares them with the current moment to approximate the novelty of the
state the agent is in and additionally rewards the agent, thus motivating it to seek new states
and try new behaviours. The proposed approach is called Rarity of Events (RoE) [25]. This
work heavily relies on VizDoom environment for experiments and thus will be one of the main
methods to compare our approach with.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>The tasks considered in this work will be some wellknown scenarios for VizDoom: Health
Gathering Supreme, My Way Home, Deadly Corridor, Deathmatch. All of them are characterized
by sparse rewards.</p>
      <p>Health Gathering Supreme is a scenario where a player find himself in a maze filled with
healthkits. The agent must collect this kits, otherwise he will die, because of the poisonous
lfoor which deals damage every second. The goal is to survive as long as possible.</p>
      <p>My Way Home is a scenario where the agent must find its way to the armor set located
somewhere in the maze. Each second the agent loses some points, so the faster it finds the
armor, the more points it will save. Agent will only gain positive reward when he gets to the
armor set.</p>
      <p>Deadly Corridor is a scenario where the agent must pass through the straight corridor with
enemies spawning on the sides. To pass the corridor the agent must firstly defeat all enemies,
otherwise it will be impossible to reach the target.</p>
      <p>Deathmatch is a complex map which consists of the square main area and four rooms filled
with healthkits, armor, ammo and various weapons. Various types of enemies periodically
spawn at random points of the main area. The agent gets points by defeating enemies. The goal
is to get as many kills as possible before the agent is killed.</p>
      <p>As a representation of each state the greyscale 84x84 image of the screen is used. The user
interface and the croosshair are not rendered. The display of the player’s weapon is turned on
for Deathmatch scenario and turned of for all the others. For each game frame we also have
access for some basic game variables such as player’s coordinates, selected weapon and kill
count for each weapon which are then used for computation of the reward function. These
variables are not directly used by the RL agent.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Approach</title>
      <p>In this work we do not aim to enhance, tune or in any way modify A2C algorithm which will
be used in our experiments. Instead our goal is to enhance the environment’s reward function
in a robust and general way, so that with minimal human supervision (which basically consists
of hyperparameter tuning) this enhancement alone would allow the agent to train faster and
converge to better strategies which were initially unreachable.</p>
      <p>In this work we try to predict the novelty of the occurred transition. By the transition
here we mean the occurrence of manually-defined events (such as player movement, kills
by any weapon, etc.) combined in either the current frame (its visual representation) or the
diference between the current and the previous frames (both alternatives are considered and
later tested). If we define some gamescreen state as , agent’s action as a, new state which was
reached after the action as 0, and a set of in-game events occurred as a result of the action
as  = {, ℎ, } (as an example), then the previously mentioned transition will be
stored as (, 0, ). To predict the novelty, we maintain a history bufer of some length of all
recent transitions (sampled with some probability).</p>
      <p>As was previously mentioned, existing ways to estimate the transition novelty included
estimation of state density functions by the usage of neural networks, implementing pseudocount
and deep pseudocount functions to count the number of visits for each state, counting occurred
events. In this work we try to estimate the transition novelty by simply training an binary
classifier which predicts the probability that a given transition occurred in some recent past.
We intentionally limit the maximal age of the recorded and stored transitions to view the ones
which occurred only somewhere in the past as novel.</p>
      <p>The main challenge in this approach is data acquisition. It is easy to get positive samples
(when a transition has occurred), but not quite clear how to get not yet visited transitions.
Viewing a part of the acquired data as unvisited means that the classifier will be trained to
classify many actually seen transitions as unseen which is intentionally misleading. Getting
negative samples either outside of the training process or generating it from some random
distribution with some degree of reliability and similarity to real data does not seem clear or
feasible.</p>
      <p>So, we split the acquired transition data into several bufers and train a separate classifier
using its corresponding bufer as positive samples and all other bufers as negative samples. So,
each classifier is trained to distinguish its own transitions from the transitions in other bufers.
The main assumption here is that if a transition has occurred recently, than it is contained in
at least one bufer, and at least one classifier (ideally exactly one) classifies this transition as
visited. But if the transition has never occurred, then all classifiers ideally view it as unvisited.
More formally, if we train N classifiers and for some transition tr we estimate the probabilties
that each classifier has seen the transition as 1(), 2(), . . . ,  () , then we estimate the
transition novelty as  () = 1 − max ().</p>
      <p>The initial idea was to train  identical novelty networks, where we maintain N separate
nonintersecting history bufers and train each network to discriminate between its own transitions
and transitions of all the other networks by using binary cross-entropy loss. The ”curiosity
bonus” given will be than defined as 0() =   (), where  is a parameter tuned to
make the bonus comparable to environment reward. The new reward function would then be a
sum of the reward provided by the environment and our bonus.</p>
      <p>However, the general performance and training speed in this case were unsatisfying. So
instead of  networks, a single network with  heads is trained. When the last layer of each
network originally outputted a single value, now it outputs an  -dimensional vector, where 
equals to the number of history bufers we maintain. It allows to both better incorporate the
information from all bufers into a single network to generate a better embedding of the game
state, and also lower training and inference time, since generally computations for most layers
are be made only once instead of  times.
4.1. A2C Agent
In this work we use A2C (synchronous modification of A3C, Asynchronous Advantage
ActorCritic) as a reinforcement learning algorithm to add our reward function to and also as a baseline
for comparison purposes.</p>
      <p>The agent’s neural network consists of 3 convolutional layers, followed by two dense layers.
For the convolutional layers, their kernels are 8 × 8, 4 × 4 and 3 × 3, their strides are 4, 2
and 1 and the numbers of produced feature maps are 32, 64 and 32. The composition of the
convolutional layers produces 32 feature maps of size 7 × 7. After that goes the common
fully-connected layer which flattened input’s size is 1568 units and which output size is 512
units. This common layer has two dense descendants: one corresponds to the Actor part and
outputs a softmax-adjusted vector which size is the number of possible actions for the agent.
This number difers for each scenario. The second descendant corresponds to the Critic part
and outputs a one-dimensional vector.</p>
      <sec id="sec-4-1">
        <title>4.2. Novelty Network</title>
        <p>The network for estimating a reward function consists of the convolutional part and the
fullyconnected part. The convolutional part is the same as in the A2C agent. The output of this part
is then flattened and concatenated to the events vector.</p>
        <p>Events vector is the vector which corresponds to the events which occured either in the
current frame or in the current episode up to this frame (we describe this later when we list all
modifications of the proposed algorithm). Its elements are repeated several times (currently 15
times) to have more influence in the resulted vector.</p>
        <p>The resulting vector is then fed into a sequence of two fully connected layers that output
512and 128-dimensional vectors correspondingly. The final layer is also fully-connected and outputs
 -dimentional vector, where  equals to the amount of ”heads”, or ”discriminators”. Sigmoid
activation function is then applied to each head independently. The visual representation of
novel network can be found in Figure 1.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Modifications</title>
      <p>Our approach is represented in several modifications, most of which can be applied
independently from each other. While the basic algorithm uses both screen info and events data, these
information channels can be disabled separately to evaluate their importance, so that the novelty
network could be trained only on screen bufer or only on game events.</p>
      <p>The number of heads, while could be customized, is set to 5 which turned out to show the
best performance. The history bufer size is another numerical parameter which could be tuned.
In most cases it is set to 5000 samples per each of 5 heads and is generally bounded by the size
of GPU memory and training speed considerations.</p>
      <p>A more qualitative modification is the way to feed the game screen into the novelty network.
While originally the unmodified image is used, alternatively we could store an element-wise
diference between the current state’s matrix and the matrix of the previously observed state.
While the former alternative leads to the estimation of the novelty of the current state, the latter
focuses more on the dynamics and represents an actual transition transition which is tightly
bound to the occurred event. We will call this modification statedif .</p>
      <p>Another important modification is the cumulativity of the events. Originally, each event is
taken only at the exact moment and does not include observations from the previous frames of
the episode. As an alternative, another approach is implemented, where each event fed into the
novelty network is in fact a normalized cumulative sum among all events which happened until
current moment in the given episode:</p>
      <p>The idea behind this is that while ideally the present events are a better indicator of the
current state, their accumulation shows how the current state was reached and thus contributes
to a better state representation. We will call this modification episodic.</p>
      <p>Other modifications mainly involve events, their episodic normalization and acquirement.
For instance, the accumulated episodic events can be normalized either by the amount of frames
or the amount of actual events which contributed to the accumulated sum:
1 ∑︁ ,
 
 =| { :  ≤ , ||  ||1&gt; 0} |</p>
      <p>While the former alternative shows a better representation of the events density during
the game episode, the latter allows us to focus on important contributing events and learn
to diferentiate between them. Because this is a fix of the averaging in a way, we will call it
ifxedaverage .</p>
      <p>Also, some events could be considered too frequent to contribute to the episodic sum (such as
player’s movement, which is empirically the most frequent event and thus the least meaningful),
thus in another modification of the algorithm they can be ignored and not stored in history
bufers.</p>
      <p>Finally, in another modification it is tested whether the lesser amount of non-contributing
(empty, zero) events in the history bufer lead to better performance. While in the basic algorithm
all events are acquired with constant probability  = 0.1, for empty events an additional
penalty , is introduced, so for such events their probability to be stored in the bufer
becomes 0 =  · . Such modifications will be denoted with zeroproba.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Training</title>
      <p>Conceptually the main training loop of the original A2C algorithm is not changed. The addition
we make is that after each A2C model update we also feed all history bufers to the novelty
network and compute binary cross-entropy loss for each head. The losses are then summed
and minimized using the same optimizer as for the original A2C network, which is RMSProp.</p>
      <p>
        The duration of training is measured in game frames which are played by the agent. Since we
use A2C and actually simultaneously have 16 agents, we sum up all the frames of all agents. The
number of training frames for each scenario varies. For ”My Way Home” we train agents for 4
million frames, For ”Health Gathering” scenarios the agents are trained for 3 million frames.
For Deathmatch we use 14 million frames for comparison of diferent modifications of our
algorithm and 25 million frames for the comparison of our best modification with the baseline
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and RoE algorithm [25].
      </p>
      <p>One important note here is that for some scenarios these numbers are actually not enough
to fully converge to a best possible strategy for the agent. For example, in the previously
mentioned RoE method [25] for the ”Deathmatch” scenario the agent was trained for 70 million
frames. For other scenarios the agent was trained for 10 million franes. Unfortunately, for our
budget such durations would be unafordable both in terms of time and money, so we had to
lower this duration for the results to be both afordable and still useful. Another concession
we had to make was the number of experiments. Most experiment were run once or twice,
so no recorded information on the variance could be given. On the other hand, generally the
conceptual results tended to coincide among all runs, so our results are good enough to compare
diferent algorithms and their modifications with good degree of reliability.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Metrics</title>
      <p>Main metrics for evaluation are mean reward and maximal reward. Periodically (after every
Nth network update) for every worker (out of 16) the reward of the last finished episode is
taken. After that either the maximal value or the mean value among all workers is computed.
Since the values are very noisy, for the evaluation and plotting exponential smoothing with
ℎ = 0.95 is applied.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Experiment Evaluation</title>
      <p>Since we are evaluating a reinforcement learning algorithm, no specific evaluation separate
from the training process was applied, and rewards acquired during training were evaluated.
Mainly the maximal values and their speed of growth are considered.</p>
      <p>Also, most experiments are applied to ”Deathmatch” scenario, and the algorithm with the best
parameters is later evaluated on other scenarios such as ”My Way Home”, ”Health Gathering
Supreme” and ”Deadly Corridor”. The only modified parameter for these scenarios is the
bonus coeficient, which determines with which weight the novelty bonus is added to the
environmental reward and which is is manually tuned to be approximately equal to the reward
at the beginning of the agent’s training.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Results</title>
      <p>In Deathmatch scenario in short-term some modifications manage to beat the baseline A2C
algorithm, but do not get better performance than the Rarity of Events method. Among
all modifications the best one uses episodically accumulated events which are normalized
among non-empty elements and uses diference between current screen state and the previous
observation. This modification is used for more lengthy training session on Deathmatch and
for all other scenarios. The results of the Deathmatch experiments are shown in Table 1 and Fig
2. The result of the longer training session on Deathmatch is shown in Figure 3.</p>
      <p>While experiments on Deathmatch scenario are not very satisfying, results on other scenarios
show much more promise. In a ”My Way Home” scenario, while the baseline A2C agent learns
well and quickly enough to make the usage of additional intrinsic rewards unneeded, our
methods still leads to faster learning, even if in this case it is not strictly necessary. At the same
time, RoE algorithm for some reason shows considerable decline in the performance, most likely
connected to the absence of extrinsic reward. The results are shown in Figure 4.</p>
      <p>For ”Health Gathering Supreme” Scenario, while at the beginning being only marginally
better than the baseline and considerably worse than RoE method, our algorithm at the end of
the training session becomes the best. And while the advantage appears only at the and and is
not significant, the dynamics of the mean reward for RoE method shows that its further rise is
unlikely, so the gap will probably only rise. The results are shown in Figure 5.</p>
      <p>In ”Deadly Corridor” our method shows the best results. In only 1 million frames it converges
to the maximal reward for this scenario and continues to steadily raise its mean reward, while
A2C shows stably bad scores. RoE method here has the maximal reward somewhere in the
middle of both other methods, while at the same time being only marginally better than A2C
baseline for its mean reward without any positive trends. The results are shown in Figure 6.
10. Discussion
Our approach performs reasonably well in several challenging VizDoom scenarios. Still, it
should be tested in other scenarios and in conjunction with other reinforcement learning
algorithms to test it generality and usefulness. Besides, a more thorough testing should be
done, with longer training sessions and with overall more runs per each setup to have a better
understanding of the dispersion and general stability of results as well as better understanding
of our algorithm’s behavior on later stages of training (this however requires significantly larger
funding).</p>
      <p>Another point to explore would be the normalization factor which we used to equalize
environment’s extrinsic reward and our intrinsic bonus. In this work this coeficient was
manually tuned for each scenario to make the bonus roughly comparable to the reward. However,
no thorough optimization took place to determine best possible values. This could be explored
in later works as well as automatic search of the best value of this coeficient.</p>
      <p>In parallel, a related topic could be researched. In RoE [25] algorithm extrinsic reward was
completely ignored, which resulted in some interesting new behavior patterns and overall
significant improvements over the baseline. Since our work works in the same field and utilizes
the same concepts, it would be an interesting area for future research.</p>
      <p>Speaking of general performance, our method showed some marginal improvements in
simple scenarios such as ”My Way Home”, which were not very challenging for the chosen
baseline. The improvements in more complex scenarios such as ”Health Gathering Supreme” and
”Deadly Corridor” were much more notable. Howerever, for arguably the most complex scenario
”Deathmatch” our approach, while being better than the baseline, was significantly worse than
RoE method. An interesting hypothesis to test is whether it is because of the insuficient quality
of our novelty estimation network. While the general concept showed its potential, further
research is needed to determine the best neural architecture and further enhance its learning
process.
11. Conclusion
In this work we introduced a novelty-based reinforcement learning approach, which focused
on the reward function and incited the RL agent to explore new behaviors. Our method in
conjunction with A2C algorithm was able to achieve significantly better performance than the
baseline A2C algorithm and comparable (or sometimes even better) performance to another
novelty-based algorithm RoE. In future, the proposed method could be incorporated into other
reward-based reinforcement learning algorithms and reused in other challenging environments
with sparse or delayed rewards for which converging to satisfying strategies without any
additional intrinsic motivation is unfeasible.
p-kar/a2c-acktr-vizdoom.
[19] I. Makarov, A. Kashin, A. Korinevskaya, Learning to play pong video game via deep
reinforcement learning, in: AIST (Supplement), 2017, pp. 236–241.
[20] D. Akimov, I. Makarov, Deep reinforcement learning with vizdoom first-person shooter,
in: CEUR Workshop Proceedings, volume 2479, 2019, pp. 3–17.
[21] M. Bakhanova, I. Makarov, Deep reinforcement learning in vizdoom via dqn and actor-critic
agents, in: Proceedings of IWANN, Springer, 2021, pp. 138–150.
[22] A. Zakharenkov, I. Makarov, Deep reinforcement learning with dqn vs. ppo in vizdoom,
in: Proceedings of CINTI’21, IEEE, 2021, pp. 1–6.
[23] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, R. Munos, Unifying
countbased exploration and intrinsic motivation, Advances in neural information processing
systems 29 (2016) 1471–1479.
[24] G. Ostrovski, M. G. Bellemare, A. Oord, R. Munos, Count-based exploration with neural
density models, in: International conference on machine learning, PMLR, 2017, pp. 2721–
2730.
[25] N. Justesen, S. Risi, Automated curriculum learning by rewarding temporally rare events,
in: 2018 IEEE Conference on Computational Intelligence and Games (CIG), IEEE, 2018, pp.
1–8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dayan</surname>
          </string-name>
          ,
          <article-title>Q-learning</article-title>
          ,
          <source>Machine learning 8</source>
          (
          <year>1992</year>
          )
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          , et al.,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          , nature
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dębiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dennison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Farhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hashme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          , et al.,
          <article-title>Dota 2 with large scale deep reinforcement learning</article-title>
          , arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>06680</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , I. Babuschkin,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Czarnecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dudzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ewalds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          , et al.,
          <article-title>Grandmaster level in starcraft ii using multi-agent reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>575</volume>
          (
          <year>2019</year>
          )
          <fpage>350</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Tsitsiklis</surname>
          </string-name>
          ,
          <article-title>Onactor-critic algorithms</article-title>
          ,
          <source>SIAM journal on Control and Optimization</source>
          <volume>42</volume>
          (
          <year>2003</year>
          )
          <fpage>1143</fpage>
          -
          <lpage>1166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Badia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1928</fpage>
          -
          <lpage>1937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mansimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          , Openai baselines: Acktr &amp; a2c, url: https://openai. com/blog/baselines-acktr-a2c (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Makarov</surname>
          </string-name>
          et al.,
          <article-title>First-person shooter game for virtual reality headset with advanced multi-agent intelligent system</article-title>
          ,
          <source>in: Proceedings of the 24th ACM international conference on Multimedia</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>735</fpage>
          -
          <lpage>736</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Makarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tokmakov</surname>
          </string-name>
          , L. Tokmakova,
          <article-title>Imitation of human behavior in 3d-shooter game</article-title>
          ,
          <source>AIST'2015 Analysis of Images, Social Networks and Texts</source>
          (
          <year>2015</year>
          )
          <fpage>64</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>I. Makarov</surname>
          </string-name>
          et al.,
          <article-title>Modelling human-like behavior through reward-based approach in a ifrst-person shooter game</article-title>
          ,
          <source>in: EEML Proceedings</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Makarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Polyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Karpichev</surname>
          </string-name>
          ,
          <article-title>Voronoi-based path planning based on visibility and kill/death ratio tactical component</article-title>
          ,
          <source>in: Proceedings of AIST</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Makarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Polyakov</surname>
          </string-name>
          ,
          <article-title>Smoothing voronoi-based path with minimized length and visibility using composite bezier curves</article-title>
          ,
          <source>in: AIST (Supplement)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Makarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Konoplia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Polyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martynov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zyuzin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gerasimova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bodishtianu</surname>
          </string-name>
          ,
          <article-title>Adapting first-person shooter video game for playing with virtual reality headsets</article-title>
          .,
          <source>in: FLAIRS Conference</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>412</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Makarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Savostyanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Litvyakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. I. Ignatov</surname>
          </string-name>
          ,
          <article-title>Predicting winning team and probabilistic ratings in “dota 2” and “counter-strike: Global ofensive” video games</article-title>
          ,
          <source>in: Proceedings of AIST</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kamaldinov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Makarov</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning in match-3 game</article-title>
          , in: Proceedings of CoG, IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kamaldinov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Makarov</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning methods in match-3 game</article-title>
          , in
          <source>: Proceedings of AIST</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kempka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wydmuch</surname>
          </string-name>
          , G. Runc,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toczek</surname>
          </string-name>
          , W. Jaśkowski,
          <article-title>Vizdoom: A doom-based ai research platform for visual reinforcement learning</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computational Intelligence and Games (CIG)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <article-title>A2c, acktr and a2t implementations for vizdoom</article-title>
          ,
          <year>2017</year>
          . URL: https://github.com/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>