<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using artificial neural networks in reinforcement learning algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Glova</string-name>
          <email>martin.glova@upjs.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Andrejková</string-name>
          <email>gabriela.andrejkova@upjs.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of computer science, Faculty of Science P. J. Šafárik University in Košice Jesenná 5</institution>
          ,
          <addr-line>04001 Košice</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Reinforcement learning algorithms represent a specific category in the machine learning field precisely because of their unique approach based on a trial-error basis. We introduce a new approach into the Q-algorithm, namely maximizing k-future rewards policy, which decreases learning time and increases maximal and average score value of an optimizing function significantly. Our modified Deep NNQ-learning using feed-forward networks instead of originally proposed convolutional neural networks gives the best results in a tested problem. We implemented the developed algorithm for the Flappy Bird game where significant improvements are achieved by appropriate setting of state space, act policy, and by pretraining neural networks.</p>
      </abstract>
      <kwd-group>
        <kwd>Artificial intelligence</kwd>
        <kwd>Reinforcement learning</kwd>
        <kwd>Q-learning</kwd>
        <kwd>Deep Q-learning</kwd>
        <kwd>Flappy Bird game</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Reinforcement learning is a learning method that
determines what action will maximize the agent’s reward in a
state derived from an environment. Introduction to the
theory of the reinforcement learning can be found in the book
Sutton &amp; Barto [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The goal of reinforcement learning
algorithms is to build a policy by learning from past good
action sequences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For this class of algorithms, two
basic properties are significant: trial and error method and
possible delayed reward.
      </p>
      <p>
        In the developed algorithm, we focus mainly on using
of neural networks in a reinforcement learning approach as
far as it appears to be a method with a potential to achieve
better results. The main reason that neural networks have
been introduced in the reinforcement learning is that
"representation learning with deep learning enables automatic
feature engineering and end-to-end learning through
updating weights of neural networks so that reliance on
domain knowledge is significantly reduced or even removed"
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. According to [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] deep learning and reinforcement
learning will play an important role in achieving artificial
intelligence and the overview describes main principles of
both learnings.
      </p>
      <p>
        In this paper, we describe an algorithm based on a
learning algorithm that could learn to play computer games
and solve similarly defined problems. Games provide
excellent test beds for similar algorithms [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Our
introduced algorithms could be later used in more
realworld applications such as a natural language processing
[
        <xref ref-type="bibr" rid="ref14 ref23 ref30 ref6">6, 14, 23, 30</xref>
        ], computer vision [
        <xref ref-type="bibr" rid="ref19 ref29 ref31">19, 29, 31</xref>
        ], business
management (recommendation, customer management,
marketing, etc.) [
        <xref ref-type="bibr" rid="ref13 ref25 ref28 ref5 ref9">5, 9, 13, 25, 28</xref>
        ], robot navigating [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] in an
environment or drones navigation, etc. An autonomous
drone has similar problems as are described in this
paper. Especially when we try to cope with indoor
navigation with lots of obstacles or navigation in caves, etc.
More real-world examples of reinforcement learning
applications can be found in the overview [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The paper is organized as follows: Section 2 contains a
description and comparison of related algorithms. Section
3 describes our new algorithms. In Section 4, there are
the results and observations of the tested algorithms on the
Flappy Bird game. The section 5 contains a summary of
our contribution to the area.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Reinforcement Learning Algorithms</title>
      <p>
        An overview of deep reinforcement learning algorithms
can be found for example in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the following
subsections we describe reinforcement learning algorithms
which we modified, tested and compared to our results.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Q-learning</title>
        <p>
          Q-learning is an algorithm which uses Q-values. These
values are represented as a function Q : S A ! R. Input
of the function Q is a couple hs; ai, where s 2 S (S is a
state space), a 2 A (A is an action space) and R is the set
of real numbers. As output of the function Q we expect
a future possible reward - a Q-value. As far as the sets
S and A are finite the Q function is a discrete function.
The pseudocode of the Q-learning algorithm is given in
the Algorithm 1 prepared according to [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>
          The algorithm requires to define a state space S, an
action space A and parameters like a learning rate h and a
discount factor g. The learning rate h 2 [0; 1] means how
much of the old Q-value we take into account. The
discount factor g 2 [0; 1] means how much of a next possible
reward we take into account. The key part of the algorithm
is the equation (1) which shows how the Q-function is
updated. Q is initialized by random values [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. If all actions
from A are chosen in all states infinitely many times, so
if the algorithm is executed infinitely many times and
parameter h and g are set properly, then Q-values converge
to the optimal values with the probability 1 [
          <xref ref-type="bibr" rid="ref17 ref27">17, 27</xref>
          ].
Algorithm 1: The pseudocode of the Q learning
algorithm.
        </p>
        <p>Initialize all Q(s; a) arbitrarily.
for all episodes do</p>
        <p>Initialize st 2 S, where t = 0 2 T .
while st is not the terminal state do</p>
        <p>Choose a 2 A using policy derived from Q
for the state st (e.g. the e-greedy policy).</p>
        <p>Take action a, observe r and a new state
st+1.</p>
        <p>Update Q(st ; a) by using equations</p>
        <p>Q(st ; a) = Q(st ; a) + d
(1)
d = h(r + g maxa02A Q(st+1; a0)
Update t = t + 1.</p>
        <p>Q(st ; a)).</p>
        <p>end
end</p>
        <p>
          The algorithm SARSA
(State-Action-Reward-StateAction) is very similar to the Q-learning Algorithm 1.
Actually, in the beginning, it was called modified Q-learning
by its inventors Rummery and Niranjan [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. The only
difference in the pseudocode is in the equation (1). It is
changed to the following equation (2):
        </p>
        <p>
          Q(st ; a) = Q(st ; a) + h(r + gQ(st+1; a0)
Q(st ; a)); (2)
where a0 is chosen by the same policy as a is chosen in
the state st (for example by the e-greedy policy). So we
do not maximize future possible reward in the equation
but always use the same policy to choose the best action
[
          <xref ref-type="bibr" rid="ref22 ref24">22, 24</xref>
          ]. That is why we also say that SARSA is on-policy
algorithm and Q-learning is off-policy algorithm.
        </p>
        <p>
          For given policy, the value of Q-function can be
computed by Bellman operator which is a contraction with
unique fixed point [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] under some conditions on
Qfunction.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Deep Q-learning</title>
        <p>
          The algorithm Deep Q-learning (DQN) introduced by
DeepMind Technologies Company is described in
Algorithm 2 and it is highly based on the algorithm from
Riedmiller [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] which uses multi-layer perceptrons.
        </p>
        <p>
          At first, it initializes a replays memory (similar
principle as it is "experience replay" in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]). Such memory is
used to store training data. The first training data can be
created in some reasonable way, for example as a random
playing of the game. The agent should balance between
failure data and successful data if he wants to learn faster.
Picking of an action in the algorithm is done the same
way as it is for Q-learning or SARSA. Only updates are
done by performing a gradient step and changing weights
of the network. In the algorithm g 2 [0; 1] is a discount
factor and has the same meaning as before in the SARSA
and Q-learning algorithms. The original algorithm was
introduced in a combination with convolutional neural
networks [
          <xref ref-type="bibr" rid="ref18 ref20">18, 20</xref>
          ].
        </p>
        <p>Algorithm 2: The pseudocode of the algorithm
Deep Q-learning.</p>
        <p>Initialize replay memory D to capacity n.</p>
        <p>Initialize the function Q with random weights.</p>
        <p>Remark. Question mark is used for indices of
minibatch transitions.
for all episodes do</p>
        <p>Initialize st 2 S, where t = 0 2 T .
while st is not the terminal state do</p>
        <p>Choose a 2 A using policy derived from Q
for the state st (e.g. the e-greedy policy).</p>
        <p>Take action a, an observation r and a new
state st+1.</p>
        <p>Store hst ; a; r; st+1i in D.</p>
        <p>Sample a minibatch of transitions
hs?; a?; r?; s0?i from D.
if s0? is the terminal state then</p>
        <p>Set y = r?.
else
end
Perform a gradient step on
(y Q(s?; a?; q ))2.</p>
        <p>Update t = t + 1.</p>
        <p>Set y = r? + g maxa02A Q(s0?; a0; q ).</p>
        <p>end
end
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Modifications of the Algorithms</title>
      <p>The following subsections describe a new developed
policy: how to choose a new action in the Q-learning and also
modified version of the Deep Q-learning algorithm.
3.1</p>
      <sec id="sec-3-1">
        <title>Maximizing k-Future Rewards Policy</title>
        <p>The policy for selecting actions is one of the key factors of
reinforcement learning algorithms.</p>
        <p>
          It is known [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] that the reinforcement learning
algorithm should be unstable or to diverge when nonlinear
function approximator is applied as the Q-function. But
approximations give quite interesting results in many
experimental cases.
        </p>
        <p>The classical description of the Q-learning algorithm
and also SARSA only takes one next step into account.
The main idea of our new approach is not only to
maximize the Q-values of the actual state for all the actions but
sum up Q-values of all possibilities in a depth k (k &gt; 0) and
pick up the best action with the greatest summed value (in
many games a player analyzes more steps forwards). The
complexity of picking a new action grows exponentially
with respect to the number of future steps taken into
account. If we take k future steps it is O(jAjk), where A is
the set of all possible actions in the action space.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Modified Deep Q-learning</title>
        <p>
          In the first step, we need to decide, what type of neural
networks will used in the implementation of the Deep
Qlearning Algorithm 2. Authors [
          <xref ref-type="bibr" rid="ref20 ref3 ref4">3, 4, 20</xref>
          ] usually use
convolutional neural networks and use raw pixel information
from the game screen to define the state space. We decided
to use feed-forward neural networks (FNN) and we use an
ensembling of FNNs in a pre-training of the developed
algorithm. Pre-training was done on real data and then we
use the best FNN to train. The pseudocode of the modified
Deep NNQ-learning algorithm is in Algorithm 3.
        </p>
        <p>The input for FNN is defined as a couple hs; ai, where
s 2 S and a 2 A and the output is defined as a real
value (a future possible reward). An advantage of the
approach is that such network is trained as one system.
For example, let us have a frequent occurrence of the
state si = hsi1; : : : ; sini 2 S, so our algorithm is well
trained for the state si. Then let us also have a very rare
occurrence of the state s j = hs j1; : : : ; s jni 2 S, s j 6= si, but
s j = hsi1 + e1; : : : ; sin + eni and ek is very close to zero for
k, where 1 k n. In other words, states si and s j are
very similar but no the same. In the previous approach
using Algorithm 1 with equation (1) or (2), the agent
should act very well for the state si but not for the state
s j, which is almost the same state, as far as training is
realized by the discrete Q function.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Flappy Bird Game</title>
        <p>In our research, we compare the results of the algorithms
applied in the Flappy Bird game. The game is a
sidescroller where the player controls a bird, attempting to fly
between columns of braun pipes without hitting them. The
score is evaluated using of the number not hitting columns.
Fig. 1 shows an example of the game configuration. The
player can see up to two of the following pipes. The bird
moves forward (along the horizontal axis) with a constant
velocity, but his movement along the vertical axix is more
complicated and it is possible to influence it by the player’s
control. If the bird is not controlled by the player, then its
vertical velocity (hereinafter referred to as only velocity)
Algorithm 3: The pseudocode of the modified
Deep NNQ-learning algorithm. The function Q in
the algorithm represents a neural network.</p>
        <p>Obtain Q-values using Algorithm 1 using policy
described in Subsection 3.1.</p>
        <p>Initialize the function Q with random weights.</p>
        <p>Pre-train the function Q using obtained Q-values
from the first step.
for all episodes do</p>
        <p>Initialize st 2 S, where t = 0 2 T .
while st is not the terminal state do</p>
        <p>Choose a 2 A using policy derived from Q
for the state st (e.g. the e-greedy policy).</p>
        <p>Take action a, observe r and a new state
st+1.
if st+1 is the terminal state then</p>
        <p>Set y = r.
else
end
Perform a gradient step on
(y Q(st ; a; q ))2.</p>
        <p>Update t = t + 1.</p>
        <p>Set y = r + g maxa02A Q(st+1; a0; q ).</p>
        <p>end
end
decreases in each step by 1 downwards until his velocity
is 10 downwards.</p>
        <p>Definition 1. Let X R be the set of all the different
distances of the leftmost pixels of the bird to the rightmost
pixels of the pipe nearest to the bird, Y R be the set of
the different distances of the bottommost pixels of the bird
to the topmost pixels of the next bottom pipe, V Z is the
bird’s velocity where the negative values mean upwards
direction and the positive values mean downwards
direction. Then the set S = fhx; y; vi : x 2 X ; y 2 Y; v 2 V g is
called state space.</p>
        <p>We define the action space as a set of two actions A =
fFLAP; NOT _FLAPg and the state space by Definition 1
to use them in reinforcement learning algorithms. In our
implementation, we work with the rounding function rnd :
R N+ ! Z; x 2 R; r 2 N+ defined by (3).</p>
        <p>rnd(x; r) = r bx=re:
(3)
The cardinality of S could be very high and we reduce
some states using a states conversion function defined in
Definition 2, formula (4).</p>
        <p>Definition 2. Let X R be the set of all the different
distances of the leftmost pixels of the bird to the
rightmost pixels of the pipe nearest to the bird, Y R be the
set of the different distances of the bottommost pixels of
the bird to the topmost pixels of the next bottom pipe,
RX ; RY N+ be the sets of all the rounding values for
the horizontal/vertical distances, V Z is the bird’s
velocity where the negative values mean upwards direction
and the positive values mean downwards direction and S =
fhx; y; vi : x 2 XS; y 2 YS; v 2 V g is state space, where XS =
X \ MrX [ frnd(min(X ); rX ); rnd(max(X ); rX )g and YS =
Y \ MrY [ frnd(min(Y ); rY ); rnd(max(Y ); rY )g, where MrZ
is the set of all multiples of rZ 2 RZ for Z 2 fX ;Y g and
rnd is the rounding function defined by (3). The function
sc f : X Y RX RY V ! S is called the states
conversion function and if x 2 X , y 2 Y , rX 2 RX , rY 2 RY , v 2 V
and rnd is the rounding function (3), then
sc f (x; y; rX ; rY ; v) = hrnd(x; rX ); rnd(y; rY ); vi:
(4)
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Simple and Advanced Greedy Algorithms</title>
        <p>We tested the simple greedy algorithm described in
Algorithm 4. This algorithm does not use any learning and only
uses the following simple rule: flap anytime vertical
distance of the bird to the next bottom pipe could be in the
next state less than zero (so the agent could hit the pipe).</p>
        <p>The average score of the algorithm for 5000 games is
143.43 and with individual scores of games (the red dots)
are shown in Fig. 2. The algorithm is implemented mostly
for comparison with our developed algorithm and in the
future it can be used as a part of other algorithms.</p>
        <p>So the advanced greedy policy does not change the
average score signficantly but still is better than Simple Greedy
Algorithm. The pipes positions in played games were the
same as it was for the simple greedy algorithm.
Algorithm 4: The pseudocode of the simple
greedy algorithm, where y is the vertical distance
of the bird to the next bottom pipe.
if y &lt; 10 then</p>
        <p>return FLAP
else
end</p>
        <p>return NOT FLAP</p>
        <p>The advanced greedy algorithm does not use any
learning and flaps only if it is necessary - so only if in the next
20 states the bird would hit the ground of the bottom pipe.</p>
        <p>The number 20 is chosen because it is the minimal time
until the bird is in the same vertical position as it was
before it flaps the last time. Or in other words: 20 is the
number of the states which are affected by flapping. The
last restriction is that the bird would flap only if it does not
cause hitting the pipe in any of the twenty next steps. The
average score of the algorithm for 5000 games is 144.36
as we can see it in Fig. 3.
We focused on searching of the optimal parameters for the
Q-learning algorithm described in Algorithm 1.</p>
        <p>The policy depends on the e value or better say whether
e-greedy policy is used or not. If e is not used then it
means there is no signficant difference between SARSA
and Q-learning as far as during update of Q-value SARSA
uses e-greedy policy and Q-learning maximize the values
(we are now comparing the update Equations (1) and (2)).
Setting e to some non-zero values is a mechanism to force
the algorithms to be able to converge to optimal Q-values.
Otherwise, it would be impossible to prove such statement.
Algorithm 5: The pseudocode of the advanced
greedy algorithm.
have_to_ f lap = FALSE
for state 2 next 20 states do
x = getX (state)
y = getYWithoutFlap(state)
if y GROU ND_BASE or
(x PIPE_W IDT H + BIRD_W IDT H and
y 0) then</p>
        <p>have_to_ f lap = true
end
end
if have_to_ f lap then
for state 2 next 20 states do
x = getX (state)
y = getYWithoutFlap(state)
if x PIPE_W IDT H + BIRD_W IDT H
and y V ERT ICAL_GAP_SIZE
BIRD_HEIGHT then</p>
        <p>return NOT FLAP
end
end
return FLAP
end
return NOT FLAP
Based on analysis of all parameters we set the parameters
to achieve as good results as possible. In all of the
simula</p>
        <p>
          This parameter ensures that we are also likely to choose
other previously unselected actions to also test whether the
future reward is not higher by choosing such action. Fig. 3
4 shows that there is not a big difference in using or not
using e-greedy policy. But using e-greedy policy generates
slightly better results. The value e was changed by the
formula ek = 1=k, where k is the number of the iteration.
tions in Fig. 5 learning rate is set to 0:7, discount factor is
set to 1:0, the reward r = 1 for alive states and r = 1000
for the three last states, the rounding values are set as
follows rX = rY = 5 with n = 1 with n = 1 and the e policy
is set to true. The colored trending lines represent average
scores of the last 1,000 games with the parameter k
appertains to the values in the legend on the left side of the
figure.
In the experiments discussed in this subsection we use
feed-forward neural network with 2 hidden layers with
600 and 200 neurons. These values seemed to be the best
in our case. We also tested more and fewer layers but the
best results are achieved by using only two layers (we also
tried 3 to 5 layers with more neurons, like 1000, or less,
like 100). The input for the network is a triple
consisting of three real values: horizontal and vertical distance
of the bird to the next pipe and the velocity. The output
of the network is an ordered pair of two real values from
interval [ 1; 1] meaning future reward by choosing flap or
not flap action. Three different activation functions were
tested: sigmoid, hyperbolic tangent and relu. In the given
evaluation we use tanh. The network is initialized with
random weights and these are updated by using the Adam
algorithm [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Learning rate is set to 0:01.
        </p>
        <p>We initialize the network by using Q-values generated
by Q-learning as described in Algorithm 3. We map for
each state s decisions to 1, if the action would be taken or
1, if the action would not be taken by the Q-learning
algorithm. So we have for each state s 2 S and both
actions a1 and a2 training data in format Q(s; a1) = 1
and Q(s; a2) = 1 or Q(s; a1) = 1 and Q(s; a2) = 1.
When training the model during initialization we use more
optimizers like gradient descent, AdaGrad, RMSProp or
Adam and train more separated networks. As far as Adam
trains the best models we use them and decisions are made
by all of them by summing predictions of all models and
then picking the action to take by maximizing the values.</p>
        <p>Rewards are then generated in a similar way as
during initialization - in state s the reward in 1, if the
action was taken or 1, if the action was not taken. Only
the last 30 states are ignored because we presume that
some of them caused the death of the bird. Training is
then done only by using new data and discount factor is
not used in this case. In Fig. 6 one can see the
comparison of two best-implemented instances of Q-learning
and Deep NNQ-learning tested on the same 100 games
where each game has at maximum 50,000 pipes
generated (then the game stops). The average score is
approximately 36,429.26 for Deep Q-learning and 2,771.59 for
Q-learning. So the modified Deep NNQ-learning plays
significantly better then Q-learning using maximizing
kfuture rewards policy.</p>
        <p>
          An open-source Python implementation of the game
is used in the paper [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Neural networks are
implemented using the library TensorFlow [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. All the source
codes used to create the results of this paper are available
in the public GitHub repository https://github.com/
martinglova/FlappyBirdBotsAI.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The results described in the paper show some properties
of Q-learning and the role of neural networks in
reinforcement learning through the Deep Q-learning algorithm. The
experiments confirm some hypotheses about the setting
parameters of the algorithms like learning rate, discount
factor, rewards, the number of penalized states, etc.
The introduced algorithms tested in the Flappy Bird game
show that extracting features from images and using
feedforward neural network instead convolutional can lead to
significant results too. The same approach has potential in
more real-world applications where raw pixels with lots of
noise are used as an input.</p>
      <p>In Fig. 5 one can see how introducing of the maximizing
k-future rewards policy improve the original Q-learning
algorithm and in Fig. 6 one can see that the improvement can
be even better using the Deep NNQ-learning algorithm.</p>
      <p>Acknowledgements. The research is supported by
the Slovak Scientific Grant Agency VEGA, Grant No.
1/0056/18.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>TensorFlow</given-names>
            <surname>Library</surname>
          </string-name>
          , https://www.tensorflow.org/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Alpaydin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Introduction to machine learning</article-title>
          . MIT Press, Cambridge (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Appiah</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vare</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Playing flappybird with deep reinforcement learning (</article-title>
          <year>2016</year>
          ), http://cs231n.stanford. edu/reports/2016/pdfs/111_Report.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning for flappy bird (</article-title>
          <year>2015</year>
          ), http://cs229.stanford.edu/proj2015/ 362_report.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Deep direct reinforcement learning for financial signal representation and trading</article-title>
          .
          <source>IEEE transactions on neural networks and learning systems 28(3)</source>
          ,
          <fpage>653</fpage>
          -
          <lpage>664</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dhingra</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , et al.:
          <article-title>End-to-end reinforcement learning of dialogue agents for information access</article-title>
          .
          <source>arXiv preprint arXiv:1609.00777</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Fenjiro</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benbrahim</surname>
          </string-name>
          , H.:
          <article-title>Deep reinforcement learning overview of the state of the art</article-title>
          .
          <source>Journal of automation, mobile robotics &amp; Intelligent Systems</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <fpage>20</fpage>
          -
          <lpage>38</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Fujimoto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Precup</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Off-policy deep reinforcement learning without exploration</article-title>
          .
          <source>In: 36th International Conference on Machine Learning, PMLR 97</source>
          . pp.
          <fpage>2052</fpage>
          -
          <lpage>2062</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosorok</surname>
            ,
            <given-names>M.R.:</given-names>
          </string-name>
          <article-title>Q-learning with censored data</article-title>
          .
          <source>Annals of statistics 40(1)</source>
          ,
          <volume>529</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hammond</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning in the enterprise: Bridging the gap from games to industry (</article-title>
          <year>2017</year>
          ), https://conferences.oreilly.com/ artificial-intelligence/ai-ca-2017/public/ schedule/detail/60500, aI Conference, San Francisco
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>A.W.</given-names>
          </string-name>
          :
          <article-title>Reinforcement learning: A survey</article-title>
          .
          <source>Journal of artificial intelligence research 4</source>
          ,
          <fpage>237</fpage>
          -
          <lpage>285</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schapire</surname>
          </string-name>
          , R.E.:
          <article-title>A contextualbandit approach to personalized news article recommendation</article-title>
          .
          <source>In: Proceedings of the 19th international conference on World wide web</source>
          . pp.
          <fpage>661</fpage>
          -
          <lpage>670</lpage>
          . ACM (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
          </string-name>
          , J.:
          <article-title>End-to-end task-completion neural dialogue systems</article-title>
          .
          <source>arXiv preprint arXiv:1703.01008</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning: An overview</article-title>
          .
          <source>arXiv preprint arXiv:1701.07274</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          :
          <article-title>Self-improving reactive agents based on reinforcement learning, planning and teaching</article-title>
          .
          <source>Machine learning 8(3-4)</source>
          ,
          <fpage>293</fpage>
          -
          <lpage>321</lpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>F.S.:</given-names>
          </string-name>
          <article-title>Convergence of q-learning: A simple proof</article-title>
          .
          <source>Institute Of Systems and Robotics, Tech. Rep</source>
          pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heess</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          ,
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Recurrent models of visual attention</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>2204</fpage>
          -
          <lpage>2212</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Playing atari with deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1312.5602</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Neural fitted q iteration-first experiences with a data efficient neural reinforcement learning method</article-title>
          .
          <source>In: European Conference on Machine Learning, LNAI 3720</source>
          . pp.
          <fpage>317</fpage>
          -
          <lpage>328</lpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Rummery</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niranjan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>On-line Q-learning using connectionist systems</article-title>
          , vol.
          <volume>37</volume>
          . University of Cambridge, Department of Engineering (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>P.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gasic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mrksic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rojas-Barahona</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ultes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandyke</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>T.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>On-line active reward learning for policy optimisation in spoken dialogue systems</article-title>
          .
          <source>arXiv preprint arXiv:1605.07669</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          :
          <article-title>Reinforcement Learning: An Introduction</article-title>
          . MIT Press, Cambridge, London (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Theocharous</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Thomas,
          <string-name>
            <given-names>P.S.</given-names>
            ,
            <surname>Ghavamzadeh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Personalized ad recommendation systems for life-time value optimization with guarantees</article-title>
          .
          <source>In: IJCAI</source>
          . pp.
          <fpage>1806</fpage>
          -
          <lpage>1812</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Verma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Flappy bird</article-title>
          . https://github.com/ sourabhv/FlapPyBird (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Watkins</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dayan</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Q-learning</article-title>
          .
          <source>Machine learning 8</source>
          ,
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Neill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maei</surname>
          </string-name>
          , H.:
          <article-title>Optimal demand response using device-based reinforcement learning</article-title>
          .
          <source>IEEE Transactions on Smart Grid</source>
          <volume>6</volume>
          (
          <issue>5</issue>
          ),
          <fpage>2312</fpage>
          -
          <lpage>2324</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhudinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          . In: International Conference on ML. pp.
          <fpage>2048</fpage>
          -
          <lpage>2057</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Gašic´,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Thomson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.D.</surname>
          </string-name>
          :
          <article-title>Pomdp-based statistical spoken dialog systems: A review</article-title>
          .
          <source>Proceedings of the IEEE</source>
          <volume>101</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1160</fpage>
          -
          <lpage>1179</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Neural architecture search with reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1611.01578</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>