<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Batch Reinforcement Learning on a RoboCup Small Size League keepaway strategy learning problem</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Franco Ollino</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel A. Solis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Héctor Allende</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Innovación y Robótica</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Católica del Norte</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Técnica Federico Santa María</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Robotic soccer provides an adversarial scenario where collaborative agents have to execute actions by following a hand-coded or a learned strategy, which in the case of the Small Size League, is given by a centralized decision maker. This work takes advantage of this centralized approach for modelling the keepaway strategy learning problem which is inherently multi-agent, as a single-agent problem, where now each robot forms part of the state of the model. One of the classical reinforcement learning methods is compared with its batch version in terms of amount of time for learning and concluding about updates efficiency based on experiences reusability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        When we talk about Batch Reinforcement Learning (BRL), we refer to one of the current line of
research in the field of Reinforcement Learning (RL), also concerned about solving sequential
decision problems modelled by a Markovian Decision Process (MDP). Given the nature of these
problems, as the intuition may suggest, the scope of this type of learning has extended to areas like
Robotics applications
        <xref ref-type="bibr" rid="ref7">(Kober et al., 2013)</xref>
        .
      </p>
      <p>
        As in the classical approach, with online algorithms, we still focus on teaching an agent how
to behave under certain conditions based on punishments or rewards (reinforcement signals)
depending on the results of applying a certain action
        <xref ref-type="bibr" rid="ref14">(Sutton et al., 1998)</xref>
        . Q-learning
        <xref ref-type="bibr" rid="ref15">(Watkins and
Dayan, 1992)</xref>
        is one of the most popular online algorithms, whose updates are computed in an
incremental manner.
      </p>
      <p>
        BRL approach aims to collect a bunch of experiences and then use them for updating action
influences instead of updating the action value function in an incremental way. In this batch
framework, algorithms like Experience Replay (ER) or Fitted Q Iteration (FQI)
        <xref ref-type="bibr" rid="ref4">(Ernst et al., 2005)</xref>
        can
be found.
      </p>
      <p>
        The Robot Soccer World Cup (RoboCup,
        <xref ref-type="bibr" rid="ref6">(Kitano et al., 1997)</xref>
        ) is an annual competition whose
main objective is far beyond than just playing a robotics soccer game, it presents a natural scenario
where RL problems can be found, in addition to several multidisciplinary challenges on its different
leagues such as small size league, standard platform league, humanoid league and others. In this
problem, a team of cooperative agents have to play a soccer match against another team composed
of autonomous agents, noting that a possible objective for a given team could be to keep as far
as possible the ball from its own goal area. In order to achieve this objective, many works can be
found in the literature, from keepaway strategies using a multi-agent approach
        <xref ref-type="bibr" rid="ref13">(Stone et al., 2005)</xref>
        to algorithms focused on training just the goalkeeper, as
        <xref ref-type="bibr" rid="ref1">(Ahumada et al., 2013)</xref>
        or
        <xref ref-type="bibr" rid="ref3">(Celiberto et al.,
2007)</xref>
        .
      </p>
      <p>
        This work, like in
        <xref ref-type="bibr" rid="ref1">(Ahumada et al., 2013)</xref>
        and
        <xref ref-type="bibr" rid="ref3">(Celiberto et al., 2007)</xref>
        uses a grid for
discretizing the state space of the agent and therefore avoiding to deal with a continous state space
representation where tabular methods become impractical
        <xref ref-type="bibr" rid="ref2">(Baird and Klopf, 1993)</xref>
        .
      </p>
      <p>
        Unlike the above references, most of the works found on literature
        <xref ref-type="bibr" rid="ref12 ref13 ref5 ref9">(Pietro et al., 2002;
Kalyanakrishnan and Stone, 2007; Sawa and Watanabe, 2011; Stone et al., 2005)</xref>
        generates a state space
representation based on angles and distances from the keeper (current learning robot that possess
the ball) to every robot on the confined space of interest. Since large (or continuous) state spaces
require function approximation,
        <xref ref-type="bibr" rid="ref13">(Stone et al., 2005)</xref>
        uses tile-coding for approximating Q-values
when implementing and comparing online RL algorithms (Q-learning vs Sarsa( )).
      </p>
      <p>
        Getting closer to our case of interest, assumptions on
        <xref ref-type="bibr" rid="ref5">(Kalyanakrishnan and Stone, 2007)</xref>
        allow
the agents to communicate with each other in order to share their experiences. They also compare
BRL with online RL algorithms, stating that Fitted-Q Iteration and Experience Replay reach a close
performance with each other, but they both outperforms the online learning algorithm used in
        <xref ref-type="bibr" rid="ref13">(Stone et al., 2005)</xref>
        .
      </p>
      <p>This document intends to compare Q-learning and its batch-version using Experience Replay
on a simulation of the RoboCup Small Size League, noting that this involves a centralized decision
problem, given the setup of the league, reducing the learning problem to a single agent case where
each robot plays a fundamental role on the state space representation.</p>
      <p>The remainder of this document is as follows: Section 2 makes a further description for BRL,
presenting the algorithms that will be used later. Section 3 makes a brief description of RoboCup Small
Size League, and explains how this setup can be used for introducing variations on the approaches
found on literature for learning a keepaway strategy, while Section 4 shows the implementation of
BRL algorithms on a simulated environment. Finally, Section 5 draw some final conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>Batch Reinforcement Learning</title>
      <p>
        Reinforcement learning
        <xref ref-type="bibr" rid="ref14">(Sutton et al., 1998)</xref>
        (RL) tackles the problem of an agent that learns while
interacting with the environment, deciding which action a to execute on the current state s of its
environment, which transfers the agent to another state s¨ receiving a reward (reinforcement signal)
whose nature would provide a quantification of how desirable was that choice. This problem can
be formulated as an MDP
        <xref ref-type="bibr" rid="ref14">(Sutton et al., 1998)</xref>
        , composed by a tuple .S; A; T ; R/ where
• S: denotes the set of all possible states.
• A: is a set of all the actions the agent can execute.
• T : S  A  S  [0; 1] is a state transition function, which gives the probability that when the
agent is in state s and executes action a, the agent will be transferred to another state s¨.
• R: S  A  R is a scalar (real-valued) reward function.
• : S  A denotes the mapping from states to action, describing the policy the agent should
take given a certain state.
      </p>
      <p>As mentioned before, the task of the agent is to learn the sequence of actions (therefore the optimal
policy, &lt; ) that leads to maximize the expected sum of all the rewards received in the long-term.
This is tackled by maximizing the return Rt, i.e. the discounted sum of rewards that the agent will
obtain from time t, given by</p>
      <p>n*1
Rt = É krt+1+k; (1)</p>
      <p>k=0
where stands for the discount factor, with 0 f &lt; 1, and rt+1 stands for the expected (scalar)
reward obtained for executing action at in state st. Then, two quantifications for the expected return
are defined, the value function and action value function, V and Q respectively.</p>
      <p>Value function is defined as the expected return when the agent is on state st at time t,
V .s/ = E t ^Rtðst = s`;
(2)
while the action value function is defined as the expected return when the agent executes at on
state st at time t following policy ,</p>
      <p>Q .s; a/ = E t ^Rtð.st = s; at = a/`:
Both functions are clearly related, as</p>
      <p>
        V .s/ = Eaðs^Q .s; a/`:
A representative method in model-free RL is Q-Learning
        <xref ref-type="bibr" rid="ref15">(Watkins and Dayan, 1992)</xref>
        , which makes
an approximation of the optimal action-value function based on the optimal policy, by making
successive updates for estimations of Q, this update would be given by
      </p>
      <p>Q.st; at/ } .1 * /Q.st; at/ +
rt+1 + max Q.st+1; a/ ;
a
where this approximation, Q, corresponds to the learned action-value function, and stands for
the learning rate.</p>
      <p>In order to understand the difference between the incremental update from online algorithms
and simultaneous update of batch algorithms, consider two consecutive transitions .s; a; r; s¨/,
.s¨; a; r; s¨¨/ and the classical online Q-learning algorithm. Then, when Q.s¨; a/ is computed using the
update rule on (5), this change will not be backpropagated to Q.s; a/ nor any of the state-action
pairs preceding s¨, being updated just when those states are visited again.</p>
      <p>In the pure batch reinforcement learning approach, the agent does not interact with the
environment while the learning phase is taking place. In growing batch reinforcement learning, which
most of the modern batch algorithms are based on, the task of collecting transitions and learning
from them are alternated for improving the exploration policy.</p>
      <p>
        Algorithm 1 describes the procedural form of a (growing) BRL approach independently of the
algorithm used for updating Q-values, as shown on
        <xref ref-type="bibr" rid="ref5">(Kalyanakrishnan and Stone, 2007)</xref>
        . Note that
when the number of forgotten experiences,m, is the same as the size of the size of the batch, i.e.
m = ðDð, experiences are forgotten so growing BRL is reduced to pure BRL, which is not the case of
this proposal.
(3)
(4)
(5)
      </p>
      <p>
        Moreover,
        <xref ref-type="bibr" rid="ref5">(Kalyanakrishnan and Stone, 2007)</xref>
        states that is better (for their task) to use all
the experiences gathered so far. This means that if every batch consists on experiences from
20 episodes, then the first updates of Q estimations will consists on experiences from those 20
episodes. Then the second time these updates are computed it will consists on experiences from
40 episodes and so on, which represents an extremely memory consuming process.
      </p>
      <p>One of the basic BRL algorithms, Experience Replay (ER) aim to improve the speed of
convergence of the action value function by replaying observed transitions repeatedly just as if they were
new observations. Algorithm 2 shows the procedural form of this algorithm.</p>
      <sec id="sec-2-1">
        <title>ALGORITHM 2</title>
        <p>Experience Replay procedure
1: for each training iteration do
2: for each transition .si; ai; ri; si+1/ on D do
3: Update Q.si; ai/ by using
4: end for
5: end for</p>
        <p>Q.si; ai/ } .1 * /Q.si; ai/ +
ri+1 + max Q.si+1; a/
a</p>
        <p>It is immediate to note that what this algorithm does, is to compute several times the updates
of Q-learning on collected transitions as an offline algorithm would do, thus speeding up the
propagation of Q values to preceeding states, but then the system is allowed to collect new
transitions for improving those previously computed estimates.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Test domain: RoboCup SSL</title>
      <p>
        RoboCup presents a challenging domain where a team of robots have to play a soccer match against
another robotic team, where the particular assumptions on the game varies across the different
leagues. This application focuses on the Small Size League (SSL), inspired by the development and
research work made by Sysmic Robotics USM (previously known as AIS Soccer)
        <xref ref-type="bibr" rid="ref11">(Rodenas et al.,
2018)</xref>
        , a group of students whose main objective is to compete on this annual event, and also test
state-of-the-art computational intelligence techniques on this particular setup.
      </p>
      <p>Figure 1 depicts the scheme of this league, where the current positions of each robot at both
teams is given as result of the image proccessing made by SSL-Vision, whose images are acquired
through video cameras provided by the organization comittee, located at the top of the soccer field.
Then, both teams receive the exact same data to their own decision maker programs, which once
an action is chosen informs the actions to take for each robot of its team via a wireless channel.</p>
      <p>
        Although we tackle the problem of finding a keepaway strategy, several challenges arises at
the Small Size League in addition to the already mentioned problems like goalkeeper training on
        <xref ref-type="bibr" rid="ref1">(Ahumada et al., 2013)</xref>
        , learning the opponent strategy as on
        <xref ref-type="bibr" rid="ref16">(Yasui et al., 2013)</xref>
        ,or learning to
control the dribbler
        <xref ref-type="bibr" rid="ref10">(Riedmiller et al., 2008)</xref>
        , noting that the work therein focuses on the Middle
Size League. This latter problem also applies to the Small Size League, being specially difficult to
keep possesion of the ball with the dribbler while changing the orientation of the robot.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Implementation and results</title>
      <sec id="sec-4-1">
        <title>Modelling the learning problem</title>
        <p>
          Although we use a slightly different state space representation compared with
          <xref ref-type="bibr" rid="ref13">(Stone et al., 2005)</xref>
          ,
by using a centralized decision taking problem given that we have a global vision of the field unlike
some other RoboCup leagues. Then, we set the keepaway learning problem to be composed of 3
keepers, robots in charge of keeping the ball as long as possible away from their goal area, and 2
takers, which are robots from the opponent team and whose objective is to take the ball and shoot
to the (center of) goal area.
        </p>
        <p>As the offense strategy learning for allowing the takers to learn better strategies to effectively
score is out of the scope of this work, we fixed their policy in a manner that they are always chasing
the ball, and once they got it, just shoot to the goal area.</p>
        <p>
          Figure 2 shows the setup of this problem in the simulated environment where algorithms will
be tested, GrSim
          <xref ref-type="bibr" rid="ref8">(Monajjemi et al., 2011)</xref>
          , which has been very helpful for testing computational
intelligence methods before implementing them on the real robots.
        </p>
        <p>The state is composed by distances from every keeper to all the other robots, including takers
and other keepers as shown in Figure 2. Also the distance from the ball to every robot is considered,
and the angle between a keeper and each taker (with respect to an imaginary horizontal line across
the soccer field). In other words, the state st at a given time t is composed by
• dist(Ki ,ball),
• dist(Tj ,ball),
• dist(Ki ,Kj), i  j,
• dist(Ki ,Tj),
• angle(Ki ,Tj),
where Ki stands for the i-th keeper and Tj for the j-th taker. Also, the reader should note that
although different state representations could work for a given problem, the angle is neccessary for
modelling this problem. Even when assuming that all the robots are always facing the ball, since if
just the distance d from the i-th keeper to the j-th taker is used, there would be theoretically infinite
points around a circle of radius d and centered on the position of the keeper where the taker could
possibly be.</p>
        <p>Then, the possible actions to execute by a keeper are
• hold./: all keepers remains on their current positions without making any pass nor trying to
intercept the ball.
• pass.Ki; Kj/: the i-th keeper performs a pass to the j-th keeper, where obviously i  j since it
would be equivalent to hold() action.
• intercept.K1; K2; :::; Kn/: send keepers to intercept the ball whenever its respective binary
argument is set to 1.</p>
        <p>In this case, since there are 3 keepers, intercept.0; 1; 0/ would send 2nd keeper to intercept
the ball, while intercept.1; 0; 1/ would send 1st and 3rd keepers to intercept the ball. Note that
intercept.0; 0; 0/ is not allowed, since it would be equivalent to hold./ action.</p>
        <p>
          Note that unlike the work in
          <xref ref-type="bibr" rid="ref13">(Stone et al., 2005)</xref>
          , since we have a global vision of the field and
thus focused on a centralized decision making problem, we learn a Q value function for the whole
system and not one for each keeper. We have identificators for each keeper, so the 1st keeper will
be K1 always, and does not refer to the keeper who is closest to the ball.
        </p>
        <p>Since the final objective of the keepaway learning problem, is to learn to keep as long as possible
the ball away from the goal area, then we will reward actions that privileges the ball possesion,
and punish actions that leads to lose possesion and punish harder when it leads to a goal scored
against the team.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Simulation results</title>
        <p>When implementing the algorithms described on Section 2, we used a growing BRL approach where
the batch of experiences D contains transitions from 20 episodes, where each one lasts 2 minutes
of gameplay (without considering reset time when a goal is scored and robots are re-locating). Then,
after updating Q-values estimations all those transitions are discarded, so the size of the Batch
always have the data for 20 episodes when entering to the learning phase.</p>
        <p>According to rewards obtained through the learning episodes, whose values are set to 5 for
keeping possesion on the ball, -5 in case of losing possesion and -50 in case of the enemy team
scoring a goal. Then, according to these reinforcement values, Figure 4 shows the evolution of time
possesion on the ball.</p>
        <p>It can be seen from Figure 3, where the line represents the mean through 10 reproductions of the
learning task, that batch version of Q-learning using Experience Replay achieve better performance
compared with its classical online version on a smaller amount of time. However, it is expected that
after several learning episodes more, batch version would learn faster but they both achieve the
same results at last.</p>
        <p>Despite the efficiency on the use of collected transitions of the learning agent, speed of
convergence for both algorithms is directly affected by the number of possible states obtained from
the chosen state space representation. Then, as the discretization grid becomes thinner, the state
space becomes larger and tabular methods become slower and even impractical for a continuous
state space representation, so function approximation methods are needed.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>As expected because of data reusability of experiences gathered so far, Experience Replay learn
faster in terms of defending the goal area, and this is mainly due to its synchrony nature and a better
use of collected experience on the interaction process between the agent and its environment for
this task. Obtained results shows the benefits of re-using data efficiently and in an inherently
multiagent problem tackled from a single agent learning task given the centralized setup of this league.
Future work may include a more in-depth analysis including other update rules and strategies in
Batch Reinforcement Learning methods, as well as field testing in other leagues, and considering
a continuous state space representation using function approximators such as artificial neural
networks or a fuzzy representation of states.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ahumada</surname>
            <given-names>GA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nettle</surname>
            <given-names>CJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solis</surname>
            <given-names>MA</given-names>
          </string-name>
          .
          <article-title>Accelerating Q-Learning through Kalman Filter Estimations Applied in a RoboCup SSL Simulation</article-title>
          .
          <source>In: Robotics Symposium and Competition (LARS/LARC)</source>
          ,
          <year>2013</year>
          Latin American IEEE;
          <year>2013</year>
          . p.
          <fpage>112</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Baird</surname>
            <given-names>LC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klopf</surname>
            <given-names>AH</given-names>
          </string-name>
          .
          <article-title>Reinforcement learning with high-dimensional, continuous actions</article-title>
          . Wright Laboratory,
          <string-name>
            <surname>Wright-Patterson Air Force Base</surname>
          </string-name>
          ,
          <source>Tech Rep WL-TR-93-1147</source>
          .
          <year>1993</year>
          ; .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Celiberto</surname>
            <given-names>LA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro</surname>
            <given-names>CH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Costa</surname>
            <given-names>AH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bianchi</surname>
            <given-names>RA</given-names>
          </string-name>
          .
          <article-title>Heuristic reinforcement learning applied to robocup simulation agents</article-title>
          . In: Robot Soccer World Cup Springer;
          <year>2007</year>
          . p.
          <fpage>220</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ernst</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geurts</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wehenkel</surname>
            <given-names>L</given-names>
          </string-name>
          .
          <article-title>Tree-based batch mode reinforcement learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          .
          <year>2005</year>
          ;
          <volume>6</volume>
          (Apr):
          <fpage>503</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Kalyanakrishnan</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stone</surname>
            <given-names>P</given-names>
          </string-name>
          .
          <article-title>Batch reinforcement learning in a complex domain</article-title>
          .
          <source>In: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems ACM; 2007</source>
          . p.
          <fpage>94</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kitano</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asada</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuniyoshi</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noda</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osawa</surname>
            <given-names>E. Robocup:</given-names>
          </string-name>
          <article-title>The robot world cup initiative</article-title>
          .
          <source>In: Proceedings of the first international conference on Autonomous agents ACM; 1997</source>
          . p.
          <fpage>340</fpage>
          -
          <lpage>347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Kober</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagnell</surname>
            <given-names>JA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peters</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Reinforcement learning in robotics: A survey</article-title>
          .
          <source>The International Journal of Robotics Research</source>
          .
          <year>2013</year>
          ;
          <volume>32</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1238</fpage>
          -
          <lpage>1274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Monajjemi</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koochakzadeh</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghidary</surname>
            <given-names>SS</given-names>
          </string-name>
          .
          <article-title>grsim-robocup small size robot soccer simulator</article-title>
          . In: Robot Soccer World Cup Springer;
          <year>2011</year>
          . p.
          <fpage>450</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Pietro</surname>
            <given-names>AD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>While</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barone L</surname>
          </string-name>
          .
          <article-title>Learning in RoboCup keepaway using evolutionary algorithms</article-title>
          .
          <source>In: Proceedings of the 4th Annual Conference on Genetic and Evolutionary</source>
          Computation Morgan Kaufmann Publishers Inc.;
          <year>2002</year>
          . p.
          <fpage>1065</fpage>
          -
          <lpage>1072</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Riedmiller</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hafner</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lauer</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>Learning to dribble on a real robot by success and failure</article-title>
          .
          <source>In: Robotics and Automation</source>
          ,
          <year>2008</year>
          .
          <article-title>ICRA 2008</article-title>
          . IEEE International Conference on IEEE;
          <year>2008</year>
          . p.
          <fpage>2207</fpage>
          -
          <lpage>2208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Rodenas</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alfaro</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reyes</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pandolfa</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aubel</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yanes</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrera</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>SH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castillo</surname>
            <given-names>S. AIS</given-names>
          </string-name>
          <article-title>Team Description Paper</article-title>
          . .
          <year>2018</year>
          ; .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Sawa</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watanabe</surname>
            <given-names>T</given-names>
          </string-name>
          .
          <article-title>Learning of keepaway task for RoboCup soccer agent based on Fuzzy Q-Learning</article-title>
          .
          <source>In: Systems, Man, and Cybernetics (SMC)</source>
          ,
          <year>2011</year>
          IEEE International Conference on IEEE;
          <year>2011</year>
          . p.
          <fpage>250</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Stone</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            <given-names>RS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhlmann</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Reinforcement learning for robocup soccer keepaway</article-title>
          .
          <source>Adaptive Behavior</source>
          .
          <year>2005</year>
          ;
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>165</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            <given-names>RS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barto</surname>
            <given-names>AG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bach</surname>
            <given-names>F</given-names>
          </string-name>
          , et al.
          <article-title>Reinforcement learning: An introduction</article-title>
          . MIT press;
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Watkins</surname>
            <given-names>CJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dayan</surname>
            <given-names>P.</given-names>
          </string-name>
          <article-title>Q-learning</article-title>
          .
          <source>Machine learning</source>
          .
          <year>1992</year>
          ;
          <volume>8</volume>
          (
          <issue>3</issue>
          -4):
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Yasui</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murakami</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naruse</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>Analyzing and learning an opponent's strategies in the RoboCup small size league</article-title>
          . In: Robot Soccer World Cup Springer;
          <year>2013</year>
          . p.
          <fpage>159</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>