<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experience Sharing in a Traffic Scenario</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana L. C. Bazzan</string-name>
          <email>bazzan@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franziska Kl u¨gl</string-name>
          <email>franziska.klugl@oru.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Traffic Assignment Problem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Background: Traffic Assignment</institution>
          ,
          <addr-line>Route Choice and Reinforcement Learning</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Travel apps become more and more popular giving information about the current traffic state to drivers who then adapt their route choice. In commuting scenarios, where people repeatedly travel between a particular origin and destination, learning effects add to this information. In this paper, we analyse the effects on the overall network, if adaptive driver agents share their aggregated experience about route choice in a reinforcement learning (Q-learning) setup. Drivers share what they have learnt about the system, not just information about their current travel times. We can show in a standard scenario that experience sharing can improve convergence times for adaptive driver agents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Which route to choose for travelling from origin to destination is a
central question a traffic participant – in our case, a driver – faces.
The route choice of a driver however is not independent of the
choices of others, as the load on a link determines the travel time
on it and as a consequence the travel time on a route. This is the
well-known problem of traffic assignment. Drivers adapt to their
experience and choose routes which they expect to be the best choice –
mostly with respect to shortest travel time. In traffic analysis this idea
is manifested in the concept of user equilibrium, which is a situation
in which no user can reduce travel time by changing its route.</p>
      <p>There many approaches for driving the overall system into overall
system optimum making individual drivers with incentives or
information. Information just about recent travel times may be seen as
miss-leading, as it just may give a one-shot impression. A
consequence are ineffective oscillations based on too fast and overbearing
reactions. A way to address this problem is by introducing a kind of
inertia into the system, assuming that people are hesitant in following
new information.</p>
      <p>More and more drivers use travel apps that also show the traffic
state collected from real-time speed observations, such as Google
Maps or Waze. Drivers fuse the information that they receive from
those apps with their individual experience to actually make their
routing decision.</p>
      <p>In this paper, we want to analyse whether sharing of experience
among drivers rather than using recent travel time information
enables the overall system to develop into the user (or Nash)
equilibrium (UE).</p>
      <p>Our analysis is inspired by a hypothetical app with which drivers
share what they have learnt related to their repeated route choice
between some origin an destination in a traffic system – like friends
chatting about which route they prefer or avoid.</p>
      <p>As discussed later in Section 5, there are some approaches that are
based on sharing or transferring knowledge to improve or accelerate
the learning process. However, the majority of them deal with
cooperative environments. In non-cooperative tasks, such as the route
choice, we observed that the transfer of a good solution from one
agent to others may produce sub-optimal outcomes. Therefore, there
is a need for approaches that address transfer of knowledge in other
contexts. With the present paper, we provide the first steps in this
direction.</p>
      <p>
        The other sections of this paper are organised as follows. First, we
introduce some background on route choice, traffic assignment
analysis concepts and the usage of reinforcement learning. Section 3 then
describes the proposed approach, while Section 4 discusses
experiments performed with a standard scenario and results gained from
that. Section 6 then presents the concluding remarks and points to
future research.
2
2.1
The traffic assignment problem (TAP) aims at assigning a route to
each driver that wants to travel from its origin to its destination. In
traffic analysis and simulation, traffic assignment is the prominent
step of connecting the demand (how many drivers want to travel
between two nodes in a traffic net?) and the supply (roads in a traffic
net with particular capacities, speed, etc.). The TAP can be solved in
different ways. Agent-based approaches aim at finding the UE [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ],
in which every driver perceives that all possible routes between its
origin and destination (its OD pair) have similar, minimal costs
resulting in that the agent has no incentive to change routes. This means
that the UE corresponds to each driver selfishly computing a route
individually.
      </p>
      <p>A traditional algorithm to solve the TAP, (i.e. to find the UE) is the
method of successive averages (MSA). Yet, this method is performed
in a centralised way3.</p>
      <p>A learning based approach, on the other hand, can be done in a
decentralised way, with each agent learning individually by means
of RL (see next section). We remark however that, since their actions
are highly coupled with other agents actions, this is not a trivial task.
Moreover, it is a non-cooperative learning task.
2.2</p>
    </sec>
    <sec id="sec-2">
      <title>Reinforcement Learning</title>
      <p>
        In reinforcement learning (RL), an agent’s goal is to learn a mapping
between a given state to a given action, by means of a value function.
A popular algorithm to compute such value functions is Q-learning
(QL) [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Value functions take the instantaneous reward received (a
numerical signal) and compute the expected, discounted value of the
      </p>
      <sec id="sec-2-1">
        <title>3 For details see, e.g., Chapter 10 in [19].</title>
        <p>corresponding action. Once such mapping is learned, an agent can
decide which action to select in order to maximize its rewards.</p>
        <p>We can model RL as a Markov decision process (MDP) composed
by a tuple (S; A; T ; R), where S is a set of states; A is a set of
actions; T is the transition function that models the probability of the
system moving from a state s 2 S to state s0 2 S, upon performing
action a 2 A; and R is the reward function that yields a real
number associated with performing an action a 2 A when one is in state
s 2 S.</p>
        <p>Q-learning (QL) computes a Q-value Q(s; a) for each action-state
pair. This value represents a reward estimate for executing the action
a at state s. The updating of Q(s; a) is done through Eq. 1, where
2 [0; 1] is the learning rate and 2 [0; 1] is the discount factor.
Q(s; a) = Q(s; a) +
(r +
maxa(Q(s0; a0))</p>
        <p>Q(s; a)) (1)</p>
        <p>Each agent is a learner that maintains a Q-table that is updated
at each subsequent episode. In order to address the exploration–
exploitation dilemma, the "-greedy exploration strategy can be used
to choose actions: the action with the highest Q value is selected with
a probability of 1 " and a random action is selected with probability
".</p>
        <p>
          Finally, even if agents may be learning independently (as, e.g., in
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]), this is a non-trivial instance of multiagent RL or MARL.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Sharing Experiences</title>
    </sec>
    <sec id="sec-4">
      <title>Q-Learning in a Route Choice Scenario</title>
      <p>Given is a road network G in which nodes represent point of
interest, which can be junctions, neighborhoods, or areas; links represent
road segments with a free-flow travel time depending on the distance
and capacity and an additional cost function that relates number of
vehicles on it to its costs (travel time). Some of the nodes serve as
origins, some as destinations. Between those origin and destination
nodes, we consider sequences of links as possible routes. Different
routes connecting origin and destination are not independent from
another, they may contain shared links. Such dependencies exists
between routes of the same origin-destination combination as well as
with routes from others.</p>
      <p>
        Drivers use Q-learning to individually determine which is the best
route they can take from their individual origin and destination pair.
Hereby, S is the particular origin-node of the individual driver. Each
driver learns independently. Each agent state is determined by its
origin, i.e., this is a stateless formulation of QL, as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The set of actions A contains all k shortest routes between origin
and destination. Therefore the number k needs to be sufficiently large
to give sufficient flexibility in route choice. The transition function
maps the origin to the destination for all actions. Reward is a function
of the experienced travel time on the route (we use the negative of
travel time, for maximization purposes).</p>
      <p>It is important to actually understand the multi-agent
characteristics of this learning task. Because links have travel times that
depends on the number of agents using them, learning is challenging.
The desirable, shortest path under free-flow, may end up producing
a high travel time, if too many agents want to use it. Agents need to
learn how to distribute themselves in a way that an optimal number
of agents use each edge, minimizing their individual travel time on
the selected routes. Coordination is desirable not just between agents
with the same origin-destination pair, but needs to happen between
all agents, even if their routes would share only a single link.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Information Sharing via App</title>
      <p>With the dissemination of various traffic apps, it is no longer
realistic to assume that drivers only learn from their own experience when
repeatedly travelling from origin to destination just observing their
own travel time. Carrying the idea of traffic apps beyond current
traffic state, more information produced by others can be used for
decision making. While there has been a number of works in which
the effect of information about current travel times on route or on
parts of the route was tested (see Section 5), the question emerged
whether explicitly sharing accumulated information about
particular routes could improve the learning process. We envision a travel
app in which participating drivers can share experience about their
favourite or most disappointing routes. The setup is visualised in
Figure 1.</p>
      <p>Best: 
R2 with 50.32</p>
      <p>The overall decision making results in three steps:
1. Agents select an action from their set of possible actions. Each
agent experiences the time that it took to perform the travel using
the route that corresponds to the selected action. The agent updates
its Q-table according to the QL rules. At the end of the day, they
share an evaluation of a route with the app. The shared
information does not need to be about the route they took most recently,
it may be their best or their worst or about a randomly route. The
app can be seen as a platform for chatting “neighbours” who talk
about what their experiences when travelling to a particular
destination. Drivers may also decide to share no information at all.
More precisely: sharing their accumulated experience means that
they share the Q-value assigned to a particular action.</p>
      <p>The agents do not reason about the potential effect of sharing
information on the reward they can receive. They do not assume
that the publication of the information influences others’ decision
making to an extend that the evaluation of the published route
changes dramatically.
2. The information server behind the app collects all information
from the information sharing drivers and determines the shared
information that the server then publishes. This information
contains, per origin-destination pair, an action and the Q-value that
belongs to the action aggregated from all handed-in information.
3. Driver agents decide whether to access the published information.</p>
      <p>Accessing has the consequence, that they incorporate the
published information in their Q-table: they replace the Q-value of
the concerned action with the published Q-value for the given
action. There is no reasoning about the credibility of the published
value in relation to the existing Q-value. The decision of accessing
the shared information is based on a budget acquired by sharing
- the more one shares the more one may access the shared value.
In the current version, the agents do not reason about whether the
investment into acquiring published information may pay off, they
rather access the information with a given frequency/probability,
provided they have a budget.</p>
      <p>This can be seen as a rudimentary form of transfer learning, with
transfer happening during the learning process not from one task to
another. Due to the way we modelled actions, transfer between
different problems – here different origin-destination pairs makes no
sense.</p>
      <p>This is a multiagent reinforcement learning setup in which agents
have to learn anti-coordination. Sharing happens in groups: only
agent within the same group, that means with the same origin and
destination pair share partial information from their Q-tables.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Sharing Issues in Detail</title>
      <p>More details need to be considered to provide full information about
the analysed scenarios:</p>
      <p>The overall dynamics are so, that agents select their route before
the actual travelling happens. Then, all agents travel through the
network. Using cost functions for determining travel time on a link from
its load, we abstract away from detailed temporal dynamics in which
departure time, etc. matters. As we want to focus on the effect of
sharing experiences and different aspects related to that, we did not
use a microscopic traffic simulation to determine how an individual
driver moves through the network. As a consequence, all agents
taking the same route encounter the same travel times. Reward from
travelling is sent to drivers after all have finished. The agents update
their Q-tables, send information to the app. In the next episode, the
drivers who want to access the published information, look into the
service and again update their Q-tables based on the published value.
So, there are a number of decisions involved:
3.3.1</p>
      <sec id="sec-6-1">
        <title>When does a driver access the shared information?</title>
        <p>Assuming that the driver needs to pay for the service of getting such
information, the decision about whether and when it is best to
incorporate information from other agents into ones Q-table may be a
strategic one.
1. the simplest strategy could be to simply access information in
every round.
2. the agent could collect some budget – e.g. from sharing
information – that it could invest into accessing information. So whenever
the agent has collected enough budget, it accesses. The relation
between gain per sharing in relation to costs of accessing is the
relevant parameter here.
3. The agent could access the information in the beginning with
decreasing frequency the longer the learning proceeds.
4. One can image more advanced strategies based on dynamics that
the agent observes. For example, if the action with the best
Qvalue was “disappointing” for a number of times, the agent may
trigger a look-up on the app.
5. An alternative could be to have a look, if the Q-values are not
sufficiently distinct to have a clear favourite route.</p>
        <p>We decided to focus on the simpler alternatives and leave the more
sophisticated approaches to future work, as they involve a lot of
individual parameters to fully specify such strategies. Hence, in this
work we follow alternative 2, varying the budget agent has, starting
with unlimited budget and then decreasing it so that agents only
access the information in some episodes. Although we consider that all
have the same budget, the access is not synchronous, i.e., not
everybody spend their budgets in the same episode (except in case budget
is unlimited and thus every agent access at every episode).
3.3.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Whether, when and with whom to share information?</title>
        <p>The agents learn about good actions/routes for their
origindestination combination, individually. In principle, they do not need
to share – yet without a sufficiently large subset of the driver
population sharing, the information that the app publishes may not possess
any value. So, it is in the interest of the app providers that as many
agents as possible hand-in their experience.</p>
        <p>There might be some required consent for basic service usage to
send information about the agents’ experience to the app. This has the
consequence that basically everybody would share experiences much
like GPS traces are collected without full awareness of the (careless)
traveller.</p>
        <p>As an alternative, one can image that there is a small compensation
for actively sharing experiences. The budget system indicated above
would need a corresponding side for earning what will be spend later.
For every time sending information, the agent could increase the
budget for retrieving information.</p>
        <p>The compensation for sharing could be also independent from the
information usage. It could be financial or give social credits
turning the individual agent to somebody more compliant with societal
values.</p>
        <p>For this first analysis, we just analysed situations in which every
agent is basically forced to contribute with its experience, that means
agents share</p>
        <p>A variant of this question is related to with whom an agent may
share. The idea of an app that publishes a value aggregated from all
handed-in experiences, is one extreme: the agent shares with
everybody that travel between the same origin and destination. Whether
the information that the agent contributes is actually valuable is a
decision of the aggregation mechanism used by the app. As an
alternative, one can image to restrict the information dissemination to only
a group of agents, denoting a group of friends or neighbours. This
could bring more heterogeneity into the information transfer. As an
extreme case, agents could pair and just share information with one
other agent, this thwarts the app idea. Additionally, we did not
expect a large impact and therefore just tested without restrictions. In
the future work, we need to test this assumption.
3.3.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>What information shall be shared and how the information shall be aggregated?</title>
        <p>This is the central question for the setup of the app. What information
shall be shared and how does the app process all the information
collected?</p>
        <p>The first question relates to what information the agents do send to
the app. It is obvious that this information can be information about
the route with the highest Q-value - basically telling others that this
route is the best for the agents’ origin-destination pair according to
the agents’ experience. Yet, also other information may be interesting
to know: There are situations in which it is better to avoid choices.
So, we foresee the following alternatives to be relevant:</p>
        <sec id="sec-6-3-1">
          <title>1. Share the last chosen action including its Q-value 2. Share a randomly selected action with its Q-value 3. Share the action with the best Q-value 4. Share the action with the worst Q-value</title>
          <p>As the first alternative in most cases is the same as the third one
and otherwise a random selection, we deemed this alternative to be
not interesting. We tested all others. The agents hereby share the
information. If they access the aggregated information, they integrate
it into their own individual Q-table replacing the Q-value of the
concerned route/action. So, they do not automatically select the route
that they were told about, but only if there is no better one and
continue with the usual learning and decision making process.</p>
          <p>As written above, the app collects and aggregates information
from all sharing drivers. Also for this overall process step, there are
different alternatives:
1. the app publishes a random value
2. the app publishes the overall best q-value and corresponding
action
3. the app publishes the overall lowest q-value with corresponding
action
4. the app averages the q-values for each possible action and then
give best or worst of these average</p>
          <p>Together with the alternatives for which values to share, this
produces a number of interesting alternative combinations: it might be
interesting to share randomly selected information from all best or
all worst q-values, yet we tested the “pure” combinations: The app
provides a fully randomly selected information, the best of all best
and the worst of all worst q-values.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Experiments and Results</title>
    </sec>
    <sec id="sec-8">
      <title>Scenario</title>
      <p>We performed a number of experiments to understand how the
sharing actually impacts the adaptation and eventually the performance
of the drivers using the app.</p>
      <p>
        The OW network (proposed in Chapter 10 in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) seen in
Figure 2, represents two residential areas (nodes A and B) and two major
shopping areas (nodes L and M). So, there are four OD pairs,
connecting residential areas with shopping areas. The numbers in the
edges are their travel times between two nodes under free flow (in
both ways). We assume no additional delays when passing nodes,
e.g. due to traffic lights. We computed k = 8 shortest paths per OD
pair, following the algorithm by [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
      </p>
      <p>The volume-delay (or cost) function is te = tef + 0:02 qe, where
te is the travel time on edge e, tef is the edge’s travel time per unit of
time under free flow conditions, and qe is the flow using edge e. This
means that the travel time in each edge increases by 0.02 of a minute
for each vehicle/hour of flow.</p>
      <p>For the sake of getting a sense of what would be optimal, the
freeflow travel times (FFTT) for each OD pair are shown in Table 1. We
note however that these times cannot be achieved because the
freeflow condition is not realistic, and when each agent tries to use its
route with the lowest travel time, jams occur and the times increase.
In addition the table shows the number of agents that travel between
the four particular origin and destination combinations and the travel
time in UE. This is averaged as drivers take different routes between
their OD pairs. Also, the fact that 1700 agents learn simultaneously,
makes the overall learning complex.</p>
      <p>
        For the QL, Q-tables are initialised with a random value around
90. All experiments were conducted with = 0:5 (learning rate) and
" = 0:05 (for "-greedy action selection), following previous works
(e.g., [
        <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
        ]) that have extensively tried other values. As mentioned
above, k = 8 is the number of shortest paths, that means number of
possible actions.
      </p>
      <p>We measure overall performance of the different approaches with
the average travel time over all 1700 agents.</p>
      <p>Each experiment was repeated 30 times. Standard deviations are
not shown to keep the plots cleaner. We thus remark that they are of
the order of 1% between the results of different runs in terms of
average travel times. There are partially dramatic oscillations between
episodes.</p>
      <p>As for the cases in which there is sharing of experiences, the
following observations can be made. First, no matter how frequently
the agents access the shared experience, we see an acceleration in
the learning process in the initial episodes. This is clear in Figure 3,
and is due to the fact that agents do not have to try out some actions
given that they share better options. Unfortunately, depending on the
frequency of access to this shared experience, this could lead to
suboptimal learning causing the collective of agents to not converge to
the UE.</p>
      <p>Let us now discuss individual cases. When agents have unlimited
budget to access the shared experience (and fully use this budget),
then the performance is bad (see green line in Figure 3). The reason
is that too many agents are informed about the best performance in
the previous episode and hence try to do the same action. Thus, the
supposedly best action is followed by virtually all agents and they
then end up selecting the same route (for each OD pair). This can
lead to bad performance, not to mention a lot of oscillations (that are
seen in the plot). This is a well known phenomenon and could be
confirmed in our experiments.</p>
      <p>When the budget is limited so that it allows the access to the shared
experiences in 6 or 8 out of 10 episodes, then there is an increase in
performance, if compared to the case in which agents have
unlimited access. We recall that such accessing is not synchronous, i.e.,
some agents may access while others are not doing so. Still, the
performance is sub-optimal (brown and magenta curves in the figure),
with the collective of agents missing the UE in the end. It must be
said, though, that the performance in the initial episodes is not bad,
i.e., there is an acceleration in the learning due to the fact that some
actions need not to be tried, as aforementioned.</p>
      <p>When the agents have a budget that allows then to only access the
shared experiences from time to time (e.g., 2 or 4 in 10 episodes, see
blue and orange curves in Figure 3), then the overall performance is
much improved, especially if the driver-vehicle units cannot afford
to access the shared experiences in more than 2 out of 10 episodes.
Not only there is an acceleration in the learning process already at
the initial episodes, but also there is a convergence towards the UE.
We assume that the best setting of the budget limits is depending
on the scenario details, especially on the particular OD pair. It will
be interesting to analyse the situation deeper. A budget strategy in
which agents share in the beginning, but then gradually give up
sharing over time, may lead to the results that we want to achieve with
fast convergence to the UE.
4.2.2</p>
      <sec id="sec-8-1">
        <title>Sharing Worst or Random Experience</title>
        <p>We also tested the effect of what happens if agents share their worst
experience which the app aggregates to the worst action/route for
each particular OD pair. The results are almost identical to vanilla
QL. So, sharing has apparently no effect when sharing the worst
experience. The reason is that when exploiting - which is according to
our settings in 95% of the episodes - the agents select the best route.
Which route has a bad Q-value does not matter hereby. So, the only
situation in which such an update of the Q-table would change the
decision of an agent is, if the route with previously the best Q-value,
is the one that the app informs about. One could image that this
happens if the previously best route is heavily overloaded, but this was
not the case here. Maybe, the number of considered possible routes is
too high for such a setup. Nevertheless, reducing the set of possible
routes makes no sense in such a learning setting.</p>
        <p>Also, randomly selecting a route from the Q-table and sharing it
with the app, which then publishes a randomly selected experience
from all handed-in values, generates a similar outcome as the
sharing the worst experience. There is no difference to vanilla QL - that
means a QL without sharing experiences. The explanation is similar.
5</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Related Work</title>
      <p>Besides the classical methods to address the TAP, there has been
some works – based on other AI paradigms – with the goal of finding
the UE. The main motivation is to solve the TAP from the perspective
of the individual agent (driver, driver-vehicle unit), thus relaxing the
assumption that a central entity is in charge of assigning routes for
those agents. In such decentralized approaches, the agent itself has
to collect experiences in order to reach the UE condition. Thus, these
approaches are suitable for commuting scenarios, where each agent
can collect experiences regarding the same kind of task (in this case,
driving from a given origin to a given destination).</p>
      <p>One popular approach to tackle such decentralized
decisionmaking is via RL (more specifically MARL). However, other are also
mentioned in the literature. We start with these and later discuss those
based on RL.</p>
      <p>
        In the work of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a neural network is used to predict drivers’
route choices, as well as compliance to such predictions, under the
influence of messages containing traffic information. However, the
authors focus on the impact of the message on the agents rather than
the impact on system-level traffic distribution and travel time.
      </p>
      <p>
        The work of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] uses the Inverted Ant Colony Optimization
(IACO). Ants (vehicles) deposit their pheromones in the routes they
are using, and the pheromone is used to repel them. Consequently,
they avoid congested routes. Also [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] applied ant colony
optimisation to the TAP. However, in both cases, the pheromone needs to be
centrally stored, thus this approach is not fully suitable for a the
decentralised modelling.
      </p>
      <p>
        One game theoretic approach to the route choice problem
appears in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This approach uses only past experiences for the route
choice. The choice itself is made at each intersection of the network.
However, it assumes that historical information is available to all
drivers.
      </p>
      <p>
        The work of [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] uses adaptive tolls to optimise drivers’ routes as
tolls change. Differently from our purpose, they are concerned with
alignment of choices towards the system optimum, which can only
be achieved by imposing costs on drivers (in their case, tolls). In the
same line, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] deal with the problem from a centralised perspective to
find an assignment that aligns users and system utilities by imposing
tolls in some links.
      </p>
      <p>
        In the frontier of aligning the optimum of the system with the UE,
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has introduced an approach that seeks a balance in which not only
the central authority benefits but also the individual agents.
      </p>
      <p>RL-based approaches to compute the UE are becoming
increasingly popular. Here, each agent seeks to learn to select routes (these
are the actions) based on the rewards obtained in each daily
commuting, where the reward is normally based on the experienced travel
time.</p>
      <p>
        Two main lines of approaches can be distinguished: One, less
popular due to its complexity, follows a traditional RL recipe, where
besides a set of actions, there is also a set of states, these being the
vertices in which the agent finds itself. Works in this category
include [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, the majority of the papers in the literature follow
a so-called stateless approach in which there is a single state (the
OD pair the agent belongs to), while the agent merely has to select
actions, which are normally associated with a set of pre-computed
routes that can be recalculated en-route, or not. This literature
includes approaches based both on Learning Automata ([
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]) and
QL.
      </p>
      <p>
        QL is increasingly being used for the task of route choice. Of
particular interest are those that compute the regret associated with the
greedy nature of selecting routes in a selfish way ([
        <xref ref-type="bibr" rid="ref20 ref21">21, 20</xref>
        ]), as well
as those that combine selfish route choice with some sort of biasing
from a centralized entity ([
        <xref ref-type="bibr" rid="ref1 ref22 ref6">1, 6, 22</xref>
        ]).
      </p>
      <p>
        There are works that, as we propose in the current paper, also deal
with some forms of communication between agents. [
        <xref ref-type="bibr" rid="ref13 ref14 ref16 ref3">14, 3, 13, 16</xref>
        ].
Information is here not always truthful, but can be manipulated for
driving agents towards overall intended outcome.
      </p>
      <p>
        The idea of sharing either learned policies or Q-values is not new.
In fact, as early as in 1993, Tan [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] suggested that communication of
some kind of knowledge could help, especially in cooperative
environments. In particular, sharing Q-values may reduce the time needed
to explore the space-action space.
      </p>
      <p>
        Some researchers have dealt with aspects such as what and when
to share. In a abstract view of sharing, [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] deal with agents that
keep a list of states in which they need to coordinate. This idea also
appears in the context of traffic management in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], where traffic
signal agents keep joint Q-tables based on coupled states and actions.
      </p>
      <p>
        Some works – for instance, [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] – deal with more fine grained
views of sharing knowledge, as for instance the transfer learning
community, in which the quest regarding what and when to
transfer gets more precise. Transfer of reward values or policies is also
explored in the literature. For instance, [
        <xref ref-type="bibr" rid="ref11 ref27 ref31">27, 11, 31</xref>
        ] show that the
learning can be accelerated if a teacher shares experiences with a
student. We remind that virtually all these works deal with cooperative
environments, where it makes sense to transfer knowledge. In
noncooperative tasks, such as the route choice, we have seen that transfer
of a good solution from one agent to others may produce highly
suboptimal outcomes. Coordination here is actually anti-coordination,
appropriately distributing agents between different alternatives.
6
      </p>
    </sec>
    <sec id="sec-10">
      <title>Conclusion and Ideas for the Future</title>
      <p>The idea of this paper is to discuss the effects of a possible next type
of travel app, in which users intentionally share their experiences,
as opposed to conventional travel apps, which are based on
collecting drivers position, in order to be able to display current average
speed at a particular segment. The app that we are considering here,
can be thought as a device that replaces direct interaction between
colleagues or neighbours chatting about habits and experiences
regarding route choice.</p>
      <p>Assuming that humans continuously adapt to what they
experience when performing (commuting) tasks, we analysed the potential
effect of such an app. So, agents can not just learn based on their own
experience, but also use others’ experience on the same task for
decision making. As we have shown - integrating others’ experiences
from time to time - speeds up the learning progress.</p>
      <p>In previous works, drivers were informed about travel times that
were collected by a central authority (as, e.g., Waze). This is
effective only if some inertia is introduced. We can observe a similar effect
here: the agents who do not excessively use integrate others’
experience, learn the best choice for their route. Agents that update their
Q-table with additional information from the app in every round
perform worst.</p>
      <p>These findings are quite preliminary. More investigation is
necessary to be able to claim general conclusions. We will test other
interesting setups: combining different sharing and aggregation
strategies. For example the agents may share their best experience, but
instead of aggregating the information, the app publishes a randomly
selected value from those sent. Another interesting idea could be the
organisation of access into groups: instead of aggregating from all
received and publishing to all who want to access, agents could be
organised into smaller groups of friends who share/aggregate/publish
only within the respective group.</p>
      <p>As we have indicated in the previous sections, we also plan to test
more dynamic and adaptive strategies: Agents may have a high initial
budget and decide to use the budget in different strategies: a first such
strategy could be that the agent uses its budget in the beginning of the
learning process, when it is exploring more. A second alternative is
that agents could spare their budgets and use them when they notice
that the Q-Value of their best route is decreasing for a given number
of rounds.</p>
      <p>More innovatively, the app becomes an agent and actively adjusts
the parameters of the budget strategies to improve the overall
learning process. The app collects a lot of information from the driver
agents, that can be used to drive the system towards the desired state,
e.g. by telling the agents how to acquire more or less credits.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGEMENTS</title>
      <p>Ana Bazzan was partially supported by CNPq under grant no.
307215/2017-2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ana</surname>
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bazzan</surname>
          </string-name>
          , '
          <article-title>Aligning individual and collective welfare in complex socio-technical systems by combining metaheuristics and reinforcement learning'</article-title>
          ,
          <source>Eng. Appl. of AI</source>
          ,
          <volume>79</volume>
          ,
          <fpage>23</fpage>
          -
          <lpage>33</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ana</surname>
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bazzan</surname>
          </string-name>
          and Camelia Chira, '
          <article-title>Hybrid evolutionary and reinforcement learning approach to accelerate traffic assignment (extended abstract)'</article-title>
          ,
          <source>in Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS</source>
          <year>2015</year>
          ), eds.,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bordini</surname>
          </string-name>
          , E. Elkind, G. Weiss, and P. Yolum, pp.
          <fpage>1723</fpage>
          -
          <lpage>1724</lpage>
          . IFAAMAS, (May
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ana</surname>
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fehler</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Klu</surname>
          </string-name>
          <article-title>¨gl, 'Learning to coordinate in a network of social drivers: The role of information'</article-title>
          ,
          <source>in Proceedings of the International Workshop on Learning and Adaptation in MAS (LAMAS</source>
          <year>2005</year>
          ), eds.,
          <string-name>
            <surname>Karl</surname>
            <given-names>Tuyls</given-names>
          </string-name>
          , Pieter
          <string-name>
            <surname>Jan't Hoen</surname>
          </string-name>
          ,
          <source>Katja Verbeeck, and Sandip Sen, number 3898 in Lecture Notes in Artificial Intelligence</source>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>128</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ana</surname>
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bazzan</surname>
            and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Grunitzki</surname>
          </string-name>
          , '
          <article-title>A multiagent reinforcement learning approach to en-route trip building'</article-title>
          ,
          <source>in 2016 International Joint Conference on Neural Networks (IJCNN)</source>
          , pp.
          <fpage>5288</fpage>
          -
          <lpage>5295</lpage>
          , (
          <year>July 2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Luciana</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Buriol</surname>
            ,
            <given-names>Michael J</given-names>
          </string-name>
          . Hirsh,
          <string-name>
            <surname>Panos M. Pardalos</surname>
          </string-name>
          , Tania Querido,
          <string-name>
            <surname>Mauricio G.C. Resende</surname>
          </string-name>
          , and Marcus Ritt, '
          <article-title>A biased random-key genetic algorithm for road congestion minimization'</article-title>
          ,
          <source>Optimization Letters</source>
          ,
          <volume>4</volume>
          ,
          <fpage>619</fpage>
          -
          <lpage>633</lpage>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Cagara</surname>
          </string-name>
          , Bjorn Scheuermann, and
          <string-name>
            <surname>Ana L.C. Bazzan</surname>
          </string-name>
          , '
          <article-title>Traffic optimization on Islands'</article-title>
          ,
          <source>in 7th IEEE Vehicular Networking Conference (VNC</source>
          <year>2015</year>
          ), pp.
          <fpage>175</fpage>
          -
          <lpage>182</lpage>
          , Kyoto, Japan, (
          <year>December 2015</year>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Caroline</given-names>
            <surname>Claus</surname>
          </string-name>
          and Craig Boutilier, '
          <article-title>The dynamics of reinforcement learning in cooperative multiagent systems'</article-title>
          ,
          <source>in Proceedings of the Fifteenth National Conference on Artificial Intelligence</source>
          ,
          <source>AAAI '98/IAAI '98</source>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>752</lpage>
          , Menlo Park, CA, USA, (
          <year>1998</year>
          ).
          <article-title>American Association for Artificial Intelligence</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Luca D'Acierno</surname>
            ,
            <given-names>Bruno</given-names>
          </string-name>
          <string-name>
            <surname>Montella</surname>
          </string-name>
          , and Fortuna De Lucia, '
          <article-title>A stochastic traffic assignment algorithm based on ant colony optimisation'</article-title>
          ,
          <source>in Ant Colony Optimization and Swarm Intelligence</source>
          , 5th International Workshop, ANTS 2006, eds.,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dorigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.M.</given-names>
            <surname>Gambardella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Birattari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martinoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Poli</surname>
          </string-name>
          , and T. Stu¨tzle, volume
          <volume>4150</volume>
          of Lecture Notes in Computer Science, pp.
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          , Berlin, (
          <year>2006</year>
          ). Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dia</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Panwai</surname>
          </string-name>
          ,
          <source>Intelligent Transport Systems: Neural Agent (Neugent) Models of Driver Behaviour</source>
          , LAP Lambert Academic Publishing,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10] Jose´ Capela Dias, Penousal Machado, Daniel Castro Silva, and Pedro Henriques Abreu, '
          <article-title>An inverted ant colony optimization approach to traffic'</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          ,
          <volume>36</volume>
          (
          <issue>0</issue>
          ),
          <fpage>122</fpage>
          -
          <lpage>133</lpage>
          , (
          <year>July 2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Anestis</surname>
            <given-names>Fachantidis</given-names>
          </string-name>
          , Matthew E. Taylor, and Ioannis P. Vlahavas, '
          <article-title>Learning to teach reinforcement learning agents'</article-title>
          ,
          <source>Machine Learning and Knowledge Extraction</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>21</fpage>
          -
          <lpage>42</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Syed</given-names>
            <surname>Md</surname>
          </string-name>
          . Galib and Irene Moser, '
          <article-title>Road traffic optimisation using an evolutionary game'</article-title>
          ,
          <source>in Proceedings of the 13th annual conference companion on Genetic and evolutionary computation</source>
          ,
          <source>GECCO '11</source>
          , pp.
          <fpage>519</fpage>
          -
          <lpage>526</lpage>
          , New York, NY, USA, (
          <year>2011</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Grunitzki and Ana L. C. Bazzan</surname>
          </string-name>
          , '
          <article-title>Combining car-toinfrastructure communication and multi-agent reinforcement learning in route choice'</article-title>
          ,
          <source>in Proceedings of the Ninth Workshop on Agents in Traffic and Transportation</source>
          (ATT-
          <year>2016</year>
          ), eds.,
          <string-name>
            <surname>Ana L. C. Bazzan</surname>
          </string-name>
          , Franziska Klu¨gl, Sascha Ossowski, and Giuseppe Vizzari, New York, (
          <year>July 2016</year>
          ).
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Klu</surname>
          </string-name>
          <article-title>¨gl</article-title>
          and
          <string-name>
            <surname>Ana L. C. Bazzan</surname>
          </string-name>
          , '
          <article-title>Simulation studies on adaptative route decision and the influence of information on commuter scenarios'</article-title>
          ,
          <source>Journal of Intelligent Transportation Systems: Technology, Planning, and Operations</source>
          ,
          <volume>8</volume>
          (
          <issue>4</issue>
          ),
          <fpage>223</fpage>
          -
          <lpage>232</lpage>
          , (October/
          <year>December 2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jelle</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kok</surname>
          </string-name>
          and Nikos Vlassis, '
          <article-title>Sparse cooperative q-learning'</article-title>
          ,
          <source>in Proceedings of the 21st. International Conference on Machine Learning (ICML)</source>
          , pp.
          <fpage>481</fpage>
          -
          <lpage>488</lpage>
          , New York, USA, (
          <year>July 2004</year>
          ). ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Koster</surname>
          </string-name>
          , Andrea Tettamanzi,
          <string-name>
            <surname>Ana L. C. Bazzan</surname>
          </string-name>
          , and
          <article-title>Ce´lia da Costa Pereira, 'Using trust and possibilistic reasoning to deal with untrustworthy communication in VANETs'</article-title>
          ,
          <source>in Proceedings of the 16th IEEE Annual Conference on Intelligent Transport Systems (IEEEITSC)</source>
          , pp.
          <fpage>2355</fpage>
          -
          <lpage>2360</lpage>
          ,
          <string-name>
            <surname>The</surname>
            <given-names>Hague</given-names>
          </string-name>
          , The Netherlands, (
          <year>2013</year>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Kumpati</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narendra and Mandayam A. L. Thathachar</surname>
          </string-name>
          ,
          <source>Learning Automata: An Introduction</source>
          , Prentice-Hall, Upper Saddle River, NJ, USA,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Denise de Oliveira and Ana L. C. Bazzan</surname>
          </string-name>
          , '
          <article-title>Multiagent learning on traffic lights control: effects of using shared information', in Multi-Agent Systems for Traffic and</article-title>
          Transportation, eds.,
          <string-name>
            <surname>Ana</surname>
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bazzan</surname>
          </string-name>
          and Franziska Klu¨gl,
          <fpage>307</fpage>
          -
          <lpage>321</lpage>
          ,
          <string-name>
            <given-names>IGI</given-names>
            <surname>Global</surname>
          </string-name>
          , Hershey,
          <string-name>
            <surname>PA</surname>
          </string-name>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Juan de Dios Ortu</surname>
          </string-name>
          <article-title>´zar</article-title>
          and Luis G. Willumsen, Modelling Transport, John Wiley &amp; Sons, 3rd edn.,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Gabriel de O. Ramos</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ana L. C. Bazzan</surname>
          </string-name>
          , and Bruno C.
          <article-title>da Silva, 'Analysing the impact of travel information for minimising the regret of route choice'</article-title>
          , Transportation Research Part C: Emerging Technologies,
          <volume>88</volume>
          ,
          <fpage>257</fpage>
          -
          <lpage>271</lpage>
          , (
          <year>Mar 2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Gabriel de O. Ramos</surname>
          </string-name>
          , Bruno C. da
          <string-name>
            <surname>Silva</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ana L. C. Bazzan</surname>
          </string-name>
          , '
          <article-title>Learning to minimise regret in route choice'</article-title>
          ,
          <source>in Proc. of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS</source>
          <year>2017</year>
          ), eds.,
          <string-name>
            <surname>S. Das</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Durfee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Winikoff</surname>
          </string-name>
          , pp.
          <fpage>846</fpage>
          -
          <lpage>855</lpage>
          , Sa˜o Paulo, (May
          <year>2017</year>
          ). IFAAMAS.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Gabriel de O. Ramos</surname>
          </string-name>
          , Bruno C. da
          <string-name>
            <surname>Silva</surname>
          </string-name>
          , Roxana Ra˘dulescu, and
          <string-name>
            <surname>Ana L. C.</surname>
          </string-name>
          <article-title>Bazzan, 'Learning system-efficient equilibria in route choice using tolls'</article-title>
          ,
          <source>in Proceedings of the Adaptive Learning Agents Workshop 2018 (ALA-18)</source>
          , Stockholm, (
          <year>Jul 2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Gabriel de O. Ramos</surname>
          </string-name>
          and Ricardo Grunitzki, '
          <article-title>An improved learning automata approach for the route choice problem', in Agent Technology for Intelligent Mobile Services</article-title>
          and Smart Societies, eds.,
          <string-name>
            <surname>Fernando</surname>
            <given-names>Koch</given-names>
          </string-name>
          ,
          <source>Felipe Meneguzzi, and Kiran Lakkaraju</source>
          , volume
          <volume>498</volume>
          of Communications in Computer and Information Science,
          <volume>56</volume>
          -
          <fpage>67</fpage>
          , Springer Berlin Heidelberg, (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Guni</surname>
            <given-names>Sharon</given-names>
          </string-name>
          , Josiah P Hanna, Tarun Rambha, Michael W Levin,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <surname>Stephen D Boyles</surname>
          </string-name>
          , and Peter Stone, '
          <article-title>Real-time adaptive tolling scheme for optimized social welfare in traffic networks'</article-title>
          ,
          <source>in Proc. of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS</source>
          <year>2017</year>
          ), eds.,
          <string-name>
            <surname>S. Das</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Durfee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Winikoff</surname>
          </string-name>
          , pp.
          <fpage>828</fpage>
          -
          <lpage>836</lpage>
          , Sa˜o Paulo, (May
          <year>2017</year>
          ). IFAAMAS.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Ming</surname>
            <given-names>Tan</given-names>
          </string-name>
          ,
          <article-title>'Multi-agent reinforcement learning: Independent vs. cooperative agents'</article-title>
          ,
          <source>in Proceedings of the Tenth International Conference on Machine Learning (ICML</source>
          <year>1993</year>
          ), pp.
          <fpage>330</fpage>
          -
          <lpage>337</lpage>
          . Morgan Kaufmann, (
          <year>June 1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Adam</surname>
            <given-names>Taylor</given-names>
          </string-name>
          , Ivana Dusparic, Edgar Galva´
          <article-title>n Lo´pez, Siobha´n Clarke, and Vinny Cahill, 'Accelerating learning in multi-objective systems through transfer learning'</article-title>
          ,
          <source>in 2014 International Joint Conference on Neural Networks (IJCNN)</source>
          , pp.
          <fpage>2298</fpage>
          -
          <lpage>2305</lpage>
          , Beijing, China, (
          <year>2014</year>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Torrey</surname>
          </string-name>
          and Matthew E. Taylor, '
          <article-title>Teaching on a budget: Agents advising agents in reinforcement learning'</article-title>
          ,
          <source>in Proceedings of the 2013 International Conference on Autonomous Agents</source>
          and
          <string-name>
            <surname>Multi-Agent</surname>
            <given-names>Systems</given-names>
          </string-name>
          , St. Paul, MN, USA, (May
          <year>2013</year>
          ). IFAAMAS.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28] John Glen Wardrop, '
          <article-title>Some theoretical aspects of road traffic research'</article-title>
          ,
          <source>Proceedings of the Institution of Civil Engineers</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <volume>1</volume>
          (
          <issue>36</issue>
          ),
          <fpage>325</fpage>
          -
          <lpage>362</lpage>
          , (
          <year>1952</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Christopher</surname>
            <given-names>J. C. H.</given-names>
          </string-name>
          <string-name>
            <surname>Watkins</surname>
            and
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Dayan</surname>
          </string-name>
          , '
          <article-title>Q-learning'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          , (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Jin</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yen</surname>
          </string-name>
          , '
          <article-title>Finding the k shortest loopless paths in a network'</article-title>
          ,
          <source>Management Science</source>
          ,
          <volume>17</volume>
          (
          <issue>11</issue>
          ),
          <fpage>712</fpage>
          -
          <lpage>716</lpage>
          , (
          <year>1971</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Matthieu</surname>
            <given-names>Zimmer</given-names>
          </string-name>
          , Paolo Viappiani, and Paul Weng, '
          <article-title>Teacher-student framework: a reinforcement learning approach'</article-title>
          , in AAMAS Workshop Autonomous Robots and
          <string-name>
            <given-names>Multirobot</given-names>
            <surname>Systems</surname>
          </string-name>
          , Paris, France, (May
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>