=Paper=
{{Paper
|id=Vol-2701/paper_6
|storemode=property
|title=Experience Sharing in a Traffic Scenario
|pdfUrl=https://ceur-ws.org/Vol-2701/paper_6.pdf
|volume=Vol-2701
|authors=Ana L. C. Bazzan,Franziska Klügl
|dblpUrl=https://dblp.org/rec/conf/ecai/BazzanK20
}}
==Experience Sharing in a Traffic Scenario==
<pdf width="1500px">https://ceur-ws.org/Vol-2701/paper_6.pdf</pdf>
<pre>
                      Experience Sharing in a Traffic Scenario
                                                Ana L. C. Bazzan1 and Franziska Klügl2


Abstract. Travel apps become more and more popular giving in-                  the learning process. However, the majority of them deal with co-
formation about the current traffic state to drivers who then adapt            operative environments. In non-cooperative tasks, such as the route
their route choice. In commuting scenarios, where people repeatedly            choice, we observed that the transfer of a good solution from one
travel between a particular origin and destination, learning effects           agent to others may produce sub-optimal outcomes. Therefore, there
add to this information. In this paper, we analyse the effects on the          is a need for approaches that address transfer of knowledge in other
overall network, if adaptive driver agents share their aggregated ex-          contexts. With the present paper, we provide the first steps in this
perience about route choice in a reinforcement learning (Q-learning)           direction.
setup. Drivers share what they have learnt about the system, not just             The other sections of this paper are organised as follows. First, we
information about their current travel times. We can show in a stan-           introduce some background on route choice, traffic assignment anal-
dard scenario that experience sharing can improve convergence times            ysis concepts and the usage of reinforcement learning. Section 3 then
for adaptive driver agents.                                                    describes the proposed approach, while Section 4 discusses experi-
                                                                               ments performed with a standard scenario and results gained from
                                                                               that. Section 6 then presents the concluding remarks and points to
1    Introduction                                                              future research.
Which route to choose for travelling from origin to destination is a
central question a traffic participant – in our case, a driver – faces.
The route choice of a driver however is not independent of the
                                                                               2     Background: Traffic Assignment, Route Choice
choices of others, as the load on a link determines the travel time
                                                                                     and Reinforcement Learning
on it and as a consequence the travel time on a route. This is the             2.1     The Traffic Assignment Problem
well-known problem of traffic assignment. Drivers adapt to their ex-
perience and choose routes which they expect to be the best choice –           The traffic assignment problem (TAP) aims at assigning a route to
mostly with respect to shortest travel time. In traffic analysis this idea     each driver that wants to travel from its origin to its destination. In
is manifested in the concept of user equilibrium, which is a situation         traffic analysis and simulation, traffic assignment is the prominent
in which no user can reduce travel time by changing its route.                 step of connecting the demand (how many drivers want to travel be-
   There many approaches for driving the overall system into overall           tween two nodes in a traffic net?) and the supply (roads in a traffic
system optimum making individual drivers with incentives or infor-             net with particular capacities, speed, etc.). The TAP can be solved in
mation. Information just about recent travel times may be seen as              different ways. Agent-based approaches aim at finding the UE [28],
miss-leading, as it just may give a one-shot impression. A conse-              in which every driver perceives that all possible routes between its
quence are ineffective oscillations based on too fast and overbearing          origin and destination (its OD pair) have similar, minimal costs re-
reactions. A way to address this problem is by introducing a kind of           sulting in that the agent has no incentive to change routes. This means
inertia into the system, assuming that people are hesitant in following        that the UE corresponds to each driver selfishly computing a route in-
new information.                                                               dividually.
   More and more drivers use travel apps that also show the traffic               A traditional algorithm to solve the TAP, (i.e. to find the UE) is the
state collected from real-time speed observations, such as Google              method of successive averages (MSA). Yet, this method is performed
Maps or Waze. Drivers fuse the information that they receive from              in a centralised way3 .
those apps with their individual experience to actually make their                A learning based approach, on the other hand, can be done in a
routing decision.                                                              decentralised way, with each agent learning individually by means
   In this paper, we want to analyse whether sharing of experience             of RL (see next section). We remark however that, since their actions
among drivers rather than using recent travel time information en-             are highly coupled with other agents actions, this is not a trivial task.
ables the overall system to develop into the user (or Nash) equilib-           Moreover, it is a non-cooperative learning task.
rium (UE).
   Our analysis is inspired by a hypothetical app with which drivers           2.2     Reinforcement Learning
share what they have learnt related to their repeated route choice be-
tween some origin an destination in a traffic system – like friends            In reinforcement learning (RL), an agent’s goal is to learn a mapping
chatting about which route they prefer or avoid.                               between a given state to a given action, by means of a value function.
   As discussed later in Section 5, there are some approaches that are         A popular algorithm to compute such value functions is Q-learning
based on sharing or transferring knowledge to improve or accelerate            (QL) [29]. Value functions take the instantaneous reward received (a
                                                                               numerical signal) and compute the expected, discounted value of the
1 UFRGS, Porto Alegre, Brazil, email: bazzan@inf.ufrgs.br
2 Örebro University, Sweden, email: franziska.klugl@oru.se                    3 For details see, e.g., Chapter 10 in [19].


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
corresponding action. Once such mapping is learned, an agent can               3.2    Information Sharing via App
decide which action to select in order to maximize its rewards.
   We can model RL as a Markov decision process (MDP) composed                 With the dissemination of various traffic apps, it is no longer realis-
by a tuple (S, A, T, R), where S is a set of states; A is a set of ac-         tic to assume that drivers only learn from their own experience when
tions; T is the transition function that models the probability of the         repeatedly travelling from origin to destination just observing their
system moving from a state s ∈ S to state s0 ∈ S, upon performing              own travel time. Carrying the idea of traffic apps beyond current
action a ∈ A; and R is the reward function that yields a real num-             traffic state, more information produced by others can be used for
ber associated with performing an action a ∈ A when one is in state            decision making. While there has been a number of works in which
s ∈ S.                                                                         the effect of information about current travel times on route or on
   Q-learning (QL) computes a Q-value Q(s, a) for each action-state            parts of the route was tested (see Section 5), the question emerged
pair. This value represents a reward estimate for executing the action         whether explicitly sharing accumulated information about particu-
a at state s. The updating of Q(s, a) is done through Eq. 1, where             lar routes could improve the learning process. We envision a travel
α ∈ [0, 1] is the learning rate and γ ∈ [0, 1] is the discount factor.         app in which participating drivers can share experience about their
                                                                               favourite or most disappointing routes. The setup is visualised in Fig-
    Q(s, a) = Q(s, a) + α(r + γ ∗ maxa (Q(s0 , a0 )) − Q(s, a)) (1)            ure 1.

   Each agent is a learner that maintains a Q-table that is updated                                                                             Best:
at each subsequent episode. In order to address the exploration–                                                                                R2 with 50.32


exploitation dilemma, the ε-greedy exploration strategy can be used
to choose actions: the action with the highest Q value is selected with
a probability of 1−ε and a random action is selected with probability
ε.
   Finally, even if agents may be learning independently (as, e.g., in
[7]), this is a non-trivial instance of multiagent RL or MARL.


3     Sharing Experiences
3.1      Q-Learning in a Route Choice Scenario
Given is a road network G in which nodes represent point of inter-
est, which can be junctions, neighborhoods, or areas; links represent                Figure 1. Setup of the system: Drivers decide about sharing their
road segments with a free-flow travel time depending on the distance               experience via an app with a server. The server processes the collected
                                                                                 information and publishes the aggregated experience. Drivers may access
and capacity and an additional cost function that relates number of                       this information depending on their information strategy.
vehicles on it to its costs (travel time). Some of the nodes serve as
origins, some as destinations. Between those origin and destination
nodes, we consider sequences of links as possible routes. Different               The overall decision making results in three steps:
routes connecting origin and destination are not independent from
another, they may contain shared links. Such dependencies exists be-           1. Agents select an action from their set of possible actions. Each
tween routes of the same origin-destination combination as well as                agent experiences the time that it took to perform the travel using
with routes from others.                                                          the route that corresponds to the selected action. The agent updates
   Drivers use Q-learning to individually determine which is the best             its Q-table according to the QL rules. At the end of the day, they
route they can take from their individual origin and destination pair.            share an evaluation of a route with the app. The shared informa-
Hereby, S is the particular origin-node of the individual driver. Each            tion does not need to be about the route they took most recently,
driver learns independently. Each agent state is determined by its ori-           it may be their best or their worst or about a randomly route. The
gin, i.e., this is a stateless formulation of QL, as in [7].                      app can be seen as a platform for chatting “neighbours” who talk
   The set of actions A contains all k shortest routes between origin             about what their experiences when travelling to a particular des-
and destination. Therefore the number k needs to be sufficiently large            tination. Drivers may also decide to share no information at all.
to give sufficient flexibility in route choice. The transition function           More precisely: sharing their accumulated experience means that
maps the origin to the destination for all actions. Reward is a function          they share the Q-value assigned to a particular action.
of the experienced travel time on the route (we use the negative of               The agents do not reason about the potential effect of sharing in-
travel time, for maximization purposes).                                          formation on the reward they can receive. They do not assume
   It is important to actually understand the multi-agent character-              that the publication of the information influences others’ decision
istics of this learning task. Because links have travel times that de-            making to an extend that the evaluation of the published route
pends on the number of agents using them, learning is challenging.                changes dramatically.
The desirable, shortest path under free-flow, may end up producing             2. The information server behind the app collects all information
a high travel time, if too many agents want to use it. Agents need to             from the information sharing drivers and determines the shared
learn how to distribute themselves in a way that an optimal number                information that the server then publishes. This information con-
of agents use each edge, minimizing their individual travel time on               tains, per origin-destination pair, an action and the Q-value that
the selected routes. Coordination is desirable not just between agents            belongs to the action aggregated from all handed-in information.
with the same origin-destination pair, but needs to happen between             3. Driver agents decide whether to access the published information.
all agents, even if their routes would share only a single link.                  Accessing has the consequence, that they incorporate the pub-

                                                                           2
   lished information in their Q-table: they replace the Q-value of               We decided to focus on the simpler alternatives and leave the more
   the concerned action with the published Q-value for the given ac-           sophisticated approaches to future work, as they involve a lot of in-
   tion. There is no reasoning about the credibility of the published          dividual parameters to fully specify such strategies. Hence, in this
   value in relation to the existing Q-value. The decision of accessing        work we follow alternative 2, varying the budget agent has, starting
   the shared information is based on a budget acquired by sharing             with unlimited budget and then decreasing it so that agents only ac-
   - the more one shares the more one may access the shared value.             cess the information in some episodes. Although we consider that all
   In the current version, the agents do not reason about whether the          have the same budget, the access is not synchronous, i.e., not every-
   investment into acquiring published information may pay off, they           body spend their budgets in the same episode (except in case budget
   rather access the information with a given frequency/probability,           is unlimited and thus every agent access at every episode).
   provided they have a budget.

   This can be seen as a rudimentary form of transfer learning, with           3.3.2   Whether, when and with whom to share information?
transfer happening during the learning process not from one task to
another. Due to the way we modelled actions, transfer between dif-             The agents learn about good actions/routes for their origin-
ferent problems – here different origin-destination pairs makes no             destination combination, individually. In principle, they do not need
sense.                                                                         to share – yet without a sufficiently large subset of the driver popula-
   This is a multiagent reinforcement learning setup in which agents           tion sharing, the information that the app publishes may not possess
have to learn anti-coordination. Sharing happens in groups: only               any value. So, it is in the interest of the app providers that as many
agent within the same group, that means with the same origin and               agents as possible hand-in their experience.
destination pair share partial information from their Q-tables.                   There might be some required consent for basic service usage to
                                                                               send information about the agents’ experience to the app. This has the
                                                                               consequence that basically everybody would share experiences much
3.3     Sharing Issues in Detail
                                                                               like GPS traces are collected without full awareness of the (careless)
More details need to be considered to provide full information about           traveller.
the analysed scenarios:                                                           As an alternative, one can image that there is a small compensation
   The overall dynamics are so, that agents select their route before          for actively sharing experiences. The budget system indicated above
the actual travelling happens. Then, all agents travel through the net-        would need a corresponding side for earning what will be spend later.
work. Using cost functions for determining travel time on a link from          For every time sending information, the agent could increase the bud-
its load, we abstract away from detailed temporal dynamics in which            get for retrieving information.
departure time, etc. matters. As we want to focus on the effect of                The compensation for sharing could be also independent from the
sharing experiences and different aspects related to that, we did not          information usage. It could be financial or give social credits turn-
use a microscopic traffic simulation to determine how an individual            ing the individual agent to somebody more compliant with societal
driver moves through the network. As a consequence, all agents tak-            values.
ing the same route encounter the same travel times. Reward from                   For this first analysis, we just analysed situations in which every
travelling is sent to drivers after all have finished. The agents update       agent is basically forced to contribute with its experience, that means
their Q-tables, send information to the app. In the next episode, the          agents share
drivers who want to access the published information, look into the               A variant of this question is related to with whom an agent may
service and again update their Q-tables based on the published value.          share. The idea of an app that publishes a value aggregated from all
So, there are a number of decisions involved:                                  handed-in experiences, is one extreme: the agent shares with every-
                                                                               body that travel between the same origin and destination. Whether
3.3.1    When does a driver access the shared information?                     the information that the agent contributes is actually valuable is a de-
                                                                               cision of the aggregation mechanism used by the app. As an alterna-
Assuming that the driver needs to pay for the service of getting such          tive, one can image to restrict the information dissemination to only
information, the decision about whether and when it is best to in-             a group of agents, denoting a group of friends or neighbours. This
corporate information from other agents into ones Q-table may be a             could bring more heterogeneity into the information transfer. As an
strategic one.                                                                 extreme case, agents could pair and just share information with one
                                                                               other agent, this thwarts the app idea. Additionally, we did not ex-
1. the simplest strategy could be to simply access information in ev-          pect a large impact and therefore just tested without restrictions. In
   ery round.                                                                  the future work, we need to test this assumption.
2. the agent could collect some budget – e.g. from sharing informa-
   tion – that it could invest into accessing information. So whenever
   the agent has collected enough budget, it accesses. The relation            3.3.3   What information shall be shared and how the
   between gain per sharing in relation to costs of accessing is the                   information shall be aggregated?
   relevant parameter here.
3. The agent could access the information in the beginning with de-            This is the central question for the setup of the app. What information
   creasing frequency the longer the learning proceeds.                        shall be shared and how does the app process all the information
4. One can image more advanced strategies based on dynamics that               collected?
   the agent observes. For example, if the action with the best Q-                The first question relates to what information the agents do send to
   value was “disappointing” for a number of times, the agent may              the app. It is obvious that this information can be information about
   trigger a look-up on the app.                                               the route with the highest Q-value - basically telling others that this
5. An alternative could be to have a look, if the Q-values are not             route is the best for the agents’ origin-destination pair according to
   sufficiently distinct to have a clear favourite route.                      the agents’ experience. Yet, also other information may be interesting

                                                                           3
           A     5     C    11     F    13     I     2     L                     The volume-delay (or cost) function is te = tfe + 0.02 × qe , where
                                                                              te is the travel time on edge e, tfe is the edge’s travel time per unit of
           7     15    7     9     9          9     12                        time under free flow conditions, and qe is the flow using edge e. This
                                                                              means that the travel time in each edge increases by 0.02 of a minute
           B     11    D     7    G      3     J
                                                                              for each vehicle/hour of flow.
                 11    7     9     9    13    9     12                           For the sake of getting a sense of what would be optimal, the free-
                                                                              flow travel times (FFTT) for each OD pair are shown in Table 1. We
                       E     7    H      3    K      2    M                   note however that these times cannot be achieved because the free-
                                                                              flow condition is not realistic, and when each agent tries to use its
                                                                              route with the lowest travel time, jams occur and the times increase.
    Figure 2. OW Road Network (as proposed by Ortuzar and Willumsen)
                                [19]                                          In addition the table shows the number of agents that travel between
                                                                              the four particular origin and destination combinations and the travel
                                                                              time in UE. This is averaged as drivers take different routes between
to know: There are situations in which it is better to avoid choices.         their OD pairs. Also, the fact that 1700 agents learn simultaneously,
So, we foresee the following alternatives to be relevant:                     makes the overall learning complex.
1. Share the last chosen action including its Q-value
                                                                                              Table 1. OW network: some characteristics
2. Share a randomly selected action with its Q-value
3. Share the action with the best Q-value
4. Share the action with the worst Q-value                                             OD      Agents    shortest path   FFTT       Avg. travel
                                                                                       Pair                                       time (in UE)
    As the first alternative in most cases is the same as the third one
and otherwise a random selection, we deemed this alternative to be                     AL        600     ACGJIL           28               71.0
not interesting. We tested all others. The agents hereby share the in-                 AM        400     ACDHKM           26              64.94
                                                                                       BL        300     BDGJIL           32              68.78
formation. If they access the aggregated information, they integrate                   BM        400     BEHKM            23               62.3
it into their own individual Q-table replacing the Q-value of the con-
                                                                                                1700                                      67.13
cerned route/action. So, they do not automatically select the route
that they were told about, but only if there is no better one and con-
tinue with the usual learning and decision making process.                       For the QL, Q-tables are initialised with a random value around -
    As written above, the app collects and aggregates information             90. All experiments were conducted with α = 0.5 (learning rate) and
from all sharing drivers. Also for this overall process step, there are       ε = 0.05 (for ε-greedy action selection), following previous works
different alternatives:                                                       (e.g., [2, 1]) that have extensively tried other values. As mentioned
                                                                              above, k = 8 is the number of shortest paths, that means number of
1. the app publishes a random value                                           possible actions.
2. the app publishes the overall best q-value and corresponding ac-              We measure overall performance of the different approaches with
   tion                                                                       the average travel time over all 1700 agents.
3. the app publishes the overall lowest q-value with corresponding               Each experiment was repeated 30 times. Standard deviations are
   action                                                                     not shown to keep the plots cleaner. We thus remark that they are of
4. the app averages the q-values for each possible action and then            the order of 1% between the results of different runs in terms of av-
   give best or worst of these average                                        erage travel times. There are partially dramatic oscillations between
                                                                              episodes.
   Together with the alternatives for which values to share, this pro-
duces a number of interesting alternative combinations: it might be
                                                                              4.2     Experimental Results
interesting to share randomly selected information from all best or
all worst q-values, yet we tested the “pure” combinations: The app            4.2.1    Sharing Best Experience
provides a fully randomly selected information, the best of all best
                                                                              We start by looking into cases in which the agents share their best
and the worst of all worst q-values.
                                                                              experience.
                                                                                 We tested different settings of the budget strategy. Agents need
4     Experiments and Results                                                 different numbers of share-actions for being able to afford looking
                                                                              into what the app publishes. Figure 3 shows the results of testing
4.1     Scenario
                                                                              scenarios in which an agent can access the information twice, 4, 6
We performed a number of experiments to understand how the shar-              or 8 times within 10 rounds. Agents decide individually, not all ac-
ing actually impacts the adaptation and eventually the performance            cess at the same time, but in average over the episodes with the given
of the drivers using the app.                                                 relation. For comparison, we also display the learning curve if pure
   The OW network (proposed in Chapter 10 in [19]) seen in Fig-               Q-learning is applied, that means in which agents learn without shar-
ure 2, represents two residential areas (nodes A and B) and two major         ing Q-values via the app.
shopping areas (nodes L and M). So, there are four OD pairs, con-                Figure 3 shows average travel time over all agents as well as over
necting residential areas with shopping areas. The numbers in the             30 experiments. It allows to compare the learning curves of different
edges are their travel times between two nodes under free flow (in            budget limitations for accessing the information that the app pub-
both ways). We assume no additional delays when passing nodes,                lished. We can see that after 500 episodes pure QL shows conver-
e.g. due to traffic lights. We computed k = 8 shortest paths per OD           gence towards the UE, which is approximately 67 in the case of the
pair, following the algorithm by [30].                                        OW network (see Table 1).

                                                                          4
 Figure 3. OW network: Development of overall averaged travel times over episodes for different sharing setup. Unlimited budget means that agents access
                                         the shared experience every round; no budget means no sharing.


   As for the cases in which there is sharing of experiences, the fol-          with the collective of agents missing the UE in the end. It must be
lowing observations can be made. First, no matter how frequently                said, though, that the performance in the initial episodes is not bad,
the agents access the shared experience, we see an acceleration in              i.e., there is an acceleration in the learning due to the fact that some
the learning process in the initial episodes. This is clear in Figure 3,        actions need not to be tried, as aforementioned.
and is due to the fact that agents do not have to try out some actions             When the agents have a budget that allows then to only access the
given that they share better options. Unfortunately, depending on the           shared experiences from time to time (e.g., 2 or 4 in 10 episodes, see
frequency of access to this shared experience, this could lead to sub-          blue and orange curves in Figure 3), then the overall performance is
optimal learning causing the collective of agents to not converge to            much improved, especially if the driver-vehicle units cannot afford
the UE.                                                                         to access the shared experiences in more than 2 out of 10 episodes.
   Let us now discuss individual cases. When agents have unlimited              Not only there is an acceleration in the learning process already at
budget to access the shared experience (and fully use this budget),             the initial episodes, but also there is a convergence towards the UE.
then the performance is bad (see green line in Figure 3). The reason            We assume that the best setting of the budget limits is depending
is that too many agents are informed about the best performance in              on the scenario details, especially on the particular OD pair. It will
the previous episode and hence try to do the same action. Thus, the             be interesting to analyse the situation deeper. A budget strategy in
supposedly best action is followed by virtually all agents and they             which agents share in the beginning, but then gradually give up shar-
then end up selecting the same route (for each OD pair). This can               ing over time, may lead to the results that we want to achieve with
lead to bad performance, not to mention a lot of oscillations (that are         fast convergence to the UE.
seen in the plot). This is a well known phenomenon and could be
confirmed in our experiments.
   When the budget is limited so that it allows the access to the shared        4.2.2   Sharing Worst or Random Experience
experiences in 6 or 8 out of 10 episodes, then there is an increase in
performance, if compared to the case in which agents have unlim-                We also tested the effect of what happens if agents share their worst
ited access. We recall that such accessing is not synchronous, i.e.,            experience which the app aggregates to the worst action/route for
some agents may access while others are not doing so. Still, the per-           each particular OD pair. The results are almost identical to vanilla
formance is sub-optimal (brown and magenta curves in the figure),               QL. So, sharing has apparently no effect when sharing the worst ex-
                                                                                perience. The reason is that when exploiting - which is according to

                                                                            5
our settings in 95% of the episodes - the agents select the best route.        ing, where the reward is normally based on the experienced travel
Which route has a bad Q-value does not matter hereby. So, the only             time.
situation in which such an update of the Q-table would change the                 Two main lines of approaches can be distinguished: One, less pop-
decision of an agent is, if the route with previously the best Q-value,        ular due to its complexity, follows a traditional RL recipe, where be-
is the one that the app informs about. One could image that this hap-          sides a set of actions, there is also a set of states, these being the
pens if the previously best route is heavily overloaded, but this was          vertices in which the agent finds itself. Works in this category in-
not the case here. Maybe, the number of considered possible routes is          clude [4]. However, the majority of the papers in the literature follow
too high for such a setup. Nevertheless, reducing the set of possible          a so-called stateless approach in which there is a single state (the
routes makes no sense in such a learning setting.                              OD pair the agent belongs to), while the agent merely has to select
   Also, randomly selecting a route from the Q-table and sharing it            actions, which are normally associated with a set of pre-computed
with the app, which then publishes a randomly selected experience              routes that can be recalculated en-route, or not. This literature in-
from all handed-in values, generates a similar outcome as the shar-            cludes approaches based both on Learning Automata ([17], [23]) and
ing the worst experience. There is no difference to vanilla QL - that          QL.
means a QL without sharing experiences. The explanation is similar.               QL is increasingly being used for the task of route choice. Of par-
                                                                               ticular interest are those that compute the regret associated with the
                                                                               greedy nature of selecting routes in a selfish way ([21, 20]), as well
5   Related Work                                                               as those that combine selfish route choice with some sort of biasing
Besides the classical methods to address the TAP, there has been               from a centralized entity ([1, 6, 22]).
some works – based on other AI paradigms – with the goal of finding               There are works that, as we propose in the current paper, also deal
the UE. The main motivation is to solve the TAP from the perspective           with some forms of communication between agents. [14, 3, 13, 16].
of the individual agent (driver, driver-vehicle unit), thus relaxing the       Information is here not always truthful, but can be manipulated for
assumption that a central entity is in charge of assigning routes for          driving agents towards overall intended outcome.
those agents. In such decentralized approaches, the agent itself has              The idea of sharing either learned policies or Q-values is not new.
to collect experiences in order to reach the UE condition. Thus, these         In fact, as early as in 1993, Tan [25] suggested that communication of
approaches are suitable for commuting scenarios, where each agent              some kind of knowledge could help, especially in cooperative envi-
can collect experiences regarding the same kind of task (in this case,         ronments. In particular, sharing Q-values may reduce the time needed
driving from a given origin to a given destination).                           to explore the space-action space.
   One popular approach to tackle such decentralized decision-                    Some researchers have dealt with aspects such as what and when
making is via RL (more specifically MARL). However, other are also             to share. In a abstract view of sharing, [15] deal with agents that
mentioned in the literature. We start with these and later discuss those       keep a list of states in which they need to coordinate. This idea also
based on RL.                                                                   appears in the context of traffic management in [18], where traffic
   In the work of [9], a neural network is used to predict drivers’            signal agents keep joint Q-tables based on coupled states and actions.
route choices, as well as compliance to such predictions, under the               Some works – for instance, [26] – deal with more fine grained
influence of messages containing traffic information. However, the             views of sharing knowledge, as for instance the transfer learning
authors focus on the impact of the message on the agents rather than           community, in which the quest regarding what and when to trans-
the impact on system-level traffic distribution and travel time.               fer gets more precise. Transfer of reward values or policies is also
   The work of [10] uses the Inverted Ant Colony Optimization                  explored in the literature. For instance, [27, 11, 31] show that the
(IACO). Ants (vehicles) deposit their pheromones in the routes they            learning can be accelerated if a teacher shares experiences with a stu-
are using, and the pheromone is used to repel them. Consequently,              dent. We remind that virtually all these works deal with cooperative
they avoid congested routes. Also [8] applied ant colony optimisa-             environments, where it makes sense to transfer knowledge. In non-
tion to the TAP. However, in both cases, the pheromone needs to be             cooperative tasks, such as the route choice, we have seen that transfer
centrally stored, thus this approach is not fully suitable for a the de-       of a good solution from one agent to others may produce highly sub-
centralised modelling.                                                         optimal outcomes. Coordination here is actually anti-coordination,
   One game theoretic approach to the route choice problem ap-                 appropriately distributing agents between different alternatives.
pears in [12]. This approach uses only past experiences for the route
choice. The choice itself is made at each intersection of the network.         6   Conclusion and Ideas for the Future
However, it assumes that historical information is available to all
drivers.                                                                       The idea of this paper is to discuss the effects of a possible next type
   The work of [24] uses adaptive tolls to optimise drivers’ routes as         of travel app, in which users intentionally share their experiences,
tolls change. Differently from our purpose, they are concerned with            as opposed to conventional travel apps, which are based on collect-
alignment of choices towards the system optimum, which can only                ing drivers position, in order to be able to display current average
be achieved by imposing costs on drivers (in their case, tolls). In the        speed at a particular segment. The app that we are considering here,
same line, [5] deal with the problem from a centralised perspective to         can be thought as a device that replaces direct interaction between
find an assignment that aligns users and system utilities by imposing          colleagues or neighbours chatting about habits and experiences re-
tolls in some links.                                                           garding route choice.
   In the frontier of aligning the optimum of the system with the UE,             Assuming that humans continuously adapt to what they experi-
[1] has introduced an approach that seeks a balance in which not only          ence when performing (commuting) tasks, we analysed the potential
the central authority benefits but also the individual agents.                 effect of such an app. So, agents can not just learn based on their own
   RL-based approaches to compute the UE are becoming increas-                 experience, but also use others’ experience on the same task for de-
ingly popular. Here, each agent seeks to learn to select routes (these         cision making. As we have shown - integrating others’ experiences
are the actions) based on the rewards obtained in each daily commut-           from time to time - speeds up the learning progress.

                                                                           6
   In previous works, drivers were informed about travel times that                        ’98, pp. 746–752, Menlo Park, CA, USA, (1998). American Associa-
were collected by a central authority (as, e.g., Waze). This is effec-                     tion for Artificial Intelligence.
                                                                                     [8]   Luca D’Acierno, Bruno Montella, and Fortuna De Lucia, ‘A stochastic
tive only if some inertia is introduced. We can observe a similar effect                   traffic assignment algorithm based on ant colony optimisation’, in Ant
here: the agents who do not excessively use integrate others’ expe-                        Colony Optimization and Swarm Intelligence, 5th International Work-
rience, learn the best choice for their route. Agents that update their                    shop, ANTS 2006, eds., M. Dorigo, L.M. Gambardella, M. Birattari,
Q-table with additional information from the app in every round per-                       A. Martinoli, R. Poli, and T. Stützle, volume 4150 of Lecture Notes in
form worst.                                                                                Computer Science, pp. 25–36, Berlin, (2006). Springer-Verlag.
                                                                                     [9]   H. Dia and S. Panwai, Intelligent Transport Systems: Neural Agent
   These findings are quite preliminary. More investigation is neces-                      (Neugent) Models of Driver Behaviour, LAP Lambert Academic Pub-
sary to be able to claim general conclusions. We will test other in-                       lishing, 2014.
teresting setups: combining different sharing and aggregation strate-               [10]   José Capela Dias, Penousal Machado, Daniel Castro Silva, and Pe-
gies. For example the agents may share their best experience, but                          dro Henriques Abreu, ‘An inverted ant colony optimization approach to
                                                                                           traffic’, Engineering Applications of Artificial Intelligence, 36(0), 122–
instead of aggregating the information, the app publishes a randomly                       133, (July 2014).
selected value from those sent. Another interesting idea could be the               [11]   Anestis Fachantidis, Matthew E. Taylor, and Ioannis P. Vlahavas,
organisation of access into groups: instead of aggregating from all                        ‘Learning to teach reinforcement learning agents’, Machine Learning
received and publishing to all who want to access, agents could be                         and Knowledge Extraction, 1(1), 21–42, (2019).
organised into smaller groups of friends who share/aggregate/publish                [12]   Syed Md. Galib and Irene Moser, ‘Road traffic optimisation using an
                                                                                           evolutionary game’, in Proceedings of the 13th annual conference com-
only within the respective group.                                                          panion on Genetic and evolutionary computation, GECCO ’11, pp.
   As we have indicated in the previous sections, we also plan to test                     519–526, New York, NY, USA, (2011). ACM.
more dynamic and adaptive strategies: Agents may have a high initial                [13]   Ricardo Grunitzki and Ana L. C. Bazzan, ‘Combining car-to-
budget and decide to use the budget in different strategies: a first such                  infrastructure communication and multi-agent reinforcement learning
                                                                                           in route choice’, in Proceedings of the Ninth Workshop on Agents
strategy could be that the agent uses its budget in the beginning of the                   in Traffic and Transportation (ATT-2016), eds., Ana L. C. Bazzan,
learning process, when it is exploring more. A second alternative is                       Franziska Klügl, Sascha Ossowski, and Giuseppe Vizzari, New York,
that agents could spare their budgets and use them when they notice                        (July 2016). CEUR-WS.org.
that the Q-Value of their best route is decreasing for a given number               [14]   F. Klügl and Ana L. C. Bazzan, ‘Simulation studies on adaptative route
of rounds.                                                                                 decision and the influence of information on commuter scenarios’,
                                                                                           Journal of Intelligent Transportation Systems: Technology, Planning,
   More innovatively, the app becomes an agent and actively adjusts                        and Operations, 8(4), 223–232, (October/December 2004).
the parameters of the budget strategies to improve the overall learn-               [15]   Jelle R. Kok and Nikos Vlassis, ‘Sparse cooperative q-learning’, in Pro-
ing process. The app collects a lot of information from the driver                         ceedings of the 21st. International Conference on Machine Learning
agents, that can be used to drive the system towards the desired state,                    (ICML), pp. 481–488, New York, USA, (July 2004). ACM Press.
                                                                                    [16]   Andrew Koster, Andrea Tettamanzi, Ana L. C. Bazzan, and Célia
e.g. by telling the agents how to acquire more or less credits.                            da Costa Pereira, ‘Using trust and possibilistic reasoning to deal with
                                                                                           untrustworthy communication in VANETs’, in Proceedings of the 16th
                                                                                           IEEE Annual Conference on Intelligent Transport Systems (IEEE-
ACKNOWLEDGEMENTS                                                                           ITSC), pp. 2355–2360, The Hague, The Netherlands, (2013). IEEE.
                                                                                    [17]   Kumpati S. Narendra and Mandayam A. L. Thathachar, Learning Au-
Ana Bazzan was partially supported by CNPq under grant no.                                 tomata: An Introduction, Prentice-Hall, Upper Saddle River, NJ, USA,
307215/2017-2.                                                                             1989.
                                                                                    [18]   Denise de Oliveira and Ana L. C. Bazzan, ‘Multiagent learning on traf-
                                                                                           fic lights control: effects of using shared information’, in Multi-Agent
REFERENCES                                                                                 Systems for Traffic and Transportation, eds., Ana L. C. Bazzan and
                                                                                           Franziska Klügl, 307–321, IGI Global, Hershey, PA, (2009).
[1] Ana L. C. Bazzan, ‘Aligning individual and collective welfare in com-           [19]   Juan de Dios Ortúzar and Luis G. Willumsen, Modelling Transport,
    plex socio-technical systems by combining metaheuristics and rein-                     John Wiley & Sons, 3rd edn., 2001.
    forcement learning’, Eng. Appl. of AI, 79, 23–33, (2019).                       [20]   Gabriel de O. Ramos, Ana L. C. Bazzan, and Bruno C. da Silva,
[2] Ana L. C. Bazzan and Camelia Chira, ‘Hybrid evolutionary and rein-                     ‘Analysing the impact of travel information for minimising the regret
    forcement learning approach to accelerate traffic assignment (extended                 of route choice’, Transportation Research Part C: Emerging Technolo-
    abstract)’, in Proceedings of the 14th International Conference on Au-                 gies, 88, 257–271, (Mar 2018).
    tonomous Agents and Multiagent Systems (AAMAS 2015), eds., R. Bor-              [21]   Gabriel de O. Ramos, Bruno C. da Silva, and Ana L. C. Bazzan, ‘Learn-
    dini, E. Elkind, G. Weiss, and P. Yolum, pp. 1723–1724. IFAAMAS,                       ing to minimise regret in route choice’, in Proc. of the 16th Interna-
    (May 2015).                                                                            tional Conference on Autonomous Agents and Multiagent Systems (AA-
[3] Ana L. C. Bazzan, M. Fehler, and F. Klügl, ‘Learning to coordinate                    MAS 2017), eds., S. Das, E. Durfee, K. Larson, and M. Winikoff, pp.
    in a network of social drivers: The role of information’, in Proceed-                  846–855, São Paulo, (May 2017). IFAAMAS.
    ings of the International Workshop on Learning and Adaptation in MAS            [22]   Gabriel de O. Ramos, Bruno C. da Silva, Roxana Rădulescu, and Ana
    (LAMAS 2005), eds., Karl Tuyls, Pieter Jan’t Hoen, Katja Verbeeck, and                 L. C. Bazzan, ‘Learning system-efficient equilibria in route choice us-
    Sandip Sen, number 3898 in Lecture Notes in Artificial Intelligence, pp.               ing tolls’, in Proceedings of the Adaptive Learning Agents Workshop
    115–128, (2006).                                                                       2018 (ALA-18), Stockholm, (Jul 2018).
[4] Ana L. C. Bazzan and R. Grunitzki, ‘A multiagent reinforcement learn-           [23]   Gabriel de O. Ramos and Ricardo Grunitzki, ‘An improved learning
    ing approach to en-route trip building’, in 2016 International Joint Con-              automata approach for the route choice problem’, in Agent Technol-
    ference on Neural Networks (IJCNN), pp. 5288–5295, (July 2016).                        ogy for Intelligent Mobile Services and Smart Societies, eds., Fernando
[5] Luciana S. Buriol, Michael J. Hirsh, Panos M. Pardalos, Tania Querido,                 Koch, Felipe Meneguzzi, and Kiran Lakkaraju, volume 498 of Commu-
    Mauricio G.C. Resende, and Marcus Ritt, ‘A biased random-key genetic                   nications in Computer and Information Science, 56–67, Springer Berlin
    algorithm for road congestion minimization’, Optimization Letters, 4,                  Heidelberg, (2015).
    619–633, (2010).                                                                [24]   Guni Sharon, Josiah P Hanna, Tarun Rambha, Michael W Levin,
[6] Daniel Cagara, Bjorn Scheuermann, and Ana L.C. Bazzan, ‘Traffic op-                    Michael Albert, Stephen D Boyles, and Peter Stone, ‘Real-time adap-
    timization on Islands’, in 7th IEEE Vehicular Networking Conference                    tive tolling scheme for optimized social welfare in traffic networks’, in
    (VNC 2015), pp. 175–182, Kyoto, Japan, (December 2015). IEEE.                          Proc. of the 16th International Conference on Autonomous Agents and
[7] Caroline Claus and Craig Boutilier, ‘The dynamics of reinforcement                     Multiagent Systems (AAMAS 2017), eds., S. Das, E. Durfee, K. Larson,
    learning in cooperative multiagent systems’, in Proceedings of the Fif-                and M. Winikoff, pp. 828–836, São Paulo, (May 2017). IFAAMAS.
    teenth National Conference on Artificial Intelligence, AAAI ’98/IAAI


                                                                                7
[25] Ming Tan, ‘Multi-agent reinforcement learning: Independent vs. coop-
     erative agents’, in Proceedings of the Tenth International Conference
     on Machine Learning (ICML 1993), pp. 330–337. Morgan Kaufmann,
     (June 1993).
[26] Adam Taylor, Ivana Dusparic, Edgar Galván López, Siobhán Clarke,
     and Vinny Cahill, ‘Accelerating learning in multi-objective systems
     through transfer learning’, in 2014 International Joint Conference on
     Neural Networks (IJCNN), pp. 2298–2305, Beijing, China, (2014).
     IEEE.
[27] Lisa Torrey and Matthew E. Taylor, ‘Teaching on a budget: Agents ad-
     vising agents in reinforcement learning’, in Proceedings of the 2013
     International Conference on Autonomous Agents and Multi-Agent Sys-
     tems, St. Paul, MN, USA, (May 2013). IFAAMAS.
[28] John Glen Wardrop, ‘Some theoretical aspects of road traffic research’,
     Proceedings of the Institution of Civil Engineers, Part II, 1(36), 325–
     362, (1952).
[29] Christopher J. C. H. Watkins and Peter Dayan, ‘Q-learning’, Machine
     Learning, 8(3), 279–292, (1992).
[30] Jin Y. Yen, ‘Finding the k shortest loopless paths in a network’, Man-
     agement Science, 17(11), 712–716, (1971).
[31] Matthieu Zimmer, Paolo Viappiani, and Paul Weng, ‘Teacher-student
     framework: a reinforcement learning approach’, in AAMAS Work-
     shop Autonomous Robots and Multirobot Systems, Paris, France, (May
     2014).


                                                                               8

</pre>