=Paper=
{{Paper
|id=Vol-2701/paper_6
|storemode=property
|title=Experience Sharing in a Traffic Scenario
|pdfUrl=https://ceur-ws.org/Vol-2701/paper_6.pdf
|volume=Vol-2701
|authors=Ana L. C. Bazzan,Franziska Klügl
|dblpUrl=https://dblp.org/rec/conf/ecai/BazzanK20
}}
==Experience Sharing in a Traffic Scenario==
Experience Sharing in a Traffic Scenario Ana L. C. Bazzan1 and Franziska Klügl2 Abstract. Travel apps become more and more popular giving in- the learning process. However, the majority of them deal with co- formation about the current traffic state to drivers who then adapt operative environments. In non-cooperative tasks, such as the route their route choice. In commuting scenarios, where people repeatedly choice, we observed that the transfer of a good solution from one travel between a particular origin and destination, learning effects agent to others may produce sub-optimal outcomes. Therefore, there add to this information. In this paper, we analyse the effects on the is a need for approaches that address transfer of knowledge in other overall network, if adaptive driver agents share their aggregated ex- contexts. With the present paper, we provide the first steps in this perience about route choice in a reinforcement learning (Q-learning) direction. setup. Drivers share what they have learnt about the system, not just The other sections of this paper are organised as follows. First, we information about their current travel times. We can show in a stan- introduce some background on route choice, traffic assignment anal- dard scenario that experience sharing can improve convergence times ysis concepts and the usage of reinforcement learning. Section 3 then for adaptive driver agents. describes the proposed approach, while Section 4 discusses experi- ments performed with a standard scenario and results gained from that. Section 6 then presents the concluding remarks and points to 1 Introduction future research. Which route to choose for travelling from origin to destination is a central question a traffic participant – in our case, a driver – faces. The route choice of a driver however is not independent of the 2 Background: Traffic Assignment, Route Choice choices of others, as the load on a link determines the travel time and Reinforcement Learning on it and as a consequence the travel time on a route. This is the 2.1 The Traffic Assignment Problem well-known problem of traffic assignment. Drivers adapt to their ex- perience and choose routes which they expect to be the best choice – The traffic assignment problem (TAP) aims at assigning a route to mostly with respect to shortest travel time. In traffic analysis this idea each driver that wants to travel from its origin to its destination. In is manifested in the concept of user equilibrium, which is a situation traffic analysis and simulation, traffic assignment is the prominent in which no user can reduce travel time by changing its route. step of connecting the demand (how many drivers want to travel be- There many approaches for driving the overall system into overall tween two nodes in a traffic net?) and the supply (roads in a traffic system optimum making individual drivers with incentives or infor- net with particular capacities, speed, etc.). The TAP can be solved in mation. Information just about recent travel times may be seen as different ways. Agent-based approaches aim at finding the UE [28], miss-leading, as it just may give a one-shot impression. A conse- in which every driver perceives that all possible routes between its quence are ineffective oscillations based on too fast and overbearing origin and destination (its OD pair) have similar, minimal costs re- reactions. A way to address this problem is by introducing a kind of sulting in that the agent has no incentive to change routes. This means inertia into the system, assuming that people are hesitant in following that the UE corresponds to each driver selfishly computing a route in- new information. dividually. More and more drivers use travel apps that also show the traffic A traditional algorithm to solve the TAP, (i.e. to find the UE) is the state collected from real-time speed observations, such as Google method of successive averages (MSA). Yet, this method is performed Maps or Waze. Drivers fuse the information that they receive from in a centralised way3 . those apps with their individual experience to actually make their A learning based approach, on the other hand, can be done in a routing decision. decentralised way, with each agent learning individually by means In this paper, we want to analyse whether sharing of experience of RL (see next section). We remark however that, since their actions among drivers rather than using recent travel time information en- are highly coupled with other agents actions, this is not a trivial task. ables the overall system to develop into the user (or Nash) equilib- Moreover, it is a non-cooperative learning task. rium (UE). Our analysis is inspired by a hypothetical app with which drivers 2.2 Reinforcement Learning share what they have learnt related to their repeated route choice be- tween some origin an destination in a traffic system – like friends In reinforcement learning (RL), an agent’s goal is to learn a mapping chatting about which route they prefer or avoid. between a given state to a given action, by means of a value function. As discussed later in Section 5, there are some approaches that are A popular algorithm to compute such value functions is Q-learning based on sharing or transferring knowledge to improve or accelerate (QL) [29]. Value functions take the instantaneous reward received (a numerical signal) and compute the expected, discounted value of the 1 UFRGS, Porto Alegre, Brazil, email: bazzan@inf.ufrgs.br 2 Örebro University, Sweden, email: franziska.klugl@oru.se 3 For details see, e.g., Chapter 10 in [19]. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). corresponding action. Once such mapping is learned, an agent can 3.2 Information Sharing via App decide which action to select in order to maximize its rewards. We can model RL as a Markov decision process (MDP) composed With the dissemination of various traffic apps, it is no longer realis- by a tuple (S, A, T, R), where S is a set of states; A is a set of ac- tic to assume that drivers only learn from their own experience when tions; T is the transition function that models the probability of the repeatedly travelling from origin to destination just observing their system moving from a state s ∈ S to state s0 ∈ S, upon performing own travel time. Carrying the idea of traffic apps beyond current action a ∈ A; and R is the reward function that yields a real num- traffic state, more information produced by others can be used for ber associated with performing an action a ∈ A when one is in state decision making. While there has been a number of works in which s ∈ S. the effect of information about current travel times on route or on Q-learning (QL) computes a Q-value Q(s, a) for each action-state parts of the route was tested (see Section 5), the question emerged pair. This value represents a reward estimate for executing the action whether explicitly sharing accumulated information about particu- a at state s. The updating of Q(s, a) is done through Eq. 1, where lar routes could improve the learning process. We envision a travel α ∈ [0, 1] is the learning rate and γ ∈ [0, 1] is the discount factor. app in which participating drivers can share experience about their favourite or most disappointing routes. The setup is visualised in Fig- Q(s, a) = Q(s, a) + α(r + γ ∗ maxa (Q(s0 , a0 )) − Q(s, a)) (1) ure 1. Each agent is a learner that maintains a Q-table that is updated Best: at each subsequent episode. In order to address the exploration– R2 with 50.32 exploitation dilemma, the ε-greedy exploration strategy can be used to choose actions: the action with the highest Q value is selected with a probability of 1−ε and a random action is selected with probability ε. Finally, even if agents may be learning independently (as, e.g., in [7]), this is a non-trivial instance of multiagent RL or MARL. 3 Sharing Experiences 3.1 Q-Learning in a Route Choice Scenario Given is a road network G in which nodes represent point of inter- est, which can be junctions, neighborhoods, or areas; links represent Figure 1. Setup of the system: Drivers decide about sharing their road segments with a free-flow travel time depending on the distance experience via an app with a server. The server processes the collected information and publishes the aggregated experience. Drivers may access and capacity and an additional cost function that relates number of this information depending on their information strategy. vehicles on it to its costs (travel time). Some of the nodes serve as origins, some as destinations. Between those origin and destination nodes, we consider sequences of links as possible routes. Different The overall decision making results in three steps: routes connecting origin and destination are not independent from another, they may contain shared links. Such dependencies exists be- 1. Agents select an action from their set of possible actions. Each tween routes of the same origin-destination combination as well as agent experiences the time that it took to perform the travel using with routes from others. the route that corresponds to the selected action. The agent updates Drivers use Q-learning to individually determine which is the best its Q-table according to the QL rules. At the end of the day, they route they can take from their individual origin and destination pair. share an evaluation of a route with the app. The shared informa- Hereby, S is the particular origin-node of the individual driver. Each tion does not need to be about the route they took most recently, driver learns independently. Each agent state is determined by its ori- it may be their best or their worst or about a randomly route. The gin, i.e., this is a stateless formulation of QL, as in [7]. app can be seen as a platform for chatting “neighbours” who talk The set of actions A contains all k shortest routes between origin about what their experiences when travelling to a particular des- and destination. Therefore the number k needs to be sufficiently large tination. Drivers may also decide to share no information at all. to give sufficient flexibility in route choice. The transition function More precisely: sharing their accumulated experience means that maps the origin to the destination for all actions. Reward is a function they share the Q-value assigned to a particular action. of the experienced travel time on the route (we use the negative of The agents do not reason about the potential effect of sharing in- travel time, for maximization purposes). formation on the reward they can receive. They do not assume It is important to actually understand the multi-agent character- that the publication of the information influences others’ decision istics of this learning task. Because links have travel times that de- making to an extend that the evaluation of the published route pends on the number of agents using them, learning is challenging. changes dramatically. The desirable, shortest path under free-flow, may end up producing 2. The information server behind the app collects all information a high travel time, if too many agents want to use it. Agents need to from the information sharing drivers and determines the shared learn how to distribute themselves in a way that an optimal number information that the server then publishes. This information con- of agents use each edge, minimizing their individual travel time on tains, per origin-destination pair, an action and the Q-value that the selected routes. Coordination is desirable not just between agents belongs to the action aggregated from all handed-in information. with the same origin-destination pair, but needs to happen between 3. Driver agents decide whether to access the published information. all agents, even if their routes would share only a single link. Accessing has the consequence, that they incorporate the pub- 2 lished information in their Q-table: they replace the Q-value of We decided to focus on the simpler alternatives and leave the more the concerned action with the published Q-value for the given ac- sophisticated approaches to future work, as they involve a lot of in- tion. There is no reasoning about the credibility of the published dividual parameters to fully specify such strategies. Hence, in this value in relation to the existing Q-value. The decision of accessing work we follow alternative 2, varying the budget agent has, starting the shared information is based on a budget acquired by sharing with unlimited budget and then decreasing it so that agents only ac- - the more one shares the more one may access the shared value. cess the information in some episodes. Although we consider that all In the current version, the agents do not reason about whether the have the same budget, the access is not synchronous, i.e., not every- investment into acquiring published information may pay off, they body spend their budgets in the same episode (except in case budget rather access the information with a given frequency/probability, is unlimited and thus every agent access at every episode). provided they have a budget. This can be seen as a rudimentary form of transfer learning, with 3.3.2 Whether, when and with whom to share information? transfer happening during the learning process not from one task to another. Due to the way we modelled actions, transfer between dif- The agents learn about good actions/routes for their origin- ferent problems – here different origin-destination pairs makes no destination combination, individually. In principle, they do not need sense. to share – yet without a sufficiently large subset of the driver popula- This is a multiagent reinforcement learning setup in which agents tion sharing, the information that the app publishes may not possess have to learn anti-coordination. Sharing happens in groups: only any value. So, it is in the interest of the app providers that as many agent within the same group, that means with the same origin and agents as possible hand-in their experience. destination pair share partial information from their Q-tables. There might be some required consent for basic service usage to send information about the agents’ experience to the app. This has the consequence that basically everybody would share experiences much 3.3 Sharing Issues in Detail like GPS traces are collected without full awareness of the (careless) More details need to be considered to provide full information about traveller. the analysed scenarios: As an alternative, one can image that there is a small compensation The overall dynamics are so, that agents select their route before for actively sharing experiences. The budget system indicated above the actual travelling happens. Then, all agents travel through the net- would need a corresponding side for earning what will be spend later. work. Using cost functions for determining travel time on a link from For every time sending information, the agent could increase the bud- its load, we abstract away from detailed temporal dynamics in which get for retrieving information. departure time, etc. matters. As we want to focus on the effect of The compensation for sharing could be also independent from the sharing experiences and different aspects related to that, we did not information usage. It could be financial or give social credits turn- use a microscopic traffic simulation to determine how an individual ing the individual agent to somebody more compliant with societal driver moves through the network. As a consequence, all agents tak- values. ing the same route encounter the same travel times. Reward from For this first analysis, we just analysed situations in which every travelling is sent to drivers after all have finished. The agents update agent is basically forced to contribute with its experience, that means their Q-tables, send information to the app. In the next episode, the agents share drivers who want to access the published information, look into the A variant of this question is related to with whom an agent may service and again update their Q-tables based on the published value. share. The idea of an app that publishes a value aggregated from all So, there are a number of decisions involved: handed-in experiences, is one extreme: the agent shares with every- body that travel between the same origin and destination. Whether 3.3.1 When does a driver access the shared information? the information that the agent contributes is actually valuable is a de- cision of the aggregation mechanism used by the app. As an alterna- Assuming that the driver needs to pay for the service of getting such tive, one can image to restrict the information dissemination to only information, the decision about whether and when it is best to in- a group of agents, denoting a group of friends or neighbours. This corporate information from other agents into ones Q-table may be a could bring more heterogeneity into the information transfer. As an strategic one. extreme case, agents could pair and just share information with one other agent, this thwarts the app idea. Additionally, we did not ex- 1. the simplest strategy could be to simply access information in ev- pect a large impact and therefore just tested without restrictions. In ery round. the future work, we need to test this assumption. 2. the agent could collect some budget – e.g. from sharing informa- tion – that it could invest into accessing information. So whenever the agent has collected enough budget, it accesses. The relation 3.3.3 What information shall be shared and how the between gain per sharing in relation to costs of accessing is the information shall be aggregated? relevant parameter here. 3. The agent could access the information in the beginning with de- This is the central question for the setup of the app. What information creasing frequency the longer the learning proceeds. shall be shared and how does the app process all the information 4. One can image more advanced strategies based on dynamics that collected? the agent observes. For example, if the action with the best Q- The first question relates to what information the agents do send to value was “disappointing” for a number of times, the agent may the app. It is obvious that this information can be information about trigger a look-up on the app. the route with the highest Q-value - basically telling others that this 5. An alternative could be to have a look, if the Q-values are not route is the best for the agents’ origin-destination pair according to sufficiently distinct to have a clear favourite route. the agents’ experience. Yet, also other information may be interesting 3 A 5 C 11 F 13 I 2 L The volume-delay (or cost) function is te = tfe + 0.02 × qe , where te is the travel time on edge e, tfe is the edge’s travel time per unit of 7 15 7 9 9 9 12 time under free flow conditions, and qe is the flow using edge e. This means that the travel time in each edge increases by 0.02 of a minute B 11 D 7 G 3 J for each vehicle/hour of flow. 11 7 9 9 13 9 12 For the sake of getting a sense of what would be optimal, the free- flow travel times (FFTT) for each OD pair are shown in Table 1. We E 7 H 3 K 2 M note however that these times cannot be achieved because the free- flow condition is not realistic, and when each agent tries to use its route with the lowest travel time, jams occur and the times increase. Figure 2. OW Road Network (as proposed by Ortuzar and Willumsen) [19] In addition the table shows the number of agents that travel between the four particular origin and destination combinations and the travel time in UE. This is averaged as drivers take different routes between to know: There are situations in which it is better to avoid choices. their OD pairs. Also, the fact that 1700 agents learn simultaneously, So, we foresee the following alternatives to be relevant: makes the overall learning complex. 1. Share the last chosen action including its Q-value Table 1. OW network: some characteristics 2. Share a randomly selected action with its Q-value 3. Share the action with the best Q-value 4. Share the action with the worst Q-value OD Agents shortest path FFTT Avg. travel Pair time (in UE) As the first alternative in most cases is the same as the third one and otherwise a random selection, we deemed this alternative to be AL 600 ACGJIL 28 71.0 not interesting. We tested all others. The agents hereby share the in- AM 400 ACDHKM 26 64.94 BL 300 BDGJIL 32 68.78 formation. If they access the aggregated information, they integrate BM 400 BEHKM 23 62.3 it into their own individual Q-table replacing the Q-value of the con- 1700 67.13 cerned route/action. So, they do not automatically select the route that they were told about, but only if there is no better one and con- tinue with the usual learning and decision making process. For the QL, Q-tables are initialised with a random value around - As written above, the app collects and aggregates information 90. All experiments were conducted with α = 0.5 (learning rate) and from all sharing drivers. Also for this overall process step, there are ε = 0.05 (for ε-greedy action selection), following previous works different alternatives: (e.g., [2, 1]) that have extensively tried other values. As mentioned above, k = 8 is the number of shortest paths, that means number of 1. the app publishes a random value possible actions. 2. the app publishes the overall best q-value and corresponding ac- We measure overall performance of the different approaches with tion the average travel time over all 1700 agents. 3. the app publishes the overall lowest q-value with corresponding Each experiment was repeated 30 times. Standard deviations are action not shown to keep the plots cleaner. We thus remark that they are of 4. the app averages the q-values for each possible action and then the order of 1% between the results of different runs in terms of av- give best or worst of these average erage travel times. There are partially dramatic oscillations between episodes. Together with the alternatives for which values to share, this pro- duces a number of interesting alternative combinations: it might be 4.2 Experimental Results interesting to share randomly selected information from all best or all worst q-values, yet we tested the “pure” combinations: The app 4.2.1 Sharing Best Experience provides a fully randomly selected information, the best of all best We start by looking into cases in which the agents share their best and the worst of all worst q-values. experience. We tested different settings of the budget strategy. Agents need 4 Experiments and Results different numbers of share-actions for being able to afford looking into what the app publishes. Figure 3 shows the results of testing 4.1 Scenario scenarios in which an agent can access the information twice, 4, 6 We performed a number of experiments to understand how the shar- or 8 times within 10 rounds. Agents decide individually, not all ac- ing actually impacts the adaptation and eventually the performance cess at the same time, but in average over the episodes with the given of the drivers using the app. relation. For comparison, we also display the learning curve if pure The OW network (proposed in Chapter 10 in [19]) seen in Fig- Q-learning is applied, that means in which agents learn without shar- ure 2, represents two residential areas (nodes A and B) and two major ing Q-values via the app. shopping areas (nodes L and M). So, there are four OD pairs, con- Figure 3 shows average travel time over all agents as well as over necting residential areas with shopping areas. The numbers in the 30 experiments. It allows to compare the learning curves of different edges are their travel times between two nodes under free flow (in budget limitations for accessing the information that the app pub- both ways). We assume no additional delays when passing nodes, lished. We can see that after 500 episodes pure QL shows conver- e.g. due to traffic lights. We computed k = 8 shortest paths per OD gence towards the UE, which is approximately 67 in the case of the pair, following the algorithm by [30]. OW network (see Table 1). 4 Figure 3. OW network: Development of overall averaged travel times over episodes for different sharing setup. Unlimited budget means that agents access the shared experience every round; no budget means no sharing. As for the cases in which there is sharing of experiences, the fol- with the collective of agents missing the UE in the end. It must be lowing observations can be made. First, no matter how frequently said, though, that the performance in the initial episodes is not bad, the agents access the shared experience, we see an acceleration in i.e., there is an acceleration in the learning due to the fact that some the learning process in the initial episodes. This is clear in Figure 3, actions need not to be tried, as aforementioned. and is due to the fact that agents do not have to try out some actions When the agents have a budget that allows then to only access the given that they share better options. Unfortunately, depending on the shared experiences from time to time (e.g., 2 or 4 in 10 episodes, see frequency of access to this shared experience, this could lead to sub- blue and orange curves in Figure 3), then the overall performance is optimal learning causing the collective of agents to not converge to much improved, especially if the driver-vehicle units cannot afford the UE. to access the shared experiences in more than 2 out of 10 episodes. Let us now discuss individual cases. When agents have unlimited Not only there is an acceleration in the learning process already at budget to access the shared experience (and fully use this budget), the initial episodes, but also there is a convergence towards the UE. then the performance is bad (see green line in Figure 3). The reason We assume that the best setting of the budget limits is depending is that too many agents are informed about the best performance in on the scenario details, especially on the particular OD pair. It will the previous episode and hence try to do the same action. Thus, the be interesting to analyse the situation deeper. A budget strategy in supposedly best action is followed by virtually all agents and they which agents share in the beginning, but then gradually give up shar- then end up selecting the same route (for each OD pair). This can ing over time, may lead to the results that we want to achieve with lead to bad performance, not to mention a lot of oscillations (that are fast convergence to the UE. seen in the plot). This is a well known phenomenon and could be confirmed in our experiments. When the budget is limited so that it allows the access to the shared 4.2.2 Sharing Worst or Random Experience experiences in 6 or 8 out of 10 episodes, then there is an increase in performance, if compared to the case in which agents have unlim- We also tested the effect of what happens if agents share their worst ited access. We recall that such accessing is not synchronous, i.e., experience which the app aggregates to the worst action/route for some agents may access while others are not doing so. Still, the per- each particular OD pair. The results are almost identical to vanilla formance is sub-optimal (brown and magenta curves in the figure), QL. So, sharing has apparently no effect when sharing the worst ex- perience. The reason is that when exploiting - which is according to 5 our settings in 95% of the episodes - the agents select the best route. ing, where the reward is normally based on the experienced travel Which route has a bad Q-value does not matter hereby. So, the only time. situation in which such an update of the Q-table would change the Two main lines of approaches can be distinguished: One, less pop- decision of an agent is, if the route with previously the best Q-value, ular due to its complexity, follows a traditional RL recipe, where be- is the one that the app informs about. One could image that this hap- sides a set of actions, there is also a set of states, these being the pens if the previously best route is heavily overloaded, but this was vertices in which the agent finds itself. Works in this category in- not the case here. Maybe, the number of considered possible routes is clude [4]. However, the majority of the papers in the literature follow too high for such a setup. Nevertheless, reducing the set of possible a so-called stateless approach in which there is a single state (the routes makes no sense in such a learning setting. OD pair the agent belongs to), while the agent merely has to select Also, randomly selecting a route from the Q-table and sharing it actions, which are normally associated with a set of pre-computed with the app, which then publishes a randomly selected experience routes that can be recalculated en-route, or not. This literature in- from all handed-in values, generates a similar outcome as the shar- cludes approaches based both on Learning Automata ([17], [23]) and ing the worst experience. There is no difference to vanilla QL - that QL. means a QL without sharing experiences. The explanation is similar. QL is increasingly being used for the task of route choice. Of par- ticular interest are those that compute the regret associated with the greedy nature of selecting routes in a selfish way ([21, 20]), as well 5 Related Work as those that combine selfish route choice with some sort of biasing Besides the classical methods to address the TAP, there has been from a centralized entity ([1, 6, 22]). some works – based on other AI paradigms – with the goal of finding There are works that, as we propose in the current paper, also deal the UE. The main motivation is to solve the TAP from the perspective with some forms of communication between agents. [14, 3, 13, 16]. of the individual agent (driver, driver-vehicle unit), thus relaxing the Information is here not always truthful, but can be manipulated for assumption that a central entity is in charge of assigning routes for driving agents towards overall intended outcome. those agents. In such decentralized approaches, the agent itself has The idea of sharing either learned policies or Q-values is not new. to collect experiences in order to reach the UE condition. Thus, these In fact, as early as in 1993, Tan [25] suggested that communication of approaches are suitable for commuting scenarios, where each agent some kind of knowledge could help, especially in cooperative envi- can collect experiences regarding the same kind of task (in this case, ronments. In particular, sharing Q-values may reduce the time needed driving from a given origin to a given destination). to explore the space-action space. One popular approach to tackle such decentralized decision- Some researchers have dealt with aspects such as what and when making is via RL (more specifically MARL). However, other are also to share. In a abstract view of sharing, [15] deal with agents that mentioned in the literature. We start with these and later discuss those keep a list of states in which they need to coordinate. This idea also based on RL. appears in the context of traffic management in [18], where traffic In the work of [9], a neural network is used to predict drivers’ signal agents keep joint Q-tables based on coupled states and actions. route choices, as well as compliance to such predictions, under the Some works – for instance, [26] – deal with more fine grained influence of messages containing traffic information. However, the views of sharing knowledge, as for instance the transfer learning authors focus on the impact of the message on the agents rather than community, in which the quest regarding what and when to trans- the impact on system-level traffic distribution and travel time. fer gets more precise. Transfer of reward values or policies is also The work of [10] uses the Inverted Ant Colony Optimization explored in the literature. For instance, [27, 11, 31] show that the (IACO). Ants (vehicles) deposit their pheromones in the routes they learning can be accelerated if a teacher shares experiences with a stu- are using, and the pheromone is used to repel them. Consequently, dent. We remind that virtually all these works deal with cooperative they avoid congested routes. Also [8] applied ant colony optimisa- environments, where it makes sense to transfer knowledge. In non- tion to the TAP. However, in both cases, the pheromone needs to be cooperative tasks, such as the route choice, we have seen that transfer centrally stored, thus this approach is not fully suitable for a the de- of a good solution from one agent to others may produce highly sub- centralised modelling. optimal outcomes. Coordination here is actually anti-coordination, One game theoretic approach to the route choice problem ap- appropriately distributing agents between different alternatives. pears in [12]. This approach uses only past experiences for the route choice. The choice itself is made at each intersection of the network. 6 Conclusion and Ideas for the Future However, it assumes that historical information is available to all drivers. The idea of this paper is to discuss the effects of a possible next type The work of [24] uses adaptive tolls to optimise drivers’ routes as of travel app, in which users intentionally share their experiences, tolls change. Differently from our purpose, they are concerned with as opposed to conventional travel apps, which are based on collect- alignment of choices towards the system optimum, which can only ing drivers position, in order to be able to display current average be achieved by imposing costs on drivers (in their case, tolls). In the speed at a particular segment. The app that we are considering here, same line, [5] deal with the problem from a centralised perspective to can be thought as a device that replaces direct interaction between find an assignment that aligns users and system utilities by imposing colleagues or neighbours chatting about habits and experiences re- tolls in some links. garding route choice. In the frontier of aligning the optimum of the system with the UE, Assuming that humans continuously adapt to what they experi- [1] has introduced an approach that seeks a balance in which not only ence when performing (commuting) tasks, we analysed the potential the central authority benefits but also the individual agents. effect of such an app. So, agents can not just learn based on their own RL-based approaches to compute the UE are becoming increas- experience, but also use others’ experience on the same task for de- ingly popular. Here, each agent seeks to learn to select routes (these cision making. As we have shown - integrating others’ experiences are the actions) based on the rewards obtained in each daily commut- from time to time - speeds up the learning progress. 6 In previous works, drivers were informed about travel times that ’98, pp. 746–752, Menlo Park, CA, USA, (1998). American Associa- were collected by a central authority (as, e.g., Waze). This is effec- tion for Artificial Intelligence. [8] Luca D’Acierno, Bruno Montella, and Fortuna De Lucia, ‘A stochastic tive only if some inertia is introduced. We can observe a similar effect traffic assignment algorithm based on ant colony optimisation’, in Ant here: the agents who do not excessively use integrate others’ expe- Colony Optimization and Swarm Intelligence, 5th International Work- rience, learn the best choice for their route. Agents that update their shop, ANTS 2006, eds., M. Dorigo, L.M. Gambardella, M. Birattari, Q-table with additional information from the app in every round per- A. Martinoli, R. Poli, and T. Stützle, volume 4150 of Lecture Notes in form worst. Computer Science, pp. 25–36, Berlin, (2006). Springer-Verlag. [9] H. Dia and S. Panwai, Intelligent Transport Systems: Neural Agent These findings are quite preliminary. More investigation is neces- (Neugent) Models of Driver Behaviour, LAP Lambert Academic Pub- sary to be able to claim general conclusions. We will test other in- lishing, 2014. teresting setups: combining different sharing and aggregation strate- [10] José Capela Dias, Penousal Machado, Daniel Castro Silva, and Pe- gies. For example the agents may share their best experience, but dro Henriques Abreu, ‘An inverted ant colony optimization approach to traffic’, Engineering Applications of Artificial Intelligence, 36(0), 122– instead of aggregating the information, the app publishes a randomly 133, (July 2014). selected value from those sent. Another interesting idea could be the [11] Anestis Fachantidis, Matthew E. Taylor, and Ioannis P. Vlahavas, organisation of access into groups: instead of aggregating from all ‘Learning to teach reinforcement learning agents’, Machine Learning received and publishing to all who want to access, agents could be and Knowledge Extraction, 1(1), 21–42, (2019). organised into smaller groups of friends who share/aggregate/publish [12] Syed Md. Galib and Irene Moser, ‘Road traffic optimisation using an evolutionary game’, in Proceedings of the 13th annual conference com- only within the respective group. panion on Genetic and evolutionary computation, GECCO ’11, pp. As we have indicated in the previous sections, we also plan to test 519–526, New York, NY, USA, (2011). ACM. more dynamic and adaptive strategies: Agents may have a high initial [13] Ricardo Grunitzki and Ana L. C. Bazzan, ‘Combining car-to- budget and decide to use the budget in different strategies: a first such infrastructure communication and multi-agent reinforcement learning in route choice’, in Proceedings of the Ninth Workshop on Agents strategy could be that the agent uses its budget in the beginning of the in Traffic and Transportation (ATT-2016), eds., Ana L. C. Bazzan, learning process, when it is exploring more. A second alternative is Franziska Klügl, Sascha Ossowski, and Giuseppe Vizzari, New York, that agents could spare their budgets and use them when they notice (July 2016). CEUR-WS.org. that the Q-Value of their best route is decreasing for a given number [14] F. Klügl and Ana L. C. Bazzan, ‘Simulation studies on adaptative route of rounds. decision and the influence of information on commuter scenarios’, Journal of Intelligent Transportation Systems: Technology, Planning, More innovatively, the app becomes an agent and actively adjusts and Operations, 8(4), 223–232, (October/December 2004). the parameters of the budget strategies to improve the overall learn- [15] Jelle R. Kok and Nikos Vlassis, ‘Sparse cooperative q-learning’, in Pro- ing process. The app collects a lot of information from the driver ceedings of the 21st. International Conference on Machine Learning agents, that can be used to drive the system towards the desired state, (ICML), pp. 481–488, New York, USA, (July 2004). ACM Press. [16] Andrew Koster, Andrea Tettamanzi, Ana L. C. Bazzan, and Célia e.g. by telling the agents how to acquire more or less credits. da Costa Pereira, ‘Using trust and possibilistic reasoning to deal with untrustworthy communication in VANETs’, in Proceedings of the 16th IEEE Annual Conference on Intelligent Transport Systems (IEEE- ACKNOWLEDGEMENTS ITSC), pp. 2355–2360, The Hague, The Netherlands, (2013). IEEE. [17] Kumpati S. Narendra and Mandayam A. L. Thathachar, Learning Au- Ana Bazzan was partially supported by CNPq under grant no. tomata: An Introduction, Prentice-Hall, Upper Saddle River, NJ, USA, 307215/2017-2. 1989. [18] Denise de Oliveira and Ana L. C. Bazzan, ‘Multiagent learning on traf- fic lights control: effects of using shared information’, in Multi-Agent REFERENCES Systems for Traffic and Transportation, eds., Ana L. C. Bazzan and Franziska Klügl, 307–321, IGI Global, Hershey, PA, (2009). [1] Ana L. C. Bazzan, ‘Aligning individual and collective welfare in com- [19] Juan de Dios Ortúzar and Luis G. Willumsen, Modelling Transport, plex socio-technical systems by combining metaheuristics and rein- John Wiley & Sons, 3rd edn., 2001. forcement learning’, Eng. Appl. of AI, 79, 23–33, (2019). [20] Gabriel de O. Ramos, Ana L. C. Bazzan, and Bruno C. da Silva, [2] Ana L. C. Bazzan and Camelia Chira, ‘Hybrid evolutionary and rein- ‘Analysing the impact of travel information for minimising the regret forcement learning approach to accelerate traffic assignment (extended of route choice’, Transportation Research Part C: Emerging Technolo- abstract)’, in Proceedings of the 14th International Conference on Au- gies, 88, 257–271, (Mar 2018). tonomous Agents and Multiagent Systems (AAMAS 2015), eds., R. Bor- [21] Gabriel de O. Ramos, Bruno C. da Silva, and Ana L. C. Bazzan, ‘Learn- dini, E. Elkind, G. Weiss, and P. Yolum, pp. 1723–1724. IFAAMAS, ing to minimise regret in route choice’, in Proc. of the 16th Interna- (May 2015). tional Conference on Autonomous Agents and Multiagent Systems (AA- [3] Ana L. C. Bazzan, M. Fehler, and F. Klügl, ‘Learning to coordinate MAS 2017), eds., S. Das, E. Durfee, K. Larson, and M. Winikoff, pp. in a network of social drivers: The role of information’, in Proceed- 846–855, São Paulo, (May 2017). IFAAMAS. ings of the International Workshop on Learning and Adaptation in MAS [22] Gabriel de O. Ramos, Bruno C. da Silva, Roxana Rădulescu, and Ana (LAMAS 2005), eds., Karl Tuyls, Pieter Jan’t Hoen, Katja Verbeeck, and L. C. Bazzan, ‘Learning system-efficient equilibria in route choice us- Sandip Sen, number 3898 in Lecture Notes in Artificial Intelligence, pp. ing tolls’, in Proceedings of the Adaptive Learning Agents Workshop 115–128, (2006). 2018 (ALA-18), Stockholm, (Jul 2018). [4] Ana L. C. Bazzan and R. Grunitzki, ‘A multiagent reinforcement learn- [23] Gabriel de O. Ramos and Ricardo Grunitzki, ‘An improved learning ing approach to en-route trip building’, in 2016 International Joint Con- automata approach for the route choice problem’, in Agent Technol- ference on Neural Networks (IJCNN), pp. 5288–5295, (July 2016). ogy for Intelligent Mobile Services and Smart Societies, eds., Fernando [5] Luciana S. Buriol, Michael J. Hirsh, Panos M. Pardalos, Tania Querido, Koch, Felipe Meneguzzi, and Kiran Lakkaraju, volume 498 of Commu- Mauricio G.C. Resende, and Marcus Ritt, ‘A biased random-key genetic nications in Computer and Information Science, 56–67, Springer Berlin algorithm for road congestion minimization’, Optimization Letters, 4, Heidelberg, (2015). 619–633, (2010). [24] Guni Sharon, Josiah P Hanna, Tarun Rambha, Michael W Levin, [6] Daniel Cagara, Bjorn Scheuermann, and Ana L.C. Bazzan, ‘Traffic op- Michael Albert, Stephen D Boyles, and Peter Stone, ‘Real-time adap- timization on Islands’, in 7th IEEE Vehicular Networking Conference tive tolling scheme for optimized social welfare in traffic networks’, in (VNC 2015), pp. 175–182, Kyoto, Japan, (December 2015). IEEE. Proc. of the 16th International Conference on Autonomous Agents and [7] Caroline Claus and Craig Boutilier, ‘The dynamics of reinforcement Multiagent Systems (AAMAS 2017), eds., S. Das, E. Durfee, K. Larson, learning in cooperative multiagent systems’, in Proceedings of the Fif- and M. Winikoff, pp. 828–836, São Paulo, (May 2017). IFAAMAS. teenth National Conference on Artificial Intelligence, AAAI ’98/IAAI 7 [25] Ming Tan, ‘Multi-agent reinforcement learning: Independent vs. coop- erative agents’, in Proceedings of the Tenth International Conference on Machine Learning (ICML 1993), pp. 330–337. Morgan Kaufmann, (June 1993). [26] Adam Taylor, Ivana Dusparic, Edgar Galván López, Siobhán Clarke, and Vinny Cahill, ‘Accelerating learning in multi-objective systems through transfer learning’, in 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2298–2305, Beijing, China, (2014). IEEE. [27] Lisa Torrey and Matthew E. Taylor, ‘Teaching on a budget: Agents ad- vising agents in reinforcement learning’, in Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Sys- tems, St. Paul, MN, USA, (May 2013). IFAAMAS. [28] John Glen Wardrop, ‘Some theoretical aspects of road traffic research’, Proceedings of the Institution of Civil Engineers, Part II, 1(36), 325– 362, (1952). [29] Christopher J. C. H. Watkins and Peter Dayan, ‘Q-learning’, Machine Learning, 8(3), 279–292, (1992). [30] Jin Y. Yen, ‘Finding the k shortest loopless paths in a network’, Man- agement Science, 17(11), 712–716, (1971). [31] Matthieu Zimmer, Paolo Viappiani, and Paul Weng, ‘Teacher-student framework: a reinforcement learning approach’, in AAMAS Work- shop Autonomous Robots and Multirobot Systems, Paris, France, (May 2014). 8