On Estimating Action Regret and Learning From It in Route Choice Gabriel de O. Ramos and Ana L. C. Bazzan Instituto de Informática Universidade Federal do Rio Grande do Sul {goramos,bazzan}@inf.ufrgs.br Abstract nations to minimise their travel costs. Learning is a funda- mental aspect of route choice because the agents must adapt The notion of regret has been extensively employed their choices to account for the changing traffic conditions. In to measure the performance of reinforcement learn- other words, the agents must adapt to each others’ decisions. ing agents. The regret of an agent measures how An interesting class of multiagent RL techniques com- much worse it performs following its current policy prises the regret minimisation approaches. In this context, re- in comparison to following the best possible pol- gret has been typically employed to measure the performance icy. As such, measuring regret requires complete of reinforcement learning agents. Specifically, regret mea- knowledge of the environment. However, such an sures how much worse an agent performs following its cur- assumption is not realistic in most multiagent sce- rent policy in comparison to following the best possible pol- narios. In this paper, we address the route choice icy one in hindsight. In this sense, regret minimisation can be problem, in which each driver must choose the best seen as an inherent definition on how rational agents behave route between its origin and its destination. The over time. Along these lines, the regret measure naturally fits expected outcome corresponds to an equilibrium as a guide of the learning process. point in the space of policies where no driver bene- fits from deviating from its policy, a concept known In recent works [Zinkevich et al., 2008; Bowling and as User Equilibrium (UE). Considering the limited Zinkevich, 2012; Waugh et al., 2015], regret has been used observability of such a scenario, we investigate how to improve the learning process. However, calculating re- the agents can estimate their regret based exclu- gret requires complete knowledge of the environment (i.e., sively on their experience. To this regard, we intro- the utility associated with every possible policy). In fact, one duce the concept of estimated action regret, through may assume that an online service broadcasts the required which an agent can estimate how much worsen it information through mobile devices. Nevertheless, investi- performs by taking a given action rather than the gating methods to accomplish such a task in the absence of best in hindsight. Additionally, we show how such any global information is more challenging and is also rel- estimations can be used as a reinforcement signal to evant, especially in highly competitive scenarios like traffic [Bazzan and Klügl, 2013; Stone and Veloso, 2000]. improve their performance. We empirically evalu- ate our approach in different route choice scenarios, In this paper, we address the route choice problem by min- showing that the agents produce reasonable estima- imising regret. Specifically, we investigate how the agents tions of their regret. Furthermore, we show that can estimate their regret locally (i.e., based exclusively on using such estimations as the reinforcement signal their experience) and how such estimations can be employed provides good approximations to the UE. to guide the RL process. To this regard, each agent keeps an internal history of experienced rewards, which is used for es- timating the regret associated with each of its actions. We re- 1 Introduction fer to such measure as the estimated action regret and employ Reinforcement learning (RL) in multiagent domains is a chal- it for updating the agents’ policies. The expected outcome lenging task. In RL, an agent must learn by trial-and-error corresponds to an equilibrium point in the space of policies in how to behave within the environment in order to maximise which no driver benefits from deviating from its policy. This its utility. When multiple agents share a common environ- is the so-called User Equilibrium (UE) [Wardrop, 1952]. To ment, they must adapt their behaviour to those of others. The the best of our knowledge, this is the first attempt to improve problem becomes even harder when the agents are selfish and the learning process by using regret estimations as the rein- compete for a common resource. An example is the route forcement signal. choice problem, which concerns how rational drivers1 be- Through experiments, we show that our approach provides have when choosing routes between their origins and desti- fairly precise estimations of the agents’ regret relying only on agents’ experience. Moreover, we present good evidence 1 Henceforth, we use the terms agent and driver alternately. that using such regret estimates as the reinforcement signal is beneficial for the learning process. Consequently, in all tested the Wardrop’s first principle: “the cost on all the routes actu- cases, the results are reasonably close to the UE. ally used are equal, and less than those which would be expe- We remark that this work represents our very first step to- rienced by a single vehicle on any unused route” [Wardrop, wards developing rational agents able to analyse their learn- 1952]. Such a solution concept is known as User Equilibrium ing performance and to improve their expected outcome. In (UE). However, observe that the UE does not describe the the medium-term, we aim at investigating formal aspects of system at its best operation. Indeed, such an state is only the learning process to guarantee the efficiency of RL under achieved when the average travel cost is minimum, as de- multiagent domains. scribed by the Wardrop’s second principle [Wardrop, 1952]. This paper is organised as follows. The background on To this regard, such solution concept is commonly referred as route choice, RL and regret algorithms is presented in Sec- System Optimum (SO). tion 2. In Sections 3 and 4, we describe how the agents can estimate their regret locally and how they can learn using such ui (R) = −CR (2) estimations, respectively. The experimental evaluation is dis- 2.2 Reinforcement Learning cussed in Section 5. Finally, Section 6 presents the conclud- ing remarks and future work directions. Reinforcement learning (RL) concerns with how an agent learns a behaviour by reward and punishment from interac- tions with its environment. We can formulate the RL prob- 2 Background lem as a Markov decision process (MDP). An MDP is a tuple 2.1 Route Choice hS, A, T, ri, where: S is the set of environment states; A is the set of actions; T : S × A × S → [0, 1] is the state transi- The route choice problem concerns how rational drivers be- tion function, with T (s, a, s0 ) = P (s0 | s, a) representing the have when choosing routes between their origins and destina- probability of reaching state s0 after taking action a in state tions. In this section, we introduce the basic concepts related s; and r : S × A → R is the reward function, with r(s, a) to route choice. For a more comprehensive overview, we refer denoting the reward received after taking action a in state s the reader to [Ortúzar and Willumsen, 2011]. [Sutton and Barto, 1998]. A road network can be represented as a directed graph In the context of the route choice problem, the actions of G = (N, L), where the set of nodes N represent the inter- an agent represent the choice of routes between his origin and sections and the set of links L represent the roads between in- destination. Such a set of actions is known a priori by the tersections. The demand for trips generates a flow of vehicles agents. In this sense, the set of states and, consequently, the on the links, with fl the flow on link l. To this regard, each transition functions can be ignored. Moreover, we can define link l ∈ L has a cost cl : fl → R+ associated with crossing the reward received after taking action a as r(a) = u(R), it. Let Tij be the demand for trips between origin i ∈ N and with a = R (we will refer to reward and utility, rather than destination j ∈ N (we refer to an origin-destination pair as cost, hereinafter). On this basis, we can model the route simply OD pair). The set of all such demands is represented choice problem as a stateless2 MDP. by an OD matrix T = {Tij | ∀i,P j ∈ N, i 6= j, Tij ≥ 0}. Solving a stateless MDP involves finding a policy π (i.e., The total demand is denoted d = Tij ∈T Tij . A trip is made which route to take) that maximises the accumulated reward. by means of a route R ⊆ L, which is a sequence of links When the model of the environment dynamics (i.e., the re- connecting an origin P to a destination. The cost of a route R is ward function r) are known by the agent, finding such an op- given by CR = l∈R cl . timal policy is trivial. However, this is rarely the case, espe- The cost of a link is typically modelled using the volume- cially in multiagent settings. To this regard, the agent must delay function (VDF) abstraction. A VDF takes as input the repeatedly interact with the environment to learn a model of flow of vehicles within a link and, based on its attributes (such its dynamics. A particularly suitable class of RL algorithms as length and capacity), returns the cost (travel time) of such here comprises the so-called temporal-difference algorithms, link. A simple VDF is presented in Equation (1), with tl de- through which an agent can learn without an explicit model noting the free flow travel time (i.e., minimum travel time, of the environment. when the link is not congested). In this particular VDF, the The Q-learning algorithm is a good representative of travel time on link l is increased by 0.02 for each vehicle/hour temporal-difference methods [Watkins and Dayan, 1992]. In of flow. the case of a stateless MDP, a Q-learning agent learns the ex- cl (fl ) = tl + 0.02 × fl (1) pected return Q(a) for selecting each action a by exploring In the route choice process, drivers decide which route to the environment. The exploration of the environment must take every day to reach their destinations. Usually, this pro- balance exploration (gain of knowledge) and exploitation (use cess is modelled as a commuting scenario, where drivers’ of knowledge). A typical strategy here is -greedy, in which daily trips occur under approximately the same conditions. the agent chooses a random action with probability  (explo- In this sense, each driver i ∈ D, with |D| = d, is modelled as ration) or choosing the best action with probability 1 −  (ex- an agent, which repeatedly deals with the problem of choos- 2 Observe that a stateless MDP is equivalent to having an initial ing the route that takes the least time to its destination. The state with actions corresponding to the routes available to the agent, utility ui : R → R received by driver i after taking route R is and an ending state with no actions. When an agent chooses action inversely associated with the route’s cost, as in Equation (2). a on the initial state, it performs the desired action and reaches the The expected outcome of this problem can be described by ending state with probability 1, receiving reward r(a). ploitation). Usually,  starts with 1.0 and, at the end of each allowing no-regret algorithms to minimise a broader class learning episode, it is multiplied by a decay rate λ in order to of optimisation problems. Nevertheless, their work also as- increase exploitation as the agent converges to its best action. sumes that the utility function is available to the agents. After taking action a and receiving reward r(a), the stateless Powers and Shoham proposed a set of performance crite- Q-learning algorithm updates Q(a) using Equation (3), where ria regarding multiagent learning and proposed an algorithm the learning rate α ∈ [0, 1] weights how much of the previ- that meets such criteria [Powers and Shoham, 2005]. Their ous estimate should be retained. The Q-learning algorithm is algorithm, however, makes some strong assumptions regard- guaranteed to converge to an optimal policy in the limit under ing the environment observability (e.g., an agent can observe certain assumptions. its opponents’ rewards). Banerjee and Peng extended Pow- ers and Shoham’s approach to a class of no-regret algorithms Q(a) = (1 − α)Q(a) + αr(a) (3) and dropped some of the observability assumptions [Banerjee 2.3 Regret and Peng, 2005]. Notwithstanding, these approaches do not employ the regret to guide the learning process. The regret concept was introduced in the context of evaluat- Along these lines, in this paper we investigate how agents ing the performance of learning rules [Hannan, 1957]. Regret can estimate their regret based on their experience and pro- measures how much worse an agent i performs, on average, pose the use of such estimations to guide the agents’ learning by following its current policy πi ∈ Π as compared to follow- process. Moreover, we disaggregate the regret formulation ing the best possible policy in hindsight. Precisely, the regret by measuring the regret of actions rather than of policies. We RTi of agent i at time T is given by Equation (4), where rt (a) show that performing such estimates is realistic and improves represents the reward of action a at time t and, slightly abus- the learning process. To the best of our knowledge, this is the ing notation, π t represents the action taken at time t under first effort towards addressing regret estimation and learning policy π. Therefore, regret can be seen as the average amount through such regret. lost for following a policy other than the best one. T T 1X t t 1X t t 3 Estimating Regret Locally RTi = max r (π ) − r (πi ) (4) π∈Π T T t=1 In this section, we discuss how agents can estimate their re- t=1 gret. As discussed in Section 2.3, the agents cannot compute In the context of reinforcement learning (RL), regret has their real regret (using Equation (4)) due to the lack of infor- been typically used as a measure of convergence [Shoham et mation regarding the routes rewards. The point is that regret al., 2007; Buşoniu et al., 2008; Zinkevich, 2003; Bowling and is measured considering (i) the agent’s average reward under Veloso, 2002; Powers and Shoham, 2005; Banerjee and Peng, their current policy and (ii) the average reward under the best 2005]. An RL algorithm is said no-regret if it learns a policy π policy in hindsight. Calculating the latter requires knowing for which RT → 0 as T → ∞ [Hannan, 1957]. Along these the rewards of all routes along time. However, after each trip, lines, regret minimisation can be seen as a natural definition an agent can observe the reward of the route taken, but cannot on how rational agents behave over time. In this paper, we observe the reward of the other routes. Such a full observ- use the regret measure to guide the learning process. ability property would only be possible under strong assump- We remark that, by definition, computing regret assumes tions (e.g., a central authority broadcasting such information), complete knowledge of the environment. Specifically, an which can be unrealistic in traffic domains. Furthermore, in- agent cannot compute its regret without knowing the reward vestigating methods to accomplish such a task in the absence of every other action along time. Consequently, agents can- of any supporting service is more challenging and is also rele- not (i) calculate their regret and (ii) learn using their regret, vant, especially in the highly competitive settings considered except if very strong assumptions are made (e.g., assuming here [Stone and Veloso, 2000]. that every agent knows the reward of all actions along time). In this paper, we investigate how agents can estimate their Therefore, using the regret of Equation (4) to guide the learn- regret based exclusively on local information (i.e., the re- ing process is not realistic in multiagent scenarios. wards actually experienced by them). To this regard, we pro- Zinkevich et al. introduced the concept of counterfactual pose an alternative definition of regret that describes the esti- regret and proposed an algorithm for minimising it [Zinke- mated regret of each action. vich et al., 2008]. The counterfactual regret is used to esti- Let Ai ⊆ A denote3 the set of routes of agent i. At time t, mate the regret when the information about states is incom- agent i performs a given action ati ∈ Ai and receives a reward plete (useful in extensive form games with imperfect infor- of rt (ati ). We represent the history of experiences of agent i mation). This is one of the first works to include the regret in as Hi = {ri,a t | a ∈ Ai , t ∈ [1, T ]}, with ri,a t the reward the learning process. Subsequently, Waugh et al. employed experience of driver i for taking action a at time t. However, a regression algorithm to provide enhanced estimates on the recall that an agent cannot observe the reward of action a on counterfactual regret [Waugh et al., 2015]. However, these time t except if it has taken such action at that time, i.e., if a = works assume that the regret is known by the agents, which ati . To this regard, the agents can assume the reward of non is unrealistic for the problem we are considering. taken actions to be the same as the most up to date experience In [Bowling and Zinkevich, 2012], the authors propose a on that route. Precisely, let r̃i,a t represent the newest reward graph representation to express the relation between actions and the associated regret. Such a representation was em- 3 We slightly abuse notation here to account for drivers with dif- ployed to mimic the structure of local search methods, thus ferent OD pairs, whose route sets are obviously different. experience of agent i for taking action a on time t (either the agent regret is not useful in the learning process because the current reward or the last4 actually experienced one), as it does not consider how much the reward of a single action given by Equation (5). The history of experiences of agent i contributes to the regret. In other words, as we consider the can then be rewritten as Hi = {r̃i,a t | a ∈ Ai , t ∈ [1, T ]}. average regret, the reward of a good-performing action could be penalised by those of bad-performing actions. Moreover, r (ai ) if a = ati  t t t r̃i,a = (5) the learning process works by adjusting the expected value (or t−1 r̃i,a otherwise regret) of each action of the agent. In this sense, our estimated action regret definition isolates the regret per action, thus al- Given the above definitions, we can now formulate the av- lowing the actions to be evaluated singly. The estimated ac- erage estimated regret of agent i for taking action a at time tion regret is more suitable to evaluate how promising a given t as in Equation (6). In general terms, we will refer to this action is as compared to the others. formulation as estimated action regret. The estimated action Finally, it is worth noting that, although each driver min- regret R̃ti,a can be seen as the estimated amount lost by agent imises its actions’ regret, this is equivalent to minimising its i for taking action a at time t instead of the best action regard- total regret. Recall that the estimated action regret measures ing its experience. Additionally, we can reformulate Equation how much worse an action performs as compared to the best (4) to obtain the estimated agent regret, as in Equation (7). one. By employing such a value in the learning process, The estimated agent regret R̃ti expresses how much worse the agent puts more weight on the actions with smaller re- agent i performed, on average, up to time t for not taking gret. Moreover, using the -greedy strategy, the agent tends only the best action regarding its experience. The main ad- to choose the action with the smallest regret with a higher vantage of this formulation over the real regret (Equation (4)) probability. Consequently, we can state that minimising the is that it can be computed locally by the agents, eliminating estimated action regret along time is equivalent to minimising the need for a central authority. Moreover, as the required in- the estimated agent regret. formation is already available to the agents, they can use such regret to guide their learning process. 5 Experimental Evaluation t t 1X 1X 5.1 Setup R̃ti,a = max s r̃i,b − r̃s (6) b∈Ai t s=1 t s=1 i,a In order to evaluate our approach, we employ five different road networks that are available in the literature. t s R̃ti = max 1X s r̃i,a − 1X s s r (ai ) (7) Pigou : this is a classical network introduced in [Pigou, a∈Ai t t s=1 1920]. It comprises only two links l1 and l2 , with s=1 cl1 (fl1 ) = 1.0 and cl2 (fl2 ) = fl2 /d. We set the number 4 Learning Through Regret of drivers to d = 100, all of them belonging to the same OD pair. In this scenario, there are only two actions (one In this section, we present a simple algorithmic solution for corresponding to each link), i.e., |A| = 2. the agents to learn using the estimated action regret of Equa- tion (6). To this end, we employ the Q-learning algorithm OW : is a small, synthetic network, with |N | = 13, |L| = 48, (as presented in Section 2.1). We highlight, however, that any and d = 1700 [Ortúzar and Willumsen, 2011, exercise other reinforcement learning algorithm could be used as well. 10.1]. The vehicles are distributed among four OD pairs. Every driver i ∈ D is represented by a Q-learning agent. The costs of the links are calculated using Equation (1). The route choice problem can then be modelled as a stateless In this network, the number of possible routes for each MDP. As such, the states and the transition functions can be OD pair is high. To this regard, we employ the KSP ignored. Let A = {Ai | i ∈ D} be the set of agents’ actions, algorithm [Yen, 1971] to find the k shortest routes for where Ai is the set of routes of agent i, with Ai = Aj if each OD pair, i.e., |A| = k. agents i and j belong to the same OD pair. The reward for Braess graphs : these are expanded versions of the net- taking action a at time t is given by rt (a). work introduced in [Braess, 1968]. Specifically, let The learning process works as follows. At each episode p ∈ {1, 2, . . .} be the pth expansion of such graph, t ∈ [1, T ], each agent i ∈ D chooses an action ati ∈ Ai where 1st Braess graph is equivalent to the orig- using the -greedy strategy. After taking the chosen action, inal graph [Roughgarden, 2006]. We generalise such the agent receives a reward of rt (ati ). Afterwards, the agent model to consider a discrete set of drivers. The com- updates its history Hi using Equation (5) and calculates the plete description of these graphs is available in Ap- estimated regret of action ati using Equation (6). Finally, the pendix A. We employ the 1st Braess graph, 2nd agent updates Qi (ati ) using Equation (3). This process is re- Braess graph and 3rd Braess graph, and de- peated for each episode, aiming at minimising agents’ regret. fine d = 4000, with all drivers belonging to the same Recall that the original definition of regret of Equation (4) OD pair, and, by definition, |A| = 2p + 1. measures the regret of the agent, not of his actions. However, An experiment corresponds to a complete execution, with 4 As initial value, one can consider the minimum possible reward, 1000 episodes, of the Q-learning algorithm on a single net- which is computed using the links’ free flow travel times (as de- work. After an execution is completed, we measure its per- scribed in Section 2.1). From a practical perspective, such informa- formance by means of the average travel time (avg-tt here- tion could be easily obtained using any offline navigation device. after, measured in minutes) and the average estimated agent Table 1: Performance of our approach in different road networks. For each tested network, we show: the average travel time (which, ideally, should approximate the UE), the UE (as reported in the literature), the average estimated agent regret (Equation (7)), the average real agent regret (Equation (4)), and the relative difference between the estimated and real agent regrets. network avg-tt UE estimated regret real regret relative difference (%) Pigou 1.000 1.000 0.0136 0.0135 4.11 OW 67.156 67.157 0.0031 0.0045 8.02 1st Braess graph 1.988 2.000 0.0245 0.0224 8.25 2nd Braess graph 2.832 3.000 0.0393 0.0221 41.66 3rd Braess graph 3.701 4.000 0.0882 0.0293 64.64 regret (using Equation (7)), both regarding the last episode. timum (SO). Nonetheless, observe that, for these particular We tested different combinations for the Q-learning param- graphs, our avg-tt results are closer to the SO than those of eters. For each such combination, 30 repetitions were per- the UE. Therefore, such results evidence that our approach formed. The best5 combination found was α = 0.5,  = 1.0 provides good approximations to the UE. and λ = 0.99. Moreover, in the case of the OW network, we Regarding our second aforementioned hypothesis, its vali- also tested different values for k (the KSP algorithm is run dation involves evaluating how precise the regret estimations once for each OD pair, in the beginning of the experiment). are. To analyse such precision, we compare the real and es- The best results were achieved for k = 8. The results of such timated agent regrets by means of their relative difference. configurations were selected for further analysis in the next The lower such difference, the better. Of course, such dif- subsection. ference cannot be computed by the agents, otherwise the re- The main hypotheses of our work are that: (i) the results gret estimation would not be necessary. As can be seen in approach the user equilibrium (UE), and (ii) the regret esti- Table 1, the estimated regrets are reasonably close to the mations are reasonably precise. In the next subsection, we real ones, especially for the networks Pigou, OW and 1st analyse how successful our approach performed regarding Braess graph. For the larger Braess graphs, the results our initial hypotheses. were somewhat worse. The point here, again, is that the Braess graphs are more challenging because they were de- 5.2 Results signed to be among the hardest networks. As the agents have The performance of our approach in all tested road networks more difficulty to learn their best routes, the network takes is presented in Table 1. In the table, we show the two main longer to reach a more stable point. Consequently, the agents performance metrics avg-tt and average estimated agent re- estimations need to be updated more frequently to account gret. Additionally, in order to analyse such results, we present for the high variations in the routes costs. However, despite the UE (as reported in the literature), the average real agent the difficulties, the estimations were reasonable. We highlight regret (using Equation (4)6 ), and the relative difference be- that such estimations could be greatly improved by adopting tween the estimated and real regrets. more sophisticated estimation methods (e.g., using a nonlin- Our first hypothesis states that the avg-tt results are close to ear regressor). Thus, the present results also evidence that the UE. As shown in Table 1, such results are indeed close to our regret estimation mechanism reaches a reasonable level the UE values reported in the literature. The results become of precision. slightly far from the UE for the Braess graphs, especially the Along these lines, we can conclude, at least experimen- larger ones (p > 1). This phenomenon can be explained by tally, that the agents can, in fact, estimate their regret locally the nature of such graphs. Under UE, only the so-called type- and use such information to learn their best routes. Obvi- P routes are used (see Appendix A for a detailed description). ously, these results are not definitive. As initially mentioned, However, such routes have very similar costs. Consequently, this work represents a very first step towards a more formal it becomes harder for the agents to choose which route to investigation regarding formal guarantees for RL algorithms, take. The problem becomes even harder for larger Braess which is our medium-term objective. graphs because the number of type-P routes also increases with p. Furthermore, the Braess graphs were designed so that 6 Concluding Remarks the UE values are the farthest possible from the System Op- In this paper, we addressed the route choice problem by min- 5 imising the drivers’ regret. This problem concerns how ratio- The best value for α varied slightly from one network to an- nal drivers learn which route to take so as to minimise their other. However, such a value was reasonably close to 0.5 in all tested cases. Thus, for uniformity, we chose α = 0.5 for all networks. expected travel costs. To this regard, we developed methods 6 In order to compute the real regret of Equation (4), we con- for learning agents to estimate their regret locally (i.e., based sidered that the space of policies consists of a simple mapping from exclusively on their experience) and to learn using such esti- routes to deterministic policies. In fact, ignoring mixed policies over mations. Specifically, each agent keeps a history of experi- the set of available actions is a common practice in the literature mented rewards, which is used to compute the so-called esti- [Banerjee and Peng, 2005; Zinkevich et al., 2008]. mated action regret. Based on experiments, we have shown that our approach [Koutsoupias and Papadimitriou, 1999] Elias Koutsoupias provides reasonably precise estimations of the agents’ regret and Christos Papadimitriou. Worst-case equilibria. In and that such estimations are useful in the learning process. Proceedings of the 16th annual conference on Theoretical We recall that this work represents an initial effort towards aspects of computer science (STACS), pages 404–413, a more formal investigation of efficiency guarantees for RL Berlin, Heidelberg, 1999. Springer-Verlag. algorithms, which is our medium-term objective. [Ortúzar and Willumsen, 2011] Juan de Dios Ortúzar and For future work, we would like to investigate formally Luis G. Willumsen. Modelling transport. John Wiley & how precise the regret estimations might be. We expect that Sons, Chichester, UK, 4 edition, 2011. more sophisticated methods could be employed to estimate the agents’ regret (e.g., using a nonlinear regressor). More- [Pigou, 1920] A. Pigou. The Economics of Welfare. Palgrave over, we would like to study how much our approach benefits Classics in Economics. Palgrave Macmillan, 1920. when the agents can communicate to improve their estima- [Powers and Shoham, 2005] Rob Powers and Yoav Shoham. tions. We also consider investigating the benefits of employ- New criteria and a new algorithm for learning in multi- ing an online service for providing global information for the agent systems. In L. K. Saul, Y. Weiss, and L. Bottou, ed- agents. Regarding the learning process, a promising direc- itors, Advances in Neural Information Processing Systems tion would be adopting algorithms that learn mixed policies 17, pages 1089–1096. MIT Press, 2005. over the actions rather than only the best action. Furthermore, [Roughgarden, 2006] Tim Roughgarden. On the severity of considering this work is a proof-of-concept, no comparison Braess’s paradox: Designing networks for selfish users was made against other methods in the literature. Thereby, is hard. Journal of Computer and System Sciences, making such a comparison is an obviously important step to 72(5):922–953, 2006. provide a more complete analysis of our approach. Last but [Shoham et al., 2007] Yoav Shoham, Rob Powers, and not least, it would be interesting to explore how the learning process could be shaped towards globally efficient routes. Trond Grenager. If multi-agent learning is the answer, what is the question? Artificial Intelligence, 171(7):365– Acknowledgments 377, May 2007. We are very grateful to the anonymous reviewers for their [Stone and Veloso, 2000] Peter Stone and Manuela Veloso. valuable comments. The authors are partially supported by Multiagent systems: A survey from a machine learn- CNPq and CAPES grants. ing perspective. Autonomous Robots, 8(3):345–383, July 2000. References [Sutton and Barto, 1998] R.S. Sutton and A.G. Barto. Rein- [Banerjee and Peng, 2005] Bikramjit Banerjee and Jing forcement Learning: An Introduction. MIT Press, Cam- Peng. Efficient no-regret multiagent learning. In Proceed- bridge, MA, 1998. ings of the Twentieth National Conference on Artificial [Wardrop, 1952] John Glen Wardrop. Some theoretical as- Intelligence, pages 41–46. AAAI Press, 2005. pects of road traffic research. In Proceedings of the Insti- [Bazzan and Klügl, 2013] Ana L. C. Bazzan and Franziska tution of Civil Engineers, volume 1, pages 325–362, 1952. Klügl. Introduction to Intelligent Systems in Traffic and [Watkins and Dayan, 1992] Christopher J. C. H. Watkins and Transportation, volume 7 of Synthesis Lectures on Arti- Peter Dayan. Q-learning. Machine Learning, 8(3):279– ficial Intelligence and Machine Learning. Morgan and 292, 1992. Claypool, 2013. [Waugh et al., 2015] Kevin Waugh, Dustin Morrill, J An- [Bowling and Veloso, 2002] Michael Bowling and Manuela drew Bagnell, and Michael Bowling. Solving games with Veloso. Multiagent learning using a variable learning rate. functional regret estimation. In Proceedings of the Twenty- Artificial Intelligence, 136(2):215–250, 2002. Ninth AAAI Conference on Artificial Intelligence, pages [Bowling and Zinkevich, 2012] Michael Bowling and Mar- 2138–2144. AAAI Press, 2015. tin Zinkevich. On local regret. In John Langford and Joelle [Yen, 1971] Jin Y. Yen. Finding the k shortest loopless Pineau, editors, Proceedings of the 29th International paths in a network. Management Science, 17(11):712– Conference on Machine Learning (ICML-12), ICML ’12, 716, 1971. pages 1631–1638, New York, USA, 2012. Omnipress. [Zinkevich et al., 2008] Martin Zinkevich, Michael Johan- [Braess, 1968] D. Braess. Über ein Paradoxon aus der son, Michael Bowling, and Carmelo Piccione. Regret Verkehrsplanung. Unternehmensforschung, 12:258, 1968. minimization in games with incomplete information. In [Buşoniu et al., 2008] L. Buşoniu, R. Babuska, and J C Platt, D Koller, Y Singer, and S T Roweis, editors, B. De Schutter. A comprehensive survey of multiagent Advances in Neural Information Processing Systems 20, reinforcement learning. Systems, Man, and Cybernetics, pages 1729–1736. Curran Associates, Inc., 2008. Part C: Applications and Reviews, IEEE Transactions on, [Zinkevich, 2003] M. Zinkevich. Online convex program- 38(2):156–172, 2008. ming and generalized infinitesimal gradient ascent. In [Hannan, 1957] James Hannan. Approximation to Bayes In Proceedings of the Twentieth International Conference risk in repeated play. Contributions to the Theory of on Machine Learning, pages 928–936, Menlo Park, USA, Games, 3:97–139, 1957. 2003. AAAI Press. A Expanding the Braess Graphs ip2 The original Braess graph was designed to illustrate the counter-intuitive phenomenon that removing a link from a road network may improve its outcome [Braess, 1968]. This i cost is the so-called Braess paradox. In this paper, we are not inter- ested in the paradox itself. However, we employed the Braess graph for validating our approach given it poses some inter- 0 esting challenges in the drivers’ decision process (as seen in flow Section 5.2). Roughgarden devised a method for generating 0 d d d p+1 p the pth expansion of the original Braess graph [Roughgar- den, 2006]. Nonetheless, the proposed modelling required Figure 1: Shape of type-C cost functions. the flow (i.e., the number of vehicles) to be normalised in the interval [0, p], with p the order of the Braess graph. In order to overcome such limitation, we extend Roughgarden’s mod- o1 elling dropping such a requirement, thus delivering a more general model. Moreover, we provide updated theoretical re- B C sults for the System Optimal (SO) and User Equilibrium (UE) solution concepts, as well as the results for the Braess para- s t A dox and the price of anarchy. C B A.1 Graphs Generation Process n1 Consider the modelling introduced in Section 2.1. Let B p be the pth Braess graph, with B 1 being equivalent to the origi- (a) 1st Braess graph nal Braess graph. The set of nodes can be described as N p = {s, n1 , . . . , np , o1 , . . . , op , t}, with |N p | = 2p + 2 and s ∈ N o2 and t ∈ N representing the source and target nodes, respec- tively, for all d drivers (i.e., all drivers share the same OD B A C pair). Let (i, j) represent a link from i ∈ N to j ∈ N . The set of links can be formulated as Lp = {(s, ni ), (ni , oi ), (oi , t) : s C n2 B o1 C t 1 ≤ i ≤ p} ∪ {(ni , oi−1 ) : 2 ≤ i ≤ p} ∪ {(n1 , t), (s, op )}, with |Lp | = 4p + 1. The links are grouped into three distinct C B types, each with a corresponding cost function, as follows. A type-A : for all links on the form (ni , oi ), we use cpl (fl ) = 0; n1 type-B : for all links on the form (ni , oi−1 ), (s, op ), and (b) 2nd Braess graph (n1 , t), we use cpl (fl ) = 1; type-C : for all links on the form (oi , t) and (s, np−i+1 ), o3 with i ∈ {1, . . . , p}, we use a piecewise, continuous, non-decreasing function as in Equation (8). Using this B A C n3 B o2 function, the maximum possible cost of a type-C link C (when fl = d) is ip2 . The shape of the type-C cost C A s t function is illustrated in Figure 1. C C n2 B o1 ( C A B p 0 if fl ≤ d/(p + 1) cl (fl ) = ip(pfl +fl −d) (8) d otherwise n1 (c) 3rd Braess graph The routes are divided into two groups. Let P denote7 the set of routes without any type-C link, i.e., P = {Pi = Figure 2: The first, second, and third Braess graphs. Links (s, ni , oi , t) | i ∈ [1, p]}, with |P | = p. All the other are labelled with their types. routes, with type-C links, are then represented by Q = {(s, n1 , t)}∪{Qi = (s, ni , oi−1 , t) | i ∈ [2, p]}∪{(s, op , t)}, with |Q| = p + 1. We will distinguish the routes from these A.2 Theoretical Results two groups as type-P and type-Q routes. Given the above formulation, we can define theoretical re- sults for the System Optimum (SO) and the User Equilibrium 7 We slightly abuse notation here, using (s, . . . , t) to represent a (UE) as follows. The SO is achieved when the total flow d is sequence of connected links {(s, ·), . . . , (·, t)} that form a route. equally divided among the p + 1 type-Q routes. In this case, each such route receives a portion d/(p + 1) of the flow and d/p and cost p + 1. Consequently, as all vehicles experiment the type-P routes are not used. Under such conditions, each the same cost, we have that the avg-tt and the route costs are route has a cost of 1 and the avg-tt is also 1. always the same. The UE, on the other hand, is achieved when only the p Regarding the Braess paradox, our modelling does not in- type-P routes are used. In this case, each such route receives validate it. Observe that, whenever the type-A links (those a flow of d/p, thus costing p + 1. Under such circumstances, in the form (ni , oi )) are removed, all type-P routes are also the avg-tt is also p + 1. By comparing the SO and UE results, eliminated. Moreover, recall that, under UE, the cost of the we can define the price of anarchy to be p + 1 [Koutsoupias flow is p + 1. However, after the type-P routes are removed, and Papadimitriou, 1999]. the UE is achieved when the flow is equally divided among Observe that, in both solution concepts, the avg-tt and the the type-Q routes, which is precisely the SO. Thus, remov- route costs are always the same. This is due to the fact that, in ing the type-P routes leads to an improvement in the overall both cases, all routes being used have precisely the same flow performance, meaning that the Braess paradox exists. and cost, i.e., under SO all used routes have a flow of d/(p+1) and cost 1 each, and under UE all used routes have a flow of