On Estimating Action Regret and Learning From It in Route Choice

                                          Gabriel de O. Ramos and Ana L. C. Bazzan
                                                    Instituto de Informática
                                           Universidade Federal do Rio Grande do Sul
                                                {goramos,bazzan}@inf.ufrgs.br


                               Abstract                              nations to minimise their travel costs. Learning is a funda-
                                                                     mental aspect of route choice because the agents must adapt
         The notion of regret has been extensively employed          their choices to account for the changing traffic conditions. In
         to measure the performance of reinforcement learn-          other words, the agents must adapt to each others’ decisions.
         ing agents. The regret of an agent measures how                An interesting class of multiagent RL techniques com-
         much worse it performs following its current policy         prises the regret minimisation approaches. In this context, re-
         in comparison to following the best possible pol-           gret has been typically employed to measure the performance
         icy. As such, measuring regret requires complete            of reinforcement learning agents. Specifically, regret mea-
         knowledge of the environment. However, such an              sures how much worse an agent performs following its cur-
         assumption is not realistic in most multiagent sce-         rent policy in comparison to following the best possible pol-
         narios. In this paper, we address the route choice          icy one in hindsight. In this sense, regret minimisation can be
         problem, in which each driver must choose the best          seen as an inherent definition on how rational agents behave
         route between its origin and its destination. The           over time. Along these lines, the regret measure naturally fits
         expected outcome corresponds to an equilibrium              as a guide of the learning process.
         point in the space of policies where no driver bene-
         fits from deviating from its policy, a concept known           In recent works [Zinkevich et al., 2008; Bowling and
         as User Equilibrium (UE). Considering the limited           Zinkevich, 2012; Waugh et al., 2015], regret has been used
         observability of such a scenario, we investigate how        to improve the learning process. However, calculating re-
         the agents can estimate their regret based exclu-           gret requires complete knowledge of the environment (i.e.,
         sively on their experience. To this regard, we intro-       the utility associated with every possible policy). In fact, one
         duce the concept of estimated action regret, through        may assume that an online service broadcasts the required
         which an agent can estimate how much worsen it              information through mobile devices. Nevertheless, investi-
         performs by taking a given action rather than the           gating methods to accomplish such a task in the absence of
         best in hindsight. Additionally, we show how such           any global information is more challenging and is also rel-
         estimations can be used as a reinforcement signal to        evant, especially in highly competitive scenarios like traffic
                                                                     [Bazzan and Klügl, 2013; Stone and Veloso, 2000].
         improve their performance. We empirically evalu-
         ate our approach in different route choice scenarios,          In this paper, we address the route choice problem by min-
         showing that the agents produce reasonable estima-          imising regret. Specifically, we investigate how the agents
         tions of their regret. Furthermore, we show that            can estimate their regret locally (i.e., based exclusively on
         using such estimations as the reinforcement signal          their experience) and how such estimations can be employed
         provides good approximations to the UE.                     to guide the RL process. To this regard, each agent keeps an
                                                                     internal history of experienced rewards, which is used for es-
                                                                     timating the regret associated with each of its actions. We re-
1       Introduction                                                 fer to such measure as the estimated action regret and employ
Reinforcement learning (RL) in multiagent domains is a chal-         it for updating the agents’ policies. The expected outcome
lenging task. In RL, an agent must learn by trial-and-error          corresponds to an equilibrium point in the space of policies in
how to behave within the environment in order to maximise            which no driver benefits from deviating from its policy. This
its utility. When multiple agents share a common environ-            is the so-called User Equilibrium (UE) [Wardrop, 1952]. To
ment, they must adapt their behaviour to those of others. The        the best of our knowledge, this is the first attempt to improve
problem becomes even harder when the agents are selfish and          the learning process by using regret estimations as the rein-
compete for a common resource. An example is the route               forcement signal.
choice problem, which concerns how rational drivers1 be-                Through experiments, we show that our approach provides
have when choosing routes between their origins and desti-           fairly precise estimations of the agents’ regret relying only
                                                                     on agents’ experience. Moreover, we present good evidence
    1
        Henceforth, we use the terms agent and driver alternately.   that using such regret estimates as the reinforcement signal is
beneficial for the learning process. Consequently, in all tested     the Wardrop’s first principle: “the cost on all the routes actu-
cases, the results are reasonably close to the UE.                   ally used are equal, and less than those which would be expe-
   We remark that this work represents our very first step to-       rienced by a single vehicle on any unused route” [Wardrop,
wards developing rational agents able to analyse their learn-        1952]. Such a solution concept is known as User Equilibrium
ing performance and to improve their expected outcome. In            (UE). However, observe that the UE does not describe the
the medium-term, we aim at investigating formal aspects of           system at its best operation. Indeed, such an state is only
the learning process to guarantee the efficiency of RL under         achieved when the average travel cost is minimum, as de-
multiagent domains.                                                  scribed by the Wardrop’s second principle [Wardrop, 1952].
   This paper is organised as follows. The background on             To this regard, such solution concept is commonly referred as
route choice, RL and regret algorithms is presented in Sec-          System Optimum (SO).
tion 2. In Sections 3 and 4, we describe how the agents can
estimate their regret locally and how they can learn using such                                ui (R) = −CR                            (2)
estimations, respectively. The experimental evaluation is dis-       2.2    Reinforcement Learning
cussed in Section 5. Finally, Section 6 presents the conclud-
ing remarks and future work directions.                              Reinforcement learning (RL) concerns with how an agent
                                                                     learns a behaviour by reward and punishment from interac-
                                                                     tions with its environment. We can formulate the RL prob-
2     Background                                                     lem as a Markov decision process (MDP). An MDP is a tuple
2.1    Route Choice                                                  hS, A, T, ri, where: S is the set of environment states; A is
                                                                     the set of actions; T : S × A × S → [0, 1] is the state transi-
The route choice problem concerns how rational drivers be-           tion function, with T (s, a, s0 ) = P (s0 | s, a) representing the
have when choosing routes between their origins and destina-         probability of reaching state s0 after taking action a in state
tions. In this section, we introduce the basic concepts related      s; and r : S × A → R is the reward function, with r(s, a)
to route choice. For a more comprehensive overview, we refer         denoting the reward received after taking action a in state s
the reader to [Ortúzar and Willumsen, 2011].                        [Sutton and Barto, 1998].
    A road network can be represented as a directed graph               In the context of the route choice problem, the actions of
G = (N, L), where the set of nodes N represent the inter-            an agent represent the choice of routes between his origin and
sections and the set of links L represent the roads between in-      destination. Such a set of actions is known a priori by the
tersections. The demand for trips generates a flow of vehicles       agents. In this sense, the set of states and, consequently, the
on the links, with fl the flow on link l. To this regard, each       transition functions can be ignored. Moreover, we can define
link l ∈ L has a cost cl : fl → R+ associated with crossing          the reward received after taking action a as r(a) = u(R),
it. Let Tij be the demand for trips between origin i ∈ N and         with a = R (we will refer to reward and utility, rather than
destination j ∈ N (we refer to an origin-destination pair as         cost, hereinafter). On this basis, we can model the route
simply OD pair). The set of all such demands is represented          choice problem as a stateless2 MDP.
by an OD matrix T = {Tij | ∀i,P        j ∈ N, i 6= j, Tij ≥ 0}.         Solving a stateless MDP involves finding a policy π (i.e.,
The total demand is denoted d = Tij ∈T Tij . A trip is made          which route to take) that maximises the accumulated reward.
by means of a route R ⊆ L, which is a sequence of links              When the model of the environment dynamics (i.e., the re-
connecting an origin
                  P to a destination. The cost of a route R is       ward function r) are known by the agent, finding such an op-
given by CR = l∈R cl .                                               timal policy is trivial. However, this is rarely the case, espe-
    The cost of a link is typically modelled using the volume-       cially in multiagent settings. To this regard, the agent must
delay function (VDF) abstraction. A VDF takes as input the           repeatedly interact with the environment to learn a model of
flow of vehicles within a link and, based on its attributes (such    its dynamics. A particularly suitable class of RL algorithms
as length and capacity), returns the cost (travel time) of such      here comprises the so-called temporal-difference algorithms,
link. A simple VDF is presented in Equation (1), with tl de-         through which an agent can learn without an explicit model
noting the free flow travel time (i.e., minimum travel time,         of the environment.
when the link is not congested). In this particular VDF, the            The Q-learning algorithm is a good representative of
travel time on link l is increased by 0.02 for each vehicle/hour     temporal-difference methods [Watkins and Dayan, 1992]. In
of flow.                                                             the case of a stateless MDP, a Q-learning agent learns the ex-
                     cl (fl ) = tl + 0.02 × fl                 (1)   pected return Q(a) for selecting each action a by exploring
   In the route choice process, drivers decide which route to        the environment. The exploration of the environment must
take every day to reach their destinations. Usually, this pro-       balance exploration (gain of knowledge) and exploitation (use
cess is modelled as a commuting scenario, where drivers’             of knowledge). A typical strategy here is -greedy, in which
daily trips occur under approximately the same conditions.           the agent chooses a random action with probability  (explo-
In this sense, each driver i ∈ D, with |D| = d, is modelled as       ration) or choosing the best action with probability 1 −  (ex-
an agent, which repeatedly deals with the problem of choos-             2
                                                                          Observe that a stateless MDP is equivalent to having an initial
ing the route that takes the least time to its destination. The      state with actions corresponding to the routes available to the agent,
utility ui : R → R received by driver i after taking route R is      and an ending state with no actions. When an agent chooses action
inversely associated with the route’s cost, as in Equation (2).      a on the initial state, it performs the desired action and reaches the
The expected outcome of this problem can be described by             ending state with probability 1, receiving reward r(a).
ploitation). Usually,  starts with 1.0 and, at the end of each     allowing no-regret algorithms to minimise a broader class
learning episode, it is multiplied by a decay rate λ in order to    of optimisation problems. Nevertheless, their work also as-
increase exploitation as the agent converges to its best action.    sumes that the utility function is available to the agents.
After taking action a and receiving reward r(a), the stateless         Powers and Shoham proposed a set of performance crite-
Q-learning algorithm updates Q(a) using Equation (3), where         ria regarding multiagent learning and proposed an algorithm
the learning rate α ∈ [0, 1] weights how much of the previ-         that meets such criteria [Powers and Shoham, 2005]. Their
ous estimate should be retained. The Q-learning algorithm is        algorithm, however, makes some strong assumptions regard-
guaranteed to converge to an optimal policy in the limit under      ing the environment observability (e.g., an agent can observe
certain assumptions.                                                its opponents’ rewards). Banerjee and Peng extended Pow-
                                                                    ers and Shoham’s approach to a class of no-regret algorithms
                Q(a) = (1 − α)Q(a) + αr(a)                   (3)
                                                                    and dropped some of the observability assumptions [Banerjee
2.3   Regret                                                        and Peng, 2005]. Notwithstanding, these approaches do not
                                                                    employ the regret to guide the learning process.
The regret concept was introduced in the context of evaluat-
                                                                       Along these lines, in this paper we investigate how agents
ing the performance of learning rules [Hannan, 1957]. Regret
                                                                    can estimate their regret based on their experience and pro-
measures how much worse an agent i performs, on average,
                                                                    pose the use of such estimations to guide the agents’ learning
by following its current policy πi ∈ Π as compared to follow-
                                                                    process. Moreover, we disaggregate the regret formulation
ing the best possible policy in hindsight. Precisely, the regret
                                                                    by measuring the regret of actions rather than of policies. We
RTi of agent i at time T is given by Equation (4), where rt (a)
                                                                    show that performing such estimates is realistic and improves
represents the reward of action a at time t and, slightly abus-
                                                                    the learning process. To the best of our knowledge, this is the
ing notation, π t represents the action taken at time t under
                                                                    first effort towards addressing regret estimation and learning
policy π. Therefore, regret can be seen as the average amount
                                                                    through such regret.
lost for following a policy other than the best one.
                         T              T
                      1X t t         1X t t                         3    Estimating Regret Locally
          RTi = max         r (π ) −       r (πi )           (4)
                  π∈Π T              T t=1                          In this section, we discuss how agents can estimate their re-
                        t=1
                                                                    gret. As discussed in Section 2.3, the agents cannot compute
   In the context of reinforcement learning (RL), regret has        their real regret (using Equation (4)) due to the lack of infor-
been typically used as a measure of convergence [Shoham et          mation regarding the routes rewards. The point is that regret
al., 2007; Buşoniu et al., 2008; Zinkevich, 2003; Bowling and      is measured considering (i) the agent’s average reward under
Veloso, 2002; Powers and Shoham, 2005; Banerjee and Peng,           their current policy and (ii) the average reward under the best
2005]. An RL algorithm is said no-regret if it learns a policy π    policy in hindsight. Calculating the latter requires knowing
for which RT → 0 as T → ∞ [Hannan, 1957]. Along these               the rewards of all routes along time. However, after each trip,
lines, regret minimisation can be seen as a natural definition      an agent can observe the reward of the route taken, but cannot
on how rational agents behave over time. In this paper, we          observe the reward of the other routes. Such a full observ-
use the regret measure to guide the learning process.               ability property would only be possible under strong assump-
   We remark that, by definition, computing regret assumes          tions (e.g., a central authority broadcasting such information),
complete knowledge of the environment. Specifically, an             which can be unrealistic in traffic domains. Furthermore, in-
agent cannot compute its regret without knowing the reward          vestigating methods to accomplish such a task in the absence
of every other action along time. Consequently, agents can-         of any supporting service is more challenging and is also rele-
not (i) calculate their regret and (ii) learn using their regret,   vant, especially in the highly competitive settings considered
except if very strong assumptions are made (e.g., assuming          here [Stone and Veloso, 2000].
that every agent knows the reward of all actions along time).           In this paper, we investigate how agents can estimate their
Therefore, using the regret of Equation (4) to guide the learn-     regret based exclusively on local information (i.e., the re-
ing process is not realistic in multiagent scenarios.               wards actually experienced by them). To this regard, we pro-
   Zinkevich et al. introduced the concept of counterfactual        pose an alternative definition of regret that describes the esti-
regret and proposed an algorithm for minimising it [Zinke-          mated regret of each action.
vich et al., 2008]. The counterfactual regret is used to esti-          Let Ai ⊆ A denote3 the set of routes of agent i. At time t,
mate the regret when the information about states is incom-         agent i performs a given action ati ∈ Ai and receives a reward
plete (useful in extensive form games with imperfect infor-         of rt (ati ). We represent the history of experiences of agent i
mation). This is one of the first works to include the regret in    as Hi = {ri,a  t
                                                                                       | a ∈ Ai , t ∈ [1, T ]}, with ri,a
                                                                                                                        t
                                                                                                                           the reward
the learning process. Subsequently, Waugh et al. employed           experience of driver i for taking action a at time t. However,
a regression algorithm to provide enhanced estimates on the         recall that an agent cannot observe the reward of action a on
counterfactual regret [Waugh et al., 2015]. However, these          time t except if it has taken such action at that time, i.e., if a =
works assume that the regret is known by the agents, which          ati . To this regard, the agents can assume the reward of non
is unrealistic for the problem we are considering.                  taken actions to be the same as the most up to date experience
   In [Bowling and Zinkevich, 2012], the authors propose a          on that route. Precisely, let r̃i,a
                                                                                                     t
                                                                                                        represent the newest reward
graph representation to express the relation between actions
and the associated regret. Such a representation was em-                3
                                                                          We slightly abuse notation here to account for drivers with dif-
ployed to mimic the structure of local search methods, thus         ferent OD pairs, whose route sets are obviously different.
experience of agent i for taking action a on time t (either             the agent regret is not useful in the learning process because
the current reward or the last4 actually experienced one), as           it does not consider how much the reward of a single action
given by Equation (5). The history of experiences of agent i            contributes to the regret. In other words, as we consider the
can then be rewritten as Hi = {r̃i,a
                                 t
                                     | a ∈ Ai , t ∈ [1, T ]}.           average regret, the reward of a good-performing action could
                                                                        be penalised by those of bad-performing actions. Moreover,
                          r (ai ) if a = ati
                         t t
                  t
                r̃i,a =                                       (5)       the learning process works by adjusting the expected value (or
                            t−1
                          r̃i,a    otherwise                            regret) of each action of the agent. In this sense, our estimated
                                                                        action regret definition isolates the regret per action, thus al-
   Given the above definitions, we can now formulate the av-            lowing the actions to be evaluated singly. The estimated ac-
erage estimated regret of agent i for taking action a at time           tion regret is more suitable to evaluate how promising a given
t as in Equation (6). In general terms, we will refer to this           action is as compared to the others.
formulation as estimated action regret. The estimated action               Finally, it is worth noting that, although each driver min-
regret R̃ti,a can be seen as the estimated amount lost by agent         imises its actions’ regret, this is equivalent to minimising its
i for taking action a at time t instead of the best action regard-      total regret. Recall that the estimated action regret measures
ing its experience. Additionally, we can reformulate Equation           how much worse an action performs as compared to the best
(4) to obtain the estimated agent regret, as in Equation (7).           one. By employing such a value in the learning process,
The estimated agent regret R̃ti expresses how much worse                the agent puts more weight on the actions with smaller re-
agent i performed, on average, up to time t for not taking              gret. Moreover, using the -greedy strategy, the agent tends
only the best action regarding its experience. The main ad-             to choose the action with the smallest regret with a higher
vantage of this formulation over the real regret (Equation (4))         probability. Consequently, we can state that minimising the
is that it can be computed locally by the agents, eliminating           estimated action regret along time is equivalent to minimising
the need for a central authority. Moreover, as the required in-         the estimated agent regret.
formation is already available to the agents, they can use such
regret to guide their learning process.                                 5     Experimental Evaluation
                              t                  t
                             1X                 1X                      5.1    Setup
              R̃ti,a = max              s
                                      r̃i,b −        r̃s         (6)
                       b∈Ai t
                                s=1
                                                t s=1 i,a               In order to evaluate our approach, we employ five different
                                                                        road networks that are available in the literature.
                            t             s
             R̃ti = max
                         1X s
                               r̃i,a −
                                       1X s s
                                             r (ai )             (7)    Pigou : this is a classical network introduced in [Pigou,
                    a∈Ai t             t s=1                                  1920]. It comprises only two links l1 and l2 , with
                           s=1
                                                                              cl1 (fl1 ) = 1.0 and cl2 (fl2 ) = fl2 /d. We set the number
4    Learning Through Regret                                                  of drivers to d = 100, all of them belonging to the same
                                                                              OD pair. In this scenario, there are only two actions (one
In this section, we present a simple algorithmic solution for                 corresponding to each link), i.e., |A| = 2.
the agents to learn using the estimated action regret of Equa-
tion (6). To this end, we employ the Q-learning algorithm               OW : is a small, synthetic network, with |N | = 13, |L| = 48,
(as presented in Section 2.1). We highlight, however, that any                and d = 1700 [Ortúzar and Willumsen, 2011, exercise
other reinforcement learning algorithm could be used as well.                 10.1]. The vehicles are distributed among four OD pairs.
   Every driver i ∈ D is represented by a Q-learning agent.                   The costs of the links are calculated using Equation (1).
The route choice problem can then be modelled as a stateless                  In this network, the number of possible routes for each
MDP. As such, the states and the transition functions can be                  OD pair is high. To this regard, we employ the KSP
ignored. Let A = {Ai | i ∈ D} be the set of agents’ actions,                  algorithm [Yen, 1971] to find the k shortest routes for
where Ai is the set of routes of agent i, with Ai = Aj if                     each OD pair, i.e., |A| = k.
agents i and j belong to the same OD pair. The reward for               Braess graphs : these are expanded versions of the net-
taking action a at time t is given by rt (a).                                 work introduced in [Braess, 1968]. Specifically, let
   The learning process works as follows. At each episode                     p ∈ {1, 2, . . .} be the pth expansion of such graph,
t ∈ [1, T ], each agent i ∈ D chooses an action ati ∈ Ai                      where 1st Braess graph is equivalent to the orig-
using the -greedy strategy. After taking the chosen action,                  inal graph [Roughgarden, 2006]. We generalise such
the agent receives a reward of rt (ati ). Afterwards, the agent               model to consider a discrete set of drivers. The com-
updates its history Hi using Equation (5) and calculates the                  plete description of these graphs is available in Ap-
estimated regret of action ati using Equation (6). Finally, the               pendix A. We employ the 1st Braess graph, 2nd
agent updates Qi (ati ) using Equation (3). This process is re-               Braess graph and 3rd Braess graph, and de-
peated for each episode, aiming at minimising agents’ regret.                 fine d = 4000, with all drivers belonging to the same
   Recall that the original definition of regret of Equation (4)              OD pair, and, by definition, |A| = 2p + 1.
measures the regret of the agent, not of his actions. However,             An experiment corresponds to a complete execution, with
    4
      As initial value, one can consider the minimum possible reward,   1000 episodes, of the Q-learning algorithm on a single net-
which is computed using the links’ free flow travel times (as de-       work. After an execution is completed, we measure its per-
scribed in Section 2.1). From a practical perspective, such informa-    formance by means of the average travel time (avg-tt here-
tion could be easily obtained using any offline navigation device.      after, measured in minutes) and the average estimated agent
Table 1: Performance of our approach in different road networks. For each tested network, we show: the average travel time
(which, ideally, should approximate the UE), the UE (as reported in the literature), the average estimated agent regret (Equation
(7)), the average real agent regret (Equation (4)), and the relative difference between the estimated and real agent regrets.
                   network              avg-tt      UE      estimated regret real regret relative difference (%)
                    Pigou               1.000     1.000           0.0136            0.0135               4.11
                       OW              67.156 67.157              0.0031            0.0045              8.02
            1st Braess graph 1.988                2.000           0.0245            0.0224               8.25
            2nd Braess graph 2.832                3.000           0.0393            0.0221              41.66
            3rd Braess graph 3.701                4.000           0.0882            0.0293              64.64


regret (using Equation (7)), both regarding the last episode.             timum (SO). Nonetheless, observe that, for these particular
We tested different combinations for the Q-learning param-                graphs, our avg-tt results are closer to the SO than those of
eters. For each such combination, 30 repetitions were per-                the UE. Therefore, such results evidence that our approach
formed. The best5 combination found was α = 0.5,  = 1.0                  provides good approximations to the UE.
and λ = 0.99. Moreover, in the case of the OW network, we                    Regarding our second aforementioned hypothesis, its vali-
also tested different values for k (the KSP algorithm is run              dation involves evaluating how precise the regret estimations
once for each OD pair, in the beginning of the experiment).               are. To analyse such precision, we compare the real and es-
The best results were achieved for k = 8. The results of such             timated agent regrets by means of their relative difference.
configurations were selected for further analysis in the next             The lower such difference, the better. Of course, such dif-
subsection.                                                               ference cannot be computed by the agents, otherwise the re-
   The main hypotheses of our work are that: (i) the results              gret estimation would not be necessary. As can be seen in
approach the user equilibrium (UE), and (ii) the regret esti-             Table 1, the estimated regrets are reasonably close to the
mations are reasonably precise. In the next subsection, we                real ones, especially for the networks Pigou, OW and 1st
analyse how successful our approach performed regarding                   Braess graph. For the larger Braess graphs, the results
our initial hypotheses.                                                   were somewhat worse. The point here, again, is that the
                                                                          Braess graphs are more challenging because they were de-
5.2    Results                                                            signed to be among the hardest networks. As the agents have
The performance of our approach in all tested road networks               more difficulty to learn their best routes, the network takes
is presented in Table 1. In the table, we show the two main               longer to reach a more stable point. Consequently, the agents
performance metrics avg-tt and average estimated agent re-                estimations need to be updated more frequently to account
gret. Additionally, in order to analyse such results, we present          for the high variations in the routes costs. However, despite
the UE (as reported in the literature), the average real agent            the difficulties, the estimations were reasonable. We highlight
regret (using Equation (4)6 ), and the relative difference be-            that such estimations could be greatly improved by adopting
tween the estimated and real regrets.                                     more sophisticated estimation methods (e.g., using a nonlin-
   Our first hypothesis states that the avg-tt results are close to       ear regressor). Thus, the present results also evidence that
the UE. As shown in Table 1, such results are indeed close to             our regret estimation mechanism reaches a reasonable level
the UE values reported in the literature. The results become              of precision.
slightly far from the UE for the Braess graphs, especially the               Along these lines, we can conclude, at least experimen-
larger ones (p > 1). This phenomenon can be explained by                  tally, that the agents can, in fact, estimate their regret locally
the nature of such graphs. Under UE, only the so-called type-             and use such information to learn their best routes. Obvi-
P routes are used (see Appendix A for a detailed description).            ously, these results are not definitive. As initially mentioned,
However, such routes have very similar costs. Consequently,               this work represents a very first step towards a more formal
it becomes harder for the agents to choose which route to                 investigation regarding formal guarantees for RL algorithms,
take. The problem becomes even harder for larger Braess                   which is our medium-term objective.
graphs because the number of type-P routes also increases
with p. Furthermore, the Braess graphs were designed so that              6   Concluding Remarks
the UE values are the farthest possible from the System Op-               In this paper, we addressed the route choice problem by min-
   5                                                                      imising the drivers’ regret. This problem concerns how ratio-
      The best value for α varied slightly from one network to an-
                                                                          nal drivers learn which route to take so as to minimise their
other. However, such a value was reasonably close to 0.5 in all tested
cases. Thus, for uniformity, we chose α = 0.5 for all networks.           expected travel costs. To this regard, we developed methods
    6
      In order to compute the real regret of Equation (4), we con-        for learning agents to estimate their regret locally (i.e., based
sidered that the space of policies consists of a simple mapping from      exclusively on their experience) and to learn using such esti-
routes to deterministic policies. In fact, ignoring mixed policies over   mations. Specifically, each agent keeps a history of experi-
the set of available actions is a common practice in the literature       mented rewards, which is used to compute the so-called esti-
[Banerjee and Peng, 2005; Zinkevich et al., 2008].                        mated action regret.
   Based on experiments, we have shown that our approach          [Koutsoupias and Papadimitriou, 1999] Elias Koutsoupias
provides reasonably precise estimations of the agents’ regret        and Christos Papadimitriou. Worst-case equilibria. In
and that such estimations are useful in the learning process.        Proceedings of the 16th annual conference on Theoretical
We recall that this work represents an initial effort towards        aspects of computer science (STACS), pages 404–413,
a more formal investigation of efficiency guarantees for RL          Berlin, Heidelberg, 1999. Springer-Verlag.
algorithms, which is our medium-term objective.                   [Ortúzar and Willumsen, 2011] Juan de Dios Ortúzar and
   For future work, we would like to investigate formally            Luis G. Willumsen. Modelling transport. John Wiley &
how precise the regret estimations might be. We expect that          Sons, Chichester, UK, 4 edition, 2011.
more sophisticated methods could be employed to estimate
the agents’ regret (e.g., using a nonlinear regressor). More-     [Pigou, 1920] A. Pigou. The Economics of Welfare. Palgrave
over, we would like to study how much our approach benefits          Classics in Economics. Palgrave Macmillan, 1920.
when the agents can communicate to improve their estima-          [Powers and Shoham, 2005] Rob Powers and Yoav Shoham.
tions. We also consider investigating the benefits of employ-        New criteria and a new algorithm for learning in multi-
ing an online service for providing global information for the       agent systems. In L. K. Saul, Y. Weiss, and L. Bottou, ed-
agents. Regarding the learning process, a promising direc-           itors, Advances in Neural Information Processing Systems
tion would be adopting algorithms that learn mixed policies          17, pages 1089–1096. MIT Press, 2005.
over the actions rather than only the best action. Furthermore,   [Roughgarden, 2006] Tim Roughgarden. On the severity of
considering this work is a proof-of-concept, no comparison           Braess’s paradox: Designing networks for selfish users
was made against other methods in the literature. Thereby,           is hard. Journal of Computer and System Sciences,
making such a comparison is an obviously important step to           72(5):922–953, 2006.
provide a more complete analysis of our approach. Last but
                                                                  [Shoham et al., 2007] Yoav Shoham, Rob Powers, and
not least, it would be interesting to explore how the learning
process could be shaped towards globally efficient routes.           Trond Grenager. If multi-agent learning is the answer,
                                                                     what is the question? Artificial Intelligence, 171(7):365–
Acknowledgments                                                      377, May 2007.
We are very grateful to the anonymous reviewers for their         [Stone and Veloso, 2000] Peter Stone and Manuela Veloso.
valuable comments. The authors are partially supported by            Multiagent systems: A survey from a machine learn-
CNPq and CAPES grants.                                               ing perspective. Autonomous Robots, 8(3):345–383, July
                                                                     2000.
References                                                        [Sutton and Barto, 1998] R.S. Sutton and A.G. Barto. Rein-
[Banerjee and Peng, 2005] Bikramjit Banerjee and Jing                forcement Learning: An Introduction. MIT Press, Cam-
  Peng. Efficient no-regret multiagent learning. In Proceed-         bridge, MA, 1998.
  ings of the Twentieth National Conference on Artificial         [Wardrop, 1952] John Glen Wardrop. Some theoretical as-
  Intelligence, pages 41–46. AAAI Press, 2005.                       pects of road traffic research. In Proceedings of the Insti-
[Bazzan and Klügl, 2013] Ana L. C. Bazzan and Franziska             tution of Civil Engineers, volume 1, pages 325–362, 1952.
  Klügl. Introduction to Intelligent Systems in Traffic and      [Watkins and Dayan, 1992] Christopher J. C. H. Watkins and
  Transportation, volume 7 of Synthesis Lectures on Arti-            Peter Dayan. Q-learning. Machine Learning, 8(3):279–
  ficial Intelligence and Machine Learning. Morgan and               292, 1992.
  Claypool, 2013.                                                 [Waugh et al., 2015] Kevin Waugh, Dustin Morrill, J An-
[Bowling and Veloso, 2002] Michael Bowling and Manuela               drew Bagnell, and Michael Bowling. Solving games with
  Veloso. Multiagent learning using a variable learning rate.        functional regret estimation. In Proceedings of the Twenty-
  Artificial Intelligence, 136(2):215–250, 2002.                     Ninth AAAI Conference on Artificial Intelligence, pages
[Bowling and Zinkevich, 2012] Michael Bowling and Mar-               2138–2144. AAAI Press, 2015.
  tin Zinkevich. On local regret. In John Langford and Joelle     [Yen, 1971] Jin Y. Yen. Finding the k shortest loopless
  Pineau, editors, Proceedings of the 29th International             paths in a network. Management Science, 17(11):712–
  Conference on Machine Learning (ICML-12), ICML ’12,                716, 1971.
  pages 1631–1638, New York, USA, 2012. Omnipress.                [Zinkevich et al., 2008] Martin Zinkevich, Michael Johan-
[Braess, 1968] D. Braess. Über ein Paradoxon aus der                son, Michael Bowling, and Carmelo Piccione. Regret
  Verkehrsplanung. Unternehmensforschung, 12:258, 1968.              minimization in games with incomplete information. In
[Buşoniu et al., 2008] L. Buşoniu, R. Babuska, and                 J C Platt, D Koller, Y Singer, and S T Roweis, editors,
  B. De Schutter. A comprehensive survey of multiagent               Advances in Neural Information Processing Systems 20,
  reinforcement learning. Systems, Man, and Cybernetics,             pages 1729–1736. Curran Associates, Inc., 2008.
  Part C: Applications and Reviews, IEEE Transactions on,         [Zinkevich, 2003] M. Zinkevich. Online convex program-
  38(2):156–172, 2008.                                               ming and generalized infinitesimal gradient ascent. In
[Hannan, 1957] James Hannan. Approximation to Bayes                  In Proceedings of the Twentieth International Conference
  risk in repeated play. Contributions to the Theory of              on Machine Learning, pages 928–936, Menlo Park, USA,
  Games, 3:97–139, 1957.                                             2003. AAAI Press.
A     Expanding the Braess Graphs
                                                                                          ip2
The original Braess graph was designed to illustrate the
counter-intuitive phenomenon that removing a link from a
road network may improve its outcome [Braess, 1968]. This                                  i


                                                                                   cost
is the so-called Braess paradox. In this paper, we are not inter-
ested in the paradox itself. However, we employed the Braess
graph for validating our approach given it poses some inter-                               0
esting challenges in the drivers’ decision process (as seen in                                                                                     flow
Section 5.2). Roughgarden devised a method for generating                                       0          d                      d            d
                                                                                                          p+1                     p
the pth expansion of the original Braess graph [Roughgar-
den, 2006]. Nonetheless, the proposed modelling required
                                                                                               Figure 1: Shape of type-C cost functions.
the flow (i.e., the number of vehicles) to be normalised in the
interval [0, p], with p the order of the Braess graph. In order
to overcome such limitation, we extend Roughgarden’s mod-                                                                o1
elling dropping such a requirement, thus delivering a more
general model. Moreover, we provide updated theoretical re-                                          B                                     C
sults for the System Optimal (SO) and User Equilibrium (UE)
solution concepts, as well as the results for the Braess para-                        s                                                             t
                                                                                                                          A
dox and the price of anarchy.
                                                                                                    C                                      B
A.1     Graphs Generation Process
                                                                                                                         n1
Consider the modelling introduced in Section 2.1. Let B p be
the pth Braess graph, with B 1 being equivalent to the origi-                                            (a) 1st Braess graph
nal Braess graph. The set of nodes can be described as N p =
{s, n1 , . . . , np , o1 , . . . , op , t}, with |N p | = 2p + 2 and s ∈ N                                               o2
and t ∈ N representing the source and target nodes, respec-
tively, for all d drivers (i.e., all drivers share the same OD
                                                                                                     B           A                         C
pair). Let (i, j) represent a link from i ∈ N to j ∈ N . The set
of links can be formulated as Lp = {(s, ni ), (ni , oi ), (oi , t) :
                                                                                      s             C       n2           B            o1   C        t
1 ≤ i ≤ p} ∪ {(ni , oi−1 ) : 2 ≤ i ≤ p} ∪ {(n1 , t), (s, op )},
with |Lp | = 4p + 1. The links are grouped into three distinct                                      C                                      B
types, each with a corresponding cost function, as follows.                                                                   A
type-A : for all links on the form (ni , oi ), we use cpl (fl ) = 0;                                                     n1
type-B : for all links on the form (ni , oi−1 ), (s, op ), and                                           (b) 2nd Braess graph
     (n1 , t), we use cpl (fl ) = 1;
type-C : for all links on the form (oi , t) and (s, np−i+1 ),                                                            o3
     with i ∈ {1, . . . , p}, we use a piecewise, continuous,
     non-decreasing function as in Equation (8). Using this                                          B           A                         C
                                                                                                            n3           B            o2
     function, the maximum possible cost of a type-C link                                                                                  C
     (when fl = d) is ip2 . The shape of the type-C cost                                            C                A
                                                                                      s                                                             t
     function is illustrated in Figure 1.                                                           C                                      C
                                                                                                            n2           B            o1
                   (                                                                                C                         A            B
         p           0               if fl ≤ d/(p + 1)
        cl (fl ) = ip(pfl +fl −d)                         (8)
                            d        otherwise                                                                           n1

                                                                                                         (c) 3rd Braess graph
   The routes are divided into two groups. Let P denote7
the set of routes without any type-C link, i.e., P = {Pi =                   Figure 2: The first, second, and third Braess graphs. Links
(s, ni , oi , t) | i ∈ [1, p]}, with |P | = p. All the other                 are labelled with their types.
routes, with type-C links, are then represented by Q =
{(s, n1 , t)}∪{Qi = (s, ni , oi−1 , t) | i ∈ [2, p]}∪{(s, op , t)},
with |Q| = p + 1. We will distinguish the routes from these                  A.2     Theoretical Results
two groups as type-P and type-Q routes.                                      Given the above formulation, we can define theoretical re-
                                                                             sults for the System Optimum (SO) and the User Equilibrium
   7
     We slightly abuse notation here, using (s, . . . , t) to represent a    (UE) as follows. The SO is achieved when the total flow d is
sequence of connected links {(s, ·), . . . , (·, t)} that form a route.      equally divided among the p + 1 type-Q routes. In this case,
each such route receives a portion d/(p + 1) of the flow and        d/p and cost p + 1. Consequently, as all vehicles experiment
the type-P routes are not used. Under such conditions, each         the same cost, we have that the avg-tt and the route costs are
route has a cost of 1 and the avg-tt is also 1.                     always the same.
   The UE, on the other hand, is achieved when only the p              Regarding the Braess paradox, our modelling does not in-
type-P routes are used. In this case, each such route receives      validate it. Observe that, whenever the type-A links (those
a flow of d/p, thus costing p + 1. Under such circumstances,        in the form (ni , oi )) are removed, all type-P routes are also
the avg-tt is also p + 1. By comparing the SO and UE results,       eliminated. Moreover, recall that, under UE, the cost of the
we can define the price of anarchy to be p + 1 [Koutsoupias         flow is p + 1. However, after the type-P routes are removed,
and Papadimitriou, 1999].                                           the UE is achieved when the flow is equally divided among
   Observe that, in both solution concepts, the avg-tt and the      the type-Q routes, which is precisely the SO. Thus, remov-
route costs are always the same. This is due to the fact that, in   ing the type-P routes leads to an improvement in the overall
both cases, all routes being used have precisely the same flow      performance, meaning that the Braess paradox exists.
and cost, i.e., under SO all used routes have a flow of d/(p+1)
and cost 1 each, and under UE all used routes have a flow of