1. Introduction

Workshop Agents in Trafic and Transportation, July

to Trip Building

Guilherme Dytz dos Santos

gdsantos@inf.ufrgs.br 0

Ana L. C . Bazzan

bazzan@inf.ufrgs.br 0 0 Computer Science, Universidade Federal do Rio Grande do Sul (UFRGS) , Porto Alegre, RS , Brazil

2022

25 2022

Using reinforcement learning to support agents in making decisions that consider more than one objective poses challenges. We formulate the problem of multiple agents learn to travel from A to B in a trafic network as a reinforcement learning task in which we take into account: non-stationarity, more than one objective, and a a stochastic game based model. To solve this task, we propose an adaptation of an existing algorithm (PQL), which deals with more than one objective. We have applied the proposed method to a scenario in which hundreds of agents have to learn how to travel from their origins to their destinations, aiming at minimizing their travel times, as well as the carbon monoxide vehicles emit. Results show that the adapted algorithm is able to find eficient routes for the agents.

reinforcement learning (RL) multiobjective RL routing route choice

1. Introduction

http://www.inf.ufrgs.br/~bazzan (A. L. C. . Bazzan) kind of problem, the work described in [ 4 ] extends a Bandit algorithm like UCB [ 5 ], in order to account for multiple objectives and for multiple learning agents.

In contrast to the aforementioned works, in the present paper, each agent or driver not only performs its optimization process locally (in a decentralized way), by means of multi-objective RL (MORL), but also it builds its trip by making decisions at each intersection (that function as states). Hence, the underlying learning task is state-based. Other characteristics of the problem we deal with is that it involves many agents competing for resources. This also poses challenges to RL methods, even if only one objective is considered. Specifically, the fact that there are many agents learning simultaneously causes the environment to be non-stationary.

In short, a realistic, decentralized RL-based approach to routing involves the following issues: (i) agents learning simultaneously result in non-stationarity; (ii) there are potentially two or more objectives to be optimized; (iii) the underlying learning task is modeled as a stochastic game, where states are the intersections of the networks, where decisions about which link to take are made.

The remainder of this paper is organized as follows. The next section discusses the background concepts and gives an overview on the related work. Section 3 details the proposed method. Its evaluation in a proof-of-concept scenario is discussed in Section 4. Concluding remarks and future directions are given in Section 5.

2. Background and Related Work

In this section, we briefly introduce some relevant concepts on RL, as well as on trafic assignment, route choice, and routing (including multi-objective variants).

2.1. Reinforcement Learning

In RL, an agent learns how to act in an environment, by receiving a feedback (reward) that measures how its action has afected the environment. The agent does not a priori know how its actions afect the environment, hence it has to learn this by trial and error, which characterizes an exploration phase. Such phase may be very noisy in multiagent scenarios, given that all agents are learning simultaneously and hence have to adapt to others’ exploration processes. However, agents should not only explore; in order to maximize the rewards of their actions, they also have to exploit the gained knowledge. Thus, there must be an exploration-exploitation strategy that is to be followed by the agents. One of these strategies is -greedy, where an action is randomly chosen (exploration) with a probability , or , with probability 1- , the best known action is chosen, i.e., the one with the highest value so far (exploitation).

In the exploitation phase, at each interaction, it is assumed that an agent has sensors to determine its current state and can then make an action. The reward perceived or received from the environment is used to update its policy, i.e., a mapping from states to actions. This policy can be generated or computed in several ways. For the sake of the present paper, we concentrate on a model-free, of-policy algorithm called Q-learning or QL [ 6 ] and its extensions. QL estimates the so-called Q-values using a table to store the experienced values of performing a given action when in a given state. The value of a state and action at time is updated based on Eq. 1, where ∈ [ 0, 1 ] is the learning rate, ∈ [ 0, 1 ] is the discount factor, +1 is the next state and is the reward received when the agent moves from to +1 after selecting action in state .

( , ) ← ( , ) + ( + max(( +1 , )) − ( , )) (1)

RL can be employed to compute how drivers learn how to go from A to B in a transportation network. The next section thus introduces this learning task.

2.2. How to Travel from A to B?

In this section, we first discuss the diferences among the various terms that are related to this problem. Then, we introduce the concepts related to the definition of an urban network (topology), as well as the demand side. Finally, we discuss the need for multiobjective approaches.

Given a road network and the demand for its use (number of trips per unit of time), the task of finding out how each unit of such demand realizes its travel needs can be solved in diferent ways, although all seek to compute the so-called user equilibrium, i.e., the condition in which no user is better of by changing its route. We recall that, in the literature, demand can be expressed as trip, traveller, user, agent, or vehicle, depending on the community.

In the transportation community this problem is normally known as trafic assignment problem (TAP), which is solved in a centralized fashion, mostly via a macroscopic approach using a volume-delay function (VDF), which estimates the travel time as a function of the volume of trips; see Chapter 10 of [ 7 ] for details. In the computer science community, that task is frequently solved by means of RL in a decentralized way, in the sense that agents learn how to reach their destinations, by performing experiments (exploration) until they learns their respective ways (which is equivalent to the condition under the user equilibrium). This tasks admits two variants: in the first, an agent knows a set of precomputed routes (or is able to compute them), and its task is to choose a route among them (again by means of experimenting with several selections of actions or routes). Once such choice is done, an agent travels the route, without changing its mind during the trip. The second variant departs from the assumption that agents are given a set of routes. Instead, agents make choices during the trip (e.g., at each intersection, or at each important point of decision), thus building the actual route while traveling. In terms of RL, these two variants have a key diference as well. While the latter is a classical RL task with a set of states (e.g., each decision points), and a set of actions, the former belongs to the stateless RL front, where agents basically need to select an action at the origin node (the sole state). Moreover, this learning task resembles Bandit algorithms such as UCB [ 5 ]. Henceforth, we refer to the former as route choice (as the agent simply chooses a route and remain on it), while the latter is referred here as routing to convey the idea that the trip is build during the actual routing task.

Regarding approaches that were already proposed to tackle this problem, in the literature, there are various approaches to solve the TAP; here we refer only to those that address more than one objective such as [ 8, 3, 2, 1 ]. Others can be found in [ 9 ]. Regarding route choice, the reader is referred to [ 10, 11 ]; in particular, multiobjective route choice is tackled by [ 4 ]. All these works use a macroscopic approach in the sense that VDF’s are used to provide a reward for each agent based on the usage of a given route. Routing is less common in the literature but some works appear in [ 12, 13, 14, 15 ]. While the former employs a VDF to compute rewards (thus, a macroscopic approach), the last two use a microscopic simulator, where vehicles realize the actual driving movements. As such, rewards (e.g., travel times) are given by the simulator itself and correspond to a more realistic reward (as opposed to the estimation provide by a VDF).

This said, next we briefly review the concept of transportation network and the demand that uses it. Formally, a trafic network can be modeled as a graph = ( , ) , where is the set of vertices or nodes that represent the intersections, and , the set of links that describe the segments connecting a pair of vertices. In the network, trips are distributed among a set of origin-destination (OD) pairs; each corresponding to a certain demand for trips. These then translates into flows on the various links.

In all methods it is assumed that agents (drivers) are rational, thus each selects the route or link with the least perceived individual cost, in order to travel from its origin to its destination. Several factors may influence this decision, such as travel time, distance, monetary cost (e.g., toll), fuel consumption, battery, emission of pollutants, etc.

Traditionally, multi-objective trafic assignment is modeled by using a linear combination of the various objectives, as proposed, for instance, by Dial in [ 8 ] for a bi-objective assignment. Such a linear combination has some drawbacks. Some eficient routes are missed, as discussed in [ 3 ]. The key point is that algorithms that use a linear combination only identify supported solutions at extreme points, while there may be other eficient solutions that are not considered in that group but could be preferred by some users.

Hence, it is necessary to have alternative solutions. One issue that arises is that, in a multiobjective scenario, there is a set of eficient solutions instead of just one. Diferent solutions have been proposed in the transportation and optimization communities; see for instance [ 8, 3, 2, 1 ]. We remark that all these methods rely on a set of (pre-computed) shortest paths and they are centralized. Thus, they are not useful in a multiagent environment where each agent should learn by its own experience using RL, nor they fit a state-based RL task, where agents learn at each intersection.

When employing RL to the problem of how agents should travel from A to B, there are two possible approaches. The first one relates closely to the TAP in the sense that shortest routes are computed for agents in each OD pair, although it is not centralized as it is the case with the aforementioned methods. Examples of this approach are [ 10, 11 ] among others.

The second variant relax the assumption that each agent selects one of the shortest routes (and then travel on it, without deviating from the selected route). Rather, agents make choices en route, i.e., they decide which link to follow once they find themselves in each vertex. This is the approach followed in [16, 14] and also in the present paper.

2.3. Multiobjective Approach to Q-learning (PQL)

In this section we discuss how to use QL when agents need to deal with more than one objective. Besides the discussion ahead, we refer readers to other approaches; see [17, 18].

In order to extend the aformentioned QL algorithm, [19] proposed Pareto Q-learning (PQL), which integrates the Pareto dominance relation into a MORL approach. PQL considers both the immediate reward vector and the set of expected future discounted reward vectors in order to compute the so-called Q-sets, which are composed of vectors. In PQL, the set of expected future discounted reward vectors relies on a function, called (from non-dominated), that finds those vectors that correspond to the possible future states, and which are not Pareto-dominated.

Eq. 2 shows how Q-sets are calculated. (,̄ ) denotes the immediate reward vector and (, ) is the set of non-dominated vectors in the next state , which is reached by performing action at time step . (,̄ ) is added to each element of (, ) . When is selected at , both terms are updated. (,̄ ) is updated according to Eq. 3, where ⃗ is the new reward vector and (, ) is the number of times action was selected in . (, ) is updated as shown in Eq. 4, using the non-dominated vectors in the ̃ of every action ′ in the next state ′. ̃ (, ) = (,̄ ) ⊕

(, ) (,̄ ) =

(,̄ ) + (, ) = (∪ ′ ̃ ( ′, ′)) ⃗− (,̄ ) (, ) (2) (3) (4)

PQL learns the entire Pareto front, finding multiple Pareto optimal solutions, provided that each state-action pair is suficiently sampled. This algorithm is not biased by the Pareto front shape (algorithms that find a single policy and use scalarization cannot sample the entire Pareto front if it is non-convex) or a weight vector (it guides the exploration to specific parts of the search space).

Note that PQL assumes that the environment is deterministic, hence it is not directly useful for environments in which the presence of multiple agents learning simultaneously cause non-stationarity, as for instance the route choice domain we use ahead.

Our contribution (called aPQL, detailed ahead) is an adaptation of PQL to deal with multiple objectives and with multiple agents (thus, a non-stationary environment).

3. Methodology 3.1. Markov Decision Process Definition

Before presenting the details of the proposed method, some considerations to the Markov decision process (MDP) underlying the learning task should be addressed. In an MDP, we define a set of states , a set of actions , a reward function ∶ × → ℝ , and a probabilistic state transition (, , ′) → [ 0, 1 ], with ∈ being the state the agent is currently in, ∈ the action the agent takes and ′ ∈ being the state it might end up, taking action in state .

In the routing learning task, each intersection within a trafic network is a state , and the actions are the links an agent might take when in (i.e., links that leave the intersection). In short, agents build their routes by making decisions about which link to follow, when finding themselves in a given intersection.

Regarding the environment, the transition model is implemented by a microscopic trafic simulator itself, which moves vehicles along the network. As for the rewards, since we deal with a multiobjective scenario, the rewards returned by the simulator are assembled in a vector; this way, the reward function definition is ∶ × → ℝ , where is the number of objectives to be optimized. As detailed in Section 3.3, in the present paper we deal with agents aiming at optimizing two objectives: travel time and carbon monoxide emission. However, the proposed method is general and can be used if more objectives are taken into account.

3.2. Time Steps and Episodes

In a traditional RL environment, at each time step, each agent finds itself in state , chooses action , receives a reward and transitions to a state ′. In the learning task at hand, time steps are controlled by the microscopic simulator and correspond to an unit of time such as one second. This has important consequences. First, not all time steps count as decision-making steps; these only happen when an agent is at an intersection. The rest of the time steps are used (e.g., see plots ahead) just as clock units. Second, diferent agents find themselves either in decision-making states, or are moving along a link (thus, not updating their value functions), or have reached their destinations. In the latter case, an episode is finished for that particular agent. Since travel times and destinations are diferent for most of the agents, episodes are not synchronous. Therefore, in general, we do not refer to episodes; rather, we use the clock units or time steps of the simulator to depict time. To control the simulation time, we introduce a parameter called maximum number of steps, .

3.3. Reward Calculation

In this section we detailed how we deal with multiple objectives in the reward function.

In the exploration phase, due to uncoordinated action selection by the agents, congestion may occur, which impacts the reward of the agents, while also inserting a lot of noise in the learning process, which then leads to slow convergence. In order to speed up this process, and avoid agents wandering around the network performing experimentation, two hyper-parameters were introduced. The first one is a bonus that is added to the reward when an agent reaches its destination. The second is a penalty that is subtracted to the reward when an agent is in a dead-end node, i.e., a node from which it cannot reaches its destination at all.

Given that the reward accounts for more than one objective (and thus is represented by a vector of size , where is the number of objectives to optimize), both the bonus and the penalty are added (subtracted in the case of ) to the corresponding scalar value, as stated in Eq. 5 (for the bonus) and Eq. 6 (for the penalty).

⃗= ⃗+ ⋅ ⃗ 1 ⃗= ⃗− ⋅ 1⃗ (5) (6)

A further hyper-parameter of our model resembles a kind of toll on emissions of pollutants. It is designed based on the assumption that certain links of the network should be spared from high emission levels; this may be the case of links close to parks, hospitals, residential areas, etc. This parameter is the penalty ; it acts similarly to the aforementioned penalty , with the diference that is subtracted only to the element of the reward vector that refers to the emission. Let ̂ be an -th basis vector, where is the position of the emission element of the reward vector. Then the penalty is used to update the reward as in Eq. 7. Algorithm 1 summarizes how the reward is calculated.

⃗= ⃗− ⋅ ̂ (7) Algorithm 1: Compute Reward

Data: , , ,

Result: ⃗ 1 ⃗← −[. 2 ⃗← ⃗+ ⋅ 3 ⃗← ⃗− ⋅ 1⃗; 4 ⃗← ⃗− ⋅ ̂ ;

1⃗; _ _ _, . _ _] ;

3.4. Action Selection: Adapted Pareto Q-Learning (aPQL)

Originally, in PQL (i.e., as proposed in [19]), every pair state-action (, ) is associated with a set of non-dominated points that represents the Pareto front. Q-sets are computed as in Eq. 2 and are used to compute a hypervolume of the Pareto front. This is then used to determine the best action. In our case, this was not efective given that the Q-sets tend to contain few points only, thus computations based on hypervolume tend to be inefective. Therefore, we had to modify the action selection strategy. Instead of using the hypervolume of the Pareto front to determine the best action, we let each agent select randomly (with uniform probability, at each decision point, and independently of one another) which objective is to be optimized at that given step. This seems to be a fair substitution for the hypervolume and avoids biasing towards one objective. Once a particular objective is randomly drawn, then an agent uses the -greedy strategy to select an action, given the Q-set. Note that the approach continues to be multiobjective, as all objectives have their values updated after a given action was selected.

Given this adaptation in the action selection part of the approach proposed in [19], henceforth we refer to ours as aPQL (adapted PQL). The remainder of aPQL follows basically the same procedures underlying PQL.

3.5. The Whole Picture

Algorithm 2 shows how our method works for each agent. The required parameters are: (maximum number of time steps); , and (bonus, the destination, and the emission penalty discussed in Section 3.3); the discount factor as well as the ( -greedy exploration strategy). Left4 Left3 Left2 Left1 Left0

Top0 Top1 Top2 Top3 Top4 A4 A3 A2 A1 A0

B4 B3 B2 B1 B0

C4 C3 C2 C1 C0

D4 D3 D2 D1 D0

E4 Right4 E3 Right3 E2 Right2 E1 Right1

E0 Right0 Bottom0 Bottom1 Bottom2 Bottom3 Bottom4

OD-Pairs Demand

Left0|Right4 / Right4|Left0 Left1|Right3 / Right3|Left1 Left3|Right1 / Right1|Left3 Left4|Right0 / Right0|Left4 in a destination or in a dead-end node. Recall that a dead-end node is one from which it is not possible to reach a destination at all, an information provided by SUMO. In both cases, Algorithm 1 is used to compute the reward, as stated in Section 3.3: in the case of a destination node (line 12), a non-zero bonus is given; in the case of a dead-end (line 15), the reward includes a non-zero penalty.

The agent can then update its non-dominated policies for the last action using the Q-sets of possible next states, and the average rewards (line 17). After these updates, the agent (re)starts a new trip (an episode for this particular agent), thus it is reinserted at its origin node.

The else statement in lines 19 - 29 deals with the agent that reached a node that is not its destination (or a dead-end). In this case, the reward computation in line 20 neither a bonus nor a penalty apply. Thus both and are set to zero.

Both methods in lines 17 and 22, as well as the Q-set computation in line 21 are performed as proposed in the original PQL algorithm.

4. Experiments and Results 4.1. Network and Demand

To test the approach, we used a 5 × 5 grid network depicted in Figure 1. The demand corresponding to the OD (origin-destination) matrix is shown in in Table 1. For each OD pair, the number of trips refers to both directions. For example, for the first line of the table, half of the vehicles travel from Bottom0 to Top4 and half from Top4 to Bottom0.

Each link shown in Figure 1 is two-way. The ticker links (in blue, connecting nodes A2 to E2) are the ones in which we want to have less emission. Thus, each agent using those links receive a penalty (as discussed in Section 3.3) once it travels through them.

4.2. Methods and Parameters

The method proposed was compared against three others: dynamic user assignment (DUA), which is an iterative method for computing the user equilibrium provided by SUMO; the standard QL optimizing only the agents’ travel time (QLTT); and the standard QL optimizing only the CO emission (QLCO).

All parameters discussed in Section 3 were set as follows: maximum simulation steps = 50, 000; since we use normalized (ranging from [−1, 0]) values for the reward, all values regarding reward computation were set as = 1 , = 1 and = 1, guaranteeing that the rewards values received are always within [ −1, 1 ], as are the rewards. For all learning methods, the discount factor = 0.9 and a fixed value = 0.05 were set. Lastly, for the methods based on the standard QL (QLTT and QLCO), a learning rate = 0.5 was used. We remark that other values for these parameters were tried, without significant improvement.

Its is also worth mentioning that, due to the stochastic nature of all methods we used for comparison, each of them were repeated 30 times. In the plots ahead, we show the mean values (and their standard deviations, as shadows).

4.3. Results and Analysis

The comparison among the aforementioned methods was performed using values related to three diferent metrics: travel time (Figure 2a), CO emission (2b) and speed (2c). These values were collected from each link in the network. Plots in Figure 2 show a moving average (over a window of 600 steps) of the corresponding value, over all links.

In order to investigate performances for a part of the network only (those links indicated in blue in Figure 1), Figure 3 also depicts the three metrics, however values there are averaged only over those particular links (not for the entire network, as in Figure 2).

Results plotted in Figure 2 show a clear advantage of the aPQL method: it outperforms the others, showing lower travel time, lower CO emission, and higher speed (already after approximately 20,000 steps). These results seem to indicate that the aPQL method have a better usage of the links in the network. On a related note, one might assume that higher speeds would lead to higher CO emission. However, a more decisive factor is acceleration/deceleration. If the speed is constant, this seems to not afect emissions in a significant negative way.

Another issue is that one also observes a better distribution of vehicle in the network, which may be due to the fact that the aPQL takes the advantage of both the optimization of travel time and CO emission. The penalty imposed in the links A2 to E2 in Figure 1 causes both methods that optimize the objectives independently to behave in a diferent way. Optimizing travel time (QLTT) alone leads agents to a solution that can be seen in Figure 3(b), where no considerable reduction to CO emission is observed when we look at the red curve that represents QLTT. Conversely, optimizing only the CO emission causes the agents not to take the links in which they are penalized. In fact, since they completely stop selecting such links, it is possible to see that there are no data points for travel time and speed in Figure 3 regarding QLCO. One also notices that the average network travel time from QLCO (green curve in Figure 2(a)) is slightly higher than the one observed in QLTT. This happens due to the fact that, by avoiding several links within the network, the agents in QLCO have less available actions, and thus experience an increase in travel time.

By employing an approach that considers both objectives at the same time, aPQL makes the agents reach a compromise that seems to yield a better distribution of vehicles throughout the network: they mostly (but not entirely) avoid the links in which the CO emission is penalized (observed in the orange curves in Figure 3), using other parts of the network, taking advantage of the fact that those links are less occupied. This leads to a considerable increase in performance in relation to the other methods.

Lastly, it is worth noting that the values observed in travel time in the present paper are considerably lower than what was seen in our previous works, when the same grid networks was used ([ 13, 14 ]). This is due to the fact that here travel times are per link, while previously we have shown travel times over entire routes.

5. Conclusion and Future Work

Using RL to support agents making decisions that consider more than one objective poses challenges. We formulate the problem of multiple agents learning to travel from A to B in a trafic network as an RL task in which: (i) agents learning simultaneously result in nonstationarity; (ii) there are potentially two or more objectives to be optimized; (iii) the underlying learning task is modeled as a stochastic game, where states are intermediary locations in the networks (such as intersections), in which decisions are made about which link to take.

To solve this task, we propose an adaptation in an existing algorithm, PQL [19]. This algorithm deals with more than one objective, but it was originally formulated for single-agent learning tasks, in which the environment is deterministic. In route choice neither of these apply. Thus our adaptation addresses them. We have applied the proposed method to a scenario in which hundreds of agents compete for a scarce resource (link usage), and have to learn how to travel from their origins to their destinations, aiming at minimizing their travel times, as well as the CO vehicles emit. Results show that the adapted algorithm is able to find routes that are more eficient then when either only travel time or only emissions are considered as rewards.

Future work is planned in the direction of considering communication between vehicles and infraestructure as in [14], given that this work was the basis of the present one; thus a revisiting of that case (now considering various objectives) is an open direction. In this regard, we also plan to include more objectives, such as other types of emissions.

Acknowledgments

This work is partially supported by FAPESP and MCTI/CGI (grants number 2020/05165-1 and 2021/05093-3). Ana Bazzan is partially supported by CNPq under grant number 304932/2021-3, and by the German Federal Ministry of Education and Research (BMBF), Käte Hamburger Kolleg Cultures des Forschens/ Cultures of Research. [14] G. D. dos. Santos, A. L. C. Bazzan, Sharing diverse information gets driver agents to learn faster: an application in en route trip building, PeerJ Computer Science 7 (2021) e428. URL: https://doi.org/10.7717/peerj-cs.428. doi:10.7717/peerj-cs.428. [15] G. Dytz dos Santos, A. L. C. Bazzan, A. P. Baumgardt, Using car to infrastructure communication to accelerate learning in route choice, Journal of Information and Data Management 12 (2021). URL: https://sol.sbc.org.br/journals/index.php/jidm/article/view/1935. [16] A. L. C. Bazzan, R. Grunitzki, A multiagent reinforcement learning approach to en-route trip building, in: 2016 International Joint Conference on Neural Networks (IJCNN), 2016, pp. 5288–5295. doi:10.1109/IJCNN.2016.7727899. [17] R. Rǎdulescu, P. Mannion, D. Roijers, A. Nowé, Multi-objective multi-agent decision making: a utility-based analysis and survey, Autonomous Agents and Multi-Agent Systems 34 (2020). doi:10.1007/s10458-019-09433-x. [18] C. F. Hayes, R. Radulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Verstraeten, L. M. Zintgraf, R. Dazeley, F. Heintz, E. Howley, A. A. Irissappane, P. Mannion, A. Nowé, G. de Oliveira Ramos, M. Restelli, P. Vamplew, D. M. Roijers, A practical guide to multi-objective reinforcement learning and planning, CoRR abs/2103.09568 (2021). URL: https://arxiv.org/abs/2103.09568. arXiv:2103.09568. [19] K. Van Mofaert, A. Nowé, Multi-objective reinforcement learning using sets of pareto dominating policies, J. Mach. Learn. Res. 15 (2014) 3483–3512.

[1]

Raith ,

J. Y.

Wang ,

Ehrgott ,

S. A.

Mitchell , Solving multi-objective trafic assignment , Annals of Operations Research 222 ( 2014 ) 483 - 516 .

[2]

J. Y.

Wang ,

Ehrgott , Modelling route choice behaviour in a tolled road network with a time surplus maximisation bi-objective user equilibrium model , Transportation Research Part B: Methodological 57 ( 2013 ) 342 - 360 . doi: 10 .1016/j.trb. 2013 . 05 .011.

[3]

J. Y. T.

Wang ,

Raith ,

Ehrgott , Tolling analysis with bi-objective trafic assignment , in: M. Ehrgott , B.

Naujoks , T. J.

Stewart , J. Wallenius (Eds.), Multiple Criteria Decision Making for Sustainable Energy and

Transportation

Systems , 2010 , pp. 117 - 129 .

[4]

C. A.

Huanca-Anquise , Multi-objective reinforcement learning methods for action selection: dealing with multiple objectives and non-stationarity, Master's thesis , Instituto de Informática, UFRGS, Porto

Alegre

, Brazil, 2021 . URL: http://hdl.handle.net/10183/231836.

[5]

Auer ,

Cesa-Bianchi ,

Fischer , Finite-time analysis of the multiarmed bandit problem , Machine Learning 47 ( 2002 ) 235 - 256 . doi: 10 .1023/A: 1013689704352 .

[6]

Watkins , Learning from Delayed Rewards, Ph.D. thesis , University of Cambridge, 1989 .

[7]

J. d. D.

Ortúzar ,

L. G.

Willumsen , Modelling transport, 4 ed., John Wiley & Sons, Chichester, UK, 2011 .

[8]

R. B.

Dial , A model and algorithm for multicriteria route-mode choice , Transportation Research Part B: Methodological 13 ( 1979 ) 311 - 316 . doi:https : / / doi.org / 10.1016 / 0191 - 2615 ( 79 ) 90024 - 9 .

[9]

A. L. C.

Bazzan ,

Klügl , Introduction to Intelligent Systems in Trafic and Transportation , volume 7 of Synthesis Lectures on Artificial Intelligence and Machine Learning , Morgan and Claypool, 2013 . doi: 10 .2200/S00553ED1V01Y201312AIM025.

[10]

G. de. O.

Ramos , B. C. da Silva ,

A. L. C.

Bazzan , Learning to minimise regret in route choice , in: S. Das , E.

Durfee , K.

Larson , M. Winikof (Eds.), Proc. of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2017 ), IFAAMAS, São

Paulo

, 2017 , pp. 846 - 855 . URL: http://ifaamas.org/Proceedings/aamas2017/pdfs/p846.pdf.

[11] T. B. F. de Oliveira , A. L. C. Bazzan , B. C. da Silva , R. Grunitzki , Comparing multi-armed bandit algorithms and Q-learning for multiagent action selection: a case study in route choice , in: 2018 International Joint Conference on Neural Networks , IJCNN , IEEE, Rio de Janeiro, 2018 , pp. 1 - 8 . URL: https://doi.org/10.1109/IJCNN. 2018 . 8489655 .

[12]

Grunitzki ,

A. L. C.

Bazzan , Combining car-to-infrastructure communication and multiagent reinforcement learning in route choice , in: A. L. C. Bazzan , F.

Klügl , S.

Ossowski , G. Vizzari (Eds.), Proceedings of the Ninth Workshop on Agents in Trafic and Transportation (ATT-2016) , volume 1678 of CEUR Workshop Proceedings , CEUR-WS.org, New York, 2016 . URL: http://ceur-ws. org/ Vol- 1678 /paper12.pdf.

[13]

G. D. dos.

Santos ,

A. L. C.

Bazzan , Accelerating learning of route choices with C2I: A preliminary investigation , in: Proc. of the VIII Symposium on Knowledge Discovery, Mining and Learning , SBC, 2020 , pp. 41 - 48 . doi: 10 .5753/kdmile. 2020 . 11957 .