=Paper=
{{Paper
|id=Vol-3173/paper11
|storemode=property
|title=A Multiobjective Reinforcement Learning Approach to Trip Building
|pdfUrl=https://ceur-ws.org/Vol-3173/11.pdf
|volume=Vol-3173
|authors=Guilherme Santos,Ana L. C. Bazzan
|dblpUrl=https://dblp.org/rec/conf/ijcai/SantosB22
}}
==A Multiobjective Reinforcement Learning Approach to Trip Building==
A Multiobjective Reinforcement Learning Approach to Trip Building Guilherme Dytz dos Santos1 , Ana L. C . Bazzan1,∗ 1 Computer Science, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil Abstract Using reinforcement learning to support agents in making decisions that consider more than one objective poses challenges. We formulate the problem of multiple agents learn to travel from A to B in a traffic network as a reinforcement learning task in which we take into account: non-stationarity, more than one objective, and a a stochastic game based model. To solve this task, we propose an adaptation of an existing algorithm (PQL), which deals with more than one objective. We have applied the proposed method to a scenario in which hundreds of agents have to learn how to travel from their origins to their destinations, aiming at minimizing their travel times, as well as the carbon monoxide vehicles emit. Results show that the adapted algorithm is able to find efficient routes for the agents. Keywords reinforcement learning (RL), multiobjective RL, routing, route choice 1. Introduction Decision-making using multiobjective reinforcement learning (RL) is turning popular in multi- agent systems. The reason is that many tasks are more naturally described by means of multiple, possibly conflicting objectives. Although this poses more challenges to RL, especially when more than few agents interact, such formulation is useful in real-world problems such as making decision regarding trips in traffic networks. In this domain, multiple drivers must learn to reach their destinations, starting at their origin locations. Depending on the formulation of the RL task, agents construct their routes by making decisions about which link to follow, once they find themselves at given locations. Usually, for such learning task, a single objective is considered, namely minimizing their travel times. However, frequently, there are other objectives to be considered. For instance, there are works that optimize two objectives – toll and travel time – such as [1, 2, 3]. However, these are centralized and do not involve RL. Route choice using travel time and toll by means of RL is addressed by [4]; however this work deals with agents selecting among k pre-computed routes that take the agents from their origins to their destinations. This means that the RL task involves only one state (stateless RL), namely the origin node, where an agent makes a decision (select one of the k routes). Then the agent follows the selected route without making further decisions during the trip. For this ATT’22: Workshop Agents in Traffic and Transportation, July 25, 2022, Vienna, Austria Envelope-Open gdsantos@inf.ufrgs.br (G. D. d. Santos); bazzan@inf.ufrgs.br (A. L. C. . Bazzan) GLOBE http://www.inf.ufrgs.br/~bazzan (A. L. C. . Bazzan) Orcid 0000-0002-2803-9607 (A. L. C. . Bazzan) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) kind of problem, the work described in [4] extends a Bandit algorithm like UCB [5], in order to account for multiple objectives and for multiple learning agents. In contrast to the aforementioned works, in the present paper, each agent or driver not only performs its optimization process locally (in a decentralized way), by means of multi-objective RL (MORL), but also it builds its trip by making decisions at each intersection (that function as states). Hence, the underlying learning task is state-based. Other characteristics of the problem we deal with is that it involves many agents competing for resources. This also poses challenges to RL methods, even if only one objective is considered. Specifically, the fact that there are many agents learning simultaneously causes the environment to be non-stationary. In short, a realistic, decentralized RL-based approach to routing involves the following issues: (i) agents learning simultaneously result in non-stationarity; (ii) there are potentially two or more objectives to be optimized; (iii) the underlying learning task is modeled as a stochastic game, where states are the intersections of the networks, where decisions about which link to take are made. The remainder of this paper is organized as follows. The next section discusses the background concepts and gives an overview on the related work. Section 3 details the proposed method. Its evaluation in a proof-of-concept scenario is discussed in Section 4. Concluding remarks and future directions are given in Section 5. 2. Background and Related Work In this section, we briefly introduce some relevant concepts on RL, as well as on traffic assign- ment, route choice, and routing (including multi-objective variants). 2.1. Reinforcement Learning In RL, an agent learns how to act in an environment, by receiving a feedback (reward) that measures how its action has affected the environment. The agent does not a priori know how its actions affect the environment, hence it has to learn this by trial and error, which characterizes an exploration phase. Such phase may be very noisy in multiagent scenarios, given that all agents are learning simultaneously and hence have to adapt to others’ exploration processes. However, agents should not only explore; in order to maximize the rewards of their actions, they also have to exploit the gained knowledge. Thus, there must be an exploration-exploitation strategy that is to be followed by the agents. One of these strategies is 𝜀-greedy, where an action is randomly chosen (exploration) with a probability 𝜀, or , with probability 1-𝜀, the best known action is chosen, i.e., the one with the highest value so far (exploitation). In the exploitation phase, at each interaction, it is assumed that an agent has sensors to determine its current state and can then make an action. The reward perceived or received from the environment is used to update its policy, i.e., a mapping from states to actions. This policy can be generated or computed in several ways. For the sake of the present paper, we concentrate on a model-free, off-policy algorithm called Q-learning or QL [6] and its extensions. QL estimates the so-called Q-values using a table to store the experienced values of performing a given action when in a given state. The value of a state 𝑠𝑡 and action 𝑎𝑡 at time 𝑡 is updated based on Eq. 1, where 𝛼 ∈ [0, 1] is the learning rate, 𝛾 ∈ [0, 1] is the discount factor, 𝑠𝑡+1 is the next state and 𝑟𝑡 is the reward received when the agent moves from 𝑠𝑡 to 𝑠𝑡+1 after selecting action 𝑎𝑡 in state 𝑠𝑡 . 𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡 + 𝛾 max(𝑄(𝑠𝑡+1 , 𝑎)) − 𝑄(𝑠𝑡 , 𝑎𝑡 )) (1) 𝑎 RL can be employed to compute how drivers learn how to go from A to B in a transportation network. The next section thus introduces this learning task. 2.2. How to Travel from A to B? In this section, we first discuss the differences among the various terms that are related to this problem. Then, we introduce the concepts related to the definition of an urban network (topology), as well as the demand side. Finally, we discuss the need for multiobjective approaches. Given a road network and the demand for its use (number of trips per unit of time), the task of finding out how each unit of such demand realizes its travel needs can be solved in different ways, although all seek to compute the so-called user equilibrium, i.e., the condition in which no user is better off by changing its route. We recall that, in the literature, demand can be expressed as trip, traveller, user, agent, or vehicle, depending on the community. In the transportation community this problem is normally known as traffic assignment problem (TAP), which is solved in a centralized fashion, mostly via a macroscopic approach using a volume-delay function (VDF), which estimates the travel time as a function of the volume of trips; see Chapter 10 of [7] for details. In the computer science community, that task is frequently solved by means of RL in a decentralized way, in the sense that agents learn how to reach their destinations, by performing experiments (exploration) until they learns their respective ways (which is equivalent to the condition under the user equilibrium). This tasks admits two variants: in the first, an agent knows a set of precomputed routes (or is able to compute them), and its task is to choose a route among them (again by means of experimenting with several selections of actions or routes). Once such choice is done, an agent travels the route, without changing its mind during the trip. The second variant departs from the assumption that agents are given a set of routes. Instead, agents make choices during the trip (e.g., at each intersection, or at each important point of decision), thus building the actual route while traveling. In terms of RL, these two variants have a key difference as well. While the latter is a classical RL task with a set of states (e.g., each decision points), and a set of actions, the former belongs to the stateless RL front, where agents basically need to select an action at the origin node (the sole state). Moreover, this learning task resembles Bandit algorithms such as UCB [5]. Henceforth, we refer to the former as route choice (as the agent simply chooses a route and remain on it), while the latter is referred here as routing to convey the idea that the trip is build during the actual routing task. Regarding approaches that were already proposed to tackle this problem, in the literature, there are various approaches to solve the TAP; here we refer only to those that address more than one objective such as [8, 3, 2, 1]. Others can be found in [9]. Regarding route choice, the reader is referred to [10, 11]; in particular, multiobjective route choice is tackled by [4]. All these works use a macroscopic approach in the sense that VDF’s are used to provide a reward for each agent based on the usage of a given route. Routing is less common in the literature but some works appear in [12, 13, 14, 15]. While the former employs a VDF to compute rewards (thus, a macroscopic approach), the last two use a microscopic simulator, where vehicles realize the actual driving movements. As such, rewards (e.g., travel times) are given by the simulator itself and correspond to a more realistic reward (as opposed to the estimation provide by a VDF). This said, next we briefly review the concept of transportation network and the demand that uses it. Formally, a traffic network can be modeled as a graph 𝐺 = (𝑉 , 𝐿), where 𝑉 is the set of vertices or nodes that represent the intersections, and 𝐿, the set of links that describe the segments connecting a pair of vertices. In the network, trips are distributed among a set of origin-destination (OD) pairs; each corresponding to a certain demand for trips. These then translates into flows on the various links. In all methods it is assumed that agents (drivers) are rational, thus each selects the route or link with the least perceived individual cost, in order to travel from its origin to its destination. Several factors may influence this decision, such as travel time, distance, monetary cost (e.g., toll), fuel consumption, battery, emission of pollutants, etc. Traditionally, multi-objective traffic assignment is modeled by using a linear combination of the various objectives, as proposed, for instance, by Dial in [8] for a bi-objective assignment. Such a linear combination has some drawbacks. Some efficient routes are missed, as discussed in [3]. The key point is that algorithms that use a linear combination only identify supported solutions at extreme points, while there may be other efficient solutions that are not considered in that group but could be preferred by some users. Hence, it is necessary to have alternative solutions. One issue that arises is that, in a multi- objective scenario, there is a set of efficient solutions instead of just one. Different solutions have been proposed in the transportation and optimization communities; see for instance [8, 3, 2, 1]. We remark that all these methods rely on a set of (pre-computed) 𝑘 shortest paths and they are centralized. Thus, they are not useful in a multiagent environment where each agent should learn by its own experience using RL, nor they fit a state-based RL task, where agents learn at each intersection. When employing RL to the problem of how agents should travel from A to B, there are two possible approaches. The first one relates closely to the TAP in the sense that 𝑘 shortest routes are computed for agents in each OD pair, although it is not centralized as it is the case with the aforementioned methods. Examples of this approach are [10, 11] among others. The second variant relax the assumption that each agent selects one of the 𝑘 shortest routes (and then travel on it, without deviating from the selected route). Rather, agents make choices en route, i.e., they decide which link to follow once they find themselves in each vertex. This is the approach followed in [16, 14] and also in the present paper. 2.3. Multiobjective Approach to Q-learning (PQL) In this section we discuss how to use QL when agents need to deal with more than one objective. Besides the discussion ahead, we refer readers to other approaches; see [17, 18]. In order to extend the aformentioned QL algorithm, [19] proposed Pareto Q-learning (PQL), which integrates the Pareto dominance relation into a MORL approach. PQL considers both the immediate reward vector and the set of expected future discounted reward vectors in order to compute the so-called Q-sets, which are composed of vectors. In PQL, the set of expected future discounted reward vectors relies on a function, called 𝑁 𝐷 (from non-dominated), that finds those vectors that correspond to the possible future states, and which are not Pareto-dominated. Eq. 2 shows how Q-sets are calculated. 𝑅(𝑠, ̄ 𝑎) denotes the immediate reward vector and 𝑁 𝐷𝑡 (𝑠, 𝑎) is the set of non-dominated vectors in the next state 𝑠, which is reached by performing action 𝑎 at time step 𝑡. 𝑅(𝑠,̄ 𝑎) is added to each element of 𝛾 𝑁 𝐷𝑡 (𝑠, 𝑎). When 𝑎 is selected at 𝑠, both terms are updated. 𝑅(𝑠, ̄ 𝑎) is updated according to Eq. 3, where 𝑅⃗ is the new reward vector and 𝑁 (𝑠, 𝑎) is the number of times action 𝑎 was selected in 𝑠. 𝑁 𝐷𝑡 (𝑠, 𝑎) is updated as shown in Eq. 4, using the non-dominated vectors in the 𝑄̃ 𝑠𝑒𝑡 of every action 𝑎′ in the next state 𝑠 ′ . 𝑄̃ 𝑠𝑒𝑡 (𝑠, 𝑎) = 𝑅(𝑠, ̄ 𝑎) ⊕ 𝛾 𝑁 𝐷𝑡 (𝑠, 𝑎) (2) ⃗ ̄ 𝑅(𝑠, ̄ 𝑎) + 𝑅 − 𝑅(𝑠, 𝑎) ̄ 𝑎) = 𝑅(𝑠, (3) 𝑁 (𝑠, 𝑎) 𝑁 𝐷𝑡 (𝑠, 𝑎) = 𝑁 𝐷(∪𝑎′ 𝑄̃ 𝑠𝑒𝑡 (𝑠 ′ , 𝑎′ )) (4) PQL learns the entire Pareto front, finding multiple Pareto optimal solutions, provided that each state-action pair is sufficiently sampled. This algorithm is not biased by the Pareto front shape (algorithms that find a single policy and use scalarization cannot sample the entire Pareto front if it is non-convex) or a weight vector (it guides the exploration to specific parts of the search space). Note that PQL assumes that the environment is deterministic, hence it is not directly useful for environments in which the presence of multiple agents learning simultaneously cause non-stationarity, as for instance the route choice domain we use ahead. Our contribution (called aPQL, detailed ahead) is an adaptation of PQL to deal with multiple objectives and with multiple agents (thus, a non-stationary environment). 3. Methodology 3.1. Markov Decision Process Definition Before presenting the details of the proposed method, some considerations to the Markov decision process (MDP) underlying the learning task should be addressed. In an MDP, we define a set of states 𝑆, a set of actions 𝐴, a reward function 𝑅 ∶ 𝑆 × 𝐴 → ℝ, and a probabilistic state transition 𝑇 (𝑠, 𝑎, 𝑠 ′ ) → [0, 1], with 𝑠 ∈ 𝑆 being the state the agent is currently in, 𝑎 ∈ 𝐴 the action the agent takes and 𝑠 ′ ∈ 𝑆 being the state it might end up, taking action 𝑎 in state 𝑠. In the routing learning task, each intersection within a traffic network is a state 𝑠, and the actions are the links an agent might take when in 𝑠 (i.e., links that leave the intersection). In short, agents build their routes by making decisions about which link to follow, when finding themselves in a given intersection. Regarding the environment, the transition model is implemented by a microscopic traffic simulator itself, which moves vehicles along the network. As for the rewards, since we deal with a multiobjective scenario, the rewards returned by the simulator are assembled in a vector; this way, the reward function definition is 𝑅 ∶ 𝑆 × 𝐴 → ℝ𝑛 , where 𝑛 is the number of objectives to be optimized. As detailed in Section 3.3, in the present paper we deal with agents aiming at optimizing two objectives: travel time and carbon monoxide emission. However, the proposed method is general and can be used if more objectives are taken into account. 3.2. Time Steps and Episodes In a traditional RL environment, at each time step, each agent finds itself in state 𝑠, chooses action 𝑎, receives a reward and transitions to a state 𝑠 ′ . In the learning task at hand, time steps are controlled by the microscopic simulator and correspond to an unit of time such as one second. This has important consequences. First, not all time steps count as decision-making steps; these only happen when an agent is at an intersection. The rest of the time steps are used (e.g., see plots ahead) just as clock units. Second, different agents find themselves either in decision-making states, or are moving along a link (thus, not updating their value functions), or have reached their destinations. In the latter case, an episode is finished for that particular agent. Since travel times and destinations are different for most of the agents, episodes are not synchronous. Therefore, in general, we do not refer to episodes; rather, we use the clock units or time steps of the simulator to depict time. To control the simulation time, we introduce a parameter called maximum number of steps, 𝑀. 3.3. Reward Calculation In this section we detailed how we deal with multiple objectives in the reward function. In the exploration phase, due to uncoordinated action selection by the agents, congestion may occur, which impacts the reward of the agents, while also inserting a lot of noise in the learning process, which then leads to slow convergence. In order to speed up this process, and avoid agents wandering around the network performing experimentation, two hyper-parameters were introduced. The first one is a bonus 𝑏 that is added to the reward when an agent reaches its destination. The second is a penalty 𝑝𝑑 that is subtracted to the reward when an agent is in a dead-end node, i.e., a node from which it cannot reaches its destination at all. Given that the reward accounts for more than one objective (and thus is represented by a vector of size 𝑛, where 𝑛 is the number of objectives to optimize), both the bonus 𝑏 and the penalty 𝑝𝑑 are added (subtracted in the case of 𝑝𝑑 ) to the corresponding scalar value, as stated in Eq. 5 (for the bonus) and Eq. 6 (for the penalty). 𝑟⃗ = 𝑟⃗ + 𝑏 ⋅ 1⃗ (5) 𝑟⃗ = 𝑟⃗ − 𝑝𝑑 ⋅ 1⃗ (6) A further hyper-parameter of our model resembles a kind of toll on emissions of pollutants. It is designed based on the assumption that certain links of the network should be spared from high emission levels; this may be the case of links close to parks, hospitals, residential areas, etc. This parameter is the penalty 𝑝𝑒 ; it acts similarly to the aforementioned penalty 𝑝𝑑 , with the difference that 𝑝𝑒 is subtracted only to the element of the reward vector that refers to the emission. Let 𝑒𝑖̂ be an 𝑖-th basis vector, where 𝑖 is the position of the emission element of the reward vector. Then the penalty 𝑝𝑒 is used to update the reward as in Eq. 7. Algorithm 1 summarizes how the reward is calculated. 𝑟⃗ = 𝑟⃗ − 𝑝𝑒 ⋅ 𝑒𝑖̂ (7) Algorithm 1: Compute Reward Data: 𝑏, 𝑝𝑑 , 𝑝𝑒 , 𝑎𝑔𝑒𝑛𝑡 Result: 𝑟⃗ 1 𝑟⃗ ← −[𝑎𝑔𝑒𝑛𝑡.𝑙𝑎𝑠𝑡_𝑒𝑔𝑑𝑒_𝑡𝑟𝑎𝑣𝑒𝑙_𝑡𝑖𝑚𝑒, 𝑎𝑔𝑒𝑛𝑡.𝑙𝑎𝑠𝑡_𝑒𝑔𝑑𝑒_𝑒𝑚𝑖𝑠𝑠𝑖𝑜𝑛]; ⃗; 2 𝑟⃗ ← 𝑟⃗ + 𝑏 ⋅ 1 3 𝑟⃗ ← 𝑟⃗ − 𝑝𝑑 ⋅ 1⃗; 4 𝑟⃗ ← 𝑟⃗ − 𝑝𝑒 ⋅ 𝑒𝑖̂ ; 3.4. Action Selection: Adapted Pareto Q-Learning (aPQL) Originally, in PQL (i.e., as proposed in [19]), every pair state-action (𝑠, 𝑎) is associated with a set of non-dominated points that represents the Pareto front. Q-sets are computed as in Eq. 2 and are used to compute a hypervolume of the Pareto front. This is then used to determine the best action. In our case, this was not effective given that the Q-sets tend to contain few points only, thus computations based on hypervolume tend to be ineffective. Therefore, we had to modify the action selection strategy. Instead of using the hypervolume of the Pareto front to determine the best action, we let each agent select randomly (with uniform probability, at each decision point, and independently of one another) which objective is to be optimized at that given step. This seems to be a fair substitution for the hypervolume and avoids biasing towards one objective. Once a particular objective is randomly drawn, then an agent uses the 𝜀-greedy strategy to select an action, given the Q-set. Note that the approach continues to be multiobjective, as all objectives have their values updated after a given action was selected. Given this adaptation in the action selection part of the approach proposed in [19], henceforth we refer to ours as aPQL (adapted PQL). The remainder of aPQL follows basically the same procedures underlying PQL. 3.5. The Whole Picture Algorithm 2 shows how our method works for each agent. The required parameters are: 𝑀 (maximum number of time steps); 𝑏, 𝑝𝑑 and 𝑝𝑒 (bonus, the destination, and the emission penalty discussed in Section 3.3); the discount factor 𝛾 as well as the 𝜀 (𝜀-greedy exploration strategy). Algorithm 2: aPQL Learning Procedure for Each Agent Data: 𝑀, 𝑏, 𝑝𝑑 , 𝑝𝑒 , 𝛾, 𝜀 1 𝑠𝑡𝑒𝑝 ← 0; 2 𝑎𝑔𝑒𝑛𝑡.𝑠𝑡𝑎𝑟𝑡(); 3 while 𝑠𝑡𝑒𝑝 < 𝑀 do 4 if agent has reached a node then 5 if agent crossed an emission toll link then 6 𝑝𝑒′ ← 𝑝𝑒 ; 7 else 8 𝑝𝑒′ ← 0; 9 end 10 if reached an ending node then 11 case reached its destination do 12 𝑟⃗ = 𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑟𝑒𝑤𝑎𝑟𝑑(𝑏, 0, 𝑝𝑒′ , 𝑎𝑔𝑒𝑛𝑡); 13 end 14 case reached a dead-end do 15 𝑟⃗ = 𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑟𝑒𝑤𝑎𝑟𝑑(0, 𝑝𝑑 , 𝑝𝑒′ , 𝑎𝑔𝑒𝑛𝑡); 16 end 17 𝑎𝑔𝑒𝑛𝑡.𝑢𝑝𝑑𝑎𝑡𝑒_𝑛𝑜𝑛_𝑑𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑_𝑎𝑛𝑑_𝑎𝑣𝑔_𝑟𝑒𝑤𝑎𝑟𝑑𝑠(⃗𝑟, 𝛾 ); 18 𝑎𝑔𝑒𝑛𝑡.𝑟𝑒𝑠𝑡𝑎𝑟𝑡(); 19 else 20 𝑟⃗ = 𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑟𝑒𝑤𝑎𝑟𝑑(0, 0, 𝑝𝑒′ , 𝑎𝑔𝑒𝑛𝑡); 21 𝑄_𝑠𝑒𝑡 = 𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑄_𝑠𝑒𝑡(𝑎𝑔𝑒𝑛𝑡.𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑡𝑎𝑡𝑒()); 22 𝑎𝑔𝑒𝑛𝑡.𝑢𝑝𝑑𝑎𝑡𝑒_𝑛𝑜𝑛_𝑑𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑_𝑎𝑛𝑑_𝑎𝑣𝑔_𝑟𝑒𝑤𝑎𝑟𝑑𝑠(⃗𝑟, 𝛾 ); 23 if 𝑟𝑎𝑛𝑑𝑜𝑚() > 𝜀 then 24 𝑐ℎ𝑜𝑠𝑒𝑛_𝑜𝑏𝑗 = 𝑎𝑔𝑒𝑛𝑡.𝑟𝑎𝑛𝑑𝑜𝑚_𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒(); 25 𝑎𝑐𝑡𝑖𝑜𝑛 = arg max𝑎 𝑄_𝑠𝑒𝑡[𝑎][𝑐ℎ𝑜𝑠𝑒𝑛_𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒]; 26 else 27 𝑎𝑐𝑡𝑖𝑜𝑛 = 𝑎𝑔𝑒𝑛𝑡.𝑟𝑎𝑛𝑑𝑜𝑚_𝑎𝑐𝑡𝑖𝑜𝑛(); 28 end 29 end 30 end 31 𝑠𝑡𝑒𝑝 ← 𝑠𝑡𝑒𝑝 + 1; 32 end The procedure is as follows: the agent and the counter that accounts for the simulation time step are both initialized. The simulation then runs until this counter reaches the value 𝑀. The if statement in lines 4 - 30 checks if the state of an agent is a node/intersection. If it is not, nothing happens (this means the agent is traveling in a link); otherwise, depending on whether or not the agent has crossed a link in which an emission penalty should be imposed (lines 5 - 9), variable 𝑝𝑒′ is set. Next, the agent needs to know if it is in an ending node (if statement in lines 10 - 18), i.e., Top0 Top1 Top2 Top3 Top4 OD-Pairs Demand Left4 A4 B4 C4 D4 E4 Right4 Bottom0|Top4 / Top4|Bottom0 102 Left3 A3 B3 C3 D3 E3 Right3 Bottom1|Top3 / Top3|Bottom1 86 Bottom3|Top1 / Top1|Bottom3 86 Left2 A2 B2 C2 D2 E2 Right2 Bottom4|Top0 / Top0|Bottom4 102 Left1 A1 B1 C1 D1 E1 Right1 Left0|Right4 / Right4|Left0 102 Left1|Right3 / Right3|Left1 86 Left0 A0 B0 C0 D0 E0 Right0 Left3|Right1 / Right1|Left3 86 Bottom0 Bottom1 Bottom2 Bottom3 Bottom4 Left4|Right0 / Right0|Left4 102 Figure 1: 5x5 Grid Network (as in [14]) Table 1: Demand (nb. of trips) per OD-pair in a destination or in a dead-end node. Recall that a dead-end node is one from which it is not possible to reach a destination at all, an information provided by SUMO. In both cases, Algorithm 1 is used to compute the reward, as stated in Section 3.3: in the case of a destination node (line 12), a non-zero bonus is given; in the case of a dead-end (line 15), the reward includes a non-zero penalty. The agent can then update its non-dominated policies for the last action using the Q-sets of possible next states, and the average rewards (line 17). After these updates, the agent (re)starts a new trip (an episode for this particular agent), thus it is reinserted at its origin node. The else statement in lines 19 - 29 deals with the agent that reached a node that is not its destination (or a dead-end). In this case, the reward computation in line 20 neither a bonus nor a penalty apply. Thus both 𝑏 and 𝑝𝑑 are set to zero. Both methods in lines 17 and 22, as well as the Q-set computation in line 21 are performed as proposed in the original PQL algorithm. 4. Experiments and Results 4.1. Network and Demand To test the approach, we used a 5 × 5 grid network depicted in Figure 1. The demand corre- sponding to the OD (origin-destination) matrix is shown in in Table 1. For each OD pair, the number of trips refers to both directions. For example, for the first line of the table, half of the vehicles travel from Bottom0 to Top4 and half from Top4 to Bottom0. Each link shown in Figure 1 is two-way. The ticker links (in blue, connecting nodes A2 to E2) are the ones in which we want to have less emission. Thus, each agent using those links receive a penalty 𝑝𝑒 (as discussed in Section 3.3) once it travels through them. 4.2. Methods and Parameters The method proposed was compared against three others: dynamic user assignment (DUA), which is an iterative method for computing the user equilibrium provided by SUMO; the standard QL optimizing only the agents’ travel time (QLTT); and the standard QL optimizing only the CO emission (QLCO). All parameters discussed in Section 3 were set as follows: maximum simulation steps 𝑀 = 50, 000; since we use normalized (ranging from [−1, 0]) values for the reward, all values regarding reward computation were set as 𝑏 = 1, 𝑝𝑑 = 1 and 𝑝𝑒 = 1, guaranteeing that the rewards values received are always within [−1, 1], as are the rewards. For all learning methods, the discount factor 𝛾 = 0.9 and a fixed value 𝜀 = 0.05 were set. Lastly, for the methods based on the standard QL (QLTT and QLCO), a learning rate 𝛼 = 0.5 was used. We remark that other values for these parameters were tried, without significant improvement. Its is also worth mentioning that, due to the stochastic nature of all methods we used for comparison, each of them were repeated 30 times. In the plots ahead, we show the mean values (and their standard deviations, as shadows). 4.3. Results and Analysis The comparison among the aforementioned methods was performed using values related to three different metrics: travel time (Figure 2a), CO emission (2b) and speed (2c). These values were collected from each link in the network. Plots in Figure 2 show a moving average (over a window of 600 steps) of the corresponding value, over all links. In order to investigate performances for a part of the network only (those links indicated in blue in Figure 1), Figure 3 also depicts the three metrics, however values there are averaged only over those particular links (not for the entire network, as in Figure 2). Results plotted in Figure 2 show a clear advantage of the aPQL method: it outperforms the others, showing lower travel time, lower CO emission, and higher speed (already after approximately 20,000 steps). These results seem to indicate that the aPQL method have a better usage of the links in the network. On a related note, one might assume that higher speeds would lead to higher CO emission. However, a more decisive factor is acceleration/deceleration. If the speed is constant, this seems to not affect emissions in a significant negative way. Another issue is that one also observes a better distribution of vehicle in the network, which may be due to the fact that the aPQL takes the advantage of both the optimization of travel time and CO emission. The penalty imposed in the links A2 to E2 in Figure 1 causes both methods that optimize the objectives independently to behave in a different way. Optimizing travel time (QLTT) alone leads agents to a solution that can be seen in Figure 3(b), where no considerable reduction to CO emission is observed when we look at the red curve that represents QLTT. Conversely, optimizing only the CO emission causes the agents not to take the links in which they are penalized. In fact, since they completely stop selecting such links, it is possible to see that there are no data points for travel time and speed in Figure 3 regarding QLCO. One also notices that the average network travel time from QLCO (green curve in Figure 2(a)) is slightly higher than the one observed in QLTT. This happens due to the fact that, by avoiding several links within the network, the agents in QLCO have less available actions, and thus experience an increase in travel time. By employing an approach that considers both objectives at the same time, aPQL makes the agents reach a compromise that seems to yield a better distribution of vehicles throughout the network: they mostly (but not entirely) avoid the links in which the CO emission is penalized Figure 2: Full network: travel time, CO emission and speed. (observed in the orange curves in Figure 3), using other parts of the network, taking advantage of the fact that those links are less occupied. This leads to a considerable increase in performance in relation to the other methods. Lastly, it is worth noting that the values observed in travel time in the present paper are considerably lower than what was seen in our previous works, when the same grid networks was used ([13, 14]). This is due to the fact that here travel times are per link, while previously we have shown travel times over entire routes. 5. Conclusion and Future Work Using RL to support agents making decisions that consider more than one objective poses challenges. We formulate the problem of multiple agents learning to travel from A to B in a traffic network as an RL task in which: (i) agents learning simultaneously result in non- stationarity; (ii) there are potentially two or more objectives to be optimized; (iii) the underlying learning task is modeled as a stochastic game, where states are intermediary locations in the Figure 3: Horizontal links: travel time, CO emission and speed. networks (such as intersections), in which decisions are made about which link to take. To solve this task, we propose an adaptation in an existing algorithm, PQL [19]. This algorithm deals with more than one objective, but it was originally formulated for single-agent learning tasks, in which the environment is deterministic. In route choice neither of these apply. Thus our adaptation addresses them. We have applied the proposed method to a scenario in which hundreds of agents compete for a scarce resource (link usage), and have to learn how to travel from their origins to their destinations, aiming at minimizing their travel times, as well as the CO vehicles emit. Results show that the adapted algorithm is able to find routes that are more efficient then when either only travel time or only emissions are considered as rewards. Future work is planned in the direction of considering communication between vehicles and infraestructure as in [14], given that this work was the basis of the present one; thus a revisiting of that case (now considering various objectives) is an open direction. In this regard, we also plan to include more objectives, such as other types of emissions. Acknowledgments This work is partially supported by FAPESP and MCTI/CGI (grants number 2020/05165-1 and 2021/05093-3). Ana Bazzan is partially supported by CNPq under grant number 304932/2021-3, and by the German Federal Ministry of Education and Research (BMBF), Käte Hamburger Kolleg Cultures des Forschens/ Cultures of Research. References [1] A. Raith, J. Y. Wang, M. Ehrgott, S. A. Mitchell, Solving multi-objective traffic assignment, Annals of Operations Research 222 (2014) 483–516. [2] J. Y. Wang, M. Ehrgott, Modelling route choice behaviour in a tolled road network with a time surplus maximisation bi-objective user equilibrium model, Transportation Research Part B: Methodological 57 (2013) 342–360. doi:10.1016/j.trb.2013.05.011 . [3] J. Y. T. Wang, A. Raith, M. Ehrgott, Tolling analysis with bi-objective traffic assignment, in: M. Ehrgott, B. Naujoks, T. J. Stewart, J. Wallenius (Eds.), Multiple Criteria Decision Making for Sustainable Energy and Transportation Systems, 2010, pp. 117–129. [4] C. A. Huanca-Anquise, Multi-objective reinforcement learning methods for action selec- tion: dealing with multiple objectives and non-stationarity, Master’s thesis, Instituto de Informática, UFRGS, Porto Alegre, Brazil, 2021. URL: http://hdl.handle.net/10183/231836. [5] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning 47 (2002) 235–256. doi:10.1023/A:1013689704352 . [6] C. Watkins, Learning from Delayed Rewards, Ph.D. thesis, University of Cambridge, 1989. [7] J. d. D. Ortúzar, L. G. Willumsen, Modelling transport, 4 ed., John Wiley & Sons, Chichester, UK, 2011. [8] R. B. Dial, A model and algorithm for multicriteria route-mode choice, Transportation Research Part B: Methodological 13 (1979) 311–316. doi:https : / / doi.org / 10.1016 / 0191-2615(79)90024-9 . [9] A. L. C. Bazzan, F. Klügl, Introduction to Intelligent Systems in Traffic and Transportation, volume 7 of Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan and Claypool, 2013. doi:10.2200/S00553ED1V01Y201312AIM025 . [10] G. de. O. Ramos, B. C. da Silva, A. L. C. Bazzan, Learning to minimise regret in route choice, in: S. Das, E. Durfee, K. Larson, M. Winikoff (Eds.), Proc. of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2017), IFAAMAS, São Paulo, 2017, pp. 846–855. URL: http://ifaamas.org/Proceedings/aamas2017/pdfs/p846.pdf. [11] T. B. F. de Oliveira, A. L. C. Bazzan, B. C. da Silva, R. Grunitzki, Comparing multi-armed bandit algorithms and Q-learning for multiagent action selection: a case study in route choice, in: 2018 International Joint Conference on Neural Networks, IJCNN, IEEE, Rio de Janeiro, 2018, pp. 1–8. URL: https://doi.org/10.1109/IJCNN.2018.8489655. [12] R. Grunitzki, A. L. C. Bazzan, Combining car-to-infrastructure communication and multi- agent reinforcement learning in route choice, in: A. L. C. Bazzan, F. Klügl, S. Ossowski, G. Vizzari (Eds.), Proceedings of the Ninth Workshop on Agents in Traffic and Transporta- tion (ATT-2016), volume 1678 of CEUR Workshop Proceedings, CEUR-WS.org, New York, 2016. URL: http://ceur-ws.org/Vol-1678/paper12.pdf. [13] G. D. dos. Santos, A. L. C. Bazzan, Accelerating learning of route choices with C2I: A preliminary investigation, in: Proc. of the VIII Symposium on Knowledge Discovery, Mining and Learning, SBC, 2020, pp. 41–48. doi:10.5753/kdmile.2020.11957 . [14] G. D. dos. Santos, A. L. C. Bazzan, Sharing diverse information gets driver agents to learn faster: an application in en route trip building, PeerJ Computer Science 7 (2021) e428. URL: https://doi.org/10.7717/peerj-cs.428. doi:10.7717/peerj-cs.428 . [15] G. Dytz dos Santos, A. L. C. Bazzan, A. P. Baumgardt, Using car to infrastructure communi- cation to accelerate learning in route choice, Journal of Information and Data Management 12 (2021). URL: https://sol.sbc.org.br/journals/index.php/jidm/article/view/1935. [16] A. L. C. Bazzan, R. Grunitzki, A multiagent reinforcement learning approach to en-route trip building, in: 2016 International Joint Conference on Neural Networks (IJCNN), 2016, pp. 5288–5295. doi:10.1109/IJCNN.2016.7727899 . [17] R. Rǎdulescu, P. Mannion, D. Roijers, A. Nowé, Multi-objective multi-agent decision making: a utility-based analysis and survey, Autonomous Agents and Multi-Agent Systems 34 (2020). doi:10.1007/s10458-019-09433-x . [18] C. F. Hayes, R. Radulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Ver- straeten, L. M. Zintgraf, R. Dazeley, F. Heintz, E. Howley, A. A. Irissappane, P. Mannion, A. Nowé, G. de Oliveira Ramos, M. Restelli, P. Vamplew, D. M. Roijers, A practical guide to multi-objective reinforcement learning and planning, CoRR abs/2103.09568 (2021). URL: https://arxiv.org/abs/2103.09568. arXiv:2103.09568 . [19] K. Van Moffaert, A. Nowé, Multi-objective reinforcement learning using sets of pareto dominating policies, J. Mach. Learn. Res. 15 (2014) 3483–3512.