<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop Agents in Trafic and Transportation, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>to Trip Building</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guilherme Dytz dos Santos</string-name>
          <email>gdsantos@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana L. C . Bazzan</string-name>
          <email>bazzan@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Universidade Federal do Rio Grande do Sul (UFRGS)</institution>
          ,
          <addr-line>Porto Alegre, RS</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>25</volume>
      <issue>2022</issue>
      <abstract>
        <p>Using reinforcement learning to support agents in making decisions that consider more than one objective poses challenges. We formulate the problem of multiple agents learn to travel from A to B in a trafic network as a reinforcement learning task in which we take into account: non-stationarity, more than one objective, and a a stochastic game based model. To solve this task, we propose an adaptation of an existing algorithm (PQL), which deals with more than one objective. We have applied the proposed method to a scenario in which hundreds of agents have to learn how to travel from their origins to their destinations, aiming at minimizing their travel times, as well as the carbon monoxide vehicles emit. Results show that the adapted algorithm is able to find eficient routes for the agents.</p>
      </abstract>
      <kwd-group>
        <kwd>reinforcement learning (RL)</kwd>
        <kwd>multiobjective RL</kwd>
        <kwd>routing</kwd>
        <kwd>route choice</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        http://www.inf.ufrgs.br/~bazzan (A. L. C. . Bazzan)
kind of problem, the work described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] extends a Bandit algorithm like UCB [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in order to
account for multiple objectives and for multiple learning agents.
      </p>
      <p>In contrast to the aforementioned works, in the present paper, each agent or driver not only
performs its optimization process locally (in a decentralized way), by means of multi-objective
RL (MORL), but also it builds its trip by making decisions at each intersection (that function as
states). Hence, the underlying learning task is state-based. Other characteristics of the problem
we deal with is that it involves many agents competing for resources. This also poses challenges
to RL methods, even if only one objective is considered. Specifically, the fact that there are
many agents learning simultaneously causes the environment to be non-stationary.</p>
      <p>In short, a realistic, decentralized RL-based approach to routing involves the following issues:
(i) agents learning simultaneously result in non-stationarity; (ii) there are potentially two or
more objectives to be optimized; (iii) the underlying learning task is modeled as a stochastic
game, where states are the intersections of the networks, where decisions about which link to
take are made.</p>
      <p>The remainder of this paper is organized as follows. The next section discusses the background
concepts and gives an overview on the related work. Section 3 details the proposed method. Its
evaluation in a proof-of-concept scenario is discussed in Section 4. Concluding remarks and
future directions are given in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>In this section, we briefly introduce some relevant concepts on RL, as well as on trafic
assignment, route choice, and routing (including multi-objective variants).</p>
      <sec id="sec-2-1">
        <title>2.1. Reinforcement Learning</title>
        <p>In RL, an agent learns how to act in an environment, by receiving a feedback (reward) that
measures how its action has afected the environment. The agent does not a priori know how its
actions afect the environment, hence it has to learn this by trial and error, which characterizes
an exploration phase. Such phase may be very noisy in multiagent scenarios, given that all
agents are learning simultaneously and hence have to adapt to others’ exploration processes.
However, agents should not only explore; in order to maximize the rewards of their actions,
they also have to exploit the gained knowledge. Thus, there must be an exploration-exploitation
strategy that is to be followed by the agents. One of these strategies is  -greedy, where an action
is randomly chosen (exploration) with a probability  , or , with probability 1- , the best known
action is chosen, i.e., the one with the highest value so far (exploitation).</p>
        <p>
          In the exploitation phase, at each interaction, it is assumed that an agent has sensors to
determine its current state and can then make an action. The reward perceived or received
from the environment is used to update its policy, i.e., a mapping from states to actions. This
policy can be generated or computed in several ways. For the sake of the present paper, we
concentrate on a model-free, of-policy algorithm called Q-learning or QL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and its extensions.
QL estimates the so-called Q-values using a table to store the experienced values of performing
a given action when in a given state. The value of a state   and action   at time  is updated
based on Eq. 1, where  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the learning rate,  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the discount factor,  +1 is the
next state and   is the reward received when the agent moves from   to  +1 after selecting
action   in state   .
        </p>
        <p>(  ,   ) ← (  ,   ) + (  +  max(( +1 , )) − (
 ,   ))

(1)</p>
        <p>RL can be employed to compute how drivers learn how to go from A to B in a transportation
network. The next section thus introduces this learning task.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. How to Travel from A to B?</title>
        <p>In this section, we first discuss the diferences among the various terms that are related to
this problem. Then, we introduce the concepts related to the definition of an urban network
(topology), as well as the demand side. Finally, we discuss the need for multiobjective approaches.</p>
        <p>Given a road network and the demand for its use (number of trips per unit of time), the task
of finding out how each unit of such demand realizes its travel needs can be solved in diferent
ways, although all seek to compute the so-called user equilibrium, i.e., the condition in which
no user is better of by changing its route. We recall that, in the literature, demand can be
expressed as trip, traveller, user, agent, or vehicle, depending on the community.</p>
        <p>
          In the transportation community this problem is normally known as trafic assignment
problem (TAP), which is solved in a centralized fashion, mostly via a macroscopic approach
using a volume-delay function (VDF), which estimates the travel time as a function of the
volume of trips; see Chapter 10 of [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for details. In the computer science community, that
task is frequently solved by means of RL in a decentralized way, in the sense that agents learn
how to reach their destinations, by performing experiments (exploration) until they learns their
respective ways (which is equivalent to the condition under the user equilibrium). This tasks
admits two variants: in the first, an agent knows a set of precomputed routes (or is able to
compute them), and its task is to choose a route among them (again by means of experimenting
with several selections of actions or routes). Once such choice is done, an agent travels the route,
without changing its mind during the trip. The second variant departs from the assumption
that agents are given a set of routes. Instead, agents make choices during the trip (e.g., at
each intersection, or at each important point of decision), thus building the actual route while
traveling. In terms of RL, these two variants have a key diference as well. While the latter is a
classical RL task with a set of states (e.g., each decision points), and a set of actions, the former
belongs to the stateless RL front, where agents basically need to select an action at the origin
node (the sole state). Moreover, this learning task resembles Bandit algorithms such as UCB [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Henceforth, we refer to the former as route choice (as the agent simply chooses a route and
remain on it), while the latter is referred here as routing to convey the idea that the trip is build
during the actual routing task.
        </p>
        <p>
          Regarding approaches that were already proposed to tackle this problem, in the literature,
there are various approaches to solve the TAP; here we refer only to those that address more
than one objective such as [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref8">8, 3, 2, 1</xref>
          ]. Others can be found in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Regarding route choice, the
reader is referred to [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ]; in particular, multiobjective route choice is tackled by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. All
these works use a macroscopic approach in the sense that VDF’s are used to provide a reward
for each agent based on the usage of a given route. Routing is less common in the literature but
some works appear in [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13, 14, 15</xref>
          ]. While the former employs a VDF to compute rewards
(thus, a macroscopic approach), the last two use a microscopic simulator, where vehicles realize
the actual driving movements. As such, rewards (e.g., travel times) are given by the simulator
itself and correspond to a more realistic reward (as opposed to the estimation provide by a VDF).
        </p>
        <p>This said, next we briefly review the concept of transportation network and the demand that
uses it. Formally, a trafic network can be modeled as a graph  = ( , ) , where  is the set
of vertices or nodes that represent the intersections, and  , the set of links that describe the
segments connecting a pair of vertices. In the network, trips are distributed among a set of
origin-destination (OD) pairs; each corresponding to a certain demand for trips. These then
translates into flows on the various links.</p>
        <p>In all methods it is assumed that agents (drivers) are rational, thus each selects the route or
link with the least perceived individual cost, in order to travel from its origin to its destination.
Several factors may influence this decision, such as travel time, distance, monetary cost (e.g.,
toll), fuel consumption, battery, emission of pollutants, etc.</p>
        <p>
          Traditionally, multi-objective trafic assignment is modeled by using a linear combination of
the various objectives, as proposed, for instance, by Dial in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for a bi-objective assignment.
Such a linear combination has some drawbacks. Some eficient routes are missed, as discussed
in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The key point is that algorithms that use a linear combination only identify supported
solutions at extreme points, while there may be other eficient solutions that are not considered
in that group but could be preferred by some users.
        </p>
        <p>
          Hence, it is necessary to have alternative solutions. One issue that arises is that, in a
multiobjective scenario, there is a set of eficient solutions instead of just one. Diferent solutions have
been proposed in the transportation and optimization communities; see for instance [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref8">8, 3, 2, 1</xref>
          ].
We remark that all these methods rely on a set of (pre-computed)  shortest paths and they are
centralized. Thus, they are not useful in a multiagent environment where each agent should
learn by its own experience using RL, nor they fit a state-based RL task, where agents learn at
each intersection.
        </p>
        <p>
          When employing RL to the problem of how agents should travel from A to B, there are two
possible approaches. The first one relates closely to the TAP in the sense that  shortest routes
are computed for agents in each OD pair, although it is not centralized as it is the case with the
aforementioned methods. Examples of this approach are [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ] among others.
        </p>
        <p>The second variant relax the assumption that each agent selects one of the  shortest routes
(and then travel on it, without deviating from the selected route). Rather, agents make choices
en route, i.e., they decide which link to follow once they find themselves in each vertex. This is
the approach followed in [16, 14] and also in the present paper.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Multiobjective Approach to Q-learning (PQL)</title>
        <p>In this section we discuss how to use QL when agents need to deal with more than one objective.
Besides the discussion ahead, we refer readers to other approaches; see [17, 18].</p>
        <p>In order to extend the aformentioned QL algorithm, [19] proposed Pareto Q-learning (PQL),
which integrates the Pareto dominance relation into a MORL approach. PQL considers both the
immediate reward vector and the set of expected future discounted reward vectors in order to
compute the so-called Q-sets, which are composed of vectors. In PQL, the set of expected future
discounted reward vectors relies on a function, called   (from non-dominated), that finds
those vectors that correspond to the possible future states, and which are not Pareto-dominated.</p>
        <p>Eq. 2 shows how Q-sets are calculated. (,̄ ) denotes the immediate reward vector and
   (, ) is the set of non-dominated vectors in the next state  , which is reached by performing
action  at time step  . (,̄ ) is added to each element of     (, ) . When  is selected at  ,
both terms are updated. (,̄ ) is updated according to Eq. 3, where  ⃗ is the new reward vector
and  (, ) is the number of times action  was selected in  .    (, ) is updated as shown in
Eq. 4, using the non-dominated vectors in the  ̃ of every action  ′ in the next state  ′.
 ̃ (, ) = (,̄ ) ⊕</p>
        <p>(, )
(,̄ ) =</p>
        <p>(,̄ ) +
   (, ) =  (∪
 ′ ̃ ( ′,  ′))
 ⃗− (,̄ )
 (, )
(2)
(3)
(4)</p>
        <p>PQL learns the entire Pareto front, finding multiple Pareto optimal solutions, provided that
each state-action pair is suficiently sampled. This algorithm is not biased by the Pareto front
shape (algorithms that find a single policy and use scalarization cannot sample the entire Pareto
front if it is non-convex) or a weight vector (it guides the exploration to specific parts of the
search space).</p>
        <p>Note that PQL assumes that the environment is deterministic, hence it is not directly useful
for environments in which the presence of multiple agents learning simultaneously cause
non-stationarity, as for instance the route choice domain we use ahead.</p>
        <p>Our contribution (called aPQL, detailed ahead) is an adaptation of PQL to deal with multiple
objectives and with multiple agents (thus, a non-stationary environment).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Markov Decision Process Definition</title>
        <p>
          Before presenting the details of the proposed method, some considerations to the Markov
decision process (MDP) underlying the learning task should be addressed. In an MDP, we define
a set of states  , a set of actions  , a reward function  ∶  ×  → ℝ , and a probabilistic state
transition  (, ,  ′) → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], with  ∈  being the state the agent is currently in,  ∈  the action
the agent takes and  ′ ∈  being the state it might end up, taking action  in state  .
        </p>
        <p>In the routing learning task, each intersection within a trafic network is a state  , and the
actions are the links an agent might take when in  (i.e., links that leave the intersection). In
short, agents build their routes by making decisions about which link to follow, when finding
themselves in a given intersection.</p>
        <p>Regarding the environment, the transition model is implemented by a microscopic trafic
simulator itself, which moves vehicles along the network. As for the rewards, since we deal
with a multiobjective scenario, the rewards returned by the simulator are assembled in a vector;
this way, the reward function definition is  ∶  ×  → ℝ  , where  is the number of objectives
to be optimized. As detailed in Section 3.3, in the present paper we deal with agents aiming at
optimizing two objectives: travel time and carbon monoxide emission. However, the proposed
method is general and can be used if more objectives are taken into account.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Time Steps and Episodes</title>
        <p>In a traditional RL environment, at each time step, each agent finds itself in state  , chooses
action  , receives a reward and transitions to a state  ′. In the learning task at hand, time steps
are controlled by the microscopic simulator and correspond to an unit of time such as one
second. This has important consequences. First, not all time steps count as decision-making
steps; these only happen when an agent is at an intersection. The rest of the time steps are
used (e.g., see plots ahead) just as clock units. Second, diferent agents find themselves either in
decision-making states, or are moving along a link (thus, not updating their value functions),
or have reached their destinations. In the latter case, an episode is finished for that particular
agent. Since travel times and destinations are diferent for most of the agents, episodes are not
synchronous. Therefore, in general, we do not refer to episodes; rather, we use the clock units
or time steps of the simulator to depict time. To control the simulation time, we introduce a
parameter called maximum number of steps,  .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Reward Calculation</title>
        <p>In this section we detailed how we deal with multiple objectives in the reward function.</p>
        <p>In the exploration phase, due to uncoordinated action selection by the agents, congestion may
occur, which impacts the reward of the agents, while also inserting a lot of noise in the learning
process, which then leads to slow convergence. In order to speed up this process, and avoid
agents wandering around the network performing experimentation, two hyper-parameters
were introduced. The first one is a bonus  that is added to the reward when an agent reaches
its destination. The second is a penalty   that is subtracted to the reward when an agent is in a
dead-end node, i.e., a node from which it cannot reaches its destination at all.</p>
        <p>Given that the reward accounts for more than one objective (and thus is represented by a
vector of size  , where  is the number of objectives to optimize), both the bonus  and the
penalty   are added (subtracted in the case of   ) to the corresponding scalar value, as stated
in Eq. 5 (for the bonus) and Eq. 6 (for the penalty).</p>
        <p>⃗= ⃗+  ⋅
⃗
1
⃗= ⃗−   ⋅ 1⃗
(5)
(6)</p>
        <p>A further hyper-parameter of our model resembles a kind of toll on emissions of pollutants.
It is designed based on the assumption that certain links of the network should be spared from
high emission levels; this may be the case of links close to parks, hospitals, residential areas,
etc. This parameter is the penalty   ; it acts similarly to the aforementioned penalty   , with
the diference that   is subtracted only to the element of the reward vector that refers to the
emission. Let  ̂ be an  -th basis vector, where  is the position of the emission element of the
reward vector. Then the penalty   is used to update the reward as in Eq. 7. Algorithm 1
summarizes how the reward is calculated.</p>
        <p>⃗= ⃗−   ⋅  ̂
(7)
Algorithm 1: Compute Reward</p>
        <p>Data:  ,   ,   ,</p>
        <p>Result: ⃗
1 ⃗← −[.
2 ⃗← ⃗+  ⋅
3 ⃗← ⃗−   ⋅ 1⃗;
4 ⃗← ⃗−   ⋅  ̂ ;</p>
        <p>1⃗;
_
_  
_,
.
_
_]
;</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Action Selection: Adapted Pareto Q-Learning (aPQL)</title>
        <p>Originally, in PQL (i.e., as proposed in [19]), every pair state-action (, ) is associated with a
set of non-dominated points that represents the Pareto front. Q-sets are computed as in Eq. 2
and are used to compute a hypervolume of the Pareto front. This is then used to determine
the best action. In our case, this was not efective given that the Q-sets tend to contain few
points only, thus computations based on hypervolume tend to be inefective. Therefore, we had
to modify the action selection strategy. Instead of using the hypervolume of the Pareto front
to determine the best action, we let each agent select randomly (with uniform probability, at
each decision point, and independently of one another) which objective is to be optimized at
that given step. This seems to be a fair substitution for the hypervolume and avoids biasing
towards one objective. Once a particular objective is randomly drawn, then an agent uses the
 -greedy strategy to select an action, given the Q-set. Note that the approach continues to be
multiobjective, as all objectives have their values updated after a given action was selected.</p>
        <p>Given this adaptation in the action selection part of the approach proposed in [19], henceforth
we refer to ours as aPQL (adapted PQL). The remainder of aPQL follows basically the same
procedures underlying PQL.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. The Whole Picture</title>
        <p>Algorithm 2 shows how our method works for each agent. The required parameters are: 
(maximum number of time steps);  ,   and   (bonus, the destination, and the emission penalty
discussed in Section 3.3); the discount factor  as well as the  ( -greedy exploration strategy).
Left4
Left3
Left2
Left1
Left0</p>
        <p>Top0 Top1 Top2 Top3 Top4
A4
A3
A2
A1
A0</p>
        <p>B4
B3
B2
B1
B0</p>
        <p>C4
C3
C2
C1
C0</p>
        <p>D4
D3
D2
D1
D0</p>
        <p>E4 Right4
E3 Right3
E2 Right2
E1 Right1</p>
        <p>E0 Right0
Bottom0 Bottom1 Bottom2 Bottom3 Bottom4</p>
        <sec id="sec-3-5-1">
          <title>OD-Pairs</title>
        </sec>
        <sec id="sec-3-5-2">
          <title>Demand</title>
          <p>Bottom0|Top4 / Top4|Bottom0
Bottom1|Top3 / Top3|Bottom1
Bottom3|Top1 / Top1|Bottom3
Bottom4|Top0 / Top0|Bottom4</p>
          <p>Left0|Right4 / Right4|Left0
Left1|Right3 / Right3|Left1
Left3|Right1 / Right1|Left3
Left4|Right0 / Right0|Left4
in a destination or in a dead-end node. Recall that a dead-end node is one from which it is
not possible to reach a destination at all, an information provided by SUMO. In both cases,
Algorithm 1 is used to compute the reward, as stated in Section 3.3: in the case of a destination
node (line 12), a non-zero bonus is given; in the case of a dead-end (line 15), the reward includes
a non-zero penalty.</p>
          <p>The agent can then update its non-dominated policies for the last action using the Q-sets of
possible next states, and the average rewards (line 17). After these updates, the agent (re)starts
a new trip (an episode for this particular agent), thus it is reinserted at its origin node.</p>
          <p>The else statement in lines 19 - 29 deals with the agent that reached a node that is not its
destination (or a dead-end). In this case, the reward computation in line 20 neither a bonus nor
a penalty apply. Thus both  and   are set to zero.</p>
          <p>Both methods in lines 17 and 22, as well as the Q-set computation in line 21 are performed as
proposed in the original PQL algorithm.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Network and Demand</title>
        <p>To test the approach, we used a 5 × 5 grid network depicted in Figure 1. The demand
corresponding to the OD (origin-destination) matrix is shown in in Table 1. For each OD pair, the
number of trips refers to both directions. For example, for the first line of the table, half of the
vehicles travel from Bottom0 to Top4 and half from Top4 to Bottom0.</p>
        <p>Each link shown in Figure 1 is two-way. The ticker links (in blue, connecting nodes A2 to
E2) are the ones in which we want to have less emission. Thus, each agent using those links
receive a penalty   (as discussed in Section 3.3) once it travels through them.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Methods and Parameters</title>
        <p>The method proposed was compared against three others: dynamic user assignment (DUA),
which is an iterative method for computing the user equilibrium provided by SUMO; the standard
QL optimizing only the agents’ travel time (QLTT); and the standard QL optimizing only the
CO emission (QLCO).</p>
        <p>
          All parameters discussed in Section 3 were set as follows: maximum simulation steps  =
50, 000; since we use normalized (ranging from [−1, 0]) values for the reward, all values regarding
reward computation were set as  = 1 ,   = 1 and   = 1, guaranteeing that the rewards values
received are always within [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ], as are the rewards. For all learning methods, the discount
factor  = 0.9 and a fixed value  = 0.05 were set. Lastly, for the methods based on the standard
QL (QLTT and QLCO), a learning rate  = 0.5 was used. We remark that other values for these
parameters were tried, without significant improvement.
        </p>
        <p>Its is also worth mentioning that, due to the stochastic nature of all methods we used for
comparison, each of them were repeated 30 times. In the plots ahead, we show the mean values
(and their standard deviations, as shadows).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results and Analysis</title>
        <p>The comparison among the aforementioned methods was performed using values related to
three diferent metrics: travel time (Figure 2a), CO emission (2b) and speed (2c). These values
were collected from each link in the network. Plots in Figure 2 show a moving average (over a
window of 600 steps) of the corresponding value, over all links.</p>
        <p>In order to investigate performances for a part of the network only (those links indicated in
blue in Figure 1), Figure 3 also depicts the three metrics, however values there are averaged
only over those particular links (not for the entire network, as in Figure 2).</p>
        <p>Results plotted in Figure 2 show a clear advantage of the aPQL method: it outperforms
the others, showing lower travel time, lower CO emission, and higher speed (already after
approximately 20,000 steps). These results seem to indicate that the aPQL method have a better
usage of the links in the network. On a related note, one might assume that higher speeds
would lead to higher CO emission. However, a more decisive factor is acceleration/deceleration.
If the speed is constant, this seems to not afect emissions in a significant negative way.</p>
        <p>Another issue is that one also observes a better distribution of vehicle in the network, which
may be due to the fact that the aPQL takes the advantage of both the optimization of travel time
and CO emission. The penalty imposed in the links A2 to E2 in Figure 1 causes both methods
that optimize the objectives independently to behave in a diferent way. Optimizing travel time
(QLTT) alone leads agents to a solution that can be seen in Figure 3(b), where no considerable
reduction to CO emission is observed when we look at the red curve that represents QLTT.
Conversely, optimizing only the CO emission causes the agents not to take the links in which
they are penalized. In fact, since they completely stop selecting such links, it is possible to see
that there are no data points for travel time and speed in Figure 3 regarding QLCO. One also
notices that the average network travel time from QLCO (green curve in Figure 2(a)) is slightly
higher than the one observed in QLTT. This happens due to the fact that, by avoiding several
links within the network, the agents in QLCO have less available actions, and thus experience
an increase in travel time.</p>
        <p>By employing an approach that considers both objectives at the same time, aPQL makes the
agents reach a compromise that seems to yield a better distribution of vehicles throughout the
network: they mostly (but not entirely) avoid the links in which the CO emission is penalized
(observed in the orange curves in Figure 3), using other parts of the network, taking advantage of
the fact that those links are less occupied. This leads to a considerable increase in performance
in relation to the other methods.</p>
        <p>
          Lastly, it is worth noting that the values observed in travel time in the present paper are
considerably lower than what was seen in our previous works, when the same grid networks
was used ([
          <xref ref-type="bibr" rid="ref13">13, 14</xref>
          ]). This is due to the fact that here travel times are per link, while previously
we have shown travel times over entire routes.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>Using RL to support agents making decisions that consider more than one objective poses
challenges. We formulate the problem of multiple agents learning to travel from A to B in
a trafic network as an RL task in which: (i) agents learning simultaneously result in
nonstationarity; (ii) there are potentially two or more objectives to be optimized; (iii) the underlying
learning task is modeled as a stochastic game, where states are intermediary locations in the
networks (such as intersections), in which decisions are made about which link to take.</p>
      <p>To solve this task, we propose an adaptation in an existing algorithm, PQL [19]. This algorithm
deals with more than one objective, but it was originally formulated for single-agent learning
tasks, in which the environment is deterministic. In route choice neither of these apply. Thus
our adaptation addresses them. We have applied the proposed method to a scenario in which
hundreds of agents compete for a scarce resource (link usage), and have to learn how to travel
from their origins to their destinations, aiming at minimizing their travel times, as well as the
CO vehicles emit. Results show that the adapted algorithm is able to find routes that are more
eficient then when either only travel time or only emissions are considered as rewards.</p>
      <p>Future work is planned in the direction of considering communication between vehicles and
infraestructure as in [14], given that this work was the basis of the present one; thus a revisiting
of that case (now considering various objectives) is an open direction. In this regard, we also
plan to include more objectives, such as other types of emissions.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is partially supported by FAPESP and MCTI/CGI (grants number 2020/05165-1 and
2021/05093-3). Ana Bazzan is partially supported by CNPq under grant number 304932/2021-3,
and by the German Federal Ministry of Education and Research (BMBF), Käte Hamburger Kolleg
Cultures des Forschens/ Cultures of Research.
[14] G. D. dos. Santos, A. L. C. Bazzan, Sharing diverse information gets driver agents to learn
faster: an application in en route trip building, PeerJ Computer Science 7 (2021) e428. URL:
https://doi.org/10.7717/peerj-cs.428. doi:10.7717/peerj-cs.428.
[15] G. Dytz dos Santos, A. L. C. Bazzan, A. P. Baumgardt, Using car to infrastructure
communication to accelerate learning in route choice, Journal of Information and Data Management
12 (2021). URL: https://sol.sbc.org.br/journals/index.php/jidm/article/view/1935.
[16] A. L. C. Bazzan, R. Grunitzki, A multiagent reinforcement learning approach to en-route
trip building, in: 2016 International Joint Conference on Neural Networks (IJCNN), 2016,
pp. 5288–5295. doi:10.1109/IJCNN.2016.7727899.
[17] R. Rǎdulescu, P. Mannion, D. Roijers, A. Nowé, Multi-objective multi-agent decision
making: a utility-based analysis and survey, Autonomous Agents and Multi-Agent Systems
34 (2020). doi:10.1007/s10458-019-09433-x.
[18] C. F. Hayes, R. Radulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T.
Verstraeten, L. M. Zintgraf, R. Dazeley, F. Heintz, E. Howley, A. A. Irissappane, P. Mannion,
A. Nowé, G. de Oliveira Ramos, M. Restelli, P. Vamplew, D. M. Roijers, A practical guide to
multi-objective reinforcement learning and planning, CoRR abs/2103.09568 (2021). URL:
https://arxiv.org/abs/2103.09568. arXiv:2103.09568.
[19] K. Van Mofaert, A. Nowé, Multi-objective reinforcement learning using sets of pareto
dominating policies, J. Mach. Learn. Res. 15 (2014) 3483–3512.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Raith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrgott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <article-title>Solving multi-objective trafic assignment</article-title>
          ,
          <source>Annals of Operations Research</source>
          <volume>222</volume>
          (
          <year>2014</year>
          )
          <fpage>483</fpage>
          -
          <lpage>516</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrgott</surname>
          </string-name>
          ,
          <article-title>Modelling route choice behaviour in a tolled road network with a time surplus maximisation bi-objective user equilibrium model</article-title>
          ,
          <source>Transportation Research Part B: Methodological</source>
          <volume>57</volume>
          (
          <year>2013</year>
          )
          <fpage>342</fpage>
          -
          <lpage>360</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.trb.
          <year>2013</year>
          .
          <volume>05</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. Y. T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrgott</surname>
          </string-name>
          ,
          <article-title>Tolling analysis with bi-objective trafic assignment</article-title>
          , in: M.
          <string-name>
            <surname>Ehrgott</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Naujoks</surname>
            ,
            <given-names>T. J.</given-names>
          </string-name>
          <string-name>
            <surname>Stewart</surname>
          </string-name>
          , J. Wallenius (Eds.),
          <article-title>Multiple Criteria Decision Making for Sustainable Energy</article-title>
          and
          <string-name>
            <given-names>Transportation</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2010</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Huanca-Anquise</surname>
          </string-name>
          ,
          <article-title>Multi-objective reinforcement learning methods for action selection: dealing with multiple objectives and non-stationarity, Master's thesis</article-title>
          , Instituto de Informática, UFRGS,
          <string-name>
            <surname>Porto</surname>
            <given-names>Alegre</given-names>
          </string-name>
          , Brazil,
          <year>2021</year>
          . URL: http://hdl.handle.net/10183/231836.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cesa-Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <article-title>Finite-time analysis of the multiarmed bandit problem</article-title>
          ,
          <source>Machine Learning</source>
          <volume>47</volume>
          (
          <year>2002</year>
          )
          <fpage>235</fpage>
          -
          <lpage>256</lpage>
          . doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1013689704352</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Watkins</surname>
          </string-name>
          , Learning from Delayed Rewards,
          <source>Ph.D. thesis</source>
          , University of Cambridge,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. d. D.</given-names>
            <surname>Ortúzar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Willumsen</surname>
          </string-name>
          , Modelling transport, 4 ed., John Wiley &amp; Sons, Chichester, UK,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Dial</surname>
          </string-name>
          ,
          <article-title>A model and algorithm for multicriteria route-mode choice</article-title>
          ,
          <source>Transportation Research Part B: Methodological</source>
          <volume>13</volume>
          (
          <year>1979</year>
          )
          <fpage>311</fpage>
          -
          <lpage>316</lpage>
          . doi:https : / / doi.org / 10.1016 /
          <fpage>0191</fpage>
          -
          <lpage>2615</lpage>
          (
          <issue>79</issue>
          )
          <fpage>90024</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klügl</surname>
          </string-name>
          , Introduction to Intelligent
          <source>Systems in Trafic and Transportation</source>
          , volume
          <volume>7</volume>
          <source>of Synthesis Lectures on Artificial Intelligence and Machine Learning</source>
          , Morgan and Claypool,
          <year>2013</year>
          . doi:
          <volume>10</volume>
          .2200/S00553ED1V01Y201312AIM025.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. de. O.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            da
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <article-title>Learning to minimise regret in route choice</article-title>
          , in: S.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Durfee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          , M. Winikof (Eds.),
          <source>Proc. of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS</source>
          <year>2017</year>
          ), IFAAMAS,
          <string-name>
            <surname>São</surname>
            <given-names>Paulo</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>846</fpage>
          -
          <lpage>855</lpage>
          . URL: http://ifaamas.org/Proceedings/aamas2017/pdfs/p846.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>T. B. F. de Oliveira</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>B. C.</given-names>
          </string-name>
          da
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Grunitzki</surname>
          </string-name>
          ,
          <article-title>Comparing multi-armed bandit algorithms and Q-learning for multiagent action selection: a case study in route choice</article-title>
          , in: 2018
          <source>International Joint Conference on Neural Networks</source>
          ,
          <string-name>
            <surname>IJCNN</surname>
          </string-name>
          , IEEE, Rio de Janeiro,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: https://doi.org/10.1109/IJCNN.
          <year>2018</year>
          .
          <volume>8489655</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Grunitzki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <article-title>Combining car-to-infrastructure communication and multiagent reinforcement learning in route choice</article-title>
          , in: A. L.
          <string-name>
            <surname>C. Bazzan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Klügl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ossowski</surname>
          </string-name>
          , G. Vizzari (Eds.),
          <source>Proceedings of the Ninth Workshop on Agents in Trafic and Transportation (ATT-2016)</source>
          , volume
          <volume>1678</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEUR-WS.org, New York,
          <year>2016</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1678</volume>
          /paper12.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G. D. dos.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <article-title>Accelerating learning of route choices with C2I: A preliminary investigation</article-title>
          ,
          <source>in: Proc. of the VIII Symposium on Knowledge Discovery, Mining and Learning</source>
          , SBC,
          <year>2020</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .5753/kdmile.
          <year>2020</year>
          .
          <volume>11957</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>