<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Management via Multi-Agent Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Cederle</string-name>
          <email>matteo.cederle@phd.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Fabris</string-name>
          <email>marco.fabris.1@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gian Antonio Susto</string-name>
          <email>gianantonio.susto@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Autonomous Intersection Management, Connected Autonomous Vehicles, DQN, Multi-Agent Reinforce-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ATT'24: Workshop Agents in Trafic and Transportation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <addr-line>35131 Padua via Gradenigo 6/B</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ment Learning</institution>
          ,
          <addr-line>Reinforcement Learning, Smart Mobility, Trafic Scenarios</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world trafic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training eficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>(G. A. Susto)
CEUR
Workshop
Proceedings</p>
      <p>
        A vast literature exists on AIM. The research in this field spans multiple fronts, each
leveraging distinct methodologies to address the challenges of optimizing trafic flow and ensuring
safety in dynamic urban environments. By employing reinforcement learning (RL), AIM systems
can efectively learn and adapt intersection control strategies in response to changing trafic
conditions [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. These systems typically comprise priority assignment models, intersection
control model learning, and safe brake control mechanisms. Experimental simulations
demonstrate the superiority of RL-inspired AIM approaches over traditional methods, showcasing
enhanced eficiency and safety. Graph neural networks (GNNs) have also garnered attention
for their potential in AIM [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. Leveraging RL algorithms, GNNs optimize trafic flow at
intersections by jointly planning for multiple vehicles. These models encode scene
representations eficiently, providing individual outputs for all involved vehicles. Game theory serves
then as a foundational framework for MARL approaches in AIM. Indeed, game-theoretic models
facilitate safe and adaptive decision-making for CAVs at intersections [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. By considering
the diverse behaviors of interacting vehicles, these algorithms ensure flexibility and adaptability,
thus enhancing autonomous vehicle performances in challenging scenarios. Finally, recursive
neural networks (RNNs) integrated in the MARL framework represent an interesting approach
in AIM research to learn complex trafic dynamics and optimize vehicle speed control [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Despite the advancements in AIM techniques, their implementation still faces important
challenges. One of the main obstacles is represented by the need for an expensive centralised
server which has to be positioned in the proximity of the intersection, in order to simultaneously
control all the vehicles. Moreover, the vehicles should continuously send their local information
to this centralised controller, which will then gather and elaborate the data coming from all the
road users, before sending back to each vehicle a velocity or acceleration command. Given the
complexity and high demands of this technological framework, the integration of AIM devices
into existing transportation infrastructures still requires many years of extensive research and
testing. In this direction, we devise an alternative distributed approach based on the 3D surround
view technology for advanced assistance systems [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. As shown in the sequel, such a method
allows the reconstruction of a 360∘ scene centered around each CAV, which is useful to recover
the information required for each agent involved in the proposed MARL-based technique. This,
in turn, grants to efectively carry out AIM in a decentralised fashion, exploiting sensors that are
currently available on the market, without the need for the centralised infrastructure described
in the previous lines. More precisely, the contributions of this paper are multiple.
• As mentioned above, we ofer a new distributed strategy that represents a competitive
and realistic alternative to the classical centralised AIM techniques.
• Relying on self-play [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and drawing inspiration from prioritised experience replay [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
to improve training eficacy, we develop a MARL-based algorithm capable of tackling and
solving a 4-way intersection by means of the SMARTS platform [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
• Our strategy outperforms a number of well-established benchmarks, which typically
leverage trafic light regulation in function of travel time, waiting time and average speed.
• Last but not least, we guarantee full reproducibility1 of the code that is used for the
generation of the virtual experiments shown in this manuscript.
      </p>
      <p>1The Python code of our work can be found at https://github.com/mcederle99/MAD4QN-PS. The authors want
to stress the fact that reproducibility represents a crucial issue within this field of research.</p>
      <p>The remainder of this paper unfolds as follows. The preliminaries for this study are yielded
in Section 2; whereas, Section 3 provides the core of our contribution, namely the multi-agent
decentralised dueling double deep q-networks algorithm with prioritised scenario replay
(MAD4QNPS). This innovative method is then tested and validated through several virtual experiments, as
illustrated in Section 4. Finally, Section 5 draws the conclusions for the present investigation,
proposing future developments.</p>
      <p>Notation: The sets of natural and positive (zero included) real numbers are denoted by ℕ and
ℝ0+, respectively. Given a random variable (r.v.)  , its probability mass function is denoted by
 [ =  ]
, whereas  [ =  |  = ]</p>
      <p>indicates the probability mass function of  conditioned to
the observation of a r.v.  . Moreover, the expected value of a r.v.  is denoted by [ ] .</p>
    </sec>
    <sec id="sec-3">
      <title>2. Theoretical background</title>
      <sec id="sec-3-1">
        <title>2.1. Basic notions of reinforcement learning</title>
        <p>
          RL is a machine learning paradigm in which an agent learns to solve a task by iteratively
interacting with its environment. Solving the task means maximising the cumulative rewards obtained
over time. A generic RL problem is formalised by the concept of Markov decision process (MDP)
[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], which is a tuple composed by five elements: ⟨ ,  ,  , ℛ, 
⟩.  and  are two generic sets,
representing the state and action space respectively.  (, ,  ′) =  [ +1 =  ′ ∣   = ,   =  ] is
the state transition probability function, in charge of updating the environment to a new state
 ′ ∈  at each step, based on the previous state  ∈ 
agent. Moreover, the reward function ℛ(, , 
′) ∶  ×  ×  → ℝ
and the action  ∈
        </p>
        <p>performed by the
is used to measure the quality
of each transition, while  ∈ [0, 1) denotes a discount factor, used to compute the cumulative

reward at time  , i.e. the return   = ∑=0  
∞</p>
        <p>++1 . The agent decides which action to take at
each iteration exploiting its policy, a function that maps any state to the probability of selecting
each possible action:
 (|) =  [</p>
        <p>=  ∣   = ], ∀ ∈  .</p>
        <p>Solving a RL problem means finding an optimal policy  ∗. One criterion that is usually adopted
to find  ∗ consists in the maximization of the state-action value function   (, ) , i.e. the expected
return starting from state  ∈  , taking action  ∈ 
, and thereafter following policy  :
  (, ) =   [  ∣   = ,   =  ] .</p>
        <p>Consequently, given the state-action value function, the optimal policy is defined as  ∗ =
arg max   (, ) . There is therefore an inherent relation between  ∗ and the optimal
stateaction value function.</p>
        <p>Finally, two other important quantities which will be used in the proceeding of this article
are the state value function and the advantage function. The former is defined as the expected
return starting from state  ∈ 
and then following policy  :</p>
        <p>() =   [  ∣   =  ] .</p>
        <p>(, ) =   (, ) −   ().</p>
        <p>The latter instead is used to give a relative measure of importance to each action for a particular
state, and it is defined starting from   (, ) and   () :
(1)
(2)
(3)
(4)</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Q-Learning and Deep Q-Networks</title>
        <p>
          To compute the optimal state-action value function we could theoretically exploit the recursive
Bellman Optimality Equation [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]:
 ∗(, ) = 
[ +1 +  max  ∗( +1 ,  ′) ∣   = ,   =  ] ,
        </p>
        <p>
          ′
however due to the curse of dimensionality and the need for perfect statistical information to
compute the closed-form solution, it is necessary to resort to iterative learning strategies even
to solve simple RL problems. The most common algorithm used in literature is Q-Learning [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ],
where the state-action value function is represented by a table, which is iteratively updated at
each step through an approximation of (5):
 +1 (  ,   ) ←   (  ,   ) + ( +1 +  max   ( +1 ,  ′) −   (  ,   )),
 ′
where  &gt; 0 is called the step-size parameter. The policy derived from the state-action value
function is usually the  -greedy policy, suitable to balance the trade-of between exploration and
exploitation [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]:
(5)
(6)
(7)
(8)
 (|) =
{
arg max (, )
random action  ∈ 
with probability 1 − 
with probability 
Tabular Q-Learning works well for simple tasks, but the problem rapidly becomes intractable
when the state space becomes very large or even continuous. For this reason state-of-the-art RL
algorithms employ function approximators, such as neural networks (NNs), to solve realistic and
complex problems. One of the first yet more used deep RL algorithms is Deep Q-Networks [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ],
which approximates the state-action value function through a NN, (, ; ) . A replay memory
is used to store the transition tuples (, ,  ,  ′). Finally, the parameters  of the Q-Network are
optimised by sampling batches ℬ of transitions from the replay memory and minimizing a
mean squared error loss derived from (6):
|ℬ| ∈∑ℬ [(  +  m a′ x ( ′,  ′; )̄ − (
 ,   ; )) 2],
where  ̄ represent the parameters of a target network, which are periodically duplicated from 
and maintained unchanged for a predefined number of iterations.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Multi-agent reinforcement learning</title>
        <p>
          MARL expands upon traditional RL by incorporating multiple agents, each making decisions in
an environment where their actions influence both the immediate rewards and the observations
of other agents. In its most general definition, a MARL problem is formalised as a partially
observable stochastic game (POSG), in which each agent has its own action space and reward
function. Moreover, the partial observability derives from the fact that the agents do not perceive
the global state, but just local observations, which carry incomplete information about the
environment [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ].
        </p>
        <p>
          MARL algorithms can be categorised depending on the type of information available to the
agents during training and execution: in centralised training and execution (CTCE), the learning
of the policies as well as the policies themselves use some type of structure that is centrally
shared between the agents. On the other hand, in decentralised training and execution (DTDE),
the agents are fully independent and do not rely on centrally shared mechanisms. Finally,
the centralised training and decentralised execution paradigm (CTDE) is in between the first
two, exploiting centralised training to learn the policies, while the execution of the policies
themselves is designed to be decentralised [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Multi-Agent Decentralised Dueling Double Deep Q-Networks with Prioritized Scenario replay</title>
      <p>
        In this section we present our novel method based on MARL, called Multi-Agent Decentralised
Dueling Double Deep Q-Networks with Prioritized Scenario replay (MAD4QN-PS). We begin
by detailing how the system is modelled, and then we describe the original learning procedure
that we implement in order to train agents through self-play [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Finally, we shall introduce
the prioritised scenario replay pipeline that is implemented to speed up training.
      </p>
      <sec id="sec-4-1">
        <title>3.1. System modelling and design</title>
        <p>The environment in which the agents live consists of a 4-way 1-lane intersection, with three
diferent turning intentions available to each vehicle.</p>
        <p>Recalling Section 2.3, we formalize the problem as a POSG, which can be seen as a multi-agent
extension to MDPs. For this reason, we shall define for each agent the observation space, the
action space and the reward function.</p>
        <p>The information retrieved by each vehicle at every time step consists of a local RGB bird-eye
view image with the vehicle at the center. As already discussed in Section 1, this type of data is
already recoverable from sensors with modern technology, thus making such a configuration
extremely interesting from an application point of view. Moreover, the final observation passed
to the agent is represented by a stack of  ∈ ℕ consecutive frames, thus allowing the algorithm
to capture temporal dependencies and understand how the environment is changing over time.</p>
        <p>
          The action space of each agent instead is discrete and it contains  ∈ ℕ velocity commands.
This choice has been made because the purpose of our algorithm is not to learn the basic
skills required for driving, such as keeping the lane and following a trajectory, but it is instead
choosing how to behave in trafic conditions and when to interact with other vehicles present
in the environment. Moreover, a similar high-level perspective has also been implemented in
other works, related to the centralised AIM paradigm [
          <xref ref-type="bibr" rid="ref15 ref16 ref19">15, 16, 19</xref>
          ].
        </p>
        <p>Finally, for what concerns the reward function, we need to take into consideration the fact
that each agent is trying to solve a multi-objective problem. Indeed the main goal of each
vehicle is crossing the intersection and reaching the end of the scenario. In the meantime, a
vehicle is also required not to collide with the others, by travelling as smoothly as possible. In
order to fulfill all these objectives we design a reward signal composed by diferent terms:
 =
⎧+
⎪−
⎨−10 ⋅ 
⎪
⎩+10 ⋅ 
if  &gt; 0,
if vehicle not moving,
if a collision occurs,
if scenario completed,
(9)
(10)
(11)
where  ∈ ℝ 0+ is the distance travelled in meters from the previous time step and  ∈ ℕ is a
hyperparameter used to weight the importance of the last three components of the reward
function with respect to the first one.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Learning Strategy</title>
        <p>The starting point for our learning strategy is the algorithm Deep Q-Networks, already presented
in Section 2.2. This algorithm is then slightly modified by considering the Double DQN scheme
and also the Dueling architecture, which will be briefly introduced in the sequel.</p>
        <p>
          The idea of Double DQN [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] is originated by the fact that Q-Learning, and consequently
also DQN, are known to overestimate state-action values under certain conditions. This is due
to the max operation (see in (6) and (8)) performed to compute the temporal diference target.
To mitigate this efect, the idea is to decouple the action selection and evaluation steps by using
two diferent networks. We thus exploit the online network in the action selection step, while
we keep using the target network for evaluation. This leads to the following modification of
the loss function:
|ℬ| ∈∑ℬ [(  +  ( ′, arg m a′ x ( ′,  ′; ); )̄ − (
 ,   ; )) 2].
        </p>
        <p>
          Dueling DQN [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] instead introduces a modification in the NN architecture. Instead of having
a unique final layer that outputs the Q-value for each possible action, we split it in two, with
the first layer in charge of estimating the state value function ( 3) and the second layer used for
evaluating the advantage function (4). These two quantities are then combined in the following
way to produce an estimate of the state-action value function:
(, ; , , ) =  (; , ) + ((, ; , ) −
1
| |
∑ (, 
 ′
′; , )),
where  and  are the network parameters of the final layer, specific for the state-value function
and advantage function respectively. Whereas, subtracting the term | |1 ∑ ′ (,  ′; , ) is
needed for stability reasons.
        </p>
        <p>The final algorithm used for training is therefore a Multi-Agent version of Dueling Double
DQN known as D3QN, with linearly-annealed  -greedy policy for all the agents. In order to
allow for decentralised execution while developing at the same time a smart training strategy, we
consider an intermediate approach between the DTDE and the CTDE paradigms. In particular,
we initialize and train three diferent D3QN agents, one for each turning intention, i.e. left ,
straight and right. In this way each vehicle can select which model to use at the beginning of
its path, according just to its own turning intention.</p>
        <p>This approach is extremely sample-eficient, because we keep the number of network
parameters constant, regardless of the number of vehicles considered. Moreover, these shared
parameters are optimised through the experiences generated by all the vehicles, leading to a
more diverse set of trajectories for training. Indeed, each of the three models has its own replay
bufer, which contains transitions shared from all the vehicles with the corresponding turning
intention. The crucial insight that makes our strategy efective is the fact that the observations
gathered from each vehicle are invariant with respect to the road in which the vehicle itself is
positioned. This parameter and experience sharing approach renders the training procedure
of the algorithm somehow centralised because trajectories coming from diferent vehicles are
used to train the three D3QN agents. However, we remark that, once the models have been
trained, the execution phase is completely decentralised, since each vehicle locally stores the
three diferent models. Then, at the beginning of the scenario, each CAV selects the model to
use based only on its own turning intention.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. The prioritised scenario replay strategy</title>
        <p>The agents are trained for a fixed number of iterations  ∈ ℕ , keeping the intersection busy in
order to obtain meaningful transitions to learn from. In particular, at each episode we consider
the most complicated situation in which there are four vehicles simultaneously crossing the
intersection, one for each road and with random turning intention.</p>
        <p>
          Every  ∈ ℕ time steps we pause training and run an evaluation phase. During this period,
the agents use a greedy policy to face all the possible scenarios described above. When the
evaluation is completed, we use the inverse of the returns from all the scenarios to build a
probability distribution, and in the following training window we sample the diferent scenarios
according to such a distribution. In this way we allow the agents to learn more from the
most complicated situations. We name this original training strategy prioritised scenario replay
because of its conceptual similarity with the prioritised experience replay scheme [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], common
in many RL algorithms. Algorithm 1 illustrates the proposed learning strategy in detail.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments on virtual environments</title>
      <p>
        In order to train and evaluate our algorithm we need a suitable simulation environment. For this
project we have chosen the platform SMARTS [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], explicitly designed for MARL experiments
for autonomous driving. SMARTS relies on the external provider SUMO (Simulation of Urban
MObility) [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], which is a widely used microscopic trafic simulator, available under an open
source license. For our setup, we have used SMARTS as a bridge between SUMO and the
MARL framework, since it follows the standard Gymnasium APIs [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], widely used in the RL
community.
      </p>
      <p>
        To develop our code Python 3.8 was employed along with the version 1.4 of the Deep Learning
library PyTorch [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. Moreover, a NVIDIA TITAN Xp GPU was used to run our experiments.
As already mentioned in Section 3.1, we have built a 4-way 1-lane intersection scenario, with
three diferent turning intentions available to the vehicles coming from each of the four ways.
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
      </p>
      <p>end if
end while
39: end while
Algorithm 1 MAD4QN-PS
1: Initialize three state-action value networks   with random parameters   ,  = 1, 2, 3
2: Initialize three target state-action value networks  ̄ with parameters   ̄ =   ,  = 1, 2, 3
3: Initialize three replay bufers   ,  = 1, 2, 3
4: Setup initial  , decay factor   , evaluation period  , target update period  , discount factor 
5: Uniformly initialize the scenarios probability distribution
6: max_episode_steps ← 
7:  ← 0
8: while  &lt;</p>
      <p>do
episode_terminated ← False
Randomly reset the environment
episode_steps ← 0
while not episode_terminated do</p>
      <p>Update target network parameters   ̄ =   for each agent  = 1, 2, 3</p>
      <p>Run evaluation phase and update the scenarios probability distribution as described
episode_steps ← episode_steps + 1
end for
 ←  −  
 ←  + 1
if  %  == 0 then
end if
if  %  == 0 then</p>
      <p>in Section 3.3
end if
 ←</p>
      <p>number of vehicles currently present in the simulation
Assign each vehicle to one of the three agents, based on its turning intention
for all vehicles  in 1, ...,  do</p>
      <sec id="sec-5-1">
        <title>Collect observations for each vehicle  1, ...,</title>
        <p>With probability  select a random action  
Otherwise   ← arg max   (  , ;   ), where  depends on the turning intention of 

end for</p>
      </sec>
      <sec id="sec-5-2">
        <title>Apply actions   and collect observations   +1 and rewards    for  in 1, ...,</title>
        <p>Store each transition in the corresponding replay bufer   ,  = 1, 2, 3
for all agents  = 1, 2, 3 do</p>
        <p>Sample random batch of transitions ℬ from replay bufer  
Update parameters   by minimising the loss function:</p>
        <p>1
|ℬ | ∈ℬ 
ℒ (  ) =
∑ [(  +   ̄ ( ′, arg max   (
′,  ′;</p>
        <p>);   ̄ ) −   (  ,   ;   ))2]
 ′
if a collision occurred or  == 0 or episode_steps == max_episode_steps then
episode_terminated ← True
The simulation step has been fixed to 100  . Regarding the observation of each vehicle, we
have stacked  = 3 consecutive frames, each consisting of a RGB image of dimensions 48 × 48
pixels. Whereas, the action space contains  = 2
possible velocity commands2, namely 0 and
15 / . The chosen velocity references are then fed at each iteration to a speed controller, in
charge of driving the vehicle until the subsequent time step. For what concerns the reward
function (9), we have fixed its hyperparameter to  = 1 . Regarding the architecture of the
NN used to approximate the state-action value function, we have considered a convolutional
neural network (CNN), whose structural details are summarised in Table 1. Finally, the training
hyperparameters of Algorithm 1 are reported in Table 2.</p>
        <sec id="sec-5-2-1">
          <title>4.1. Baselines</title>
          <p>
            In order to assess the quality of our algorithm, we benchmark it versus the following baselines 3:
• Random policy (RP) for all the vehicles, which helps confirm whether our algorithm is
efectively learning meaningful patterns, as it demonstrates its ability to outperform
random actions, which lack any deliberate learning process.
• Three symmetric (N/S &amp; W/E) fixed-time trafic lights
(FTTL), considering two cycle
lengths already analyzed in [
            <xref ref-type="bibr" rid="ref16 ref19">16, 19</xref>
            ], and also the optimal cycle length in function of the
trafic flow, computed according to Webster’s formula [
37]. The final flow rate of vehicles
for evaluation has been set to 600 veh/hour, as will be discussed in Section 4.2.
2Such a choice for the action space has been made because we observed that considering more velocity commands
only introduced more complexity in the system, without increasing the performance of the algorithm.
          </p>
          <p>3The baselines simulations with trafic lights have been performed exploiting Flow [ 36], another platform used
to interface with SUMO, which easily allows for the definition and control of trafic lights.</p>
          <p>
            • Two symmetric (N/S &amp; W/E) actuated trafic lights (ATL) [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ], with diferent cycle lengths,
which operate by transitioning to the next phase once they identify a pre-specified time
gap between consecutive vehicles. In this way the allocation of green time across phases is
optimised and the cycle duration is adjusted in accordance with changing trafic dynamics.
The parameters of the five trafic lights configurations are reported in Table
As a final note, we emphasize that we do not have considered any RL-based centralised AIM
approach as baseline because the purpose of our method is to propose a more realistic and
feasible alternative to them, which is however able to outperform classical intersection control
methods, such as trafic lights.
          </p>
        </sec>
        <sec id="sec-5-2-2">
          <title>4.2. Results</title>
          <p>To assess the quality of MAD4QN-PS we consider four metrics, namely the travel time, the
waiting time, the average speed and the collision rate. We remark that we have accounted for
vehicle-centered metrics because of the decentralised nature of our algorithm. However, it is
evident that by optimizing the performance of each single road user we also implicitly improve
the quality of the whole intersection. The robustness of our method is ensured by performing
training ten times across diferent seeds, and then considering all the diferent trained models
while evaluating our strategy. In particular, each model has been tested by running a cycle of
the evaluation phase presented in Section 3.3, considering 600 veh/hour as flow rate of vehicles
coming through the intersection. Then, the obtained results from the diferent models have
been averaged out. Moreover, to ensure a fair comparison and analysis of the results, the same
evaluation setup has been adopted for all the baselines introduced above. Finally, it is worth
noticing that the inference time of the networks at evaluation phase is at most 1 , thus allowing
for real-time control, given that the simulation step has been fixed to 100  , as discussed at the
beginning of this section.</p>
          <p>Starting from Figure 1a, we can observe the average travel time and waiting time of a generic
vehicle for all the methods. The former is defined as the overall time that the vehicle spends
inside the environment, while the latter is defined as the fraction of the travel time in which the
vehicle is moving with velocity less or equal to 0.1 / , i.e. when it is stopped or almost stopped.
We clearly see that our method strongly outperforms all the trafic lights configurations. This is
mainly due to the fact that, when using trafic lights, a fraction of the vehicles is forced to stop
as soon as the corresponding light becomes red. Conversely, the trained MAD4QN-PS agents
are able to smoothly handle the interaction among multiple vehicles, allowing them to avoid
stopping unless it is strictly necessary. Figure 1b instead displays the average speed of each
vehicle. The results shown in this histogram are clearly related to those in Figure 1a; indeed,
also in this case we can see that our method outperforms trafic lights control schemes. This
occurs since the vehicles almost never stop, thus keeping a smoother velocity profile throughout
all the duration of the simulation.</p>
          <p>
            Lastly, we are left with the analysis of the random policy baseline, as we need to look at all
the three plots to fully understand its behaviour. If we just look at Figures 1a and 1b we could
argue that the random policy performance is similar to that of MAD4QN-PS. This hypothesis is
however disproved by Figure 1c, where the average collision rate for each vehicle is illustrated.
The extremely high collision percentage obtained by the random policy explains why each
vehicle on average spends a small amount of time with high velocity into the environment.
Indeed, the simulation is terminated as soon as a vehicle crashes. MAD4QN-PS, instead, achieves
an extremely low collision rate. In addition, the fact that such a collision rate is non-zero is
expected and also observed in other works exploiting RL-based techniques [
            <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
            ], given that
our algorithm has to implicitly learn collision avoidance through the reward signal. In practice,
the remaining failures are not problematic, because we can integrate rule-based sanity checks
in the pipeline in order to be 100% collision-free. Additionally, we note that two out of the ten
trained models with diferent seeds achieve exactly 0% collision rate, meaning that if we select
one of those models for deployment we are able to attain collision-free performances. This is
interesting since from an applicability perspective only the best trained model would be used in
practice. As a final note, we have not plotted the collision rate of the trafic lights methods for
better visualization, since the latter quantity is trivially zero for all the configurations.
          </p>
          <p>A short video showing the performances of MAD4QN-PS can be found at this link.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions and future directions</title>
      <p>In this study, we consider a distributed approach to face the AIM paradigm. In particular, we
propose a novel algorithm which exploits MARL through self-play and an original learning
strategy, named prioritised scenario replay, to train three diferent intersection crossing agents.
The derived models are stored inside CAVs, that are then able to complete their paths by choosing
the model corresponding to their own turning intention while relying just on local observations.
Our algorithm represents a feasible and realistic alternative to the centralised AIM concept, that
is still expected to require years of technological advancement to be implementable in a
realworld scenario. In addition, simulation experiments demonstrates the superior performances of
our method w.r.t. classic intersection control schemes, such as static and actuated trafic lights,
in terms of travel time, waiting time and average speed for each vehicle.</p>
      <p>In future works, we aim to explore diferent directions for advancements. In particular, one
of the main objectives is to also consider human driven vehicles inside the environment and
extend our approach to this field of research (see, e.g. the initial efort made in [ 38]). In this
case, the most challenging issue is indeed represented by the synchronization of trafic lights
accounting for the presence of human driven vehicles. Moreover, given the decentralised nature
of the proposed method, we expect to render our algorithm more robust without dramatically</p>
      <p>Average travel time per vehicle
Average waiting time per vehicle
8
7
/[]56
s
m
d
e4
e
p
3
S
2
1
0 FTTL1 FTTL2 FTTLOPT ATL1 ATL2 RP MAD4QN-PS</p>
      <p>0 FTTL1 FTTL2 FTTLOPT ATL1 ATL2 RP MAD4QN-PS
(a) Average travel and waiting time per vehicle
(b) Average speed per vehicle
80
[]%60
e
g
a
t
en40
c
r
e
P
20
0</p>
      <p>RP</p>
      <p>MAD4QN-PS
(c) Average collision rate
change it. Conversely, significant redesign would be necessary for a centralised AIM approach.
Furthermore, we envisage to test more complicated scenarios, both in terms of dimension and
layout, again to improve the robustness of our algorithm. Finally, we intend to implement our
algorithm in a scaled real-world scenario with miniature vehicles [39], to practically demonstrate
the applicability of our method.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This study was carried out within the MOST (the Italian National Center for Sustainable
Mobility), Spoke 8: Mobility as a Service and Innovative Services, and received funding from
NextGenerationEU (Italian PNRR – CN00000023 - D.D. 1033 17/06/2022 - CUP C93C22002750006).</p>
      <p>L. Antiga, A. Lerer, Automatic diferentiation in pytorch (2017).
[35] G. Hinton, N. Srivastava, K. Swersky, Lecture 6a overview of mini–batch gradient descent,
Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online
(2012).
[36] C. Wu, A. R. Kreidieh, K. Parvate, E. Vinitsky, A. M. Bayen, Flow: A modular learning
framework for mixed autonomy trafic, IEEE Transactions on Robotics 38 (2021) 1270–1286.
[37] F. V. Webster, Trafic signal settings, Technical Report, 1958.
[38] Z. Yan, C. Wu, Reinforcement learning for mixed autonomy intersections, in: 2021
IEEE International Intelligent Transportation Systems Conference (ITSC), IEEE, 2021, pp.
2089–2094.
[39] L. Paull, J. Tani, H. Ahn, J. Alonso-Mora, L. Carlone, M. Cap, Y. F. Chen, C. Choi, J. Dusek,
Y. Fang, et al., Duckietown: an open, inexpensive and flexible platform for autonomy
education and research, in: 2017 IEEE International Conference on Robotics and Automation
(ICRA), 2017, pp. 1497–1504.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kopelias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Demiridi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vogiatzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skabardonis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zafiropoulou</surname>
          </string-name>
          , Connected &amp;
          <article-title>autonomous vehicles-environmental impacts-a review</article-title>
          ,
          <source>Science of the total environment 712</source>
          (
          <year>2020</year>
          )
          <fpage>135237</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Savithramma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Ashwini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sumathi</surname>
          </string-name>
          ,
          <article-title>Smart mobility implementation in smart cities: A comprehensive review on state-of-art technologies</article-title>
          ,
          <source>in: 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Papadoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Quddus</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Imprialou, Evaluating the safety impact of connected and autonomous vehicles on motorways</article-title>
          ,
          <source>Accident Analysis &amp; Prevention</source>
          <volume>124</volume>
          (
          <year>2019</year>
          )
          <fpage>12</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Szénási</surname>
          </string-name>
          , G. Kertész,
          <string-name>
            <surname>I. Felde</surname>
          </string-name>
          , L. Nádai,
          <article-title>Statistical accident analysis supporting the control of autonomous vehicles</article-title>
          ,
          <source>Journal of Computational Methods in Sciences and Engineering</source>
          <volume>21</volume>
          (
          <year>2021</year>
          )
          <fpage>85</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Brewer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kameswaran</surname>
          </string-name>
          ,
          <article-title>Understanding the power of control in autonomous vehicles for people with vision impairment</article-title>
          ,
          <source>in: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Dicianno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sivakanthan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Sundaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satpute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kulich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Powers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Deepak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <article-title>Systematic review: Automated vehicles</article-title>
          and
          <article-title>services for people with disabilities</article-title>
          ,
          <source>Neuroscience Letters</source>
          <volume>761</volume>
          (
          <year>2021</year>
          )
          <fpage>136103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hyldmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prorok</surname>
          </string-name>
          ,
          <article-title>A fleet of miniature cars for experiments in cooperative driving</article-title>
          ,
          <source>in: 2019 International Conference on Robotics and Automation (ICRA)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3238</fpage>
          -
          <lpage>3244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Talebpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Mahmassani</surname>
          </string-name>
          ,
          <article-title>Influence of connected and autonomous vehicles on trafic lfow stability and throughput</article-title>
          , Transportation research part C:
          <article-title>emerging technologies 71 (</article-title>
          <year>2016</year>
          )
          <fpage>143</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Conlon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Greenhouse gas emission impact of autonomous vehicle introduction in an urban network</article-title>
          ,
          <source>Transportation Research Record</source>
          <volume>2673</volume>
          (
          <year>2019</year>
          )
          <fpage>142</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taiebat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Saford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A review on energy, environmental, and sustainability implications of connected and automated vehicles</article-title>
          ,
          <source>Environmental science &amp; technology 52</source>
          (
          <year>2018</year>
          )
          <fpage>11449</fpage>
          -
          <lpage>11465</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brosig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Plinge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Eskofier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mutschler</surname>
          </string-name>
          ,
          <article-title>An introduction to multiagent reinforcement learning and review of its application to autonomous mobility</article-title>
          ,
          <source>in: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1342</fpage>
          -
          <lpage>1349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Dresner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <article-title>Multiagent trafic management: A reservation-based intersection control mechanism</article-title>
          , in: Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          , International Joint Conference on, volume
          <volume>3</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2004</year>
          , pp.
          <fpage>530</fpage>
          -
          <lpage>537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Karthikeyan</surname>
          </string-name>
          , W.-L. Chen, P.-A. Hsiung,
          <article-title>Autonomous intersection management by using reinforcement learning</article-title>
          ,
          <source>Algorithms</source>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>326</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ayeelyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Hsu</surname>
          </string-name>
          , P.-A. Hsiung,
          <article-title>Advantage actor-critic for autonomous intersection management</article-title>
          ,
          <source>Vehicles</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>1391</fpage>
          -
          <lpage>1412</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klimke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Völz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buchholz</surname>
          </string-name>
          ,
          <article-title>Cooperative behavior planning for automated driving using graph neural networks</article-title>
          ,
          <source>in: 2022 IEEE Intelligent Vehicles Symposium (IV)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klimke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gerigk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Völz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buchholz</surname>
          </string-name>
          ,
          <article-title>An enhanced graph representation for machine learning based automatic intersection management</article-title>
          ,
          <source>in: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>523</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pan</surname>
          </string-name>
          , G. Liu,
          <article-title>Safe and adaptive decision algorithm of automated vehicle for unsignalized intersection driving</article-title>
          ,
          <source>Journal of the Brazilian Society of Mechanical Sciences and Engineering</source>
          <volume>45</volume>
          (
          <year>2023</year>
          )
          <fpage>537</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          , I. Kolmanovsky,
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Tseng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Filev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Girard</surname>
          </string-name>
          ,
          <article-title>Potential game-based decision-making for autonomous driving</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>8014</fpage>
          -
          <lpage>8027</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.-P.</given-names>
            <surname>Antonio</surname>
          </string-name>
          , C.
          <article-title>Maria-Dolores, Multi-agent deep reinforcement learning to manage connected autonomous vehicles at tomorrow's intersections</article-title>
          ,
          <source>IEEE Transactions on Vehicular Technology</source>
          <volume>71</volume>
          (
          <year>2022</year>
          )
          <fpage>7033</fpage>
          -
          <lpage>7043</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          , 3
          <article-title>-d surround view for advanced driver assistance systems</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>19</volume>
          (
          <year>2017</year>
          )
          <fpage>320</fpage>
          -
          <lpage>328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <article-title>Multiagent learning in the presence of memory-bounded agents, Autonomous agents and multi-agent systems 28 (</article-title>
          <year>2014</year>
          )
          <fpage>182</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          , Prioritized experience replay,
          <source>arXiv preprint arXiv:1511.05952</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Villella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Fadakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Smarts: An open-source scalable multi-agent rl training school for autonomous driving</article-title>
          ,
          <source>in: Conference on Robot Learning</source>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>264</fpage>
          -
          <lpage>285</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bellman</surname>
          </string-name>
          ,
          <article-title>A markovian decision process</article-title>
          ,
          <source>Journal of mathematics and mechanics (</source>
          <year>1957</year>
          )
          <fpage>679</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>C. J. C. H. Watkins</surname>
          </string-name>
          ,
          <article-title>Learning from delayed rewards (</article-title>
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          , et al.,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          , nature
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Shapley</surname>
          </string-name>
          ,
          <article-title>Stochastic games</article-title>
          ,
          <source>Proceedings of the national academy of sciences 39</source>
          (
          <year>1953</year>
          )
          <fpage>1095</fpage>
          -
          <lpage>1100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Christianos</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Schäfer, Multi-agent reinforcement learning: Foundations and modern approaches</article-title>
          , Massachusetts Institute of Technology: Cambridge, MA, USA (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>H.</given-names>
            <surname>Van Hasselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning with double q-learning</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>30</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hasselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Freitas</surname>
          </string-name>
          ,
          <article-title>Dueling network architectures for deep reinforcement learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1995</fpage>
          -
          <lpage>2003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Behrisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bieker-Walz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Erdmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-P.</given-names>
            <surname>Flötteröd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hilbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lücken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rummel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wagner</surname>
          </string-name>
          , E. Wießner,
          <article-title>Microscopic trafic simulation using sumo</article-title>
          ,
          <source>in: The 21st IEEE International Conference on Intelligent Transportation Systems</source>
          ,
          <year>2018</year>
          . URL: https://elib.dlr.de/124092/.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>M.</given-names>
            <surname>Towers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Terry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. U.</given-names>
            <surname>Balis</surname>
          </string-name>
          , G. d. Cola,
          <string-name>
            <given-names>T.</given-names>
            <surname>Deleu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goulão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kallinteris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. KG</given-names>
            ,
            <surname>M. Krimmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perez-Vicente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pierré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T. J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. G.</given-names>
            <surname>Younis</surname>
          </string-name>
          , Gymnasium,
          <year>2023</year>
          . URL: https://zenodo.org/record/8127025. doi:
          <volume>10</volume>
          .5281/zenodo.8127026.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>