<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop Agents in Trafic and Transportation, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Growing Neural Gas in Multi-Agent Reinforcement Learning Adaptive Trafic Signal Control</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mladen Miletić</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivana Dusparić</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edouard Ivanjko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Trinity College Dublin, School of Computer Science and Statistics, College Green</institution>
          ,
          <addr-line>Dublin 2</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Zagreb, Faculty of Transport and Trafic Sciences</institution>
          ,
          <addr-line>Vukelićeva Street 4, 10000, Zagreb</addr-line>
          ,
          <country country="HR">Croatia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>19</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>In recent years, there has been a significant increase in research interest in applying Reinforcement Learning (RL) to Adaptive Trafic Signal Control (ATSC). Urban trafic networks present a suitable environment for Multi-Agent (MA) ATSC systems, as each intersection can be managed by a single RL agent. However, the non-stationarity of the ATSC environment in Multi-Agent Reinforcement Learning (MARL) poses a challenge since the actions of one agent can directly afect the performance of its neighboring agents. To address this issue, this paper presents and compares several MARL ATSC approaches utilizing Growing Neural Gas (GNG) for state identification, implemented using a microscopic trafic simulator with a synthetic trafic model of nine intersections. This paper explores the efectiveness of various MARL ATSC approaches, including fully independent agents and those augmented with reward and state-sharing mechanisms. The results demonstrate that fully independent agents can enhance global trafic performance by optimizing local decisions. Furthermore, when agents share rewards and states, they achieve additional improvements in both local and global trafic conditions by fostering cooperative behavior and mitigating the impact of non-stationarity. In addition, this paper identifies the approach of centralized state identification with GNG, coupled with decentralized agent execution, as the most efective ATSC strategy. This configuration leverages the strengths of centralized data processing for accurate state representation while maintaining the flexibility and scalability of decentralized agent operation. Overall, the ifndings highlight the potential of GNG-based state identicfiation in enhancing the performance of MARL ATSC systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Growing Neural Gas</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Adaptive Trafic Signal Control</kwd>
        <kwd>Multi-Agent Systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The rising world population and the increasing urbanization trends have a direct impact on trafic in
cities. It is estimated that further population growth will be mostly concentrated in the cities further
increasing the mobility demands [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The recent COVID-19 pandemic caused a short-term shift in
mobility demands and the modal split of daily commutes with more people being able to work from
home and use online shopping services. However, initial analysis shows that this was only temporary
as people are slowly returning to pre-pandemic behaviour [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Combined with the growing population
the urban trafic system will be unable to handle the increase in mobility demand. Traditional solutions
such as building additional infrastructure were proven to be inefective in the long term as increased
capacity attracted additional demand in an efect called Braess’s paradox [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In addition, most older
cities are not able to build any additional infrastructure due to a lack of available building space. A
possible solution is seen in the introduction of Intelligent Transportation Systems (ITS) to improve the
operational capacity of the existing transportation infrastructure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In most urban areas, road trafic is primarily controlled by Trafic Signal Control (TSC) on intersections.
While the primary task of TSC is to allow safe passage of vehicles entering the intersections it has a
large efect on the trafic flow since opposing flows need to be temporarily stopped while waiting for
their signal phase. For this reason, the intersections controlled by TSC are the primary bottlenecks in
urban trafic networks. This negative efect is most noticeable with improperly adjusted controllers
leading to even more congestion. TSC can operate with one of the three main control strategies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: (I)
Fixed Time Signal Control (FTSC); (II) Trafic Actuated Signal Control (TASC); and (III) Adaptive Trafic
Signal Control (ATSC).
      </p>
      <p>
        The most commonly used TSC strategy is FTSC which uses pre-determined signal programs adjusted
using historical trafic data. FTSC systems have low initial costs but are dificult to update and are not
able to respond to changes in demand. TASC systems are an extension of FTSC that allows the signal
program to change upon vehicle detection. TASC systems perform well in low-demand scenarios, but
in high-demand scenarios, their operations are similar to FTSC. The most advanced type of TSC is
ATSC which uses real-time trafic data to adapt signal programs to satisfy its operational objective.
ATSC systems can operate on a network level with multiple intersections in the control loop. Many
commercial ATSC systems are available such as SCOOT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], SCATS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], UTOPIA [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and ImFlow [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Such ATSC systems can improve the trafic flow significantly but still require manual adjustment and
ifne-tuning to achieve good results. To overcome this problem, current state-of-the-art proposed ATSC
approaches are based on Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        RL is a subset of algorithms and techniques of Machine Learning that focuses on learning by direct
interaction of the controller usually called an agent with its environment [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The agent learns its
control policy by interacting with the environment and receiving feedback on how successful its
actions were. By applying the concept of RL to ATSC the trafic signal controller becomes an agent
and its environment is the trafic network in which it operates. This approach removes the need for
manual adjustment of ATSC parameters as now the agent can learn the control policy on its own.
Another benefit of RL is that it can continuously learn even after deployment, meaning that if the trafic
behavior changes the ATSC agent will be able to adapt its control policy. A common problem in RL
is the identification of the environment state in high dimensional state space, especially for tabular
RL algorithms such as Q-Learning. To overcome this problem a dimensionality reduction technique
of Growing Neural Gas (GNG) can be used for state identification. This approach was successfully
implemented in a single intersection trafic environment but no GNG-based controllers have been
implemented on multiple intersections [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To allow scaling of ATSC to multiple intersections, the RL
control is expanded to Multi-Agent Reinforcement Learning (MARL) systems by allowing each agent
to control a single intersection. The use of MARL systems brings forth new challenges in ATSC since
each agent operates independently but their selected actions influence neighboring agents [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For
this reason, MARL agents are required to learn how to optimize local actions and how their actions
influence neighboring agents.
      </p>
      <p>This paper focuses on the evaluation of 3 diferent families of MARL ATSC approaches that use
GNG for state identification to reduce the state-action complexity and help with the problem of
nonstationarity. Following this introduction, the rest of the paper is organized as follows. Section 2 provides
insight into previous and related research in MARL-based ATSC systems highlighting the contributions
of this paper. Section 3 explains the relevant background and details on the use of RL and GNG in trafic
environments. Section 4 expands the currently known MARL-based controllers by introducing GNG as
a state identifier in RL. In section 5 the simulation environment and scenarios tested are explained in
detail. Section 6 provides an overview and discussion of the obtained results, highlighting the most
important observations. The paper ends with a conclusion section which includes a commentary on
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The field of RL application in ATSC systems has attracted many researchers since trafic control problems
provide a good environment for RL integration [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Older works focus on modeling ATSC controllers
as a Markov Decision Process (MDP) by specifying the observable state space, actions an agent can take,
and the reward for measuring the performance of the selected operational objective. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], authors
test how diferent reward definitions afect the performance of RL-ATSC. They tested rewards defined
by queue lengths, cumulative delay, and throughput. An edge was given to cumulative delay with a
comment on the dificulties in its measurement in real-world scenarios. The authors also conclude that
the performance of the reward function depends on the total trafic volume. Early forms of
MARLATSC approaches are shown in papers [
        <xref ref-type="bibr" rid="ref12 ref15">15, 12</xref>
        ]. The former used independent agents, while the latter
attempted to establish cooperation between agents to achieve even better performance. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], authors
combine the idea of ATSC with the concept of Connected Vehicles (CVs) in a cooperative Multi-Agent
(MA) environment.
      </p>
      <p>
        It is evident from reviews [
        <xref ref-type="bibr" rid="ref13 ref17 ref5">5, 13, 17</xref>
        ] on RL-ATSC that a more modern approach is to solve the control
problem by using Deep Reinforcement Learning (DRL). Papers [
        <xref ref-type="bibr" rid="ref18">18, 19</xref>
        ] both show significant trafic
performance benefits from using DRL-ATSC. It is also shown, that trafic networks with a high number
of intersections can be controlled with DRL-ATSC with noticeable performance benefits. The former
also introduces two approaches to improve MA convergence stability by improving agent observability
and introducing a discount factor that reduces the impact of states and rewards from other agents.
      </p>
      <p>
        Considering the review of related works above, the most common problems identified in RL-ATSC
applications are: (I) State-Action complexity; (II) Reward definition; and (III) Non-stationarity of the
environment when applied in MARL configuration. The problem of high state-action complexity usually
arises because the state space is continuous. To allow RL the state space representation needs to be
created beforehand which can be problematic if there is no available data before learning. Ideally, the
state representation would be built during the learning process as new previously unknown states might
be encountered. For this reason, methods such as Growing Neural Gas (GNG) can be used to construct
the space representation online. The growing characteristics of the GNG allow the network to adapt to
newly encountered spaces without degrading previous knowledge [20, 21]. Previous works from the
authors of this paper attempted to solve the problem of high State-Action complexity in trafic control
by introducing Self Organizing Maps (SOM) [22] and GNG [
        <xref ref-type="bibr" rid="ref11">11, 23</xref>
        ] for state identification to ofer an
alternative to DRL approaches while maintaining a good convergence rate. Both approaches were
successfully implemented on a single intersection, with both papers stating that future work should
include scaling to multiple intersection networks. This paper answers the research gap identified in
previous works by scaling the GNG-RL controller to multiple intersections in a MARL configuration.
Hence, the main contributions of this paper are as follows:
• Scaling of the GNG-based RL-ATSC controller to GNG-based MARL-ATSC controller;
• Performance comparison of 15 GNG-based MARL-ATSC approaches.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Relevant background</title>
      <p>In this section, the relevant background on RL and GNG for the purpose of trafic state identification is
presented.</p>
      <sec id="sec-3-1">
        <title>3.1. Reinforcement Learning</title>
        <p>When applying RL algorithms for ATSC, in literature, the ATSC controller is usually modeled as an
MDP tuple ⟨, , , ,  ⟩. Here  is the set of environment states;  is the set of actions available to
the controller;  is the transition probability of reaching state ′ from state  after an action  ∈  is
performed;  is the reward received from the environment after an action is performed; and  ∈ [0, 1)
is the discount factor which determines the impact of future rewards [24]. Alternatively, ATSC can be
modeled as a Partially Observable Markov Decision Process (POMDP) with a tuple ⟨, , , , Ω, ,  ⟩,
where the MDP is extended with Ω which is a set of observations, and  which is the conditional
probability of those observations [25]. The POMDP model is more appropriate for ATSC since the
actual trafic state is unknown to the controller and is instead represented by a vector of observations
or measurements from the environment. However, the MDP model is usually used instead of POMDP
to maintain simplicity in algorithm applications.</p>
        <p>
          Once the ATSC controller is defined as an MDP the learning task is to find the optimal control policy
to maximize the cumulative discounted reward. A common algorithm used to achieve this is Q-Learning
based on Bellman’s optimality equation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The algorithm reaches the optimal control policy using
the Q-value update rule:
(, ) ←
(, ) +  (+1 +  max (+1, ′) − (, )),
′∈
(1)
where (, ) is the expected return of taking action  ∈  in state  ∈  in time step , and  is the
learning rate coeficient. After multiple iterations of each state-action combination, an optimal policy is
formed by selecting actions with maximum Q-value for a given state. The main benefit of using the
Q-Learning algorithm for ATSC is that it is a model-free approach that does not use state transition
probabilities  which are dificult to model in trafic control.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Growing Neural Gas</title>
        <p>The GNG [26] is a dimensionality reduction technique created as an extension to SOM which can
be considered as a special type of neural network that uses competitive learning. The SOM or GNG
network consists of a set of connected neurons sometimes called nodes. Each neuron is defined by a
weight vector which is the representation of the neuron position in the input space. While training, the
input signal is received, and the neuron closest to the input will become known as the Best Matching
Unit (BMU). The BMU and neurons connected to it will then adjust their weight vectors to more closely
match the received input using the equation:
( + 1) = () + Θ(,  , ) ︀[ () − ()]︀ ,
(2)
where () is the input signal vector in time step ; () is the weight vector of neuron ;
Θ(,  , ) is the neighborhood function which scales the weight update using the distance between
neuron  and the BMU; and   is the learning rate of the network. The neighborhood function Θ is
usually modeled with a Gaussian or a Ricker Wavelet function [27]. After training the network becomes
a map of connected neurons with low dimensionality that preserves the topology of the original input
data samples.</p>
        <p>Unlike SOM, which has a fixed arrangement of neurons, the neurons and their connections in GNG
can be added or removed during the learning process [28]. The primary benefit of the growing structure
of GNG is that there is no need to specify the number of neurons in advance, but a limit on the total
number can be imposed. A new neuron will be added to the GNG if the input signal has a large Euclidean
distance from the current BMU. If the distance of the input signal is not as large but still significant a
new neuron will be added with an edge forming to the current BMU. If the input signal is close to the
current BMU, the existing BMU will be used instead of adding new neurons, and an edge will be formed
between the BMU and the neuron which is the second closest to the input signal. By specifying neuron
addition and connection distances as a hyperparameter of GNG, it is possible to set a desired level of
detail a created topological map will be able to accommodate. As the network grows some neurons
might become redundant or inert and will be then removed from the network to remain computationally
eficient.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Trafic state identification</title>
        <p>As discussed above, the observation space Ω for ATSC is defined by available trafic observation variables.
As the number of variables increases the ATSC agent can have better insight into the actual environment
state  ∈ . However, the downside of adding more variables is the curse of dimensionality which
can hinder the learning performance. In addition, most trafic observation variables are continuous,
creating an -dimensional continuous observation space, where  is the number of variables. To
successfully apply tabular RL techniques, such as Q-Learning, in continuous space it is necessary
to perform generalization from states that were previously experienced to the ones that have not
Input layer</p>
        <p>Kohonen
competitive layer</p>
        <p>BMU</p>
        <p>
          Output
been [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. A simple approach would be to split the state space into a number of finite and discrete
segments, and then perform a generalization on those segments. It is dificult to eficiently define
the state segments, especially if the distribution of states is unknown beforehand. To overcome this
problem, the generalization can be done using GNG [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>To use GNG for state space generalization, the length of the weights vector in each neuron must be
the same as the number of trafic observation variables. Each visited state is sent to the GNG which
will map the received state to the BMU of the trained GNG. This essentially means that the weights of
neurons in GNG become origin points from which Voronoi polytopes are constructed. Each constructed
polytope is a subspace of Ω that generalizes to one discrete state. The number of generalized states
is equal to the number of neurons in the network. The entire process of using GNG for trafic state
identification is shown in Figure 1. The growing properties of GNG allow for dynamic scaling of the
number of identified states as the neuron map is constructed. A major benefit is that the GNG state
map can be constructed simultaneously as the RL algorithm updates the control policy. Initially, the
GNG will consist of a small number of neurons and a highly generalized state space, but this will be
reduced as new neurons are added in later stages of learning.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Multi-Agent Reinforcement Learning Adaptive Trafic Signal</title>
    </sec>
    <sec id="sec-5">
      <title>Control</title>
      <p>When extending the ATSC to multiple intersections, the control can be configured in a local or global
manner. Localized control refers to each agent making a decision based solely on data gathered locally
from the intersection it controls. Both the state space and reward from the environment are received
locally, allowing the agent to adjust its control policy in a way that would maximize local trafic
performance. This approach can also be considered a non-cooperative game in game theory, with
each agent acting as a player. Localized control allows for good scalability and is simple to implement.
It is also known to perform well in trafic networks with low demand, or in networks consisting of
intersections located very far from one another. However, a significant problem of localized control
is that it can lead to sub-optimal network-wide performance due to the actions of an agent on one
intersection impacting the state and reward of its neighbors without any regard for the broader trafic
situation. This problem can be partially overcome by introducing cooperation between agents. The
simplest way to achieve cooperation is to modify the reward function to also include global performance
metrics. However, the convergence in such an approach would be problematic as the agent will not
be able to properly predict the received reward by only observing the local state. Further cooperative
augmentation is the sharing of state variables between neighboring agents. This allows the agent a
small level of insight into broader trafic control at the expense of slightly increasing state complexity,
but remaining scalable to a high number of intersections.</p>
      <p>The global control concept uses a single or coordinated group of agents to perform TSC using data
gathered from the entire trafic network. This approach allows the agent to build their control policy
for maximization of global trafic performance, even when local intersection performance would be
lower. Global control can in theory achieve optimal performance, but it sufers from poor scalability as
the state-action complexity increases exponentially as additional intersections are added to the control
loop. In addition, it is computationally expensive and would require a robust communication system to
be implemented.</p>
      <p>While global control would be ideal, it is usually not feasible. For this reason, the proposal of this
paper is to balance between global and local control by using individual agents that receive local rewards
but have insight into the state of the entire network. This allows the agent to build a control policy
that is finely tuned to the entire network’s state but still uses local reward optimization. This approach
can further be augmented by modifying the reward function of each agent to include a performance
measure of neighboring agents giving them the incentive to select action that will not have a negative
impact on the neighboring intersections’ performance. In this type of control, the state space remains
large and dificult to separate into discrete segments. To overcome this problem the GNG presented in
section 3 can be used to reduce the dimensionality of the problem and identify the global state of the
network keeping the state-action complexity low even when the number of intersections in the trafic
network increases. The additional benefit of using GNG as a state identifier is that it will be able to
operate even if some sensors go ofline by using the value of zero for state variables with ofline sensors.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Setup</title>
      <p>In this section, all analyzed TSC approaches are presented with their hyperparameters and an explanation
of the simulation environment used to implement each TSC approach.</p>
      <sec id="sec-6-1">
        <title>5.1. Simulation environment</title>
        <p>To evaluate the performance of various TSC approaches, the simulation environment consisting of a
synthetic trafic model in microscopic trafic simulator PTV VISSIM is used. PTV VISSIM is a
state-ofthe-art microscopic trafic simulator capable of multi-modal trafic simulation [ 29]. A major benefit of
PTV VISSIM is the developed COM interface which allows for external control of objects in VISSIM. This
interface can be used to develop external trafic signal controllers, while VISSIM handles the simulation
of vehicle behavior. The trafic model used for the analysis consists of 9 connected intersections arranged
in a 3 by 3 grid as shown in Figure 2. There is a total of 12 vehicle inputs labeled 1 to 12 with varying
trafic demand lasting for a total of 16 simulation hours. Trafic demands are synthetically generated to
be variable from The changes in trafic demand are shown in Figure 3. Each intersection is operated by
a TSC in an FTSC regime with a cycle length of 60 seconds and an equal distribution of green times
for both phases. This cycle length was selected to allow adequate trafic flow through the intersection
according to the defined trafic demand. This FTSC regime will be used as a baseline TSC for comparison
with GNG-based ATSC approaches.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Analyzed approaches and hyperparameters</title>
        <p>A total of 3 ATSC approaches are presented for analysis and comparison to the FTSC Baseline. They
are split into three main groups: Independent agents; State Sharing (SS); and Centralized State with
Decentralized Agents (CSDA). In addition, each approach is augmented with four levels of varying
Reward Sharing (RS).</p>
        <sec id="sec-6-2-1">
          <title>5.2.1. Independent agents</title>
          <p>The approach with Independent agents uses an RL agent at each intersection in the network. Each
agent is only capable of observing its immediate surroundings or more precisely the queue lengths on
each intersection approach. The average values of queue lengths are calculated for the previous time
step by the VISSIM simulator. Those queue values are then given as input to the GNG of the agent.
Each agent uses its own GNG with the following equations used to describe the parameters:
 () = 0.95 * 0.9− 1 + 0.05
( + 1) =  () −  ( + 1)
 () = 0.95 * 0.9− 1 + 0.05,
Θ() = exp
︂(</p>
          <p>− 2
input signal is higher than 20, if the distance is higher than 10 a neuron will be added and connected to
the previously identified BMU. The maximum number of neurons in the network is limited to
150.</p>
          <p>With GNG used as a state identifier, the RL component of the agent will use Q-Learning with
hyperparameters 
= 0.1 and</p>
          <p>= 0.8. The action selection will use the  -greedy policy with 
determined by Equation 5 to balance between exploration and exploitation of knowledge. The reward
function will be defined as the reduction in lost time  of vehicles passing through the intersection as
shown in Equation 6. Lost time is a measure of how much time a vehicle lost because its speed was
lower than desired because of its interaction with the environment. Each agent can choose from a set
of five actions  ∈ {− 10, − 5, 0, 5, 10} each representing a time change of the green split in the default
signal program. The agent will take an action every Δ = 300 simulation seconds. This value was
selected according to previous research as it allows for enough time for the new signal program to
have an impact. An additional benefit is that the</p>
          <p>Δ value is a multiple of the cycle length, allowing for
easier analysis.
(3)
(4)
(5)
(6)</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>5.2.2. Reward sharing</title>
          <p>Since the actions of one agent can afect the performance of neighboring agents the first level of
cooperation is by introducing RS with a reward cooperation coeficient  . The main principle is that
each agent reward is expanded by a partial reward from its neighbors as shown in the equation:
( + 1) =  () −  ( + 1) +  ︃( 1 ∑︁ () −  ( + 1) ,</p>
          <p>)︃
 =1
(7)
where  is the number of neighboring intersections, and  is the lost time of vehicles on neighboring
intersection . Four diferent values of  ∈ {0.25, 0.5, 0.75, 1.00} will be analyzed in combination
with each TSC approach.</p>
        </sec>
        <sec id="sec-6-2-3">
          <title>5.2.3. State sharing</title>
          <p>The GNG of independent agents uses only queue lengths at their local intersections to determine the
current state. In some trafic situations, an agent could be unaware of the increase in upcoming trafic
demand from neighboring intersections. This makes it dificult for the agent to predict the future
state and to select actions that would not help it to prepare for the upcoming trafic in advance. By
incorporating simple SS each queue length value on the intersection has the added value of the queue
length from the upstream intersection if it exists. The complexity of the GNG in this approach does not
change, but the state identification includes some data from neighboring intersections. To implement
this approach in realistic scenarios it would be imperative to know the desired direction of all vehicles
in the queue to properly scale the number of vehicles that will move downstream.</p>
        </sec>
        <sec id="sec-6-2-4">
          <title>5.2.4. Centralized state with decentralized agents</title>
          <p>The primary reason for the use of MA systems in TSC for large networks is the problem of scalability
since the state-action space required for a single agent would grow exponentially with the increasing
number of state-defining variables. The GNG enables a high level of dimensionality reduction and is
capable of identifying the state of the entire network. The state-action complexity still remains high
but can be reduced by the use of decentralized agents at each intersection. In this CSDA approach, each
agent receives the same state ID from the centralized GNG but builds its control policy by receiving
local rewards. The GNG in this approach has 36 total inputs but all its hyperparameters and functions
are kept the same as in the Individual agents approach.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Results and Discussion</title>
      <p>In this section, the obtained results for each TSC approach are presented and discussed.</p>
      <sec id="sec-7-1">
        <title>6.1. Obtained results</title>
        <p>The following network-level Measures of Efectiveness (MoEs) were used to evaluate the performance
of each TSC approach:    as the Total Travel Time of all vehicles in the network;    as the Total
Number of Stops of all vehicles; and  as the average lost time per vehicle. In addition, an analysis
of total intersection-level lost time   was performed to evaluate the local efects of each proposed
approach since the reward definition in each analyzed TSC approach was modeled to reduce the lost
time. For each MoE, the mean value for the last 50 episodes was calculated and used for the comparison.
The obtained results for    are shown in Figure 4 with a moving average with a window of 5 to
allow for easier inspection. The calculated mean results of   for each intersection and approach
are shown in Figure 5. Detailed numerical results for network-level MoEs are shown in Table 1 with the
calculated improvement compared to the baseline scenario. In addition, the standard deviation  for
the last 50 episodes is calculated to evaluate the convergence stability. The best-performing approach is
bolded in the table for easier identification.
Mean network results for each TSC approach from episode 251 to 300</p>
        <sec id="sec-7-1-1">
          <title>TSC approach</title>
          <p>Baseline
Independent
RS0.25
RS0.50
RS0.75
RS1.00</p>
          <p>SS
SS + RS0.25
SS + RS0.50
SS + RS0.75
SS + RS1.00</p>
          <p>CSDA
CSDA + RS0.25
CSDA + RS0.50
CSDA + RS0.75
Baseline
Independent
SS
CSDA</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. Discussion</title>
        <p>From the results presented in Table 1, it is evident that each analyzed TSC approach is capable of
improving the trafic network performance when compared to the baseline controller which uses FTSC.
This behavior is expected as any form of ATSC will be able to somewhat accommodate changing
trafic demand. The Independent agents’ approach reduced    by 9.56% on average by significantly
improving the performance of the central east-west corridor with intersections 4, 5, 6 as can be
seen from   in Figure 5. The performance of intersections 1 and 7 decreased somewhat because
of the interaction with the agent on intersection 4. This behavior is common in MA systems and is
referred to as the problem of non-stationarity. The agents on 1 and 7 had no insight into the state
of the central corridor and could not prepare actions to accommodate the sudden increase in trafic
demand from right and left turn movements on 4 resulting from increased throughput on 4.</p>
        <p>By expanding the Independent agents’ control with varying levels of reward sharing from neighboring
intersections a slight improvement in performance can be observed when the reward cooperation
coeficient was set to 0.25. For higher values of the cooperation coeficient the performance slightly
decreased. The use of reward sharing in TSC gives an incentive to the agents not to choose actions
that will have a negative efect on neighboring intersections. Agents will instead tolerate lower local
rewards if the neighboring agents perform well. In the RS0.25 scenario, the performance of intersection
6 was significantly increased, but the performance of neighboring intersections 3, 5 and 9 decreased
since they chose actions that would help the agent on 6. With higher values of the reward cooperation
coeficient, it becomes dificult for the agents to associate the obtained reward with respect to the
current state-action pair since the state of neighboring agents is not known. The trend of decreasing
performance with higher values of the reward cooperation coeficient is observable, but since the
standard deviation of the results is high it is dificult to evaluate the impact of the reward cooperation
coeficient properly.</p>
        <p>By using the simple state-sharing augmentation the performance slightly increases. The agents can
now prepare to meet the increasing trafic demand from neighboring intersections. When this approach
is combined with reward sharing, a pattern of decreasing performance with a rising cooperation
coeficient is observed, similar to scenarios without state sharing. Again, the best-performing approach
was with a cooperation coeficient of 0.25. This increase in performance can be attributed to a better
mapping between state actions and rewards since a part of neighboring agents’ states are now shared.</p>
        <p>The final group of tested approaches is with centralized state and distributed agents which
outperformed all previous approaches by a great margin. With each agent having an insight into the entire
network it was easier for each agent to select actions that would provide the best local benefit. The
problem of non-stationarity is still present as the agents could not predict the actions of other agents
and their efect on the network. By introducing reward sharing, the agents could now select actions that
would benefit neighboring intersections if required. For CSDA approaches the best performing one was
the approach with a cooperation coeficient of 0.75. This result shows that cooperation is easier when
agents have more detailed information about the state of their neighbors. The GNG used for CSDA
approaches handled the increased dimensionality of the input space without any loss of performance,
but the scalability of GNG to larger networks remains an open question since it is expected that there
are diminishing returns from including state variables from intersections that are further apart.</p>
        <p>Inspection of results from Figure 4 shows how convergence is impacted by the chosen TSC approach.
The Independent and state-sharing approaches seem to converge at around episodes 100 to 150, while
the convergence of CSDA approaches seems slower and shows a tendency to continue converging
even after the tested 300 episodes. Since all approaches use the same hyperparameters, this decrease in
convergence speed can only be explained by slow adaptations of GNG neuron weights in later stages of
learning. The total volume of the state space for CSDA approaches is much higher than the state space
for other approaches. Thus, the limit of 150 neurons in the GNG might hinder performance since there
would be more situations that require the addition of a neuron. This is confirmed by a large number of
isolated neurons in the GNG of CSDA. However, by increasing the number of neurons, the total number
of states would also increase which can also negatively impact the convergence.</p>
        <p>The results of  seem to correlate with the results of    , which is expected since vehicles
spending less time in the trafic network are moving closer to their desired speed. The results for the
   also show a correlation with    . The environmental impacts were not directly analyzed as
they are highly dependent on the emission model of vehicles, but considering the reduction in the total
number of stops and travel time it can be expected that there would be a noticeable reduction in vehicle
emissions.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion</title>
      <p>This paper analyzed the performance of 3 diferent MARL ATSC approaches which are based on using
the GNG for state identification. The performance was compared to the baseline FTSC approach to
evaluate the full impact of each approach. Each approach improved the total performance of the trafic
network, but the approach that uses centralized GNG for state identification of the entire network with
decentralized agents performed the best. This result was further improved by expanding the reward
function with reward sharing from neighboring agents. The approach that used only simple sharing of
state variables had only a minor positive impact on the total performance. Future direction for the study
should test additional state-sharing options such as including the state variables from downstream
intersections which could prove beneficial when used with reward sharing.</p>
      <p>While the CSDA approach performs the best, the problem of further scaling to more intersections
remains an open question since adding more intersections would negatively impact the learning
convergence of the agents. A possible research direction to overcome this convergence problem could
be the introduction of knowledge sharing between neurons that share connections or have a small
distance between each other if only a local subspace is considered.</p>
      <p>Future work should also include the implementation in realistic trafic environments, with
heterogeneous intersections. In such realistic scenarios, it would also be possible to perform the environmental
analysis by including detailed vehicle information for the agents and modifying the reward function
to a minimization of vehicle emissions. The addition of an ofset-controlling agent could also help to
improve trafic performance by enabling green wave control on corridors in the network that have a
dominant trafic flow in one direction.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The author of this paper Mladen Miletić received a mobility grant from the Croatian Science Foundation
under the project MOBDOK-2023-4074 that enabled the research presented in this paper. This work has
been partly supported by the Croatian Science Foundation under the project IP-2020-02-5042 (DLASIUT),
and by the Science Foundation Ireland under SFI Frontiers for the Future Grant No. 21/FFP-A/8957
(Clearway project). The authors would like to thank the company PTV for supporting the research in this
paper. This research has also been carried out within the activities of the Centre of Research Excellence
for Data Science and Cooperative Systems supported by the Ministry of Science and Education of the
Republic of Croatia.
[19] C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, Z. Li, Toward a thousand lights:
Decentralized deep reinforcement learning for large-scale trafic signal control, Proceedings of
the AAAI Conference on Artificial Intelligence 34 (2020) 3414–3421. doi: 10.1609/aaai.v34i04.
5744.
[20] M. Guériau, F. Armetta, S. Hassas, R. Billot, N.-E. El Faouzi, A constructivist approach for a
self-adaptive decision-making system: Application to road trafic control, in: 2016 IEEE 28th
International Conference on Tools with Artificial Intelligence (ICTAI), 2016, pp. 670–677. doi: 10.
1109/ICTAI.2016.0107.
[21] M. Guériau, N. Cardozo, I. Dusparic, Constructivist approach to state space adaptation in
reinforcement learning, in: 2019 IEEE 13th International Conference on Self-Adaptive and Self-Organizing
Systems (SASO), 2019, pp. 52–61. doi:10.1109/SASO.2019.00016.
[22] M. Miletić, K. Kušić, M. Gregurić, E. Ivanjko, State complexity reduction in reinforcement learning
based adaptive trafic signal control, in: 2020 International Symposium ELMAR, 2020, pp. 61–66.
doi:10.1109/ELMAR49956.2020.9219024.
[23] M. Miletić, D. Čakija, F. Vrbanić, E. Ivanjko, Impact of connected vehicles on learning based
adaptive trafic control systems, in: 2022 IEEE International Conference on Systems, Man, and
Cybernetics (SMC), 2022, pp. 3311–3316. doi:10.1109/SMC53654.2022.9945071.
[24] M. van Otterlo, M. Wiering, Reinforcement Learning and Markov Decision Processes, Springer</p>
      <p>Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 3–42. doi:10.1007/978-3-642-27645-3_1.
[25] M. Igl, L. Zintgraf, T. A. Le, F. Wood, S. Whiteson, Deep variational reinforcement learning for
POMDPs, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th International Conference on Machine
Learning, volume 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 2117–2126.
[26] B. Fritzke, A growing neural gas network learns topologies, in: G. Tesauro,
D. Touretzky, T. Leen (Eds.), Advances in Neural Information Processing Systems,
volume 7, MIT Press, 1994. URL: https://proceedings.neurips.cc/paper_files/paper/1994/file/
d56b9fc4b0f1be8871f5e1c40c0067e7-Paper.pdf.
[27] H. Hikawa, Y. Maeda, Improved learning performance of hardware self-organizing map using a
novel neighborhood function, IEEE Transactions on Neural Networks and Learning Systems 26
(2015) 2861–2873. doi:10.1109/TNNLS.2015.2398932.
[28] Y. Prudent, A. Ennaji, An incremental growing neural gas learns topologies, in: Proceedings. 2005
IEEE International Joint Conference on Neural Networks, 2005., volume 2, 2005, pp. 1211–1216 vol.
2. doi:10.1109/IJCNN.2005.1556026.
[29] E. Joelianto, H. Sutarto, Simulation of Trafic Control Using VissimCOM Interface, Internetworking
Indonesia Journal 11 (2019) 55.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ceder</surname>
          </string-name>
          ,
          <article-title>Urban mobility and public transport: future perspectives and review</article-title>
          ,
          <source>International Journal of Urban Sciences</source>
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>455</fpage>
          -
          <lpage>479</lpage>
          . doi:
          <volume>10</volume>
          .1080/12265934.
          <year>2020</year>
          .
          <volume>1799846</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Brůhová Foltýnová</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brůha</surname>
          </string-name>
          ,
          <article-title>Expected long-term impacts of the COVID-19 pandemic on travel behaviour and online activities: Evidence from a czech panel survey</article-title>
          ,
          <source>Travel Behaviour and Society</source>
          <volume>34</volume>
          (
          <year>2024</year>
          )
          <article-title>100685</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.tbs.
          <year>2023</year>
          .
          <volume>100685</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Braess</surname>
          </string-name>
          ,
          <source>Über ein Paradoxon aus der Verkehrsplanung, Unternehmensforschung</source>
          <volume>12</volume>
          (
          <year>1968</year>
          )
          <fpage>258</fpage>
          -
          <lpage>268</lpage>
          . doi:
          <volume>10</volume>
          .1007/BF01918335, [In German].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>S.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Pavlou</surname>
          </string-name>
          ,
          <article-title>Mitigating trafic congestion: The role of intelligent transportation systems</article-title>
          ,
          <source>Information Systems Research</source>
          <volume>31</volume>
          (
          <year>2020</year>
          )
          <fpage>653</fpage>
          -
          <lpage>674</lpage>
          . doi:
          <volume>10</volume>
          .1287/isre.
          <year>2019</year>
          .
          <volume>0894</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Miletić</surname>
          </string-name>
          , E. Ivanjko,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gregurić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kušić</surname>
          </string-name>
          ,
          <article-title>A review of reinforcement learning applications in adaptive trafic signal control</article-title>
          ,
          <source>IET Intelligent Transport Systems</source>
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <fpage>1269</fpage>
          -
          <lpage>1285</lpage>
          . doi:
          <volume>10</volume>
          . 1049/itr2.
          <fpage>12208</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bretherton</surname>
          </string-name>
          ,
          <article-title>SCOOT Urban Trafic Control System-Philosophy and Evaluation</article-title>
          ,
          <source>IFAC Proceedings Volumes</source>
          <volume>23</volume>
          (
          <year>1990</year>
          )
          <fpage>237</fpage>
          -
          <lpage>239</lpage>
          . doi:
          <volume>10</volume>
          .1016/S1474-
          <volume>6670</volume>
          (
          <issue>17</issue>
          )
          <fpage>52676</fpage>
          -
          <lpage>2</lpage>
          , IFAC/IFIP/IFORS Symposium on Control, Computers, Communications in Transportation, Paris, France,
          <fpage>19</fpage>
          -
          <lpage>21</lpage>
          September.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lowrie</surname>
          </string-name>
          ,
          <article-title>The Sydney co-ordinated adaptive trafic system: Principles, methodology, algorithms</article-title>
          ,
          <source>IEE CONF. PUBL.; ISSN 0537-9989; GBR; DA</source>
          .
          <year>1982</year>
          ; NO 207; PP.
          <fpage>67</fpage>
          -
          <lpage>70</lpage>
          (
          <year>1982</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pavleski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koltovska-Nechoska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ivanjko</surname>
          </string-name>
          ,
          <article-title>Evaluation of adaptive trafic control system UTOPIA using microscopic simulation</article-title>
          , in: 2017 International
          <string-name>
            <surname>Symposium</surname>
            <given-names>ELMAR</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .23919/ELMAR.
          <year>2017</year>
          .
          <volume>8124425</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wahlstedt</surname>
          </string-name>
          ,
          <article-title>Evaluation of the two self-optimising trafic signal systems Utopia/Spot and ImFlow, and comparison with existing signal control in Stockholm, Sweden</article-title>
          , in: 16th
          <source>International IEEE Conference on Intelligent Transportation Systems (ITSC</source>
          <year>2013</year>
          ),
          <year>2013</year>
          , pp.
          <fpage>1541</fpage>
          -
          <lpage>1546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Miletić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ivanjko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandžuka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Nečoska</surname>
          </string-name>
          ,
          <article-title>Combining neural gas and reinforcement learning for adaptive trafic signal control</article-title>
          , in: 2021 International
          <string-name>
            <surname>Symposium</surname>
            <given-names>ELMAR</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>182</lpage>
          . doi:
          <volume>10</volume>
          .1109/ELMAR52657.
          <year>2021</year>
          .
          <volume>9550948</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>El-Tantawy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Abdulhai</surname>
          </string-name>
          ,
          <article-title>Multi-Agent Reinforcement Learning for Integrated Network of Adaptive Trafic Signal Controllers (MARLIN-ATSC)</article-title>
          ,
          <source>in: 2012 15th International IEEE Conference on Intelligent Transportation Systems</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>319</fpage>
          -
          <lpage>326</lpage>
          . doi:
          <volume>10</volume>
          .1109/ITSC.
          <year>2012</year>
          .
          <volume>6338707</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Noaeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Naik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Crebo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Abrar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. S. H.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Far</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning in urban network trafic signal control: A systematic literature review</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>199</volume>
          (
          <year>2022</year>
          )
          <article-title>116830</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2022</year>
          .
          <volume>116830</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Touhbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Babram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen-Huu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Marilleau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Hbid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cambier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stinckwich</surname>
          </string-name>
          , Adaptive Trafic Signal Control:
          <article-title>Exploring Reward Definition For Reinforcement Learning</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>109</volume>
          (
          <year>2017</year>
          )
          <fpage>513</fpage>
          -
          <lpage>520</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2017</year>
          .
          <volume>05</volume>
          .327.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Arel</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Urbanik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kohls</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning-based multi-agent system for network trafic signal control</article-title>
          ,
          <source>IET Intelligent Transport Systems</source>
          <volume>4</volume>
          (
          <year>2010</year>
          )
          <fpage>128</fpage>
          -
          <lpage>135</lpage>
          (
          <issue>7</issue>
          ). doi:
          <volume>10</volume>
          .1049/ iet-its.
          <year>2009</year>
          .
          <volume>0070</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Qin,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>Distributed Cooperative Reinforcement Learning-Based Trafic Signal Control That Integrates V2X Networks' Dynamic Clustering</article-title>
          ,
          <source>IEEE Transactions on Vehicular Technology</source>
          <volume>66</volume>
          (
          <year>2017</year>
          )
          <fpage>8667</fpage>
          -
          <lpage>8681</lpage>
          . doi:
          <volume>10</volume>
          .1109/TVT.
          <year>2017</year>
          .
          <volume>2702388</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gregurić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vujić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alexopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Miletić</surname>
          </string-name>
          ,
          <article-title>Application of deep reinforcement learning in trafic signal control: An overview and impact of open trafic data</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>10</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .3390/app10114011.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Codecà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Multi-agent deep reinforcement learning for large-scale trafic signal control</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1086</fpage>
          -
          <lpage>1095</lpage>
          . doi:
          <volume>10</volume>
          .1109/TITS.
          <year>2019</year>
          .
          <volume>2901791</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>