<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep convolutional Q-learning for trafic lights optimization in Smart Cities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riccardo Cappi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastiano Monti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Tosi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli studi dell'Insubria</institution>
          ,
          <addr-line>Varese -</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padova</institution>
          ,
          <addr-line>Padova -</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Autonomous trafic control is an important and active field of research that could potentially lead to remarkable improvements in congestion management and consequent delay and air pollution reductions. In this paper, we propose a deep reinforcement learning model to achieve autonomous trafic lights control at an intersection in a simulated environment. The model consists of a Convolutional Neural Network (CNN) that takes as input an image-like representation of the trafic state and is trained, using the Deep Q-Learning algorithm (DQL), to maximize a reward function based both on the decrease in queue length and maximum waiting times. We show that this approach reduces average waiting time and average queue length when compared to several baselines, such as a multi-layer perceptron architecture with a simpler state space representation and four non-parametric models, which implement the most waiting first heuristic, the longest queue first heuristic, an actuated trafic control scheme, and a simple static configuration of the trafic lights, respectively. The designed approach suggests its applicability in future smart cities for real trafic light control systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reinforcement Learning</kwd>
        <kwd>Deep Q-learning</kwd>
        <kwd>Trafic lights</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Smart Cities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advancement of smart technologies during the past 10 years, such as IoT devices, big data analytics,
and artificial intelligence methods, has led to the emergence of Smart Cities. One of the key components
of the smart urban environment is the optimization of urban vehicle transportation, directly impacting
trafic congestion, costs, and emissions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Two types of solutions are possible to address this challenge.
The least eficient one, in terms of costs and durability, consists in the expansion of road infrastructures,
while the most functional one involves increasing the eficiency of already existing infrastructures, such
as trafic light signals at intersections [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The latter can be implemented through several algorithms,
such as static trafic light phases or vehicle-actuated signal control. However, the most promising
techniques for adaptive signal control seem to be based on Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This
paper aims at implementing a RL-based agent able to dynamically control the trafic light phases of an
intersection in order to minimize jam lengths and vehicles’ waiting times. In particular, we implemented
a Convolutional Neural Network (CNN), trained using the Deep Q-Learning (DQL) algorithm, which
takes as input an image-like representation of the trafic state. We employed a state space definition that
combines discrete trafic state encoding (DTSE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with vehicles’ waiting times in order to consider
both space and time information. We also defined a reward function according to the best-performing
approaches proposed in literature, which involves both the variation in queue length and waiting times.
We evaluated the performance of our model by comparing it with that of diferent baselines, such as a
multi-layer perceptron architecture with a simpler state space representation and four heuristic-based
models. We show that our approach performs better than the baselines in reducing the average queue
length and the average waiting time at the considered intersection.
      </p>
      <p>The next sections are organized as follows: Section 2 summarizes the most common algorithms
and methodologies present in literature in the field of adaptive trafic lights control. Section 3 briefly
describes the reinforcement learning paradigm. Section 4 defines the components of the operating
environment in which the agent works, such as the performance measures and the employed simulation
software. Section 5 provides details regarding the state space, action space and reward function, as well
as describing the learning algorithm and network architecture. Section 6 details the experimental setup
and the obtained results, while Section 7 summarizes the conducted research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        A lot of research has been done using reinforcement learning to build adaptive trafic signal control
systems. These works mainly difer in the state representation of the environment, the action space of
the agent and the reward function. Authors in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] defined the state representation on the basis of
queue length of diferent incoming roads, while in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] the trafic state is estimated by considering both
queue length and the maximum time a vehicle has waited on each lane at the intersection. However,
authors in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] pointed out that these abstract representations of the trafic state may omit relevant
information and lead to suboptimal solutions. For this reason, other works employed an image-like
representation by defining a Boolean-valued matrix whose cells can contain a value of one, indicating the
presence of a vehicle, or zero, indicating its absence [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], this matrix is further combined with
another that indicates vehicles’ speed at the intersection. In this paper, instead, we aim at developing a
model able to automatically learn high-level state representations without providing as input too many
handcrafted features. To this purpose, we implemented a convolutional neural network that takes as
input an image-like representation of the trafic state, exploiting the idea mentioned above. However,
we propose a state definition that takes into consideration both the position and the waiting times
of vehicles, and additionally uses a stack of consecutive simulation frames to make the model able to
implicitly estimate vehicles’ velocity and travel direction, following the idea proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        An important aspect of reinforcement learning for trafic lights control is how the action space is
defined. Previous works proposed two diferent possibilities: (1) authors in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed a system in
which all the phases cyclically change in a fixed sequence to guide vehicles through the intersection. In
that system, the agent’s action is to select the phase duration in the next cycle. (2) On the other hand,
most of the previous research defined the action space as the set of possible signal phase configurations
(i.e., all the allowed green/red light configurations at the intersection) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this scenario, the
agent’s action consists in selecting which lanes get a green light by choosing one of the allowed
green/red light settings. Since the agent does not optimize the duration of each phase, green/red light
timings can only be a multiple of a fixed-length interval. We chose to use the second action space
definition, as it seems to be the most popular.
      </p>
      <p>
        Another key component is the reward function. A lot of reward definitions have been proposed in
literature, such as change in cumulative vehicle delay [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and change in number of queued vehicles
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, authors in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] suggest to define a reward function that is based both on the decrease in
queue length and on the decrease in vehicles’ waiting times. This approach is also proposed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
where the results show that if the reward is exclusively based on queue length metrics, the model could
leave some cars wait for an indefinite period of time. Therefore, in order to avoid situations of this kind,
we decided to design our reward function following the latter approach.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>In a reinforcement learning setting, an agent interacts with the environment to get rewards from its
actions. Usually, a reinforcement learning model faces an unknown Markov decision process. It consists
of the set of all the states , the action set , the transition function  , and the reward function . At
each discrete time :
• the agent observes state  ∈ ;
• it chooses action  ∈  (among the possible actions in state ) and executes it;
• it receives an immediate reward  = (, ), that can be positive, negative or neutral;
• the state changes to +1 =  (, ).</p>
      <p>Assuming that  and +1 only depend on current state and action, the agent’s goal is to learn an action
policy  :  →  that maximizes the expected sum of (discounted) rewards obtained if policy  is
followed. For each possible policy  the agent might adopt, we can define an evaluation function over
states:
 * =    (), (∀).
(2)
In the Q-learning framework, a numeric value (, ) ∈ R, called Q-value, is associated to each
state-action pair. The value of  is the reward received immediately upon executing action  from state
, plus the value (discounted by  ) of following the optimal policy thereafter:</p>
      <p>(, ) = (, ) +   * ( (, ))
where  (, ) denotes the state resulting from applying action  to state . Then, we can reformulate
(2) as:</p>
      <p>* () = (, ).</p>
      <p>The Q-values are estimated in the Q-learning algorithm by iterative Bellman updates:
(, ) = − 1(, ) +  ( +  ′ − 1(′, ′) − − 1(, )).</p>
      <p>In this way, if the agent learns the  function instead of   * , it will be able to select optimal actions
even if it has no knowledge of  and  .</p>
    </sec>
    <sec id="sec-4">
      <title>4. Operating environment</title>
      <p>In this section, we define the operating environment in which the agent works.</p>
      <p>
        Simulation environment: since it is dificult to retrieve real trafic data and perform real-world
experimentation, we relied on SUMO [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], an open source trafic simulator that makes it possible to
model real-world trafic behavior. This software, through an API called TraCI, provides complete control
over the simulation environment elements, such as vehicles’ speed and position, trafic flow’s intensity
on each lane, trafic light phases, the shape of the intersection, etc.
      </p>
      <p>Performance measures: the performance of the agent is assessed with respect to two common trafic
metrics: queue length and vehicles’ waiting times. The goal is to find a model able to dynamically
control the trafic lights of an intersection in order to minimize these two metrics.</p>
      <p>Although dynamic trafic light control is an extremely complex task in the real world, SUMO allows
you to operate in a more controlled environment. Specifically, the agent works in a fully-observable
environment, since the software gives access to its complete state at each point in time. For this
paper, we also defined a deterministic environment by setting a non-stochastic trafic flow generation.
This makes the analysis simpler, but it is also one of the biggest limitations of this work. Clearly, the
environment is also sequential and single agent.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Methods</title>
      <p>In order to build a reinforcement learning model for trafic lights control, we need to define the trafic
state representation, the action space and the reward function.</p>
      <sec id="sec-5-1">
        <title>5.1. State space</title>
        <p>We propose a state representation that takes into consideration both vehicles’ positions and waiting
times. The idea is to map each lane approaching the intersection into a Boolean-valued vector, where
each cell can contain a 1, indicating the presence of a vehicle at that position, or a 0, indicating its
absence. Each cell of the vector corresponds to 1 meter of the lane. The matrix of vehicles’ positions is
then obtained by stacking all the lane vectors. Given an intersection with  lanes, where the longest
lane is  meters, this intermediate state representation ′ consists of a ( × ) matrix. Note that a
zero-padding is added to lane vectors with a length less than the maximum length lane in order to have
all equally-sized vectors.</p>
        <p>Then, the ′ representation is enriched by using a stack of consecutive simulation frames to make the
model able to implicitly estimate vehicles’ velocity and travel direction. In particular, ′ is computed for
the last  ( = 2 in our setting) simulation steps, yielding a new ( ×  × ) matrix, denoted as ′′.</p>
        <p>The ′′ representation built so far consists of a Boolean-valued matrix that contains the information
about vehicles’ positions of the last  simulation steps. However, it does not take into consideration the
waiting times. This information is embodied in the representation by computing another state matrix,
whose cells contain the normalized values of the vehicles’ waiting times of the last simulation step.
Then, the final state representation  is a ( + 1 ×  × ) matrix.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Action space</title>
        <p>To handle trafic at the intersection, the agent selects which lanes get a green light according to a set
of three possible green/red light configurations. On each of the three incoming roads there is a trafic
light that manages the trafic on the corresponding lanes. The combination of the individual phases of
these trafic lights forms the set of the possible green/red light configurations. In Figure 1, all the three
possible signal phases that can occur at the considered intersection are shown. Green and red lines
represent the routes that vehicles can travel during the simulation. Vehicles on green paths are allowed
to pass, while vehicles on red paths must stop.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Reward function</title>
        <p>The proposed definition of the reward function takes into account both the variation in queue length
and waiting times. In particular, the reward  is given by the following formula:</p>
        <p>= ( − +1) −  +1
where  represents the sum of the jam lengths (in meters) observed over the lanes at time , and +1
represents the sum of the maximum waiting times (in seconds) observed over the lanes at time  + 1.
 is a hyper-parameter that determines how much to penalize the agent for letting vehicles wait too
much (in our setting  = 0.4). The agent receives a positive reward if the last action performed, ,
leads to a state +1 with shorter total queue length and/or low maximum waiting times.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Network architecture</title>
        <p>The proposed architecture is a convolutional neural network that takes as input the state matrix
mentioned in Section 5.1 and returns as output an approximation of the optimal Q-values. The model
is composed of two convolutional layers and two fully connected layers at the end. In particular, the
ifrst convolutional layer consists of 16 (2 × 10)-filters with stride (2 × 1) followed by a LeakyReLU
activation function. The second layer has 32 (1 × 4)-filters with stride (1 × 2) followed by a LeakyReLU
activation function and a max pooling layer of size (1 × 2). The first fully-connected layer has 256
nodes followed by a LeakyReLU activation function, while the output layer has 3 linear output neurons
(one for each possible green/red light configuration). In Figure 2, a summary of the CNN architecture is
shown. We designed the convolutional kernels so that, ideally, they compute high-level representations
of each road separately. Then, the joint information among the diferent roads is merged by the network
in the last two fully connected layers.</p>
        <p>Input</p>
        <p>Conv 1</p>
        <p>Conv 2</p>
        <p>Max Pooling</p>
        <p>Flatten</p>
        <p>Dense</p>
        <p>Output</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Learning algorithm</title>
        <p>The proposed model was trained using the Deep Q-Learning (DQL) algorithm, shown in Algorithm 1,
which consists in combining Q-Learning with Deep Neural Networks (DNN). The employed
hyperparameters are shown in Table 1, which are typically found in literature.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <sec id="sec-6-1">
        <title>6.1. Simulation setup</title>
        <p>The considered intersection (Figure 1) is composed of three incoming roads, each with two lanes. In
order to simulate real-life scenarios, the intersection was designed similarly to a real one located in
Como (IT) at the following coordinates: (45.802155, 9.084961). The two main roads’ lengths are
Algorithm 1 Deep Q-Learning with Experience Replay
1: procedure DQL for traffic lights control</p>
        <p>Initialize replay memory  to capacity 
Initialize policy network  with random weights 
Initialize target network ^ with random weights  − = 
Create simulation environment 
ℎ ←
 ←
0</p>
        <p>0
while ℎ &lt;  do
1 ←
 ←
for  = 1,  do
.reset()</p>
        <p>+ 1
Select action  sampling from  ((;  ))
+1,  ←</p>
        <p>.step()
Store transition ⟨, , , +1⟩ in 
Sample a mini-batch of transitions ⟨ ,  ,  , +1⟩ uniformly from 
if +1 is terminal then
else
 ←</p>
        <p>←  +  max′ ^(+1, ′;  − )
 = ℎ_1_( , ( ,  ;  ))</p>
        <sec id="sec-6-1-1">
          <title>Optimize  , using ADAM, according to</title>
          <p>− ←  
+ (1 −  )  −
if  mod 5 = 0 then
ℎ ←</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Evaluate</title>
          <p>ℎ + 1
309 and 211 respectively, while the minor road’s length is 103. The max speed is 13.9/, which
is equal to 50/ℎ, on each road. On each lane, vehicles can travel following diferent routes through
the intersection. Due to the dificulty of finding a dataset of the trafic flows of Italian roads, we set the
trafic flow rate to 450 vehicles per hour on each route. A scheme of the routes that vehicles can travel
is shown in Figure 1. We can observe that the east incoming road has 4 diferent routes; therefore, the
trafic on that road will be higher than on the others. The minimum green/red-light phase duration is
ifxed at 10 simulation steps ( 10 seconds in the simulation environment), while the yellow-light phase
duration between two neighboring phases is fixed at 5 seconds. These two fixed lengths determine
how many simulation steps SUMO can run before letting the model take a new action. With this
configuration, the green-light phase is guaranteed to be of at least 10 seconds. For simplicity, we chose
to generate only one vehicle’s type. In particular, each vehicle’s length is 5 meters. After 500 simulation
steps, the system stops generating vehicles and the simulation ends. The proposed model was trained
for 45 epochs, where each epoch is composed of 5 complete SUMO simulations.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Results</title>
        <p>
          As we said before, the proposed model was assessed with respect to two common trafic metrics: queue
length and vehicles’ waiting times. We compared the performance of the proposed model with that of
the following baselines:
• A Multi-Layer Perceptron (MLP) network with one fully-connected hidden layer of 80 nodes,
followed by a ReLU activation function, and 3 linear output neurons. The input of the MLP
consists of a vector containing the information about the current phase, the queue length (in
meters) on each lane, and the maximum time (in seconds) a vehicle has waited on each lane at
the intersection, following the approach proposed in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The MLP was trained with the same
hyper-parameters and optimization method used for the CNN.
• Two trafic control systems provided by default by SUMO: (1) the first one is a simple Static
configuration of trafic light signals, in which all the phases cyclically change in a fixed sequence
and each green/red-light phase has a fixed duration of 25 seconds, while the yellow-light duration
is still 5 seconds. (2) The second system is the default implementation of the gap-based Actuated
trafic control scheme, which dynamically adjusts trafic light phases’ durations whenever a
continuous stream of trafic is detected.
• Two models that implement the most waiting first (MWF) heuristic and longest queue first (LQF)
heuristic. The first model sets a green light to lanes in which vehicles waited the most, up to
the current simulation step. The second model, instead, sets a green light to lanes in which the
longest queues were observed. For both models, the green/red-light duration and the yellow-light
duration are the same as the CNN model.
Table 2 shows the performance of the tested models. It is clear that the proposed agent performs
better than every baseline, providing less average waiting time1 and less average queue length. We can
also see that the non-parametric methods such as MWF, LQF, Actuated and Static heuristics perform
dramatically worse than the RL-based agents. Therefore, we continue the analysis by exploring the
diferences between the two neural network models.
        </p>
        <p>In Figure 3, a comparison between the average rewards obtained by the CNN and the MLP on each
epoch is shown (red line and blue line, respectively). The learning process seems to be more stable
for the CNN-based agent, which performs better than the baseline. However, we can observe a rapid
increase in the rewards obtained by the MLP agent at the end of the training. This suggests that, even
1the average waiting time at the intersection is computed by averaging the maximum waiting times observed on each lane
during the simulation
if the CNN model provides better results in this experiment, the MLP does not perform dramatically
worse. The same result can be deduced by looking at the average queue lengths and average waiting
times obtained by the two architectures over the epochs, shown in Figure 4. For this reason, in order to
assess whether the CNN-based agent concretely brings significant improvements with respect to the
MLP-based one, we compared both models by training them under diferent trafic conditions. Figure 5
shows the box plots of the average rewards obtained by training both models in 4 diferent simulation
setups, featuring increasing trafic intensities. Each setup is equivalent to the one presented in Section
6.1, with 350, 450, 550 and 700 vehicles per hour, respectively. The results show that, under low trafic
conditions, the two models perform very similarly. However, the CNN-based agent scales better than
the baseline with increasing trafic intensity, showing that the proposed model is more robust and can
deal with more complex scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>Smart cities and the planet urgently ask for environmental emissions to be reduced while improving the
quality of life for citizens. To this end, Artificial Intelligence can provide researchers with instruments
and tools to help this virtuous process. In this paper, a new CNN-based approach has been designed
and tested to improve queue length and vehicle waiting times for trafic light control systems. The
proposed approach has been extensively experimented against five baseline models. The results show
that CNN models perform better than baselines. This opens the possibility of testing our approach in
real-life conditions and in future Smart Cities that will exploit intelligent trafic light control systems.
J. Rummel, P. Wagner, E. Wießner, Microscopic trafic simulation using sumo, in: The 21st
IEEE International Conference on Intelligent Transportation Systems, IEEE, 2018. URL: https:
//elib.dlr.de/124092/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tosi</surname>
          </string-name>
          ,
          <article-title>Cell phone big data to compute mobility scenarios for future smart cities</article-title>
          ,
          <source>International Journal of Data Science and Analytics</source>
          <volume>4</volume>
          (
          <year>2017</year>
          )
          <fpage>265</fpage>
          -
          <lpage>284</lpage>
          . URL: https://doi.org/10.1007/s41060-017-0061-2. doi:
          <volume>10</volume>
          .1007/s41060-017-0061-2.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Genders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razavi</surname>
          </string-name>
          ,
          <article-title>Using a deep reinforcement learning agent for trafic signal control</article-title>
          ,
          <source>arXiv preprint arXiv:1611.01142</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Sadeh,</surname>
          </string-name>
          <article-title>The real deal: A review of challenges and opportunities in moving reinforcement learning-based trafic signal control systems towards reality</article-title>
          ,
          <source>arXiv preprint arXiv:2206.11996</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bolong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. T. K. Teo</surname>
          </string-name>
          ,
          <article-title>Exploring q-learning optimization in trafic signal timing plan management, in: 2011 third international conference on computational intelligence, communication systems and networks</article-title>
          , IEEE,
          <year>2011</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Maiti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chilukuri</surname>
          </string-name>
          ,
          <article-title>Trafic signal control for an isolated intersection using reinforcement learning</article-title>
          ,
          <source>in: 2021 International Conference on COMmunication Systems &amp; NETworkS (COMSNETS)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Natafgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Osman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Haidar</surname>
          </string-name>
          , L. Hamandi,
          <article-title>Smart trafic light system using machine learning</article-title>
          ,
          <source>in: 2018 IEEE International Multidisciplinary Conference on Engineering Technology (IMCET)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Van der Pol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Oliehoek</surname>
          </string-name>
          ,
          <article-title>Coordinated deep reinforcement learners for trafic light control, Proceedings of learning, inference and control of multi-agent systems (at NIPS</article-title>
          <year>2016</year>
          )
          <volume>8</volume>
          (
          <year>2016</year>
          )
          <fpage>21</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for trafic light control in vehicular networks</article-title>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>11115</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schukat</surname>
          </string-name>
          , E. Howley,
          <article-title>Trafic light control using deep policy-gradient and valuefunction-based reinforcement learning</article-title>
          ,
          <source>IET Intelligent Transport Systems</source>
          <volume>11</volume>
          (
          <year>2017</year>
          )
          <fpage>417</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Arel</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Urbanik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Kohls</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning-based multi-agent system for network trafic signal control</article-title>
          ,
          <source>IET Intelligent Transport Systems</source>
          <volume>4</volume>
          (
          <year>2010</year>
          )
          <fpage>128</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Koohy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stein</surname>
          </string-name>
          , E. Gerding, G. Manla,
          <article-title>Reward function design in multi-agent reinforcement learning for trafic signal control (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Behrisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bieker-Walz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Erdmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-P.</given-names>
            <surname>Flötteröd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hilbrich</surname>
          </string-name>
          , L. Lücken,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>