<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Deep convolutional Q-learning for traffic lights optimization in Smart Cities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Riccardo</forename><surname>Cappi</surname></persName>
							<email>riccardo.cappi@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padova</orgName>
								<address>
									<settlement>Padova</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sebastiano</forename><surname>Monti</surname></persName>
							<email>sebastiano.monti@studenti.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Padova</orgName>
								<address>
									<settlement>Padova</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Davide</forename><surname>Tosi</surname></persName>
							<email>davide.tosi@uninsubria.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Università degli studi dell&apos;Insubria</orgName>
								<address>
									<settlement>Varese</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Workshop Agents in Traffic and Transportation</orgName>
								<address>
									<addrLine>October 19</addrLine>
									<postCode>2024</postCode>
									<settlement>Santiago de Compostela</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Deep convolutional Q-learning for traffic lights optimization in Smart Cities</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">29F87ACF71D3EADFDCB434E4D8461B17</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Reinforcement Learning</term>
					<term>Deep Q-learning</term>
					<term>Traffic lights</term>
					<term>Convolutional Neural Networks</term>
					<term>Smart Cities</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Autonomous traffic control is an important and active field of research that could potentially lead to remarkable improvements in congestion management and consequent delay and air pollution reductions. In this paper, we propose a deep reinforcement learning model to achieve autonomous traffic lights control at an intersection in a simulated environment. The model consists of a Convolutional Neural Network (CNN) that takes as input an image-like representation of the traffic state and is trained, using the Deep Q-Learning algorithm (DQL), to maximize a reward function based both on the decrease in queue length and maximum waiting times. We show that this approach reduces average waiting time and average queue length when compared to several baselines, such as a multi-layer perceptron architecture with a simpler state space representation and four non-parametric models, which implement the most waiting first heuristic, the longest queue first heuristic, an actuated traffic control scheme, and a simple static configuration of the traffic lights, respectively. The designed approach suggests its applicability in future smart cities for real traffic light control systems.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The advancement of smart technologies during the past 10 years, such as IoT devices, big data analytics, and artificial intelligence methods, has led to the emergence of Smart Cities. One of the key components of the smart urban environment is the optimization of urban vehicle transportation, directly impacting traffic congestion, costs, and emissions <ref type="bibr" target="#b0">[1]</ref>. Two types of solutions are possible to address this challenge. The least efficient one, in terms of costs and durability, consists in the expansion of road infrastructures, while the most functional one involves increasing the efficiency of already existing infrastructures, such as traffic light signals at intersections <ref type="bibr" target="#b1">[2]</ref>. The latter can be implemented through several algorithms, such as static traffic light phases or vehicle-actuated signal control. However, the most promising techniques for adaptive signal control seem to be based on Reinforcement Learning (RL) <ref type="bibr" target="#b2">[3]</ref>. This paper aims at implementing a RL-based agent able to dynamically control the traffic light phases of an intersection in order to minimize jam lengths and vehicles' waiting times. In particular, we implemented a Convolutional Neural Network (CNN), trained using the Deep Q-Learning (DQL) algorithm, which takes as input an image-like representation of the traffic state. We employed a state space definition that combines discrete traffic state encoding (DTSE) <ref type="bibr" target="#b1">[2]</ref> with vehicles' waiting times in order to consider both space and time information. We also defined a reward function according to the best-performing approaches proposed in literature, which involves both the variation in queue length and waiting times. We evaluated the performance of our model by comparing it with that of different baselines, such as a multi-layer perceptron architecture with a simpler state space representation and four heuristic-based models. We show that our approach performs better than the baselines in reducing the average queue length and the average waiting time at the considered intersection.</p><p>The next sections are organized as follows: Section 2 summarizes the most common algorithms and methodologies present in literature in the field of adaptive traffic lights control. Section 3 briefly describes the reinforcement learning paradigm. Section 4 defines the components of the operating environment in which the agent works, such as the performance measures and the employed simulation software. Section 5 provides details regarding the state space, action space and reward function, as well as describing the learning algorithm and network architecture. Section 6 details the experimental setup and the obtained results, while Section 7 summarizes the conducted research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>A lot of research has been done using reinforcement learning to build adaptive traffic signal control systems. These works mainly differ in the state representation of the environment, the action space of the agent and the reward function. Authors in <ref type="bibr" target="#b3">[4]</ref> <ref type="bibr" target="#b4">[5]</ref> defined the state representation on the basis of queue length of different incoming roads, while in <ref type="bibr" target="#b5">[6]</ref> the traffic state is estimated by considering both queue length and the maximum time a vehicle has waited on each lane at the intersection. However, authors in <ref type="bibr" target="#b1">[2]</ref> pointed out that these abstract representations of the traffic state may omit relevant information and lead to suboptimal solutions. For this reason, other works employed an image-like representation by defining a Boolean-valued matrix whose cells can contain a value of one, indicating the presence of a vehicle, or zero, indicating its absence <ref type="bibr" target="#b6">[7]</ref>. In <ref type="bibr" target="#b1">[2]</ref>[8], this matrix is further combined with another that indicates vehicles' speed at the intersection. In this paper, instead, we aim at developing a model able to automatically learn high-level state representations without providing as input too many handcrafted features. To this purpose, we implemented a convolutional neural network that takes as input an image-like representation of the traffic state, exploiting the idea mentioned above. However, we propose a state definition that takes into consideration both the position and the waiting times of vehicles, and additionally uses a stack of consecutive simulation frames to make the model able to implicitly estimate vehicles' velocity and travel direction, following the idea proposed in <ref type="bibr" target="#b8">[9]</ref>.</p><p>An important aspect of reinforcement learning for traffic lights control is how the action space is defined. Previous works proposed two different possibilities: (1) authors in <ref type="bibr" target="#b7">[8]</ref> proposed a system in which all the phases cyclically change in a fixed sequence to guide vehicles through the intersection. In that system, the agent's action is to select the phase duration in the next cycle. (2) On the other hand, most of the previous research defined the action space as the set of possible signal phase configurations (i.e., all the allowed green/red light configurations at the intersection) <ref type="bibr" target="#b6">[7]</ref>[9] <ref type="bibr" target="#b1">[2]</ref>. In this scenario, the agent's action consists in selecting which lanes get a green light by choosing one of the allowed green/red light settings. Since the agent does not optimize the duration of each phase, green/red light timings can only be a multiple of a fixed-length interval. We chose to use the second action space definition, as it seems to be the most popular.</p><p>Another key component is the reward function. A lot of reward definitions have been proposed in literature, such as change in cumulative vehicle delay <ref type="bibr" target="#b9">[10]</ref> <ref type="bibr" target="#b8">[9]</ref> and change in number of queued vehicles <ref type="bibr" target="#b4">[5]</ref>. However, authors in <ref type="bibr" target="#b5">[6]</ref> suggest to define a reward function that is based both on the decrease in queue length and on the decrease in vehicles' waiting times. This approach is also proposed in <ref type="bibr" target="#b10">[11]</ref>, where the results show that if the reward is exclusively based on queue length metrics, the model could leave some cars wait for an indefinite period of time. Therefore, in order to avoid situations of this kind, we decided to design our reward function following the latter approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Background</head><p>In a reinforcement learning setting, an agent interacts with the environment to get rewards from its actions. Usually, a reinforcement learning model faces an unknown Markov decision process. It consists of the set of all the states 𝑆, the action set 𝐴, the transition function 𝛿, and the reward function 𝑅. At each discrete time 𝑡:</p><p>• the agent observes state 𝑠 𝑡 ∈ 𝑆;</p><p>• it chooses action 𝑎 𝑡 ∈ 𝐴 (among the possible actions in state 𝑠 𝑡 ) and executes it; • it receives an immediate reward 𝑟 𝑡 = 𝑅(𝑠 𝑡 , 𝑎 𝑡 ), that can be positive, negative or neutral; • the state changes to 𝑠 𝑡+1 = 𝛿(𝑠 𝑡 , 𝑎 𝑡 ).</p><p>Assuming that 𝑟 𝑡 and 𝑠 𝑡+1 only depend on current state and action, the agent's goal is to learn an action policy 𝜋 : 𝑆 → 𝐴 that maximizes the expected sum of (discounted) rewards obtained if policy 𝜋 is followed. For each possible policy 𝜋 the agent might adopt, we can define an evaluation function over states:</p><formula xml:id="formula_0">𝑉 𝜋 (𝑠) = ∞ ∑︁ 𝑖=0 𝛾 𝑖 𝑟 𝑡+𝑖<label>(1)</label></formula><p>where 𝑟 𝑡 , 𝑟 𝑡+1 , ... are generated executing policy 𝜋 starting at state 𝑠. Then, the choice of the best actions to play becomes an optimization problem. Indeed, it comes down to finding the optimal policy 𝜋 * that maximizes (1) for all states 𝑠:</p><formula xml:id="formula_1">𝜋 * = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 𝑉 𝜋 (𝑠), (∀𝑠).<label>(2)</label></formula><p>In the Q-learning framework, a numeric value 𝑄(𝑠, 𝑎) ∈ R, called Q-value, is associated to each state-action pair. The value of 𝑄 is the reward received immediately upon executing action 𝑎 from state 𝑠, plus the value (discounted by 𝛾) of following the optimal policy thereafter:</p><formula xml:id="formula_2">𝑄(𝑠, 𝑎) = 𝑅(𝑠, 𝑎) + 𝛾𝑉 𝜋 * (𝛿(𝑠, 𝑎))</formula><p>where 𝛿(𝑠, 𝑎) denotes the state resulting from applying action 𝑎 to state 𝑠. Then, we can reformulate (2) as:</p><formula xml:id="formula_3">𝜋 * (𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎).</formula><p>The Q-values are estimated in the Q-learning algorithm by iterative Bellman updates:</p><formula xml:id="formula_4">𝑄 𝑡 (𝑠, 𝑎) = 𝑄 𝑡−1 (𝑠, 𝑎) + 𝛼(𝑟 + 𝛾𝑚𝑎𝑥 𝑎 ′ 𝑄 𝑡−1 (𝑠 ′ , 𝑎 ′ ) − 𝑄 𝑡−1 (𝑠, 𝑎)).</formula><p>In this way, if the agent learns the 𝑄 function instead of 𝑉 𝜋 * , it will be able to select optimal actions even if it has no knowledge of 𝑅 and 𝛿.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Operating environment</head><p>In this section, we define the operating environment in which the agent works.</p><p>Simulation environment: since it is difficult to retrieve real traffic data and perform real-world experimentation, we relied on SUMO <ref type="bibr" target="#b11">[12]</ref>, an open source traffic simulator that makes it possible to model real-world traffic behavior. This software, through an API called TraCI, provides complete control over the simulation environment elements, such as vehicles' speed and position, traffic flow's intensity on each lane, traffic light phases, the shape of the intersection, etc.</p><p>Performance measures: the performance of the agent is assessed with respect to two common traffic metrics: queue length and vehicles' waiting times. The goal is to find a model able to dynamically control the traffic lights of an intersection in order to minimize these two metrics.</p><p>Although dynamic traffic light control is an extremely complex task in the real world, SUMO allows you to operate in a more controlled environment. Specifically, the agent works in a fully-observable environment, since the software gives access to its complete state at each point in time. For this paper, we also defined a deterministic environment by setting a non-stochastic traffic flow generation. This makes the analysis simpler, but it is also one of the biggest limitations of this work. Clearly, the environment is also sequential and single agent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Methods</head><p>In order to build a reinforcement learning model for traffic lights control, we need to define the traffic state representation, the action space and the reward function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">State space</head><p>We propose a state representation that takes into consideration both vehicles' positions and waiting times. The idea is to map each lane approaching the intersection into a Boolean-valued vector, where each cell can contain a 1, indicating the presence of a vehicle at that position, or a 0, indicating its absence. Each cell of the vector corresponds to 1 meter of the lane. The matrix of vehicles' positions is then obtained by stacking all the lane vectors. Given an intersection with 𝑙 lanes, where the longest lane is 𝑚 meters, this intermediate state representation 𝑠 ′ consists of a (𝑙 × 𝑚) matrix. Note that a zero-padding is added to lane vectors with a length less than the maximum length lane in order to have all equally-sized vectors.</p><p>Then, the 𝑠 ′ representation is enriched by using a stack of consecutive simulation frames to make the model able to implicitly estimate vehicles' velocity and travel direction. In particular, 𝑠 ′ is computed for the last 𝑝 (𝑝 = 2 in our setting) simulation steps, yielding a new (𝑝 × 𝑙 × 𝑚) matrix, denoted as 𝑠 ′′ .</p><p>The 𝑠 ′′ representation built so far consists of a Boolean-valued matrix that contains the information about vehicles' positions of the last 𝑝 simulation steps. However, it does not take into consideration the waiting times. This information is embodied in the representation by computing another state matrix, whose cells contain the normalized values of the vehicles' waiting times of the last simulation step. Then, the final state representation 𝑠 is a (𝑝 + 1 × 𝑙 × 𝑚) matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Action space</head><p>To handle traffic at the intersection, the agent selects which lanes get a green light according to a set of three possible green/red light configurations. On each of the three incoming roads there is a traffic light that manages the traffic on the corresponding lanes. The combination of the individual phases of these traffic lights forms the set of the possible green/red light configurations. In Figure <ref type="figure" target="#fig_0">1</ref>, all the three </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Reward function</head><p>The proposed definition of the reward function takes into account both the variation in queue length and waiting times. In particular, the reward 𝑟 𝑡 is given by the following formula:</p><formula xml:id="formula_5">𝑟 𝑡 = (𝐽 𝑡 − 𝐽 𝑡+1 ) − 𝛼𝑊 𝑡+1</formula><p>where 𝐽 𝑡 represents the sum of the jam lengths (in meters) observed over the lanes at time 𝑡, and 𝑊 𝑡+1 represents the sum of the maximum waiting times (in seconds) observed over the lanes at time 𝑡 + 1. 𝛼 is a hyper-parameter that determines how much to penalize the agent for letting vehicles wait too much (in our setting 𝛼 = 0.4). The agent receives a positive reward if the last action performed, 𝑎 𝑡 , leads to a state 𝑠 𝑡+1 with shorter total queue length and/or low maximum waiting times.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Network architecture</head><p>The proposed architecture is a convolutional neural network that takes as input the state matrix mentioned in Section 5.1 and returns as output an approximation of the optimal Q-values. The model is composed of two convolutional layers and two fully connected layers at the end. In particular, the first convolutional layer consists of 16 (2 × 10)-filters with stride (2 × 1) followed by a LeakyReLU activation function. The second layer has 32 (1 × 4)-filters with stride (1 × 2) followed by a LeakyReLU activation function and a max pooling layer of size (1 × 2). The first fully-connected layer has 256 nodes followed by a LeakyReLU activation function, while the output layer has 3 linear output neurons (one for each possible green/red light configuration). In Figure <ref type="figure" target="#fig_1">2</ref>, a summary of the CNN architecture is shown. We designed the convolutional kernels so that, ideally, they compute high-level representations of each road separately. Then, the joint information among the different roads is merged by the network in the last two fully connected layers.  <ref type="table" target="#tab_1">1</ref>, which are typically found in literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Simulation setup</head><p>The considered intersection (Figure <ref type="figure" target="#fig_0">1</ref>) is composed of three incoming roads, each with two lanes. In order to simulate real-life scenarios, the intersection was designed similarly to a real one located in Como (IT) at the following coordinates: (45.802155, 9.084961). The two main roads' lengths are Select action 𝑎 𝑡 sampling from 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑄(𝑠 𝑡 ; 𝜃)) </p><formula xml:id="formula_6">𝑦 𝑗 ← 𝑟 𝑗 + 𝛾 max 𝑎 ′ 𝑄 ^(𝑠 𝑗+1 , 𝑎 ′ ; 𝜃 − ) 20: 𝑙𝑜𝑠𝑠 = 𝑠𝑚𝑜𝑜𝑡ℎ_𝐿1_𝑙𝑜𝑠𝑠(𝑦 𝑗 , 𝑄(𝑠 𝑗 , 𝑎 𝑗 ; 𝜃)) 21:</formula><p>Optimize 𝜃, using ADAM, according to 𝑙𝑜𝑠𝑠 22:</p><formula xml:id="formula_7">𝜃 − ← 𝜏 𝜃 + (1 − 𝜏 ) 𝜃 − 23: if 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 mod 5 = 0 then 24: 𝑒𝑝𝑜𝑐ℎ ← 𝑒𝑝𝑜𝑐ℎ + 1 25:</formula><p>Evaluate 𝑄 309𝑚 and 211𝑚 respectively, while the minor road's length is 103𝑚. The max speed is 13.9𝑚/𝑠, which is equal to 50𝑘𝑚/ℎ, on each road. On each lane, vehicles can travel following different routes through the intersection. Due to the difficulty of finding a dataset of the traffic flows of Italian roads, we set the traffic flow rate to 450 vehicles per hour on each route. A scheme of the routes that vehicles can travel is shown in Figure <ref type="figure" target="#fig_0">1</ref>. We can observe that the east incoming road has 4 different routes; therefore, the traffic on that road will be higher than on the others. The minimum green/red-light phase duration is fixed at 10 simulation steps (10 seconds in the simulation environment), while the yellow-light phase duration between two neighboring phases is fixed at 5 seconds. These two fixed lengths determine how many simulation steps SUMO can run before letting the model take a new action. With this configuration, the green-light phase is guaranteed to be of at least 10 seconds. For simplicity, we chose to generate only one vehicle's type. In particular, each vehicle's length is 5 meters. After 500 simulation steps, the system stops generating vehicles and the simulation ends. The proposed model was trained for 45 epochs, where each epoch is composed of 5 complete SUMO simulations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Results</head><p>As we said before, the proposed model was assessed with respect to two common traffic metrics: queue length and vehicles' waiting times. We compared the performance of the proposed model with that of the following baselines:</p><p>• A Multi-Layer Perceptron (MLP) network with one fully-connected hidden layer of 80 nodes, followed by a ReLU activation function, and 3 linear output neurons. The input of the MLP consists of a vector containing the information about the current phase, the queue length (in meters) on each lane, and the maximum time (in seconds) a vehicle has waited on each lane at the intersection, following the approach proposed in <ref type="bibr" target="#b5">[6]</ref>. The MLP was trained with the same hyper-parameters and optimization method used for the CNN. • Two traffic control systems provided by default by SUMO: (1) the first one is a simple Static configuration of traffic light signals, in which all the phases cyclically change in a fixed sequence and each green/red-light phase has a fixed duration of 25 seconds, while the yellow-light duration is still 5 seconds. (2) The second system is the default implementation of the gap-based Actuated traffic control scheme, which dynamically adjusts traffic light phases' durations whenever a continuous stream of traffic is detected. • Two models that implement the most waiting first (MWF) heuristic and longest queue first (LQF)</p><p>heuristic. The first model sets a green light to lanes in which vehicles waited the most, up to the current simulation step. The second model, instead, sets a green light to lanes in which the longest queues were observed. For both models, the green/red-light duration and the yellow-light duration are the same as the CNN model. Table <ref type="table" target="#tab_2">2</ref> shows the performance of the tested models. It is clear that the proposed agent performs better than every baseline, providing less average waiting time 1 and less average queue length. We can also see that the non-parametric methods such as MWF, LQF, Actuated and Static heuristics perform dramatically worse than the RL-based agents. Therefore, we continue the analysis by exploring the differences between the two neural network models.</p><p>In Figure <ref type="figure" target="#fig_3">3</ref>, a comparison between the average rewards obtained by the CNN and the MLP on each epoch is shown (red line and blue line, respectively). The learning process seems to be more stable for the CNN-based agent, which performs better than the baseline. However, we can observe a rapid increase in the rewards obtained by the MLP agent at the end of the training. This suggests that, even if the CNN model provides better results in this experiment, the MLP does not perform dramatically worse. The same result can be deduced by looking at the average queue lengths and average waiting times obtained by the two architectures over the epochs, shown in Figure <ref type="figure" target="#fig_4">4</ref>. For this reason, in order to   <ref type="figure" target="#fig_6">5</ref> shows the box plots of the average rewards obtained by training both models in 4 different simulation setups, featuring increasing traffic intensities. Each setup is equivalent to the one presented in Section 6.1, with 350, 450, 550 and 700 vehicles per hour, respectively. The results show that, under low traffic conditions, the two models perform very similarly. However, the CNN-based agent scales better than the baseline with increasing traffic intensity, showing that the proposed model is more robust and can deal with more complex scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>Smart cities and the planet urgently ask for environmental emissions to be reduced while improving the quality of life for citizens. To this end, Artificial Intelligence can provide researchers with instruments and tools to help this virtuous process. In this paper, a new CNN-based approach has been designed and tested to improve queue length and vehicle waiting times for traffic light control systems. The proposed approach has been extensively experimented against five baseline models. The results show that CNN models perform better than baselines. This opens the possibility of testing our approach in real-life conditions and in future Smart Cities that will exploit intelligent traffic light control systems.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Possible phase configurations that can occur at the intersection</figDesc><graphic coords="4,212.19,405.29,168.40,239.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: CNN architecture summary</figDesc><graphic coords="5,159.92,413.59,272.95,129.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Algorithm 1 3 : 4 :</head><label>134</label><figDesc>Deep Q-Learning with Experience Replay 1: procedure DQL for traffic lights control 2: Initialize replay memory 𝐷 to capacity 𝐿 Initialize policy network 𝑄 with random weights 𝜃 Initialize target network 𝑄 ^with random weights 𝜃 − = 𝜃</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Average rewards obtained by the CNN and the MLP over the epochs.</figDesc><graphic coords="8,180.96,65.61,230.87,180.18" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Average queue length (meters) and average waiting time (seconds) obtained by CNN and MLP on each epoch</figDesc><graphic coords="8,84.12,345.76,202.32,163.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure</head><label></label><figDesc></figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Average rewards obtained by training CNN and MLP agents considering different traffic flow rates</figDesc><graphic coords="9,80.61,65.61,213.12,163.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>𝑠 𝑡+1 , 𝑟 𝑡 ← 𝑒𝑛𝑣.step(𝑎 𝑡 ) 14: Store transition ⟨𝑠 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠 𝑡+1 ⟩ in 𝐷 15: Sample a mini-batch of transitions ⟨𝑠 𝑗 , 𝑎 𝑗 , 𝑟 𝑗 , 𝑠 𝑗+1 ⟩ uniformly from 𝐷</figDesc><table><row><cell>13:</cell><cell></cell></row><row><cell>16:</cell><cell>if 𝑠 𝑗+1 is terminal then</cell></row><row><cell>17:</cell><cell>𝑦 𝑗 ← 𝑟 𝑗</cell></row><row><cell>18:</cell><cell>else</cell></row><row><cell>19:</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc></figDesc><table><row><cell>Agent's hyper-parameters</cell><cell></cell></row><row><cell>Hyper-parameter</cell><cell>Value</cell></row><row><cell>Optimizer</cell><cell>ADAM</cell></row><row><cell cols="2">Replay memory size 5000</cell></row><row><cell>Learning rate</cell><cell>0.001</cell></row><row><cell>Mini-Batch size</cell><cell>32</cell></row><row><cell>Discount factor 𝛾</cell><cell>0.9</cell></row><row><cell>State matrix size</cell><cell>3 × 6 × 309</cell></row><row><cell>Epochs</cell><cell>45</cell></row><row><cell>𝜏</cell><cell>0.001</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Performance comparison of the analyzed models. CNN and MLP's values are obtained by testing the models that got the highest average reward during the training phase.</figDesc><table><row><cell>Model</cell><cell cols="4">Max queue length Max waiting time Avg. queue length Avg. waiting time</cell></row><row><cell></cell><cell>[𝑚]</cell><cell>[𝑠]</cell><cell>[𝑚]</cell><cell>[𝑠]</cell></row><row><cell>CNN</cell><cell>87.54</cell><cell>70</cell><cell>15.78</cell><cell>13.49</cell></row><row><cell>MLP</cell><cell>124.33</cell><cell>159</cell><cell>20.71</cell><cell>19.41</cell></row><row><cell>MWF</cell><cell>181.21</cell><cell>157</cell><cell>38.76</cell><cell>36.67</cell></row><row><cell>LQF</cell><cell>140.01</cell><cell>241</cell><cell>36.65</cell><cell>41.17</cell></row><row><cell>Static</cell><cell>199.43</cell><cell>214</cell><cell>31.15</cell><cell>30.94</cell></row><row><cell>Actuated</cell><cell>178.21</cell><cell>117</cell><cell>27.67</cell><cell>26.14</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Cell phone big data to compute mobility scenarios for future smart cities</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tosi</surname></persName>
		</author>
		<idno type="DOI">10.1007/s41060-017-0061-2</idno>
		<ptr target="https://doi.org/10.1007/s41060-017-0061-2.doi:10.1007/s41060-017-0061-2" />
	</analytic>
	<monogr>
		<title level="j">International Journal of Data Science and Analytics</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="265" to="284" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Genders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Razavi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1611.01142</idno>
		<title level="m">Using a deep reinforcement learning agent for traffic signal control</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sadeh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2206.11996</idno>
		<title level="m">The real deal: A review of challenges and opportunities in moving reinforcement learning-based traffic signal control systems towards reality</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Exploring q-learning optimization in traffic signal timing plan management</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">K</forename><surname>Chin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bolong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">T K</forename><surname>Teo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2011 third international conference on computational intelligence, communication systems and networks</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="269" to="274" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Traffic signal control for an isolated intersection using reinforcement learning</title>
		<author>
			<persName><forename type="first">N</forename><surname>Maiti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">R</forename><surname>Chilukuri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 International Conference on COMmunication Systems &amp; NETworkS (COM-SNETS)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="629" to="633" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Smart traffic light system using machine learning</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Natafgi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Osman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Haidar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hamandi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Multidisciplinary Conference on Engineering Technology (IMCET), IEEE</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Coordinated deep reinforcement learners for traffic light control, Proceedings of learning, inference and control of multi-agent systems</title>
		<author>
			<persName><forename type="first">E</forename><surname>Van Der Pol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Oliehoek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NIPS</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="21" to="38" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Han</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.11115</idno>
		<title level="m">Deep reinforcement learning for traffic light control in vehicular networks</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Traffic light control using deep policy-gradient and valuefunction-based reinforcement learning</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Mousavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schukat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Howley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IET Intelligent Transport Systems</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="417" to="423" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Reinforcement learning-based multi-agent system for network traffic signal control</title>
		<author>
			<persName><forename type="first">I</forename><surname>Arel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Urbanik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Kohls</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IET Intelligent Transport Systems</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="128" to="135" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Reward function design in multi-agent reinforcement learning for traffic signal control</title>
		<author>
			<persName><forename type="first">B</forename><surname>Koohy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gerding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Manla</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Microscopic traffic simulation using sumo</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Behrisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bieker-Walz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Erdmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-P</forename><surname>Flötteröd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hilbrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lücken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rummel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wagner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Wießner</surname></persName>
		</author>
		<ptr target="https://elib.dlr.de/124092/" />
	</analytic>
	<monogr>
		<title level="m">The 21st IEEE International Conference on Intelligent Transportation Systems</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
