<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Reward Function Design in Multi-Agent Reinforcement Learning for Traffic Signal Control</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Behrad</forename><surname>Koohy</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Southampton</orgName>
								<address>
									<addrLine>University Road, Highfield</addrLine>
									<postCode>SO17 1BJ</postCode>
									<settlement>Southampton</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sebastian</forename><surname>Stein</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Southampton</orgName>
								<address>
									<addrLine>University Road, Highfield</addrLine>
									<postCode>SO17 1BJ</postCode>
									<settlement>Southampton</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Enrico</forename><surname>Gerding</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Southampton</orgName>
								<address>
									<addrLine>University Road, Highfield</addrLine>
									<postCode>SO17 1BJ</postCode>
									<settlement>Southampton</settlement>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Ghaithaa</forename><surname>Manla</surname></persName>
							<email>manla.ghaithaa@yunextraffic.com</email>
							<affiliation key="aff1">
								<orgName type="department">Yunex Traffic</orgName>
								<address>
									<addrLine>Sopers Lane</addrLine>
									<postCode>BH17 7ER</postCode>
									<settlement>Poole</settlement>
									<region>Dorset</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Reward Function Design in Multi-Agent Reinforcement Learning for Traffic Signal Control</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">9A9551CDA8E74044899D2A22F0A26094</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:08+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Traffic Signal Control</term>
					<term>Intelligent Traffic Management</term>
					<term>Reinforcement Learning</term>
					<term>Problem of Non-Stationarity</term>
					<term>Multi-Agent Reinforcement Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In recent years, there has been increased interest in Reinforcement Learning (RL) for Traffic Signal Control (TSC), with implementations of RL touted as a potential successor to the current commercial solutions in place. Commercial systems, such as Microprocessor Optimised Vehicle Actuation (MOVA) and Split, Cycle, and Offset Optimisation Technique (SCOOT), can adapt to the changing traffic state, but do not learn the specific traffic characteristics of an intersection, and leave much to be desired when performance is compared to the potential benefits of using RL for TSC. Furthermore, distributed RL can provide the unique benefits of scalability and decentralisation for road infrastructure. However, using RL for TSC introduces the problem of non-stationarity where the changing policies of RL agents, tasked with optimal control of traffic signals, directly impacts the observed state of the system and therefore the policies of other agents. This non-stationarity can be mitigated through careful consideration and selection of an appropriate reward function. However, existing literature does not consider the impact of the reward function on the performance of agents in a non-stationary environment such as TSC. In this paper, we select 12 reward functions from the literature, and empirically evaluate them compared to a baseline of a commercial solution in a multi-agent setting. Furthermore, we are particularly interested in the performance of agents when used in a real-world scenario, and so we use demand calibrated data from Ingolstadt, Germany to compare the average waiting time and trip duration of vehicles. We find that reward functions which often perform well in a single intersection setting may not outperform commercial solutions in a multi-agent setting due to their impact on the demand profile of other agents. Furthermore, the reward functions which include the waiting time of agents produce the most predictable demand profile, in turn leading to increased throughput than alternatively proposed solutions.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Reinforcement Learning (RL) for Traffic Signal Control (TSC) is an area which has been investigated in detail as a potential improvement on the current adaptive systems in use. Current commercially available systems do not use RL, and require manual setup of signal timings for each intersection, something which can be time-consuming to do and can have a negative impact on traffic flow if not configured correctly. In the UK, MOVA <ref type="bibr" target="#b0">[1]</ref> (Microprocessor Optimised Vehicle Actuation) and SCOOT <ref type="bibr" target="#b1">[2]</ref> (Split, Cycle, and Offset Optimisation Technique) are the most widely implemented commercial systems, with the latter being used mainly for regions of up to 30 traffic signal junctions. While adaptive (extending green signals when traffic demand is high in a given direction), these algorithms do not use RL to learn the specific characteristics of a traffic signal. The design of these algorithms was completed in the 1980s, and the iterative improvements made since then have not taken advantage of the vast amount of information available now from roadside sensors. In addition to this, modern approaches to the TSC problem can employ more advanced data sources such as traffic cameras, and information from connected and autonomous vehicles, allowing for a more accurate picture of the traffic flow through a road network. Furthermore, this allows for prioritisation of certain types of traffic, where appropriate, such as allowing heavy goods vehicles (HGVs) to pass through lights and avoid deceleration (followed by acceleration), or clearing the road network in a certain direction to allow for easier passage of emergency vehicles attending to an emergency.</p><p>RL based approaches for TSC, whilst not exposed to the decades of development which current approaches in use have had, have still been shown to outperform well-calibrated systems in simulations <ref type="bibr" target="#b2">[3]</ref>. Current state-of-the-art approaches make use of some innovative methods such as junction pressure <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>, convolutional neural networks <ref type="bibr" target="#b5">[6]</ref> and graph attention networks <ref type="bibr" target="#b6">[7]</ref>.</p><p>Introducing independent RL agents at each intersection within a road network has a number of benefits. Firstly, it allows for easier scalability when compared to a centralised system as changes to the road network such as the addition of new roads or traffic signals can be tolerated by introducing new agents, rather than re-training or modifying a central system. Secondly, the state and action space of a centralised agent increases exponentially when more traffic signals are introduced, leading to the curse of dimensionality <ref type="bibr" target="#b7">[8]</ref>. Independent RL agents deployed to each intersection suffer from neither of these problems, and each agent can learn the specific characteristics of the intersection under their control. However, a problem emerges when we consider the simultaneous learning process which is used to train the independent agents. As an agent updates their policy to be optimal from their observations, the optimal policy for the agents at connected intersections from this agent may change based on the impact to the demand profile of their intersection. We refer to this as the problem of non-stationarity.</p><p>In this paper, we evaluate reward functions from the literature and review them in the context of a real-world multi-agent scenario, using calibrated data from Ingolstadt, Germany, to test them, including an implementation of a commonly used commercial solution, MOVA, as the baseline. We highlight the impact of reward functions on the ability of the agent to learn, and how solutions to the problem of non-stationarity may not be feasible in the real world when used in the TSC context. To evaluate the performance of different reward functions, we compare the waiting time and trip duration of vehicles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background and Related Work</head><p>The non-stationary problem is one which has been observed in many multi-agent RL contexts <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>. We define the non-stationary problem as when independent agents are in an environment where they take actions to optimise their policy, aided by a reward function, but the actions of these agents impact the surrounding agents. The changing environment can be referred to as non-stationary. When thought of in the context of the TSC problem, we encounter a changing environment when agents change their policies to one which they believe is more optimal. This change can impact the demand characteristics which other agents see, and may result in their own policies no longer being optimal. Furthermore, in addition to the reason of changing road networks, as well as the curse of dimensionality, it is also not feasible to have a centralised system to learn and control traffic timings as computational complexity exponentially grows in the numbers of lanes and junctions <ref type="bibr" target="#b11">[12]</ref>.</p><p>A potential solution to this problem is to employ an actor-critic (AC) algorithm <ref type="bibr" target="#b12">[13]</ref> for each agent, with a common critic. In the context of TSC, multi-agent AC and the derivatives have been implemented and tested, with Feudal AC <ref type="bibr" target="#b13">[14]</ref> evaluated by Ault et al. <ref type="bibr" target="#b5">[6]</ref>, and their investigation found that they perform similarly to Deep Q-Learning algorithms but take significantly longer to converge on the solution. An alternative approach to the problem of non-stationarity is to introduce a form of communication between agents. Foerster et al. <ref type="bibr" target="#b14">[15]</ref> introduce a Deep Distributed Recurrent Q-Network, where agents share hidden layers and are tasked with developing a communication protocol to expedite the solving of communicationbased coordination tasks. Sukhbaatar et al. <ref type="bibr" target="#b15">[16]</ref> introduced the architecture of CommNet, which incorporates a communication message, the average of the previous hidden layers from all other agents into the input of each layer of the agent. However, for both AC approaches and communication between agents, the issues around scalability remain, and may require the critic or communicative agent to be retrained when changes are made to the road network.</p><p>In work by Cabrejas-Egea et al. <ref type="bibr" target="#b16">[17]</ref>, an assessment of 15 common reward functions, aggregated into 5 groups (queue-length based rewards, waiting time based rewards, delay based rewards, average speed based rewards and throughput based rewards), is performed and it is found that average speed maximisation reduces the average vehicle waiting time. However, this was performed in a single agent scenario, with one junction. Whilst maximising speed may perform best in isolated junctions, it is unknown how nearby junctions will be affected. Wei et al. <ref type="bibr" target="#b17">[18]</ref> provides more details on alternative approaches in RL for TSC, including the state and reward functions employed and the dataset used to verify results.</p><p>Moreover, it is suggested that there is a significant gap between the performance of agents in synthetic benchmarks and calibrated data from the real world. Ault et al. <ref type="bibr" target="#b5">[6]</ref> compared implementations of MPLight <ref type="bibr" target="#b18">[19]</ref>, FMA2C <ref type="bibr" target="#b13">[14]</ref> and DQN based approaches <ref type="bibr" target="#b19">[20]</ref> (among others) and concluded that whilst synthetic benchmarks can prove challenging for RL agents, there is a difference in performance between them and calibrated data. There is a gap in the literature to explore whether this continues into reward functions as well. Specifically, we are interested in the performance of reward functions in realistic traffic scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem Formulation</head><p>The TSC problem can be formulated as a Partially Observable Markov Decision Process (POMDP) <ref type="bibr" target="#b20">[21]</ref> &lt; 𝑆, 𝐴, ℙ, 𝑅, Ω, 𝑂, 𝛾 &gt;, defined as 𝑆, the set of states, 𝐴, the set of possible actions, ℙ(𝑠 𝑡 , 𝑎, 𝑠 𝑡+1 ) ∶ 𝑆 × 𝐴 × 𝑆 → [0, 1], the state transition function, 𝑅(𝑠, 𝑎, 𝑠 ′ ) ∶ 𝑆 × 𝐴 → ℝ which describes the likelihood of transitioning from 𝑠 to 𝑠 ′ when action 𝑎 is taken, Ω, the set of observations, 𝑂, the set of conditional observation probabilities 𝑂 ∶ 𝑆 × 𝐴 × Ω → [0, 1], and 𝛾, the discount factor. We define this problem as a POMDP rather than a standard Markov Decision Process due to the limitations in sensor capability and knowledge of the global state. Therefore, 𝑆 can be defined as the state of the system, contrasted to Ω, the observations of the system state from the sensors at an intersection.</p><p>The choice of reward function 𝑅 is important to the performance of our agents. In the TSC problem, the high-level aim is to maximise throughput of vehicular traffic across all traffic signals. Part of increasing throughput is to reduce vehicle waiting time and increase average speed as these two factors directly contribute to how quickly vehicles reach their destination. However, for the same reasons that it is not feasible to use a centralised single agent to control all the intersections, it is not feasible to incorporate the total throughput of all agents as a reward function. Furthermore, the problem of non-stationarity is still prevalent as the reward that agents see will now be explicitly and directly impacted by the policies of other agents.</p><p>In the context of TSC, we define a phase 𝜑 as a group of non-conflicting green lights at a signalised intersection, and a signalised intersection as having a finite set of phases Φ such that 𝜑 ∈ Φ. In each intersection, we construct the state space (𝑆) as a combination of the current phase the intersection has selected and the observation of the current traffic state. In addition to this, we can define the action space (𝐴) for an agent as Φ. If the selected phase is a change to the current phase, there must be a mandatory yellow phase interjected, and the selected phase must also be chosen for longer than the minimum limit <ref type="bibr" target="#b21">[22]</ref>. Each intersection includes an emulated traffic signal controller, and if an agent selects a different phase or an action which does not fulfill the mandatory requirements, the traffic controller enforces the legal safety requirements. The reward function 𝑅 differs between implementations, and how to choose this is the focus of our paper.</p><p>It should also be mentioned that by describing the TSC problem as an POMDP, we are assuming that the TSC problem fulfils the Markov property, that is, that the process of TSC is memory-less (the result of the next state only depends on the action taken from the current state). Formally, given a state history 𝑆 𝐻 :</p><formula xml:id="formula_0">𝑆 𝐻 = 𝑆 𝑡 , 𝑆 𝑡+1 , ..., 𝑆 ∞<label>(1)</label></formula><p>then, if following the Markov Property</p><formula xml:id="formula_1">ℙ(𝑆 𝑡+1 |𝑆 𝐻 ) = ℙ(𝑆 𝑡+1 |𝑆 𝑡 , 𝑆 𝑡+1 , ..., 𝑆 ∞ ) = ℙ(𝑆 𝑡+1 |𝑆 𝑡 )<label>(2)</label></formula><p>When applied to the context of TSC, it may seem like this assumption does not hold true, as traffic has known periodic cycles of greater and lesser demand. Son et al. <ref type="bibr" target="#b22">[23]</ref> showed that fluctuations in traffic flow (and seasonality) can be modelled using a Fourier Transform, and used to make predictions about future traffic predictions. However, this is only possible when the states of traffic signals are viewed over a period of days to weeks, and in the TSC problem, this temporal horizon is very small (seconds to minutes) and in a resolution below the required amount to make assumptions regarding traffic seasonality. With the assumption that TSC does fulfil the Markov property, and taking into consideration the computational complexity of solving POMDP, we model the problem as a regular MDP.</p><p>When RL is applied to this MDP, the aim of the agents is to learn a policy 𝜋 to maximise the future discounted reward defined by:</p><formula xml:id="formula_2">∞ ∑ 𝑡=0 𝛾 𝑡 𝑅(𝑠 𝑡 , 𝑎 𝑡 )<label>(3)</label></formula><p>Where 𝛾 ∈ [0, 1]. Q-learning, an off-policy model-free value-based RL algorithm is an effective and powerful tool in solving MDPs and has been shown to find an optimal policy (one which maximises expected total discounted reward) in any finite MDP <ref type="bibr" target="#b23">[24]</ref>. This approach aims to learn an optimal action-value (Q) function 𝑄 * (𝑠, 𝑎) given a state 𝑠 and action 𝑎: when optimal policy 𝜋 * is followed.</p><formula xml:id="formula_3">𝑄 * (𝑠, 𝑎) = 𝔼[𝑟|𝑠, 𝑎] + 𝛾 ∑ 𝑠 ′ ℙ(𝑠 ′ |𝑠, 𝑎) max 𝑎 ′ 𝑄 * (𝑠 ′ , 𝑎 ′ )<label>(4)</label></formula><p>Q-learning, in this format, takes the form of a table-based algorithm which recursively approximates 𝑄 * (𝑠, 𝑎) through iterative Bellman updates with a learning rate of 𝛼 and temporal difference target of 𝑦 𝑡 for the Q-function:</p><formula xml:id="formula_4">𝑄 * (𝑠 𝑡+1 , 𝑎 𝑡+1 ) ← 𝑄 𝜋 (𝑠 𝑡 , 𝑎 𝑡 ) + 𝛼(𝑦 𝑡 − 𝑄(𝑠 𝑡 , 𝑎 𝑡 )) 𝑦 𝑡 = 𝑅 𝑡 + 𝛾 max 𝛼 𝑡+1 𝑄 𝜋 (𝑠 𝑡+1 , 𝑎 𝑡+1 )<label>(5)</label></formula><p>A major improvement to Q-learning performance came from using a convolutional neural network for the Q-value estimator combined with a novel experience replay mechanism and an iterative periodic update process which allowed the Deep Q-Network (DQN) agent to converge on an optimal policy when tested on the Atari 2600 dataset <ref type="bibr" target="#b24">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Reward Functions</head><p>The following functions are experimentally reviewed. We review reward functions from the literature <ref type="bibr" target="#b0">(1,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b10">11)</ref> and propose some functions here (2, 3, 5, 7, 9, 12), inspired by the previously proposed algorithms. We define 𝑉 𝑡 as the set of vehicles in incoming lanes and 𝑚 𝑣 as as the speed of vehicle 𝑣. Furthermore, we define 𝜏 𝑣 as the waiting time of vehicles. Similar to the definition of upstream traffic 𝑉 𝑡 , we define pressure as 𝑃 𝑥 where 𝑥 ∈ {𝑢𝑝, 𝑑𝑜𝑤𝑛}, and {𝑢𝑝, 𝑑𝑜𝑤𝑛} representing the upstream and downstream traffic flows respectively.</p><p>1. Average Speed: Used in <ref type="bibr" target="#b25">[26]</ref>, we aim for the agent to maximise the flow of vehicles by reducing the amount of time stopped or at low speeds. This is the optimal solution proposed by <ref type="bibr" target="#b16">[17]</ref>.</p><formula xml:id="formula_5">𝑟 𝑡 = 1 |𝑉 𝑡 | ∑ 𝑣∈𝑉 𝑡 𝑚 𝑣<label>(6)</label></formula><p>2. Average Speed Normalised: By normalising the average speed with the maximum observed speed in a lane 𝑚 𝑚𝑎𝑥 (defined as max 𝑉 𝑡 (𝑚 𝑣 )), we aim to reduce any problems caused by different speed limits in the approaches to the junction.</p><formula xml:id="formula_6">𝑟 𝑡 = 1 |𝑉 𝑡 | ∑ 𝑣∈𝑉 𝑡 𝑚 𝑣 𝑚 𝑚𝑎𝑥<label>(7)</label></formula><p>3. Maximum Wait Time: This approach prioritises the vehicles which have been waiting the longest.</p><formula xml:id="formula_7">𝑟 𝑡 = − max {𝑣∈𝑉 𝑡 } 𝜏 𝑡<label>(8)</label></formula><p>4. Aggregate Wait Time: As suggested by <ref type="bibr" target="#b26">[27]</ref> ,the reward is the negative sum of the wait time of all the queuing cars.</p><formula xml:id="formula_8">𝑟 𝑡 = − ∑ 𝑣∈𝑉 𝑡 𝜏 𝑡<label>(9)</label></formula><p>5. Aggregate Wait Time Normalised: Similar to Aggregate Wait Time, but we use the maximum waiting time to normalise the value. This is so the agent is not forced into acting in a first in, first out manner which may happen with just using Aggregate Wait Time.</p><formula xml:id="formula_9">𝑟 𝑡 = − ∑ 𝑣∈𝑉 𝑡 𝜏 𝑡 𝜏 𝑚𝑎𝑥<label>(10)</label></formula><p>6. Pressure: Used in <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b27">28,</ref><ref type="bibr" target="#b28">29,</ref><ref type="bibr" target="#b18">19]</ref>, pressure is a very common reward function, and is defined as the difference of vehicle density in the upstream lanes 𝑃 𝑢𝑝 and downstream lanes 𝑃 𝑑𝑜𝑤𝑛 . This approach has been promising simulations which use synthetic or grid based city layouts, and has been shown to synchronise the green phases of the main roads <ref type="bibr" target="#b29">[30]</ref>.</p><formula xml:id="formula_10">𝑟 𝑡 = −𝑃 𝑖 = −(𝑃 𝑢𝑝 ) − (𝑃 𝑑𝑜𝑤𝑛 ) Where<label>(11)</label></formula><p>7. Pressure Squared: Following on from pressure, we implemented pressure squared to test if penalising actions which lead to increased pressure is an effective approach to the reward function.</p><formula xml:id="formula_11">𝑟 𝑡 = −(𝑃 𝑖 ) 2<label>(12)</label></formula><p>8. Queue: This reward function is trivial to calculate and implement in the real world, and is used in some VA implementations. In addition, it is one of the most common reward functions used in implementations, as seen in <ref type="bibr" target="#b17">[18]</ref>.</p><formula xml:id="formula_12">𝑟 𝑡 = −|𝑉 𝑡 |<label>(13)</label></formula><p>9. Queue Squared: This reward function further penalises the actions which lead to larger queue. This was included due to the multi-agent scenario, as reducing the amount of queuing cars could increase the predictability of the traffic flow outbound from an intersection.</p><formula xml:id="formula_13">𝑟 𝑡 = −(|𝑉 𝑡 |) 2<label>(14)</label></formula><p>10. Maximum Wait Aggregated Queue (MWAQ): In this reward function, we use the value for the maximum waiting time multiplied with the length of the queue to approximate the worst case aggregate time waited for all the cars. This approach is a modification of the approach used by Ma et al. <ref type="bibr" target="#b13">[14]</ref>.</p><formula xml:id="formula_14">𝑟 𝑡 = −(max {𝑣∈𝑉 𝑡 } 𝜏 𝑡 * ∑ 𝑛∈𝑁 𝑞 𝑡 )<label>(15)</label></formula><p>11. Neighbourhood Adjusted Maximum Wait (NAMW): In this approach, we include basic information (number of vehicles) from a neighbouring intersection, as demonstrated in <ref type="bibr" target="#b30">[31]</ref>. This may pose some implementational problems in the real world use due to the changing nature of traffic networks. However, this information is collectable via the most common type of sensor used in UK roads, induction loop sensors, which are low-cost and effective. In addition, it is possible to retrofit these sensors into existing infrastructure <ref type="bibr" target="#b31">[32]</ref>.</p><formula xml:id="formula_15">𝑟 𝑡 = −(max {𝑣∈𝑉 𝑡 } 𝜏 𝑣 + 𝛾 max {𝑣∈𝑉 𝑡𝑛 } 𝜏 𝑣 )</formula><p>Where 𝑉 𝑡𝑛 is the vehicles at neighbour intersections <ref type="bibr" target="#b15">(16)</ref> In the definition of NAMW, we include an additional discount factor 𝛾. This value is applied to the information from the neighbouring intersections, to ensure that the component of this function which has the greatest impact on the overall value is the component from the agent in question. 12. MOVA (referred to as VA in our results): As a benchmark, we used an implementation of one of the most commonly found TSC algorithms in the UK, MOVA <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental Setup</head><p>We used the RESCO benchmarking environment as introduced by Ault et al. <ref type="bibr" target="#b5">[6]</ref>, based on the Simulation of Urban MObility (SUMO) simulator. Included in RESCO is the Ingolstadt environment <ref type="bibr" target="#b32">[33]</ref>, a demand-calibrated scenario for SUMO. The traffic network and traffic demand were set up as described in the Ingolstadt scenario <ref type="bibr" target="#b32">[33]</ref>.</p><p>We chose to use Deep Q-Learning for all of our agents as Deep Q-Learning is commonly used within the literature <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b33">34]</ref>. Furthermore, it was found by Genders et al. that the agent is not sensitive to the state representation <ref type="bibr" target="#b34">[35]</ref>, and so in our experiments we chose to use the state representation provided by <ref type="bibr" target="#b19">[20]</ref>. This state definition at an intersection includes number of vehicles in each incoming lane, the speed of the incoming vehicles, the queue length and the total waiting time of the vehicles at that intersection. The DQN used was implemented in PyTorch, and included a convolutional layer, followed by two fully connected layers of 32 neurons. The parameters for the DQN were set as in <ref type="bibr" target="#b19">[20]</ref>.</p><p>Each reward function was repeated with 𝑛 = 20, with the total waiting time calculated for each run, and the average of this cumulative waiting time was used to evaluate the functions. Moreover, in our initial experiments, we found that the traffic scenario did not include enough vehicles to saturate the road network, and definitively test the reward functions. In order to resolve this, we chose to modify the traffic scale option within SUMO. This option, which is set to 1 by default, proportionally increases the traffic by that percentage. We set it to 1.5, meaning that each car in the network had a 50% chance of being duplicated. We chose this instead of generating random data as it would still maintain the flow of traffic which is seen in the Ingolstadt dataset. We chose a scale of 1.5 as it was a compromise, due to a quirk in how SUMO processed uncompleted journeys at the end of the simulation. If cars do not arrive at their destination by the simulation end time, they are not included in the output data, leading to misleading information as worse-performing agents outperform those which can (despite long delays) allow a greater throughput of vehicles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results</head><p>Figures 1 and 2 contain box plots of our results for traffic scale factors of 1 and 1.5, respectively. Tables <ref type="table" target="#tab_1">1 and 2</ref> contains the tabular waiting time results for the traffic scale factor of 1 and 1.5 Our initial run with the default traffic flow found that pressure (pressure and pressure squared) and average speed (average speed and average speed normalised) based methods were the only methods to not outperform the MOVA/VA benchmark when the traffic scale factor was set to 1 and the traffic did not saturate (or near-saturate) the network. Whilst no conclusions can be made between the RL algorithms in this scenario, 7 of the reward functions used outperformed the benchmark, showing that there is significant potential in the use of RL for TSC. Furthermore, we note that the average speed and ASN reward functions performed significantly worse than in <ref type="bibr" target="#b16">[17]</ref> when used in a multi-agent scenario. We speculate that this is caused by the problem of non-stationarity as agents could struggle to differentiate what is causing their low reward results when a signal upstream is essentially controlling the flow of traffic into that junction. Furthermore, junctions which see few vehicles passing through are likely to be impacted more by the changing policy of upstream intersections, a factor which could penalise the average speed functions moreso.</p><p>In addition, the pressure based reward functions did not outperform the benchmark either. We hypothesise that this is in part caused by the structure of the road layout, and the type of road layout which was used to develop these algorithms. Whilst these algorithms may perform well in arterial road layouts <ref type="bibr" target="#b3">[4]</ref> and grid based layouts, they may struggle when faced with other road networks. It's important to note that this kind of road layout is rare in Europe, which is where the data originates from.</p><p>Once the traffic scale was increased to 1.5, a greater divergence is seen between the functions used. We see the trend of average speed and pressure being outperformed by the baseline. It is also important to note the reduced variance in the VA baseline. In real world deployments, this may be important as it increases the predictability of the algorithm.</p><p>Whilst the MWAQ was only slightly better than the other algorithms at the higher traffic scale, the reduced variance in the runs mean that in almost every run, it outperformed the other algorithms. We believe that this improvement is due to the fact that when only queue metrics are used, there is chance that the priority will almost certainly be given to the direction which has the greatest flow of traffic, leaving some cars to wait for a significant amount of time as they are travelling perpendicular to the flow of traffic. When the maximum wait time is included, the agent is less likely to prioritise the flow of traffic as often. Results from Ingolstadt dataset with a traffic scale factor of 1.5. All measurements are in seconds, and sorted by the mean.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In this paper, we discuss the non-stationary problem and how it may impact the use of RL for the TSC problem. We evaluate 11 different reward functions, including some of the more commonly used examples, and compare them to a benchmark of a real world function through simulations on a calibrated dataset from Ingolstadt. We believe that one potential avenue of further work is to conduct these experiments on a larger scenario, which would allow for further validation of the optimal reward function for use in the TSC problem. There may also be benefits to an ensemble approach to the TSC, where multiple agents with different reward functions are used to come to a conclusion on the optimal decision. Moreover, an approach which could be explored is to employ pretraining on new agents, training each agent on an individual intersection with the same number of lanes as the one they will control before being implemented in the network with other agents. Whilst this may increase the time required to train an agent, it may allow all agents to converge on a solution sooner.</p><p>Additionally, there could be a focus on the environmental impacts of using one reward function over another. For example, HGVs emit significantly more emissions when they accelerate compared to private vehicles. Therefore, an algorithm which does not differentiate between these types of vehicles will not prioritise this (or the impact on traffic once the HGV slows down, and the corresponding environmental impact of this), and therefore cause more damage to the environment.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :Figure 2 :</head><label>12</label><figDesc>Figure 1: Average time spent waiting for each vehicle with traffic scale of 1</figDesc><graphic coords="8,151.80,84.19,291.69,185.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Results from Ingolstadt dataset with a traffic scale factor of 1. All measurements are in seconds, and sorted by the mean.</figDesc><table><row><cell>Waiting Time</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc></figDesc><table /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>Behrad Koohy is supported by an ICASE studentship funded by the Engineering and Physical Sciences Research Council (EPSRC) and Yunex Traffic. Enrico Gerding and Sebastian Stein are funded by the EPSRC AutoTrust platform grant (EP/R029563/1). Sebastian Stein is also supported by an EPSRC Turing AI Acceleration Fellowship on Citizen-Centric AI Systems (EP/V022067/1).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Vincent</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Peirce</surname></persName>
		</author>
		<title level="m">MOVA: Traffic responsive, self-optimising signal control for isolated intersections</title>
				<imprint>
			<date type="published" when="1988">1988</date>
		</imprint>
	</monogr>
	<note type="report_type">TRRL Research Report</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">The scoot on-line traffic signal optimisation technique</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">B</forename><surname>Hunt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">I</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">D</forename><surname>Bretherton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Royle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Traffic engineering and control</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<date type="published" when="1982">1982</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Reinforcement learning for traffic signal control: comparison with commercial systems</title>
		<author>
			<persName><forename type="first">A</forename><surname>Cabrejas-Egea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Walton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transportation research procedia</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="638" to="645" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Presslight: Learning max pressure control to coordinate traffic signals in arterial network</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gayah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</title>
				<meeting>the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1290" to="1298" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Max pressure control of a network of signalized intersections</title>
		<author>
			<persName><forename type="first">P</forename><surname>Varaiya</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transportation Research Part C: Emerging Technologies</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="177" to="195" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Reinforcement learning benchmarks for traffic signal control</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sharon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Thirtyfifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Colight: Learning network-level cooperation for traffic signal control</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM International Conference on Information and Knowledge Management</title>
				<meeting>the 28th ACM International Conference on Information and Knowledge Management</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1913" to="1922" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Dynamic programming</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bellman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">153</biblScope>
			<biblScope unit="page" from="34" to="37" />
			<date type="published" when="1966">1966</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">C</forename><surname>Cheung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Simchi-Levi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1843" to="1854" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Continual reinforcement learning in 3d non-stationary environments</title>
		<author>
			<persName><forename type="first">V</forename><surname>Lomonaco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Desai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Culurciello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Maltoni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="248" to="249" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Choosing search heuristics by non-stationary reinforcement learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Nareyek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Metaheuristics: Computer decision-making</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="523" to="544" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Reinforcement learning with function approximation for traffic signal control</title>
		<author>
			<persName><forename type="first">L</forename><surname>Prashanth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhatnagar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Intelligent Transportation Systems</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="412" to="421" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A convergent actor-critic-based frl algorithm with application to power management of wireless transmitters</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">R</forename><surname>Berenji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vengerov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Fuzzy Systems</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="478" to="485" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Feudal multi-agent deep reinforcement learning for traffic signal control</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS &apos;20, International Foundation for Autonomous Agents and Multiagent Systems</title>
				<meeting>the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS &apos;20, International Foundation for Autonomous Agents and Multiagent Systems<address><addrLine>Richland, SC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="816" to="824" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">N</forename><surname>Foerster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">M</forename><surname>Assael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Freitas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Whiteson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1602.02672</idno>
		<title level="m">Learning to communicate to solve riddles with deep distributed recurrent q-networks</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Learning multiagent communication with backpropagation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sukhbaatar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Assessment of reward functions for reinforcement learning traffic signal control under real-world limitations</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Egea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Howell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Knutins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Connaughton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="965" to="972" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A survey on traffic signal control methods</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gayah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AAMAS</title>
		<imprint>
			<biblScope unit="volume">2019</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control</title>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhenhui</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>AAAI</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Ault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Hanna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sharon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.11023</idno>
		<title level="m">Learning an interpretable traffic signal control policy</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">A markovian decision process</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bellman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of mathematics and mechanics</title>
		<imprint>
			<biblScope unit="page" from="679" to="684" />
			<date type="published" when="1957">1957</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Traffic signs manual</title>
		<ptr target="https://www.gov.uk/government/publications/traffic-signs-manual" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>Department for Transport</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A fast vehicular traffic flow prediction scheme based on fourier and wavelet analysis</title>
		<author>
			<persName><forename type="first">P</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Aljeri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Boukerche</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Global Communications Conference (GLOBECOM), IEEE</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Q-learning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Watkins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dayan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine learning</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="279" to="292" />
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">The arcade learning environment: An evaluation platform for general agents</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Bellemare</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Naddaf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Veness</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bowling</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="253" to="279" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Coordinated deep reinforcement learners for traffic light control</title>
		<author>
			<persName><forename type="first">E</forename><surname>Van Der Pol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Oliehoek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Coordinated Deep Reinforcement Learners for Traffic Light Control</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Multi-agent deep reinforcement learning for large-scale traffic signal control</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Codecà</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Intelligent Transportation Systems</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="1086" to="1095" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lü</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.02336</idno>
		<title level="m">Efficient pressure: Improving efficiency for signalized intersections</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Rouphail</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tarko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<title level="m">Traffic flow at signalized intersections</title>
				<imprint>
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Roess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Prassas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Mcshane</surname></persName>
		</author>
		<title level="m">Traffic engineering</title>
				<imprint>
			<publisher>Prentice Hall</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>4th ed.. Includes bibliographical references and index</note>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Traffic light control in non-stationary environments based on multi agent q-learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Abdoos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mozayani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Bazzan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">14th International IEEE conference on intelligent transportation systems (ITSC), IEEE</title>
				<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="1580" to="1585" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Road traffic data: Collection methods and applications</title>
		<author>
			<persName><forename type="first">G</forename><surname>Leduc</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Working Papers on Energy, Transport and Climate Change</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1" to="55" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Lobo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Neumeier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Fernandez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Facchi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2011.11995</idno>
		<title level="m">InTAS-the ingolstadt traffic scenario for SUMO</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Computational intelligence in urban traffic signal control: A survey</title>
		<author>
			<persName><forename type="first">D</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="485" to="494" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Evaluating reinforcement learning state representations for adaptive traffic signal control</title>
		<author>
			<persName><forename type="first">W</forename><surname>Genders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Razavi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Procedia computer science</title>
		<imprint>
			<biblScope unit="volume">130</biblScope>
			<biblScope unit="page" from="26" to="33" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
