<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Design in Multi-Agent Reinforcement Learning for Trafic Signal Control</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Behrad Koohy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Stein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Gerding</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ghaithaa Manla</string-name>
          <email>manla.ghaithaa@yunextrafic.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Trafic Signal Control, Intelligent Trafic Management, Reinforcement Learning, Problem of Non-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ATT'22: Workshop Agents in Trafic and Transportation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Southampton</institution>
          ,
          <addr-line>University Road, Highfield, Southampton, SO17 1BJ</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yunex Trafic</institution>
          ,
          <addr-line>Sopers Lane, Poole, Dorset, BH17 7ER</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, there has been increased interest in Reinforcement Learning (RL) for Trafic Signal Control (TSC), with implementations of RL touted as a potential successor to the current commercial solutions in place. Commercial systems, such as Microprocessor Optimised Vehicle Actuation (MOVA) and Split, Cycle, and Ofset Optimisation Technique (SCOOT), can adapt to the changing trafic state, but do not learn the specific trafic characteristics of an intersection, and leave much to be desired when performance is compared to the potential benefits of using RL for TSC. Furthermore, distributed RL can provide the unique benefits of scalability and decentralisation for road infrastructure. However, using RL for TSC introduces the problem of non-stationarity where the changing policies of RL agents, tasked with optimal control of trafic signals, directly impacts the observed state of the system and therefore the policies of other agents. This non-stationarity can be mitigated through careful consideration and selection of an appropriate reward function. However, existing literature does not consider the impact of the reward function on the performance of agents in a non-stationary environment such as TSC. In this paper, we select 12 reward functions from the literature, and empirically evaluate them compared to a baseline of a commercial solution in a multi-agent setting. Furthermore, we are particularly interested in the performance of agents when used in a real-world scenario, and so we use demand calibrated data from Ingolstadt, Germany to compare the average waiting time and trip duration of vehicles. We find that reward functions which often perform well in a single intersection setting may not outperform commercial solutions in a multi-agent setting due to their impact on the demand profile of other agents. Furthermore, the reward functions which include the waiting time of agents produce the most predictable demand profile, in turn leading to increased throughput than alternatively proposed solutions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) for Trafic Signal Control (TSC) is an area which has been
investigated in detail as a potential improvement on the current adaptive systems in use. Current
commercially available systems do not use RL, and require manual setup of signal timings
for each intersection, something which can be time-consuming to do and can have a negative
impact on trafic flow if not configured correctly. In the UK, MOVA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (Microprocessor
Op
      </p>
      <p>(G. Manla)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
are the most widely implemented commercial systems, with the latter being used mainly for
regions of up to 30 trafic signal junctions. While adaptive (extending green signals when
trafic demand is high in a given direction), these algorithms do not use RL to learn the specific
characteristics of a trafic signal. The design of these algorithms was completed in the 1980s,
and the iterative improvements made since then have not taken advantage of the vast amount of
information available now from roadside sensors. In addition to this, modern approaches to the
TSC problem can employ more advanced data sources such as trafic cameras, and information
from connected and autonomous vehicles, allowing for a more accurate picture of the trafic
lfow through a road network. Furthermore, this allows for prioritisation of certain types of
trafic, where appropriate, such as allowing heavy goods vehicles (HGVs) to pass through lights
and avoid deceleration (followed by acceleration), or clearing the road network in a certain
direction to allow for easier passage of emergency vehicles attending to an emergency.</p>
      <p>
        RL based approaches for TSC, whilst not exposed to the decades of development which current
approaches in use have had, have still been shown to outperform well-calibrated systems in
simulations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Current state-of-the-art approaches make use of some innovative methods such
as junction pressure [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], convolutional neural networks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and graph attention networks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Introducing independent RL agents at each intersection within a road network has a number
of benefits. Firstly, it allows for easier scalability when compared to a centralised system as
changes to the road network such as the addition of new roads or trafic signals can be tolerated
by introducing new agents, rather than re-training or modifying a central system. Secondly, the
state and action space of a centralised agent increases exponentially when more trafic signals
are introduced, leading to the curse of dimensionality [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Independent RL agents deployed to
each intersection sufer from neither of these problems, and each agent can learn the specific
characteristics of the intersection under their control. However, a problem emerges when we
consider the simultaneous learning process which is used to train the independent agents. As
an agent updates their policy to be optimal from their observations, the optimal policy for
the agents at connected intersections from this agent may change based on the impact to the
demand profile of their intersection. We refer to this as the problem of non-stationarity.
      </p>
      <p>In this paper, we evaluate reward functions from the literature and review them in the context
of a real-world multi-agent scenario, using calibrated data from Ingolstadt, Germany, to test
them, including an implementation of a commonly used commercial solution, MOVA, as the
baseline. We highlight the impact of reward functions on the ability of the agent to learn, and
how solutions to the problem of non-stationarity may not be feasible in the real world when
used in the TSC context. To evaluate the performance of diferent reward functions, we compare
the waiting time and trip duration of vehicles.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        The non-stationary problem is one which has been observed in many multi-agent RL contexts
[
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. We define the non-stationary problem as when independent agents are in an
environment where they take actions to optimise their policy, aided by a reward function, but
the actions of these agents impact the surrounding agents. The changing environment can
be referred to as non-stationary. When thought of in the context of the TSC problem, we
encounter a changing environment when agents change their policies to one which they believe
is more optimal. This change can impact the demand characteristics which other agents see,
and may result in their own policies no longer being optimal. Furthermore, in addition to the
reason of changing road networks, as well as the curse of dimensionality, it is also not feasible
to have a centralised system to learn and control trafic timings as computational complexity
exponentially grows in the numbers of lanes and junctions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        A potential solution to this problem is to employ an actor-critic (AC) algorithm [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for
each agent, with a common critic. In the context of TSC, multi-agent AC and the derivatives
have been implemented and tested, with Feudal AC [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] evaluated by Ault et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and
their investigation found that they perform similarly to Deep Q-Learning algorithms but take
significantly longer to converge on the solution. An alternative approach to the problem of
non-stationarity is to introduce a form of communication between agents. Foerster et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
introduce a Deep Distributed Recurrent Q-Network, where agents share hidden layers and are
tasked with developing a communication protocol to expedite the solving of
communicationbased coordination tasks. Sukhbaatar et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] introduced the architecture of CommNet, which
incorporates a communication message, the average of the previous hidden layers from all
other agents into the input of each layer of the agent. However, for both AC approaches and
communication between agents, the issues around scalability remain, and may require the critic
or communicative agent to be retrained when changes are made to the road network.
      </p>
      <p>
        In work by Cabrejas-Egea et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], an assessment of 15 common reward functions,
aggregated into 5 groups (queue-length based rewards, waiting time based rewards, delay based
rewards, average speed based rewards and throughput based rewards), is performed and it is
found that average speed maximisation reduces the average vehicle waiting time. However,
this was performed in a single agent scenario, with one junction. Whilst maximising speed may
perform best in isolated junctions, it is unknown how nearby junctions will be afected. Wei et
al. [18] provides more details on alternative approaches in RL for TSC, including the state and
reward functions employed and the dataset used to verify results.
      </p>
      <p>
        Moreover, it is suggested that there is a significant gap between the performance of agents
in synthetic benchmarks and calibrated data from the real world. Ault et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] compared
implementations of MPLight [19], FMA2C [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and DQN based approaches [20] (among others)
and concluded that whilst synthetic benchmarks can prove challenging for RL agents, there is a
diference in performance between them and calibrated data. There is a gap in the literature to
explore whether this continues into reward functions as well. Specifically, we are interested in
the performance of reward functions in realistic trafic scenarios.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Formulation</title>
      <p>
        The TSC problem can be formulated as a Partially Observable Markov Decision Process
(POMDP) [21] &lt; , , ℙ, , Ω, ,  &gt; , defined as  , the set of states,  , the set of possible
actions, ℙ(  , ,  +1 ) ∶  ×  ×  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] , the state transition function, (, ,  ′) ∶  ×  → ℝ
which describes the likelihood of transitioning from  to  ′ when action  is taken, Ω, the set of
observations,  , the set of conditional observation probabilities  ∶  ×  × Ω → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] , and  , the
discount factor. We define this problem as a POMDP rather than a standard Markov Decision
Process due to the limitations in sensor capability and knowledge of the global state. Therefore,
 can be defined as the state of the system, contrasted to Ω, the observations of the system state
from the sensors at an intersection.
      </p>
      <p>The choice of reward function  is important to the performance of our agents. In the TSC
problem, the high-level aim is to maximise throughput of vehicular trafic across all trafic
signals. Part of increasing throughput is to reduce vehicle waiting time and increase average
speed as these two factors directly contribute to how quickly vehicles reach their destination.
However, for the same reasons that it is not feasible to use a centralised single agent to control
all the intersections, it is not feasible to incorporate the total throughput of all agents as a
reward function. Furthermore, the problem of non-stationarity is still prevalent as the reward
that agents see will now be explicitly and directly impacted by the policies of other agents.</p>
      <p>In the context of TSC, we define a phase  as a group of non-conflicting green lights at a
signalised intersection, and a signalised intersection as having a finite set of phases Φ such that
 ∈ Φ . In each intersection, we construct the state space ( ) as a combination of the current
phase the intersection has selected and the observation of the current trafic state. In addition to
this, we can define the action space (  ) for an agent as Φ. If the selected phase is a change to the
current phase, there must be a mandatory yellow phase interjected, and the selected phase must
also be chosen for longer than the minimum limit [22]. Each intersection includes an emulated
trafic signal controller, and if an agent selects a diferent phase or an action which does not
fulfill the mandatory requirements, the trafic controller enforces the legal safety requirements.
The reward function  difers between implementations, and how to choose this is the focus of
our paper.</p>
      <p>It should also be mentioned that by describing the TSC problem as an POMDP, we are
assuming that the TSC problem fulfils the Markov property, that is, that the process of TSC is
memory-less (the result of the next state only depends on the action taken from the current
state). Formally, given a state history   :

 =   ,  +1 , ...,  ∞
then, if following the Markov Property</p>
      <p>ℙ( +1 |  ) = ℙ( +1 |  ,  +1 , ...,  ∞) = ℙ( +1 |  )</p>
      <p>When applied to the context of TSC, it may seem like this assumption does not hold true,
as trafic has known periodic cycles of greater and lesser demand. Son et al. [ 23] showed that
lfuctuations in trafic flow (and seasonality) can be modelled using a Fourier Transform, and
used to make predictions about future trafic predictions. However, this is only possible when
the states of trafic signals are viewed over a period of days to weeks, and in the TSC problem,
this temporal horizon is very small (seconds to minutes) and in a resolution below the required
amount to make assumptions regarding trafic seasonality. With the assumption that TSC
does fulfil the Markov property, and taking into consideration the computational complexity of
solving POMDP, we model the problem as a regular MDP.</p>
      <p>
        When RL is applied to this MDP, the aim of the agents is to learn a policy  to maximise the
future discounted reward defined by:
∞
=0
∑   (  ,   )
(1)
(2)
(3)
Where  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] .
      </p>
      <p>Q-learning, an of-policy model-free value-based RL algorithm is an efective and powerful
tool in solving MDPs and has been shown to find an optimal policy (one which maximises
expected total discounted reward) in any finite MDP [ 24]. This approach aims to learn an
optimal action-value (Q) function  ∗(, ) given a state  and action  : when optimal policy  ∗
is followed.</p>
      <p>∗(, ) = [ |, ] +  ∑′ ℙ( ′|, ) m a′ x  ∗( ′,  ′) (4)
Q-learning, in this format, takes the form of a table-based algorithm which recursively
approximates  ∗(, ) through iterative Bellman updates with a learning rate of  and temporal
diference target of   for the Q-function:</p>
    </sec>
    <sec id="sec-4">
      <title>4. Reward Functions</title>
      <p>∗( +1 ,  +1 ) ←   (  ,   ) + (  − (  ,   ))
  =   +  max   ( +1 ,  +1 )
 +1
A major improvement to Q-learning performance came from using a convolutional neural
network for the Q-value estimator combined with a novel experience replay mechanism and an
iterative periodic update process which allowed the Deep Q-Network (DQN) agent to converge
on an optimal policy when tested on the Atari 2600 dataset [25].</p>
      <p>
        The following functions are experimentally reviewed. We review reward functions from the
literature (1, 3, 4, 6, 8, 11) and propose some functions here (2, 3, 5, 7, 9, 12), inspired by the
previously proposed algorithms. We define   as the set of vehicles in incoming lanes and  
as as the speed of vehicle  . Furthermore, we define   as the waiting time of vehicles. Similar
to the definition of upstream trafic   , we define pressure as   where  ∈ {,  } , and
{,  } representing the upstream and downstream trafic flows respectively.
1. Average Speed: Used in [26], we aim for the agent to maximise the flow of vehicles
by reducing the amount of time stopped or at low speeds. This is the optimal solution
proposed by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>= 1 ∑   (6)</p>
      <p>|  | ∈ 
2. Average Speed Normalised: By normalising the average speed with the maximum
observed speed in a lane   (defined as max  (  )), we aim to reduce any problems
caused by diferent speed limits in the approaches to the junction.</p>
      <p>=
1 ∑
|  | ∈</p>
      <p>= − max</p>
      <p>{∈  }
3. Maximum Wait Time: This approach prioritises the vehicles which have been waiting
the longest.
(5)
(7)
(8)
4. Aggregate Wait Time: As suggested by [27] ,the reward is the negative sum of the wait
time of all the queuing cars.
5. Aggregate Wait Time Normalised: Similar to Aggregate Wait Time, but we use the
maximum waiting time to normalise the value. This is so the agent is not forced into
acting in a first in, first out manner which may happen with just using Aggregate Wait
Time.
lanes  
roads [30].</p>
      <p>
        reward function.
6. Pressure: Used in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5, 28, 29, 19</xref>
        ], pressure is a very common reward function, and is
defined as the diference of vehicle density in the upstream lanes
 
and downstream
. This approach has been promising in simulations which use synthetic or
grid based city layouts, and has been shown to synchronise the green phases of the main
  = −  = −(  ) − ( 
)
      </p>
      <p>Where
7. Pressure Squared: Following on from pressure, we implemented pressure squared to
test if penalising actions which lead to increased pressure is an efective approach to the
8. Queue: This reward function is trivial to calculate and implement in the real world, and
is used in some VA implementations. In addition, it is one of the most common reward
functions used in implementations, as seen in [18].</p>
      <p>= − ∑</p>
      <p>∈ 
  = − ∑
∈ 




  = −(  )2
  = −|  |
  = −(|  |)2</p>
      <p>
        (9)
(10)
(11)
(12)
(13)
(14)
(15)
9. Queue Squared: This reward function further penalises the actions which lead to larger
queue. This was included due to the multi-agent scenario, as reducing the amount
of queuing cars could increase the predictability of the trafic flow outbound from an
intersection.
10. Maximum Wait Aggregated Queue (MWAQ): In this reward function, we use the value
for the maximum waiting time multiplied with the length of the queue to approximate
the worst case aggregate time waited for all the cars. This approach is a modification of
the approach used by Ma et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>= −(max   ∗ ∑   )
{∈ 
}
∈
11. Neighbourhood Adjusted Maximum Wait (NAMW): In this approach, we include
basic information (number of vehicles) from a neighbouring intersection, as demonstrated
in [31]. This may pose some implementational problems in the real world use due to the
changing nature of trafic networks. However, this information is collectable via the most
common type of sensor used in UK roads, induction loop sensors, which are low-cost and
efective. In addition, it is possible to retrofit these sensors into existing infrastructure
[32].</p>
      <p>= −(max   +  max   )
{∈  } {∈  }</p>
      <p>
        Where   is the vehicles at neighbour intersections (16)
In the definition of NAMW, we include an additional discount factor  . This value is applied
to the information from the neighbouring intersections, to ensure that the component of
this function which has the greatest impact on the overall value is the component from
the agent in question.
12. MOVA (referred to as VA in our results): As a benchmark, we used an implementation
of one of the most commonly found TSC algorithms in the UK, MOVA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <p>
        We used the RESCO benchmarking environment as introduced by Ault et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], based on
the Simulation of Urban MObility (SUMO) simulator. Included in RESCO is the Ingolstadt
environment [33], a demand-calibrated scenario for SUMO. The trafic network and trafic
demand were set up as described in the Ingolstadt scenario [33].
      </p>
      <p>We chose to use Deep Q-Learning for all of our agents as Deep Q-Learning is commonly
used within the literature [18, 34]. Furthermore, it was found by Genders et al. that the agent is
not sensitive to the state representation [35], and so in our experiments we chose to use the
state representation provided by [20]. This state definition at an intersection includes number
of vehicles in each incoming lane, the speed of the incoming vehicles, the queue length and
the total waiting time of the vehicles at that intersection. The DQN used was implemented
in PyTorch, and included a convolutional layer, followed by two fully connected layers of 32
neurons. The parameters for the DQN were set as in [20].</p>
      <p>Each reward function was repeated with  = 20 , with the total waiting time calculated for
each run, and the average of this cumulative waiting time was used to evaluate the functions.
Moreover, in our initial experiments, we found that the trafic scenario did not include enough
vehicles to saturate the road network, and definitively test the reward functions. In order to
resolve this, we chose to modify the trafic scale option within SUMO. This option, which is
set to 1 by default, proportionally increases the trafic by that percentage. We set it to 1.5,
meaning that each car in the network had a 50% chance of being duplicated. We chose this
instead of generating random data as it would still maintain the flow of trafic which is seen in
the Ingolstadt dataset. We chose a scale of 1.5 as it was a compromise, due to a quirk in how
SUMO processed uncompleted journeys at the end of the simulation. If cars do not arrive at
their destination by the simulation end time, they are not included in the output data, leading
to misleading information as worse-performing agents outperform those which can (despite
long delays) allow a greater throughput of vehicles.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>Figures 1 and 2 contain box plots of our results for trafic scale factors of 1 and 1.5, respectively.
Tables 1 and 2 contains the tabular waiting time results for the trafic scale factor of 1 and 1.5
respectively.</p>
      <p>
        Our initial run with the default trafic flow found that pressure (pressure and pressure squared)
and average speed (average speed and average speed normalised) based methods were the only
methods to not outperform the MOVA/VA benchmark when the trafic scale factor was set to 1
and the trafic did not saturate (or near-saturate) the network. Whilst no conclusions can be
made between the RL algorithms in this scenario, 7 of the reward functions used outperformed
the benchmark, showing that there is significant potential in the use of RL for TSC. Furthermore,
we note that the average speed and ASN reward functions performed significantly worse than
in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] when used in a multi-agent scenario. We speculate that this is caused by the problem
of non-stationarity as agents could struggle to diferentiate what is causing their low reward
results when a signal upstream is essentially controlling the flow of trafic into that junction.
Furthermore, junctions which see few vehicles passing through are likely to be impacted more
by the changing policy of upstream intersections, a factor which could penalise the average
speed functions moreso.
      </p>
      <p>
        In addition, the pressure based reward functions did not outperform the benchmark either.
We hypothesise that this is in part caused by the structure of the road layout, and the type of
road layout which was used to develop these algorithms. Whilst these algorithms may perform
well in arterial road layouts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and grid based layouts, they may struggle when faced with
other road networks. It’s important to note that this kind of road layout is rare in Europe, which
is where the data originates from.
      </p>
      <p>Once the trafic scale was increased to 1.5, a greater divergence is seen between the functions
used. We see the trend of average speed and pressure being outperformed by the baseline. It is
also important to note the reduced variance in the VA baseline. In real world deployments, this
may be important as it increases the predictability of the algorithm.</p>
      <p>Whilst the MWAQ was only slightly better than the other algorithms at the higher trafic
scale, the reduced variance in the runs mean that in almost every run, it outperformed the other
algorithms. We believe that this improvement is due to the fact that when only queue metrics
are used, there is chance that the priority will almost certainly be given to the direction which
has the greatest flow of trafic, leaving some cars to wait for a significant amount of time as they
are travelling perpendicular to the flow of trafic. When the maximum wait time is included,
the agent is less likely to prioritise the flow of trafic as often.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we discuss the non-stationary problem and how it may impact the use of RL for the
TSC problem. We evaluate 11 diferent reward functions, including some of the more commonly
used examples, and compare them to a benchmark of a real world function through simulations
on a calibrated dataset from Ingolstadt. We believe that one potential avenue of further work is
to conduct these experiments on a larger scenario, which would allow for further validation
of the optimal reward function for use in the TSC problem. There may also be benefits to an
ensemble approach to the TSC, where multiple agents with diferent reward functions are used
to come to a conclusion on the optimal decision.</p>
      <p>Moreover, an approach which could be explored is to employ pretraining on new agents,
training each agent on an individual intersection with the same number of lanes as the one
they will control before being implemented in the network with other agents. Whilst this may
increase the time required to train an agent, it may allow all agents to converge on a solution
sooner.</p>
      <p>Additionally, there could be a focus on the environmental impacts of using one reward function
over another. For example, HGVs emit significantly more emissions when they accelerate
compared to private vehicles. Therefore, an algorithm which does not diferentiate between
these types of vehicles will not prioritise this (or the impact on trafic once the HGV slows
down, and the corresponding environmental impact of this), and therefore cause more damage
to the environment.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>Behrad Koohy is supported by an ICASE studentship funded by the Engineering and Physical
Sciences Research Council (EPSRC) and Yunex Trafic. Enrico Gerding and Sebastian Stein
are funded by the EPSRC AutoTrust platform grant (EP/R029563/1). Sebastian Stein is also
supported by an EPSRC Turing AI Acceleration Fellowship on Citizen-Centric AI Systems
(EP/V022067/1).
for reinforcement learning trafic signal control under real-world limitations, in: 2020
IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2020, pp.
965–972.
[18] H. Wei, G. Zheng, V. Gayah, Z. Li, A survey on trafic signal control methods, AAMAS
2019 (2019).
[19] C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, Zhenhui, Toward a thousand
lights: Decentralized deep reinforcement learning for large-scale trafic signal control, in:
AAAI, 2020.
[20] J. Ault, J. P. Hanna, G. Sharon, Learning an interpretable trafic signal control policy, arXiv
preprint arXiv:1912.11023 (2019).
[21] R. Bellman, A markovian decision process, Journal of mathematics and mechanics (1957)
679–684.
[22] Department for Transport, Trafic signs manual, 2020. URL: https://www.gov.uk/
government/publications/traffic-signs-manual.
[23] P. Sun, N. AlJeri, A. Boukerche, A fast vehicular trafic flow prediction scheme based
on fourier and wavelet analysis, in: 2018 IEEE Global Communications Conference
(GLOBECOM), IEEE, 2018, pp. 1–6.
[24] C. J. Watkins, P. Dayan, Q-learning, Machine learning 8 (1992) 279–292.
[25] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The arcade learning environment: An
evaluation platform for general agents, Journal of Artificial Intelligence Research 47 (2013)
253–279.
[26] E. van der Pol, F. A. Oliehoek, Coordinated deep reinforcement learners for trafic light
control, in: Coordinated Deep Reinforcement Learners for Trafic Light Control, 2016.
[27] T. Chu, J. Wang, L. Codecà, Z. Li, Multi-agent deep reinforcement learning for large-scale
trafic signal control, IEEE Transactions on Intelligent Transportation Systems 21 (2019)
1086–1095.
[28] Q. Wu, L. Zhang, J. Shen, L. Lü, B. Du, J. Wu, Eficient pressure: Improving eficiency for
signalized intersections, arXiv preprint arXiv:2112.02336 (2021).
[29] N. Rouphail, A. Tarko, J. Li, Trafic flow at signalized intersections (1992).
[30] R. Roess, E. Prassas, W. McShane, Trafic engineering, 4th ed., Prentice Hall, 2011. Includes
bibliographical references and index.
[31] M. Abdoos, N. Mozayani, A. L. Bazzan, Trafic light control in non-stationary environments
based on multi agent q-learning, in: 2011 14th International IEEE conference on intelligent
transportation systems (ITSC), IEEE, 2011, pp. 1580–1585.
[32] G. Leduc, et al., Road trafic data: Collection methods and applications, Working Papers
on Energy, Transport and Climate Change 1 (2008) 1–55.
[33] S. C. Lobo, S. Neumeier, E. M. Fernandez, C. Facchi, InTAS–the ingolstadt trafic scenario
for SUMO, arXiv preprint arXiv:2011.11995 (2020).
[34] D. Zhao, Y. Dai, Z. Zhang, Computational intelligence in urban trafic signal control: A
survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews) 42 (2011) 485–494.
[35] W. Genders, S. Razavi, Evaluating reinforcement learning state representations for adaptive
trafic signal control, Procedia computer science 130 (2018) 26–33.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Peirce</surname>
          </string-name>
          , MOVA:
          <article-title>Trafic responsive, self-optimising signal control for isolated intersections</article-title>
          ,
          <source>TRRL Research Report</source>
          (
          <year>1988</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Hunt</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. I. Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Bretherton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Royle</surname>
          </string-name>
          ,
          <article-title>The scoot on-line trafic signal optimisation technique, Trafic engineering</article-title>
          and control
          <volume>23</volume>
          (
          <year>1982</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cabrejas-Egea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Walton</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for trafic signal control: comparison with commercial systems</article-title>
          ,
          <source>Transportation research procedia</source>
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <fpage>638</fpage>
          -
          <lpage>645</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Zheng,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gayah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Presslight: Learning max pressure control to coordinate trafic signals in arterial network</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1290</fpage>
          -
          <lpage>1298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Varaiya</surname>
          </string-name>
          ,
          <article-title>Max pressure control of a network of signalized intersections</article-title>
          ,
          <source>Transportation Research Part C: Emerging Technologies</source>
          <volume>36</volume>
          (
          <year>2013</year>
          )
          <fpage>177</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ault</surname>
          </string-name>
          , G. Sharon,
          <article-title>Reinforcement learning benchmarks for trafic signal control</article-title>
          ,
          <source>in: Thirtyifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Zheng,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Colight: Learning network-level cooperation for trafic signal control</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1913</fpage>
          -
          <lpage>1922</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bellman</surname>
          </string-name>
          ,
          <article-title>Dynamic programming</article-title>
          ,
          <source>Science</source>
          <volume>153</volume>
          (
          <year>1966</year>
          )
          <fpage>34</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Simchi-Levi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism</article-title>
          , in: International Conference on Machine Learning, PMLR,
          <year>2020</year>
          , pp.
          <fpage>1843</fpage>
          -
          <lpage>1854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lomonaco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Desai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Culurciello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maltoni</surname>
          </string-name>
          ,
          <article-title>Continual reinforcement learning in 3d non-stationary environments</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nareyek</surname>
          </string-name>
          ,
          <article-title>Choosing search heuristics by non-stationary reinforcement learning</article-title>
          , in: Metaheuristics: Computer decision-making, Springer,
          <year>2003</year>
          , pp.
          <fpage>523</fpage>
          -
          <lpage>544</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Prashanth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning with function approximation for trafic signal control</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>12</volume>
          (
          <year>2010</year>
          )
          <fpage>412</fpage>
          -
          <lpage>421</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Berenji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vengerov</surname>
          </string-name>
          ,
          <article-title>A convergent actor-critic-based frl algorithm with application to power management of wireless transmitters</article-title>
          ,
          <source>IEEE Transactions on Fuzzy Systems</source>
          <volume>11</volume>
          (
          <year>2003</year>
          )
          <fpage>478</fpage>
          -
          <lpage>485</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Feudal multi-agent deep reinforcement learning for trafic signal control</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems</source>
          , AAMAS '20,
          <string-name>
            <surname>International</surname>
            <given-names>Foundation</given-names>
          </string-name>
          <source>for Autonomous Agents and Multiagent Systems</source>
          , Richland,
          <string-name>
            <surname>SC</surname>
          </string-name>
          ,
          <year>2020</year>
          , p.
          <fpage>816</fpage>
          -
          <lpage>824</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Foerster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Assael</surname>
          </string-name>
          , N. de Freitas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whiteson</surname>
          </string-name>
          ,
          <article-title>Learning to communicate to solve riddles with deep distributed recurrent q-networks</article-title>
          ,
          <source>arXiv preprint arXiv:1602.02672</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sukhbaatar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          , et al.,
          <article-title>Learning multiagent communication with backpropagation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Egea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Howell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knutins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Connaughton</surname>
          </string-name>
          , Assessment of reward functions
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>