The Real Deal: A Review of Challenges and
Opportunities in Moving Reinforcement
Learning-Based Traffic Signal Control Systems
Towards Reality
Rex Chen, Fei Fang and Norman Sadeh
Institute of Software Research, School of Computer Science, Carnegie Mellon University


                                      Abstract
                                      Traffic signal control (TSC) is a high-stakes domain that is growing in importance as traffic volume
                                      grows globally. An increasing number of works are applying reinforcement learning (RL) to TSC; RL
                                      can draw on an abundance of traffic data to improve signalling efficiency. However, RL-based signal
                                      controllers have never been deployed. In this work, we provide the first review of challenges that must
                                      be addressed before RL can be deployed for TSC. We focus on four challenges involving (1) uncertainty
                                      in detection, (2) reliability of communications, (3) compliance and interpretability, and (4) heterogeneous
                                      road users. We show that the literature on RL-based TSC has made some progress towards addressing
                                      each challenge. However, more work should take a systems thinking approach that considers the impacts
                                      of other pipeline components on RL.

                                      Keywords
                                      Traffic signal control, Reinforcement learning, Intelligent transportation system, System deployment,
                                      Review


1. Introduction
As the traffic volume of metropolitan areas continues to grow worldwide, gridlock is becoming
an increasingly prevalent concern. According to the 2021 Urban Mobility Report [1], gridlock led
to over 4 billion hours in travel delay and $100+ million in congestion costs across the United
States in 2021. This not only impacts commercial productivity but also has environmental
consequences. One important mechanism for alleviating gridlock is improving the timing of
traffic signals [2]. Historically, most jurisdictions have used fixed timing plans based on traffic
models, which assume fixed values of factors such as lane volumes and arrival rates [3]. To
minimize implementation burden, traditional traffic signal control (TSC) either uses one fixed
plan throughout the entire day, or rotates through several plans depending on the time of the
day. However, fixed plans cannot respond in real time to changes in traffic demand [3, 4].
   Large traffic volumes also offer an abundance of data that can be used for real-time optimiza-
tion of signal timing plans. Many deployed systems combine logic-triggered state changes with
data-driven searches over sets of schedules [3]. However, an increasing number of approaches

ATT ’22: Workshop on Agents in Traffic and Transportation, July 25, 2022, Vienna, Austria
$ rexc@cmu.edu (R. Chen); feifang@cmu.edu (F. Fang); sadeh@cs.cmu.edu (N. Sadeh)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
Rex Chen et al. CEUR Workshop Proceedings                                                       –


traverse larger search spaces using optimization and scheduling algorithms [5]. Among these
approaches, reinforcement learning (RL) has yielded significant improvements over fixed and
actuated TSC algorithms in simulations [6]. RL allows systems to learn from the consequences
of their decisions, which enables them to achieve continuous self-improvement. Deployments
of RL algorithms have achieved success in a variety of complex domains involving human
interaction, such as card games [7], real-time strategy games [8], and other applications in
transportation such as dispatching for ride-hailing services [9].
   However, to our knowledge, RL-based TSC algorithms have never been deployed. This is in
spite of the fact that papers introducing novel algorithms in this area commonly list real-world
deployment as a goal for future work [10]. We believe that this discrepancy has arisen due to a
focus on methodological contributions, instead of on a holistic systems thinking approach based
on the data-to-deployment pipeline [11]. If RL-based signal controllers are to achieve success in
deployment, domain experts in TSC and in RL must have a shared view of the problem. We
take a step towards bridging the gap between research and deployment by providing the first
review of challenges that may arise from end-to-end deployments of RL-based TSC, which we
intend to provide a common basis of collaboration between research in TSC and RL.
   We begin by describing our review methodology in Section 1.1. Then, we provide a high-level
review of the fields of TSC and RL in Section 2. Next, we explore four engineering challenges.
For each of these challenges, we will provide a review of (1) how these challenges are significant
concerns for the state of the art in RL-based TSC; (2) what practical considerations relevant to
these challenges have arisen in deployments of non-RL TSC systems; and (3) what progress has
been made in the RL-based TSC literature towards solving these challenges.

    • Uncertainty in detection. (Section 3) Typically, RL-based TSC algorithms learn based
      on metrics such as queue length or travel time. These require accurate vehicle detection
      technologies, which may not always be available in the field. Strategies to deal with
      detector uncertainty and failure are a prerequisite of deployment.

    • Reliability of communications. (Section 4) Some decentralization is necessary for RL-
      based TSC. Coordination between intersections is important for optimizing network-level
      metrics, yet most work in RL-based TSC has not considered the practicalities of dealing
      with failure and latency in inter-intersection communications.

    • Compliance and interpretability. (Section 5) Jurisdictions will not have confidence
      in RL-based signal controllers without assurances about compliance to standards (e.g.,
      minimum green time) and safety requirements. The interpretability of models is important
      for ensuring that signalling plans can be audited and adjusted by stakeholders.

    • Heterogeneous road users. (Section 6) Most simulations for RL-based TSC assume that
      all cars are the same size and have the same free-flow speed. However, cars share the
      road with pedestrians, buses, emergency vehicles, and other road users. Algorithms must
      detect and respond to the needs of different road users in a safe, equitable manner.

  Finally, we end with concluding thoughts and suggestions for future work in Section 7.
Rex Chen et al. CEUR Workshop Proceedings                                                         –


1.1. Methodology
To obtain an overview of the domain of RL-based TSC, we conducted a targeted search on
Google Scholar with the keywords “traffic signal”/“traffic light”, “reinforcement learning”, and
“review”/“survey”. We identified the four challenges addressed in the following sections through
these reviews. From here, we conducted snowball sampling based on their citations to locate
papers in the RL literature that discuss these challenges. For RL papers, we focused on those
published after 2015, since this field has rapidly evolved over the past several years. We also
performed additional targeted Google Scholar searches to find literature which describes non-RL
deployments of TSC, by searching the keywords “traffic signal”/“traffic light” and “adaptive” in
conjunction with the following keywords:

    • For Section 3, “uncertainty”, “noise”, “sensing error”, “accuracy’.

    • For Section 4, “coordination”, “communication”, “closed loop”, “message”, “NTCIP”.

    • For Section 5, “compliance”, “safety”, “accountability”, “interpretability”/“explainability”.

    • For Section 6, “pedestrian”/“leading pedestrian interval”, “cyclist”, “transit”, “emergency
      vehicle”, “priority”, “preempt”.


2. Related work
2.1. Traffic signal control
Traffic signal control (TSC) aims to allocate green time at an intersection to traffic moving in
different directions. Every approach (roadway entering the intersection) is split into lanes for
forward, left-turn, and (possibly) right turn movements (which may be assumed to always be
permissible) [12, 13]. For efficiency, pairs of compatible movements are often arranged into
phases and signalled simultaneously [10, 14, 15]. The task is to find some division of green time
between phases for each intersection in a road network, which maximizes metrics such as the
throughput of the network. We refer the reader to [16] for details of the problem formulation.
   Different approaches to dividing green time include choosing phase durations or phase
sequences, or fixing a phase sequence within a cycle and choosing the length of the cycle or
the proportions of each phase within the cycle [10, 12, 15]. Three main types of algorithmic
approaches exist. In fixed-time control, which has historically been a popular strategy [3], a
small number of fixed plans are optimized based on past traffic data under the assumption of
uniform demand. In actuated control, detector inputs (such as vehicle presence data from loop
detectors) are used in conjunction with a fixed set of logical rules. Finally, adaptive control uses
more complex prediction and optimization algorithms to control signalling plans [12, 16].

2.2. Reinforcement learning
One emerging approach to adaptive control has been reinforcement learning (RL). RL is a
sequential decision-making paradigm wherein agents learn how to act through trial-and-error
interactions with an environment. The goal of RL is to learn policies, which describe how agents
Rex Chen et al. CEUR Workshop Proceedings                                                          –


should act given the state of the environment. Early work in reinforcement learning during
the 1980s and 1990s, which included the seminal 𝑄-learning algorithm [17], relied on tabular
enumeration of environment states and agent actions. RL remained relatively difficult to scale
until the emergence of methods based on function approximation in the 2010s, specifically the
use of neural networks for deep RL [18]. Since then, the popularity and complexity of RL has
experienced explosive growth. Deep RL has also found novel applications in practical domains
such as robotics, natural language processing, finance, and healthcare [19]. Transportation has
been one of the most significant applications of deep RL, with tasks including autonomous
driving [20], vehicle dispatching [9] and routing [21], and traffic signal control (see Section 2.3).
We refer the reader to [22] for an in-depth review of the history of reinforcement learning.
   The body of work that we review in this paper can be seen as a parallel to work in RL for
robotics that attempts to close the gap between simulations and reality. RL methods, especially
deep RL methods, require an abundance of data to learn from environmental interactions. Due
to the cost of real-world data collection, simulators are often employed instead to generate large
quantities of interactions. However, simulators can never perfectly emulate reality. This problem,
which is referred to as the reality gap [23], has been addressed by the sim-to-real literature.
Some sim-to-real methods employ randomization in sensors and controllers to learn robust
policies (domain randomization); some explicitly model the reality gap and try to unify the
feature spaces of the source and target environments (domain adaptation); some train policies
to generalize across different tasks (meta-RL); some attempt to learn from demonstrations of
behaviour in target environments (imitation learning); and others attempt to improve simulators.
We refer the reader to [24, 25] for surveys of these methods. In this work, we draw parallels
between some of these methods and developments in RL-based TSC. However, at the same
time, TSC involves unique challenges that are usually not present in robotics. Environments in
robotics where sim-to-real methods have been applied (see [24]) are usually highly controlled
with well-defined objectives (e.g., [26]) and minimal interaction with other agents. However,
TSC may be affected by varying environmental conditions and large numbers of road users.

2.3. Related reviews
Various reviews of applications of RL in TSC have been published. While each of the following
reviews captures distinct aspects of the field that are highly relevant to our work, none of
them have focused on the key issue of practical engineering challenges that present barriers to
deployment, and — crucially — how to solve them instead of leaving them as open problems.
   [27], [28], and [29] provided brief syntheses of early RL-based TSC methods in reviews of
applications of AI in transportation. [30] and [31] were the first to take a systematic approach to
reviewing RL-based TSC algorithms; the former performed the first experimental comparison of
RL algorithms with a synthetic network, while the latter addressed data sources such as models
of road networks and vehicle arrivals. Both reviewed state, action, and reward formulations.
These reviews considered traditional algorithms in RL such as Q-learning and SARSA.
   With the increasing popularity of deep learning to address challenges of scalability in RL,
[4, 32] (the latter a follow-up to [31]) both reviewed deep RL methods for TSC and provided
recommendations for designing novel deep RL-based TSC algorithms. [4] focused on choosing
state, action, and reward representations, with some discussion of data processing, but did not
Rex Chen et al. CEUR Workshop Proceedings                                                      –


consider downstream challenges in deployment. [32] provided a broad overview of various
algorithm and architecture designs with less of a focus on practicalities.
   Both [15, 33] reviewed alternative state, action, and reward formulations among deep RL-
based TSC algorithms, as well as options for inter-agent coordination and simulation-based
evaluation. They outlined, but did not investigate, challenges to deployment. [15] further
compared deep RL-based algorithms to traditional actuated and adaptive methods. Likewise,
as part of a wider review on deep RL for intelligent traffic systems, [34] reviewed problem
formulations and the history of algorithmic developments for RL-based TSC. Finally, [10]
performed a highly systematic overview of the past 26 years of research in this domain that
provides quantitative support for some of the patterns that we identify.


3. Uncertainty in detection
3.1. Significance of challenges
States are described in inputs to RL-based TSC algorithms using abstracted features. These
include vehicles’ queue lengths, positions, and speeds [10]. Many works take for granted that
these state features are readily available [35]. As reported by [10], 67% of surveyed papers
did not envision any specific data sources. Even in papers where potential data sources were
specified, it is unclear how robust the methods would be to detector noise or failure. For
instance, among algorithms that use vehicle positions as state features, [36, 37, 38, 39] all
used the simulator SUMO to obtain noiseless images of single-intersection toy networks; [40]
extended this approach with a 3D simulator for images from the perspectives of traffic cameras;
and [41] used simulated traffic in SUMO based on flow rates from traffic camera footage. Each
of these methods provides a sanitized representation that may not necessarily be representative
of real-world conditions. Furthermore, the loss of information to noise may cause state aliasing
[42], which hinders the generalizability of learned policies to different demand scenarios [43].

3.2. Lessons from deployments
Types of instruments for traffic sensing include intrusive detectors (installed into the road
surface) and non-intrusive detectors (mounted above the road surface) [44, 45]. Among intrusive
detectors, loop detectors are relatively inexpensive, accurate, and robust to weather and time of
day, but they are also highly vulnerable to wear and tear [46]. When they fail, loop detectors
are being increasingly replaced by non-intrusive detectors such as video-based and radar
detection systems [44], which can be flexibly reconfigured to detect different road segments
and vehicle types. However, the accuracy of these systems degrades in inclement weather, and
video detectors are also inaccurate at night and on high-speed roads [45, 47]. RL-based signal
controllers must be designed with these limitations in mind; learning ensembles of models [48]
to capture the strengths of different detectors may improve robustness. Although data about
speed and position from connected vehicles can be useful, penetration remains low, so they must
be integrated with traditional detector data. [49] showed in simulations that connected vehicle
data could improve adaptive control even with limited penetration. Furthermore, agencies may
configure their detectors differently. To account for uncertainty in vehicle stopping positions,
Rex Chen et al. CEUR Workshop Proceedings                                                        –


for instance, the size of the detection zone behind the stop bar may vary [50]; detectors may
also report data at different frequencies [51]. Thus, verifying the mapping from real detector
data to abstract state representations is an important task for RL-based TSC.
   Agencies often address problems in detection by modifying their detection setup [44] or by
configuring parameters such as passage time (i.e., the amount of time that a phase is extended for
upon actuation) [45]. [5] explicitly addressed error in queue length detection for their adaptive
controller SURTRAC. To mitigate underestimation, they used heuristics based on differences in
vehicle counts reported by advance and stop bar detectors [52]. They considered overestimation
acceptable, as it provides the algorithm with buffer time; similarly, [53] found that moderate
queue length overestimation significantly improves the performance of adaptive control.

3.3. Progress toward solutions
Two lines of work within RL-based TSC have the potential to address detection uncertainty.
   First, various authors have investigated the effects of reducing the dimensionality of the state
space. In particular, [3] showed that complex image representations of intersection state achieve
inferior performance compared to a simple representation containing only vehicle counts and
phases. [54] reached similar conclusions with a state representation based on queue length.
Both papers also provided optimality results that connected these formulations to traditional
methods in TSC. Meanwhile, [43, 55] investigated the effects of switching to coarser state
representations with a single algorithm. [55] found that occupancy and speed data (e.g., from
loop detectors) yielded near-identical performance to high-fidelity position data (e.g., from
cameras). However, the experiments of [43] suggested that coarser state discretizations harm
generalization across sudden shifts in traffic flow. Regardless, simpler state representations
could facilitate identification and debugging of issues caused by detection uncertainty.
   Second, other work has attempted to imbue RL-based TSC algorithms with robustness to
detection uncertainty. Several methods are analogous to domain randomization in the sim-
to-real literature [26, 56]. The approach of [57] is closest to the sim-to-real literature: they
randomize weather and lighting conditions in their traffic simulator and train policies based on
the resulting images. [58] applied Dropout to neural network units to prevent overfitting and
thus to learn robust policies. They evaluated their algorithm with a simulation of probabilistic
detector failure. As is done in adversarial machine learning, [59] injected Gaussian noise into
queue length observations, and validated their approach with simulations where trucks cause
vehicle count overestimation. Meanwhile, to handle miscalibrated measurements, [35] combined
next state prediction with imitation learning from a real traffic controller (SCOOTS), [60] used
autoencoders to denoise input data, and [61] evaluated the effects of lane-blocking incidents
and detector noise on performance. Finally, in a growing body of work that uses connected
vehicle data for RL, [62] was the first to explicitly address partial observability by adding the
phase duration into the state space to learn its indirect impact on delay.
   Overall, these methods are helpful approaches for improving the robustness of RL-based TSC
to detection uncertainty. However, they should be designed and tuned to address the challenges
of specific deployments, leveraging past knowledge to identify and address potential causes of
detector noise or failure. It may also help to model partial observability as part of the problem.
Rex Chen et al. CEUR Workshop Proceedings                                                           –


4. Reliability of communications
4.1. Significance of challenges
Some level of controller decentralization is often applied in RL-based TSC, because the com-
putational cost of RL may be prohibitive when the state and action space dimensionalities are
high. At the same time, to ensure that controllers take the traffic conditions of other inter-
sections into account for signalling decisions, a growing number of works have implemented
mechanisms for inter-intersection coordination [33]. Typical approaches involve sharing states
[63, 64, 65, 66, 67, 68], actions [69], or hidden state representations from neural networks [70, 71]
between controllers for neighbouring intersections. While much of this work has focused on
designing neural network architectures to leverage shared information (such as graph neural
networks [66, 67, 70, 71]), less attention has been devoted to the mechanisms by which infor-
mation must be exchanged in the first place. If there are inconsistencies in the availability of
communication infrastructure and detectors between intersections (see also Section 3), it is
unclear how they may affect the performance of RL-based TSC.

4.2. Lessons from deployments
In practice, signal controllers are commonly deployed as part of closed-loop systems, where
control is distributed over three levels. At the top level, traffic management centres (TMCs)
make policy-based signalling decisions, often involving dialogue with other stakeholders. These
decisions are used to configure field master controllers (FMCs), which are installed on-site
and coordinate multiple local intersection controllers (LICs) [72]. Each FMC aggregates traffic
conditions reported by connected LICs to make signalling decisions over a small region; FMCs
also synchronize the clocks of LICs to ensure that they are coordinated [12, 14]. As 90% of TSC
systems in the United States are closed-loop [73], upgrades to adaptive control have largely
been implemented within this hierarchical organization [51]. LICs may make some limited
decisions based on local traffic conditions, but coordination is still largely delegated to FMCs
even in adaptive control [72]. Transitioning to adaptive control has also required agencies to
update to Type 2070 or ATC controllers [12], but some controllers in road networks may retain
relatively outdated hardware [14]. RL-based signal controllers will likely be deployed into such
ecosystems, where control is distributed hierarchically and different intersections have different
capabilities for control and/or detection. Thus, algorithms based on techniques for domain
adaptation from the sim-to-real literature may be helpful.
   Messages are sent between controllers and TMCs using multiple communication media in
modern TSC systems [12]. For wired connections, fibre optic cables are increasingly replacing
traditional copper wires or coaxial cables. Wireless communication systems implemented using
radio or Wi-Fi are also becoming increasingly common [44]. Thus, communication bandwidth
is not likely to be a concern, except in jurisdictions where fibre optic infrastructure is not readily
available. However, a major issue reported by agencies in [44] was connection reliability: poor
signal strength often results in data loss or latency. In terms of data formatting, the NTCIP
1202 standard includes standard object definitions for actuated signal controllers, which has
also been used for adaptive systems [73]. Communications for RL would need to fit into this
standard, at least until it is updated (as has already been done for connected vehicles) [74].
Rex Chen et al. CEUR Workshop Proceedings                                                       –


In SURTRAC, [5] encoded data for communication between neighbouring intersections using
JSON messages with standard types.

4.3. Progress toward solutions
One line of work in RL-based TSC has sought to learn more compact representations of informa-
tion. Although bandwidth is not a concern, reducing message dimensionality could still mitigate
the impact of communication failures. Several algorithms directly exchange state values of
learned policies instead of learning from exchanged state representations. In [75, 76], state
values are directly exchanged between neighbours and weighted; [37, 77, 78] leveraged the
max-plus algorithm for coordination graphs, which is known to converge to near-optimality
even for cyclic graphs [79]. Meanwhile, [80] designed an architecture to exchange information
from the previous time step to ensure robustness to latency, and showed that it asymptotically
reduces communication relative to neighbour-based approaches by 50%. [81] demonstrated that
cumulative rewards can be estimated based only on vehicle counts on inbound approaches.
   Some work has also focused on designing RL-based TSC algorithms for hierarchically dis-
tributed frameworks of communication and control, which could improve RL’s robustness,
scalability, and applicability for deployment in closed-loop systems. [82] implemented a two-
level architecture where LICs can either act independently or receive joint actions from FMCs
based on predictions of the regional traffic state. [63] introduced a feudal RL algorithm, in
which “manager” controllers do not directly control the actions of “worker” controllers, but
instead set goals that influence their rewards. [83] trained multiple sub-policies that minimize
various proxy metrics such as queue length and waiting time, and a high-level controller that
adaptively delegates control to sub-policies to minimize the longer-term metric of travel time.
However, all of these architectures are conceptual and further work is needed to deploy them.


5. Compliance and interpretability
5.1. Significance of challenges
At the heart of the fact that RL-based TSC algorithms have not been deployed are the potential
regulatory and safety risks that are introduced by RL [15, 34]. The issue of trust and safety for
RL is by no means exclusive to the domain of TSC [84, 85, 86], but in this case the stakes are
high because contollers must interact with a large number of human users and mistakes may
have fatal consequences. For RL-based signal controllers to be trusted, we need to assess — both
prospectively or retrospectively — whether their decisions comply with standards and reasonable
expectations [87]. However, the proliferation of deep RL algorithms based on complicated state
representations runs counter to this goal, as assessment of compliance is not possible if we
cannot understand or at least verify their decisions. At the same time, issues of interpretability
and safety have rarely been discussed in the literature on RL-based TSC [10] and are more often
mentioned as desiderata for future work in reviews [10, 15, 34].
Rex Chen et al. CEUR Workshop Proceedings                                                         –


5.2. Lessons from deployments
In the real world, regulatory frameworks for traffic signalling are often scattershot. In the United
States, the federal Manual on Uniform Traffic Control Devices [88] includes standards about the
necessity, meaning, and placement of different traffic signals. Many of these standards involve
the control of individual movement signals, which would be abstracted away from RL through
phase-based action space definitions. However, factors such as yellow change and red clearance
intervals are left to “engineering judgement”. States may impose further requirements on signal
timing plans based on regional transportation policies [14]. In a review of signal timing policies
for 15 states, [89] found recommendations for factors such as minimum green, yellow change,
and red clearance intervals, as well as when to serve turn movements. Such recommendations
should be incorporated into the design of the RL action space, as was done by [5] who treated
safety constraints as inputs to SURTRAC. Yet, these recommendations can also be arbitrary and
dependent on data (e.g., vehicle and pedestrian clearing times [89]), and algorithmic approaches
to stakeholder preference learning [90] may help to find better values.
   One common strategy to ensure the safety of signal timing plans is to review common types
and causes of crashes in historical data [89]. Naturally, this is a reactive approach that requires
crashes to happen in the first place, and crash reports may also be biased by severity or by
environmental conditions [14]. Accident modification factors (AMFs) are a popular method of
quantitative analysis; they statistically estimate the effectiveness of particular changes to signal
timing plans based on their expected reductions in crash rate [91, 92, 93]. We are unaware of any
work in RL that estimates or uses AMFs, but they may be a valuable pathway to interpretability.
The Highway Safety Manual also provides standard crash risk assessment models, but these
models often require extensive tuning to local conditions [94, 95, 96].

5.3. Progress toward solutions
Some work has enhanced the interpretability of RL-based TSC through algorithm design. [97]
focused on learning surrogate policies that are regulatable, i.e. monotonic in state variables,
which allows parameters to be viewed as weights. [98] learned human-auditable decision tree
surrogates using VIPER, an algorithm that identifies critical states where suboptimality harms
future rewards. Closer to the literature on interpretability for machine learning, [99] used SHAP
values to analyze how induction loop detections contribute to choices of phases for a controller
in a simulated roundabout. They found that advance detectors have higher SHAP values as
they are more indicative of congestion. Similarly, [57] used Grad-CAM to generate heatmaps
for image-based inputs. Instead of directly interfacing with the simulator, [100] used logical
rules based on signal controllers to post-process RL policy outputs for ensuring compliance.
   Further work has applied heuristic modifications to RL algorithms to enforce safety. [101]
prevented their system from taking actions when pedestrians are detected in crosswalks, and
enforced minimum green times for pedestrians. [102] drew on their models of rear-end conflict
rates (based on various observable intersection state features [103]) to design a reward formula-
tion that minimizes such conflicts. Similarly, [104] used a binary logistic crash risk model to
define crash penalties while also minimizing waiting time. Using a state formulation based on
individual signals, [105] regularized the red light duration of signalling plans to mitigate unsafe
Rex Chen et al. CEUR Workshop Proceedings                                                          –


behaviour caused by driver frustration with extended red lights. [106] included yellow change
intervals in their action space and added a penalty for emergency braking by vehicles.
   While we have reviewed many promising methods that have been developed for the inter-
pretability and safety of RL-based TSC, more work is still needed on determining which of
these methods correspond well to stakeholder requirements. Furthermore, there is a substantial
literature on safe reinforcement learning using constrained optimization [107, 108, 109], which
has hitherto not been applied to TSC; it is likely that such work can provide more rigorous
theoretical guarantees about algorithm behaviour. We also believe that, to deal with safety
failures ethically, work is needed in algorithmic accountability for RL-based signal controllers.


6. Heterogeneous road users
6.1. Significance of challenges
Traditional models of traffic flow used for TSC assume, simplistically, that all vehicles are
identical [110, 111]. In reality, the assumption of identical or even unimodal traffic is often
unrealistic, because many types of vehicles and road users — each with different needs and
behavioural patterns — interact with each other on roads. RL algorithms can still implicitly
encode these assumptions through simplistic state spaces, since common state variables such as
queue length and vehicle position [15] do not account for inter-vehicle variation. Although such
state formulations can be helpful for deriving optimality results based on traditional models
in TSC [3, 54], it is unclear how these assumptions may impact the performance and safety of
RL-based signal controllers in practice, especially because road users such as pedestrians and
cyclists may behave non-intuitively. Dedicated simulators developed for RL-based TSC likewise
abstract away inter-vehicle variation [112]. [10] found in 160 papers on RL-based TSC that only
three accounted for non-private vehicle types, and only one accounted for pedestrians.

6.2. Lessons from deployments
In practice, agencies make a variety of adjustments to signalling plans to accommodate different
classes of road users other than regular passenger vehicles, including pedestrians, cyclists,
transit vehicles, and emergency vehicles [14]. In this section, we focus on current practice in
the field for pedestrians and transit/emergency vehicles. When balancing the needs of different
road user classes in RL-based signal controllers, stakeholders’ requirements should be taken
into account; in the US, for instance, agencies’ opinions differ on whether preemption for trains
should take priority over pedestrians [89].
   For pedestrians, the simplest option is for the pedestrian signal to be activated in the direction
of the through movement, as is implicitly assumed by many works in RL and made explicit in
some (e.g., [113]). However, doing so may cause pedestrians to impede the flow of left-turning
and right-turning traffic, which creates safety hazards. In practice, leading pedestrian intervals
(LePIs) mitigate this risk by allowing pedestrians to start crossing before cars are permitted to
make turns [14]. Alternative phase sequence designs add lagging pedestrian intervals (after
turning phases) or phases exclusively for pedestrians. [114] developed a benefit-cost model to
assess the safety-delay tradeoffs for LePIs at individual intersections. Beyond safety, additional
Rex Chen et al. CEUR Workshop Proceedings                                                       –


work has tried to minimize the delay of pedestrians so that they are treated equitably compared
to drivers, as codified by regulations in Germany, the UK, and China [115]. For the deployed
SURTRAC system, [116] adaptively set pedestrian walk intervals based on predicted phase
lengths to avoid cutting them short, while [117] considered using vehicular volumes and
pedestrian actuation frequencies to switch between controller modes. We are unaware of any
work in RL that has explicitly included LePIs as part of the action space formulation.
   As for handling transit and emergency vehicles, typical strategies include the prioritization
and preemption of signals. Prioritization handles requests made by vehicles through vehicle-to-
infrastructure (V2I) communications, and may or may not result in adjustments to signalling
plans. Meanwhile, preemption (often used for firetrucks or trains) deterministically replaces
the signal plan with a predefined routine that favours the preempting vehicle. Typically, signal
controllers need multiple cycles after preemption to recover from the interruption [14]. The
adaptive SCATS controller natively implements both prioritization and preemption; compared to
prior practice, [118] found that SCATS’ performance improvements were robust to prioritization,
and [119] found that it could reduce recovery time from preemption. These results suggest the
potential of implementing prioritization and preemption with RL-based methods; in particular,
explicit modelling of recovery from preemption may further improve recovery times. In addition
to interactions at intersections, RL-based signal controllers should also consider the effects
of transit and emergency vehicles on traffic between intersections. For instance, when buses
are stopped on roads, they may block other traffic from passing. As initial steps towards
implementing bus prioritization in the SURTRAC system, [120] delayed the allocation of green
time in intersections located downstream from stopped buses, and [121] predicted bus dwelling
times at stops by leveraging V2I communications.

6.3. Progress toward solutions
One paper in RL-based TSC was cited by [10] as explicitly modelling pedestrians: [101] defined
the reward using the weighted average of the local intersection’s vehicular queue length,
neighbouring intersections’ vehicular queue lengths, and the local intersection’s pedestrian
queue length. Beyond this paper, several other works have explicitly considered pedestrians as
part of the problem formulation. [122] likewise addressed joint vehicle-pedestrian control at
intersections, but made no assumptions about pedestrian detector capabilities. [123] used deep
RL to control a signalized crosswalk across a road (with the actions being to set the pedestrian
signal to green or red), and found that it outperformed actuation under moderate levels of
pedestrian demand in simulations. [61] analyzed the performance of RL-based TSC in the
presence of jaywalking pedestrians that cause vehicles to slow.
   Several works in RL-based TSC have also considered prioritization and preemption. For
prioritization, [57] upweighted buses and emergency vehicles in their throughput-based reward
formulation; [124] used a state representation based on the cell transmission traffic model and
modelled priority as a binary variable; [125] adopted an implicit approach based on minimizing
delay per person instead of per vehicle; [126] and [127] both considered prioritization for trams,
with the former’s rewards being based on tram schedule adherence and the latter using model
predictive control to model driver behaviour; and [128] adaptively altered vehicles’ priorities
depending on queue length, waiting time, and emergency vehicle presence. For preemption,
Rex Chen et al. CEUR Workshop Proceedings                                                         –


[129] learned TSC policies for emergency vehicle routing with rewards that encourage low
vehicle density, and [130] used RL to learn policies for notifying connected vehicles to clear out
lanes for emergency vehicles to pass.
   Lastly, [100] included demand data from the field for multiple types of road users — including
pedestrians, cyclists, motorcyclists, trucks, and buses — in their benchmark simulation for
RL-based TSC, LemgoRL, which is based on a real road network; they also included pedestrian
waiting times in rewards and enforced minimum pedestrian green times. There is a need to
connect high-fidelity simulations such as LemgoRL to the various approaches for handling
different road user classes that we outlined above, so as to ensure their ecological validity.


7. Conclusion
We have reviewed four barriers to the deployment of RL-based controllers for TSC. Each of these
barriers has been insufficiently addressed by the majority of new work in RL-based TSC, which
has focused on algorithmic contributions. However, TSC algorithms do not exist in a vacuum —
they must be trained based on data from detectors, interface with signals through controllers,
and control the movements of a variety of road users. Challenges both intrinsic to RL algorithms
and in other pipeline components may cascade into failures with significant implications for
the efficiency and safety of transportation infrastructure. Based on our literature review, we
suggested ways in which further work in RL-based TSC could address these challenges.
   Echoing the recommendations of [11], we emphasize the importance of engaging in consulta-
tion with agency stakeholders and experts in TSC for RL practitioners. This can break down
information silos that would otherwise prevent the recognition of issues during requirements
engineering and integration (cf. [131]); we could not have identified these challenges ourselves
without engaging with the literature on traditional TSC. Additionally, as we discussed, the
practicalities of these challenges — including the availability and configuration of detectors, sig-
nalling constraints, and the priorities of different road users — will often vary depending on the
statuses of road networks and their responsible agencies. While benchmark simulations based
on synthetic networks facilitate evaluation, we advocate for the creation of more simulations
like [100] that incorporate realistic domain constraints. RL algorithms that are trained using
such benchmarks would likely have better generalizability and robustness in deployments.
   More generally, we uncovered a diversity of work that addresses each challenge, which
previous reviews of TSC have not comprehensively surveyed. This suggests that RL-based
TSC is closer to deployment than might be suggested by a review of state-of-the-art methods.
If future developments focus on combining algorithmic improvements with both real-world
considerations and reproducibility techniques to facilitate collaboration [132], we believe that
the integration of RL to improve real-world transportation infrastructure is within reach.


Acknowledgments
The authors thank Christian Kästner, Eunsuk Kang, Stephanie Milani, Peide Huang, Ryan Shi,
and Steven Jecmen for useful information and suggestions that they provided to support the
drafting of this review.
Rex Chen et al. CEUR Workshop Proceedings                                                        –


References
  [1] D. Schrank, L. Albert, B. Eisele, T. Lomax, 2021 Urban Mobility Report, Technical Report,
      Texas A&M Transportation Institute, 2021.
  [2] S. Chin, O. Franzese, D. Greene, H. Hwang, R. Gibson, Temporary losses of highway
      capacity and impacts on performance: Phase 2, Technical Report ORNL/TM-2004/209,
      Oak Ridge National Laboratory, 2004.
  [3] G. Zheng, X. Zang, N. Xu, H. Wei, Z. Yu, V. Gayah, K. Xu, Z. Li, Diagnosing reinforcement
      learning for traffic signal control, arXiv preprint (2019). arXiv:1905.04716.
  [4] M. Gregurić, M. Vujić, C. Alexopoulos, M. Miletić, Application of deep reinforcement
      learning in traffic signal control: An overview and impact of open traffic data, Applied
      Sciences 10 (2020) 4011.
  [5] S. Smith, G. Barlow, X.-F. Xie, Z. Rubinstein, Smart urban signal networks: Initial
      application of the SURTRAC adaptive traffic signal control system, in: Proceedings of
      the 23rd International Conference on Automated Planning and Scheduling, ICAPS ’13,
      2013, pp. 434–442.
  [6] C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, Z. Li, Toward a thousand
      lights: Decentralized deep reinforcement learning for large-scale traffic signal control, in:
      Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI ’20, 2020, pp.
      3414–3421.
  [7] N. Brown, T. Sandholm, Superhuman AI for heads-up no-limit poker: Libratus beats top
      professionals, Science 359 (2017) 418–424.
  [8] O. Vinyals, I. Babuschkin, W. M. C. amd Michaël Mathieu, A. Dudzik, J. Chung, D. H. Choi,
      R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang,
      L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen,
      V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff,
      Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap,
      K. Kavukcuoglu, D. Hassabis, C. Apps, D. Silver, Grandmaster level in StarCraft II using
      multi-agent reinforcement learning, Nature 575 (2019) 350–354.
  [9] Z. T. Qin, X. Tang, Y. Jiao, F. Zhang, Z. Xu, H. Zhu, J. Ye, Ride-hailing order dispatching
      at didi via reinforcement learning, INFORMS Journal on Applied Analytics 50 (2020)
      272–286.
 [10] M. Noaeen, A. Naik, L. Goodman, J. Crebo, T. Abrar, Z. S. H. Abad, A. L. Bazzan, B. Far,
      Reinforcement learning in urban network traffic signal control: A systematic literature
      review, Expert Systems with Applications 199 (2022) 116830.
 [11] A. Perrault, F. Fang, A. Sinha, M. Tambe, AI for social impact: Learning and planning in
      the data-to-deployment pipeline, arXiv preprint (2019). arXiv:2001.00088.
 [12] R. L. Gordon, W. Tighe, Traffic Control Systems Handbook, Federal Highway Administra-
      tion, 2005.
 [13] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, Y. Li, K. Xu, Z. Li, Learning phase
      competition for traffic signal control, in: Proceedings of the 28th ACM International
      Conference on Information and Knowledge Management, CIKM ’19, 2021, pp. 1963–1972.
 [14] P. Koonce, L. Rodegerdts, K. Lee, S. Quayle, S. Beaird, C. Braud, J. Bonneson, P. Tarnoff,
      T. Urbanik, Traffic Signal Timing Manual, Federal Highway Administration, 2008.
Rex Chen et al. CEUR Workshop Proceedings                                                          –


 [15] H. Wei, G. Zheng, V. Gayah, Z. Li, A survey on traffic signal control methods, arXiv
      preprint (2019). arXiv:1904.08117.
 [16] M. Eom, B.-I. Kim, The traffic signal control problem for intersections: a review, European
      Transport Research Review 12 (2020) 50.
 [17] C. J. C. H. Watkins, P. Dayan, Q-learning, Machine Learning 8 (1992) 279–292.
 [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
      M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,
      H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through
      deep reinforcement learning, Nature 518 (2015) 529–533.
 [19] Y. Li, Deep reinforcement learning, arXiv preprint (2018). arXiv:1810.06339.
 [20] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sallab, S. Yogamani, P. Pérez,
      Deep reinforcement learning for autonomous driving: A survey, IEEE Transactions on
      Intelligent Transportation Systems (2021).
 [21] M. Nazari, A. Oroojlooy, M. Takáč, L. V. Snyder, Reinforcement learning for solving the
      vehicle routing problem, in: Proceedings of the 32nd International Conference on Neural
      Information Processing Systems, NIPS ’18, 2018, pp. 9861–9871.
 [22] R. S. Sutton, A. G. Barto, Early history of reinforcement learning, in: Reinforcement
      Learning: An Introduction, The MIT Press, 2018, pp. 11–17.
 [23] J.-B. Mouret, K. Chatzilygeroudis, 20 years of reality gap: a few thoughts about simu-
      lators in evolutionary robotics, in: Proceedings of the 2017 Genetic and Evolutionary
      Computation Conference Companion, GECCO ’17, 2017, pp. 1121–1124.
 [24] W. Zhao, J. P. Queralta, T. Westerlund, Sim-to-real transfer in deep reinforcement
      learning for robotics: a survey, in: Proceedings of the 2020 IEEE Symposium Series on
      Computational Intelligence, SSCI ’20, 2020, pp. 737–744.
 [25] K. Dimitropoulos, I. Hatzilygeroudis, K. Chatzilygeroudis, A brief survey of Sim2Real
      methods for robot learning, in: Proceedings of the 2022 International Conference on
      Robotics in Alpe-Adria Danube Region, RAAD ’22, 2022, pp. 133–140.
 [26] M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. Pachocki, A. Petron,
      M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng,
      W. Zaremba, Learning dexterous in-hand manipulation, The International Journal of
      Robotics Research 39 (2020) 3–20.
 [27] B. Abdulhai, L. Kattan, Reinforcement learning: Introduction to theory and potential for
      transport applications, Canadian Journal of Civil Engineering 30 (2003) 981–991.
 [28] A. L. C. Bazzan, Opportunities for multiagent systems and multiagent reinforcement
      learning in traffic control, Autonomous Agents and Multi-Agent Systems 18 (2009)
      342–375.
 [29] A. L. C. Bazzan, F. Klügl, A review on agent-based technology for traffic and transportation,
      The Knowledge Engineering Review 29 (2013) 375–403.
 [30] P. Mannion, J. Duggan, E. Howley, An experimental review of reinforcement learning
      algorithms for adaptive traffic signal control, in: Autonomic Road Transport Support
      Systems, Springer, 2016, pp. 47–66.
 [31] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, P. Komisarczuk, A survey on reinforcement
      learning models and algorithms for traffic signal control, ACM Computing Surveys 50
      (2017) 34.
Rex Chen et al. CEUR Workshop Proceedings                                                         –


 [32] F. Rasheed, K.-L. A. Yau, R. M. Noor, C. Wu, Y.-C. Low, Deep reinforcement learning for
      traffic signal control: A review, IEEE Access 8 (2020) 208016–208044.
 [33] H. Wei, G. Zheng, V. Gayah, Z. Li, Recent advances in reinforcement learning for traffic
      signal control: A survey of models and evaluation, ACM SIGKDD Explorations Newsletter
      22 (2021) 12–18.
 [34] A. Haydari, Y. Yilmaz, Deep reinforcement learning for intelligent transportation systems:
      A survey, IEEE Transactions on Intelligent Transportation Systems 23 (2022) 11–32.
 [35] H. Wang, Y. Yuan, X. T. Yang, T. Zhao, Y. Liu, Deep Q learning-based traffic signal control
      algorithms: Model development and evaluation with field data, Journal of Intelligent
      Transportation Systems (2022).
 [36] W. Genders, S. Razavi, Using a deep reinforcement learning agent for traffic signal control,
      arXiv preprint (2016). arXiv:1611.01142.
 [37] E. van der Pol, F. A. Oliehoek, Coordinated deep reinforcement learners for traffic
      light control, in: Proceedings of the 30th Conference on Neural Information Processing
      Systems, NIPS ’16, 2016, pp. 1–8.
 [38] S. S. Mousavi, M. Schukat, E. Howley, Traffic light control using deep policy-gradient
      and value-function based reinforcement learning, IET Intelligent Transport Systems 11
      (2017) 417–423.
 [39] X. Liang, X. Du, G. Wang, Z. Han, A deep reinforcement learning network for traffic
      light cycle control, IEEE Transactions on Vehicular Technology 68 (2019) 1243–1253.
 [40] D. Garg, M. Chli, G. Vogiatzis, Deep reinforcement learning for autonomous traffic
      light control, in: Proceedings of the 2018 3rd International Conference on Intelligent
      Transportation Engineering, ICITE ’18, 2018, pp. 214–218.
 [41] H. Wei, G. Zheng, H. Yao, Z. Li, IntelliLight: A reinforcement learning approach for
      intelligent traffic light control, in: Proceedings of the 24th ACM SIGKDD International
      Conference on Knowledge Discovery & Data Mining, KDD ’18, 2018, pp. 2496–2505.
 [42] M. T. J. Spaan, N. Vlassis, A point-based pomdp algorithm for robot planning, in:
      Proceedings of the 2004 IEEE International Conference on Robotics and Automation,
      ICRA ’04, 2004, pp. 2399–2404.
 [43] L. N. Alegre, A. L. Bazzan, B. C. da Silva, Quantifying the impact of non-stationarity
      in reinforcement learning-based traffic signal control, PeerJ Computer Science 7 (2021)
      e575.
 [44] D. Sun, L. Dodoo, A. Rubio, H. K. Penumala, M. Pratt, S. Sunkari, Synthesis study of Texas
      signal control systems: technical report, Technical Report FHWA/TX-13/0-6670-1, Texas
      A&M Transportation Institute, 2012.
 [45] S. Sunkari, A. Bibeka, N. Chaudhary, K. Balke, Impact of Traffic Signal Controller Settings
      on the Use of Advanced Detection Devices, Technical Report FHWA/TX-18/0-6934-R1,
      Texas A&M Transportation Institute, 2019.
 [46] D. Gibson, M. K. P. Mills, D. R. Jr., Staying in the loop: The search for improved reliability
      of traffic sensing systems through smart test instruments, Public Roads 62 (1998).
 [47] A. Rhodes, D. M. Bullock, J. R. Sturdevant, Z. T. Clark, Evaluation of Stop Bar Video
      Detection Accuracy at Signalized Intersections, Technical Report FHWA/IN/JTRP-2005/28,
      Joint Transportation Research Program, Indiana Department of Transportation and
      Purdue University, 2005.
Rex Chen et al. CEUR Workshop Proceedings                                                        –


 [48] K. Lee, M. Laskin, A. Srinivas, P. Abbeel, SUNRISE: A simple unified framework for en-
      semble learning in deep reinforcement learning, in: Proceedings of the 38th International
      Conference on Machine Learning, ICML ’21, 2021, pp. 6131–6141.
 [49] S. M. A. B. A. Islam, M. Tajalli, R. Mohebifard, A. Hajbabaie, Effects of connectivity and
      traffic observability on an adaptive traffic signal control system, Transportation Research
      Record 2675 (2021) 800–814.
 [50] A. M. T. Emtenan, C. M. Day, Impact of detector configuration on performance measure-
      ment and signal operations, Transportation Research Record 2674 (2020) 300–313.
 [51] F. Luyanda, D. Gettman, L. Head, S. Shelby, D. Bullock, P. Mirchandani, ACS-Lite
      algorithmic architecture: Applying adaptive control system technology to closed-loop
      traffic signal control systems, Transportation Research Record 1856 (2003) 175–184.
 [52] X.-F. Xie, G. J. Barlow, S. F. Smith, Z. B. Rubinstein, Accounting for Real-World Uncertainty
      in Real-Time Adaptive Traffic Control, Technical Report ATCSTR12, Carnegie Mellon
      University, 2012.
 [53] C. Cai, B. Hengst, G. Ye, E. Huang, Y. Wang, C. Aydos, G. Geers, On the performance of
      adaptive traffic signal control, in: Proceedings of the Second International Workshop on
      Computational Transportation Science, ICWTS ’09, 2009, pp. 37–42.
 [54] H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, Z. Li, PressLight: Learning max
      pressure control to coordinate traffic signals in arterial network, in: Proceedings of the
      25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
      KDD ’19, 2019, pp. 1290–1298.
 [55] W. Genders, S. Razavi, Evaluating reinforcement learning state representations for
      adaptive traffic signal control, in: Proceedings of the 9th International Conference on
      Ambient Systems, Networks and Technologies, ANT ’18, 2018, pp. 26–33.
 [56] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel, Domain randomization for
      transferring deep neural networks from simulation to the real world, in: Proceedings of
      the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS ’17,
      2017, pp. 23–30.
 [57] D. Garg, M. Chli, G. Vogiatzis, Fully-autonomous, vision-based traffic signal control:
      from simulation to reality, in: Proceedings of the 21th International Conference on
      Autonomous Agents and MultiAgent Systems, AAMAS ’22, 2022, pp. 454–462.
 [58] F. Rodrigues, C. L. Azevedo, Towards robust deep reinforcement learning for traffic
      signal control: Demand surges, incidents and sensor failures, in: Proceedings of the
      2019 International Conference on Intelligent Transportation Systems, ITSC ’19, 2019, pp.
      3559–3566.
 [59] K. L. Tan, A. Sharma, S. Sarkar, Robust deep reinforcement learning for traffic signal
      control, Journal of Big Data Analytics in Transportation 2 (2020) 263–274.
 [60] C. Li, F. Yan, Y. Zhou, J. Wu, X. Wang, A regional traffic signal control strategy with deep
      reinforcement learning, in: Proceedings of the 37th Chinese Control Conference, CCC
      ’18, 2018, pp. 7690—7695.
 [61] M. Aslani, S. Seipel, M. S. Mesgari, M. Wiering, Traffic signal optimization through
      discrete and continuous reinforcement learning with robustness analysis in downtown
      Tehran, Advanced Engineering Informatics 38 (2018) 639–655.
 [62] W. Li, M. Zhao, Y. Fu, K. Ruan, X. Di, CVLight: Decentralized learning for adaptive traffic
Rex Chen et al. CEUR Workshop Proceedings                                                       –


      signal control with connected vehicles, arXiv preprint (2021). arXiv:2104.10340.
 [63] J. Ma, F. Wu, Feudal multi-agent deep reinforcement learning for traffic signal control, in:
      Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent
      Systems, AAMAS ’20, 2020, pp. 816–824.
 [64] T. Chu, J. Wang, L. Codecà, Z. Li, Multi-agent deep reinforcement learning for large-scale
      traffic signal control, IEEE Transactions on Intelligent Transportation Systems 21 (2020)
      1086–1095.
 [65] M. Xu, J. Wu, L. Huang, R. Zhou, T. Wang, D. Hu, Network-wide traffic signal control
      based on the discovery of critical nodes and deep reinforcement learning, Journal of
      Intelligent Transportation Systems 24 (2020) 1–10.
 [66] M. Wang, L. Wu, J. Li, L. He, Traffic signal control with reinforcement learning based
      on region-aware cooperative strategy, IEEE Transactions on Intelligent Transportation
      Systems (2021).
 [67] Z. Zeng, GraphLight: Graph-based reinforcement learning for traffic signal control,
      in: Proceedings of the 6th International Conference on Computer and Communication
      Systems, ICCCS ’21, 2021, pp. 645–650.
 [68] P. Zhou, T. Braud, A. Alhilal, P. Hui, J. Kangasharju, ERL: Edge based reinforcement
      learning for optimized urban traffic light control, in: Proceedings of the 3rd International
      Workshop on Smart Edge Computing and Networking, SmartEdge ’19, 2019, pp. 849–854.
 [69] H. Ge, Y. Song, C. Wu, J. Ren, G. Tan, Cooperative deep Q-learning with Q-value transfer
      for multi-intersection signal control, IEEE Access 7 (2019) 40797–40809.
 [70] H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, Z. Li,
      CoLight: Learning network-level cooperation for traffic signal control, in: Proceedings
      of the 28th ACM International Conference on Information and Knowledge Management,
      CIKM ’19, 2019, pp. 1913–1922.
 [71] T. Nishi, K. Otaki, K. Hayakawa, T. Yoshimura, Traffic signal control based on rein-
      forcement learning with graph convolutional neural nets, in: Proceedings of the 2018
      International Conference on Intelligent Transportation Systems, ITSC ’18, 2018, pp.
      877–883.
 [72] H. Chen, J. J. Lu, Comparison of current practical adaptive traffic control systems, in:
      Proceedings of the 10th International Conference of Chinese Transportation Professionals,
      ICCTP ’10, 2010, pp. 1611–1619.
 [73] D. Gettman, S. G. Shelby, L. Head, D. M. Bullock, N. Soyke, Data-driven algorithms for
      real-time adaptive tuning of offsets in coordinated traffic signal systems, Transportation
      Research Record 2035 (2007) 1–9.
 [74] Z. Huang, E. Leslie, A. Balse, Infrastructure Connectivity Certification Test Procedures
      for Infrastructure-Based Connected Automated Vehicle Components: Test Procedures,
      Signal Phase and Timing — NTCIP 1202 v03, Technical Report FHWA-JPO-20-802, Leidos,
      2019.
 [75] Y. Wang, S. Geng, Q. Li, Intelligent transportation control based on proactive complex
      event processing, in: Proceedings of the 3rd International Conference on Mechanics and
      Mechatronics Research, ICMMR ’16, 2016, pp. 1–5.
 [76] W. Liu, G. Qin, Y. He, F. Jiang, Distributed cooperative reinforcement learning-based
      traffic signal control that integrates V2X networks’ dynamic clustering, IEEE Transactions
Rex Chen et al. CEUR Workshop Proceedings                                                       –


      on Vehicular Technology 66 (2017) 8667–8681.
 [77] S. Yang, B. Yang, H.-S. Wong, Z. Kang, Cooperative traffic signal control using multi-step
      return and off-policy asynchronous advantage actor-critic graph algorithm, Knowledge-
      Based Systems 183 (2019) 104855.
 [78] T. Chu, J. Wang, Traffic signal control by distributed reinforcement learning with min-
      sum communication, in: Proceedings of the 2017 American Control Conference, ACC
      ’17, 2017, pp. 5095–5100.
 [79] J. R. Kok, N. Vlassis, Using the max-plus algorithm for multiagent decision making in
      coordination graphs, in: Proceedings of the Fourth Robot Soccer World Cup, RoboCup
      ’05, 2005, pp. 1–12.
 [80] D. Xie, Z. Wang, C. Chen, D. Dong, IEDQN: Information exchange DQN with a centralized
      coordinator for traffic signal control, in: Proceedings of the 2020 International Joint
      Conference on Neural Networks, IJCNN ’20, 2020, pp. 1–8.
 [81] Q. Jiang, M. Qin, S. Shi, W. Sun, B. Zheng, Multi-agent reinforcement learning for
      traffic signal control through universal communication method, arXiv preprint (2022).
      arXiv:2204.12190.
 [82] M. Abdoos, A. L. Bazzan, Hierarchical traffic signal optimization using reinforcement
      learning and traffic prediction with long-short term memory, Expert Systems with
      Applications 171 (2021) 114580.
 [83] B. Xu, Y. Wang, Z. Wang, H. Jia, Z. Lu, Hierarchically and cooperatively learning traffic
      signal control, in: Proceedings of the 35th AAAI Conference on Artificial Intelligence,
      AAAI ’21, 2021, pp. 1–9.
 [84] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, A. P. Schoellig, Safe
      learning in robotics: From learning-based control to safe reinforcement learning, Annual
      Review of Control, Robotics, and Autonomous Systems 5 (2022).
 [85] J. García, F. Fernández, A comprehensive survey on safe reinforcement learning, Journal
      of Machine Learning Research 16 (2015) 1437–1480.
 [86] C. Yu, J. Liu, S. Nemati, G. Yin, Reinforcement learning in healthcare: A survey, ACM
      Computing Surveys 55 (2023) 1–36.
 [87] F. R. Ward, I. Habli, An assurance case pattern for the interpretability of machine learning
      in safety-critical systems, in: Proceedings of the 2020 International Conference on
      Computer Safety, Reliability, and Security, SAFECOMP ’20, 2020, pp. 395–407.
 [88] DOT, Manual on Uniform Traffic Signal Control Devices, revision 2 ed., US Department
      of Transportation, 2012.
 [89] J. Bonneson, M. Pratt, K. Zimmerman, Development of a Traffic Signal Operations Hand-
      book, Technical Report FHWA/TX-09/0-5629-1, Texas A&M Transportation Institute,
      2009.
 [90] M. K. Lee, D. Kusbit, A. Kahng, J. T. Kim, X. Yuan, A. Chan, D. See, R. Noothigattu,
      S. Lee, A. Psomas, A. D. Procaccia, WeBuildAI: Participatory framework for algorithmic
      governance, Proceedings of the ACM on Human-Computer Interaction 3 (2019) 1–35.
 [91] D. Lord, J. A. Bonneson, Role and application of accident modification factors within
      highway design process, Transportation Research Record 1961 (2006) 65–73.
 [92] L. Wu, D. Lord, Y. Zou, Validation of crash modification factors derived from cross-
      sectional studies with regression models, Transportation Research Record 2514 (2015)
Rex Chen et al. CEUR Workshop Proceedings                                                        –


      88–96.
 [93] J. Ma, M. D. Fontaine, F. Zhou, J. Hu, Estimation of crash modification factors for an
      adaptive traffic-signal control system, Journal of Transportation Engineering 142 (2016)
      04016061.
 [94] X. Sun, Y. Li, D. Magri, H. H. Shirazi, Application of Highway Safety Manual draft chapter:
      Louisiana experience, Transportation Research Record 1950 (2006) 55–64.
 [95] C. Sun, H. Brown, P. Edara, B. Carlos, K. Nam, Calibration of the Highway Safety Manual
      for Missouri, Technical Report 25-1121-0003-177, Mid-America Transportation Center,
      2013.
 [96] F. Xie, K. Gladhill, K. K. Dixon, C. M. Monsere, Calibration of Highway Safety Manual
      predictive models for Oregon state highways, Transportation Research Record 2241
      (2011) 19–28.
 [97] J. Ault, J. P. Hanna, G. Sharon, Learning an interpretable traffic signal control policy, in:
      Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent
      Systems, AAMAS ’20, 2020, pp. 88–96.
 [98] V. Jayawardana, A. Landler, C. Wu, Mixed autonomous supervision in traffic signal con-
      trol, in: Proceedings of the 2021 International Conference on Intelligent Transportation
      Systems, ITSC ’21, 2021, pp. 1767–1773.
 [99] S. G. Rizzo, G. Vantini, S. Chawla, Reinforcement learning with explainability for traffic
      signal control, in: Proceedings of the 2019 International Conference on Intelligent
      Transportation Systems, ITSC ’19, 2019, pp. 3567–3572.
[100] A. Müller, V. Rangras, G. Schnittker, M. Waldmann, M. Friesen, T. Ferfers, L. Schrecken-
      berg, F. Hufen, J. Jasperneite, M. Wiering, Towards real-world deployment of reinforce-
      ment learning for traffic signal control, in: Proceedings of the 20th IEEE International
      Conference on Machine Learning and Applications, ICMLA ’21, 2021, pp. 507–514.
[101] Y. Liu, L. Liu, W.-P. Chen, Intelligent traffic light control using distributed multi-agent Q
      learning, in: Proceedings of the 2017 International Conference on Intelligent Transporta-
      tion Systems, ITSC ’17, 2017, pp. 1–8.
[102] M. Essa, T. Sayed, Self-learning adaptive traffic signal control for real-time safety opti-
      mization, Accident Analysis & Prevention 146 (2020) 105713.
[103] M. Essa, T. Sayed, Traffic conflict models to evaluate the safety of signalized intersections
      at the cycle level, Transportation Research Part C: Emerging Technologies 89 (2018)
      289–302.
[104] Y. Gong, M. Abdel-Aty, J. Yuan, Q. Cai, Multi-objective reinforcement learning approach
      for improving safety at intersections with adaptive traffic signal control, Accident Analysis
      & Prevention 144 (2020) 105655.
[105] L. Liao, J. Liu, X. Wu, F. Zou, J. Pan, Q. Sun, S. E. Li, M. Zhang, Time difference penalized
      traffic signal timing by LSTM Q-network to balance safety and capacity at intersections,
      IEEE Access 8 (2020) 80086–80096.
[106] B. Yu, J. Guo, Q. Zhao, J. Li, W. Rao, Smarter and safer traffic signal controlling via deep
      reinforcement learning, in: Proceedings of the 29th ACM International Conference on
      Information and Knowledge Management, CIKM ’20, 2020, pp. 3345–3348.
[107] S. Bohez, A. Abdolmaleki, M. Neunert, J. Buchli, N. Heess, R. Hadsell, Value constrained
      model-free continuous control, arXiv preprint (2019). arXiv:1902.04623.
Rex Chen et al. CEUR Workshop Proceedings                                                           –


[108] D. Ding, K. Zhang, T. Basar, M. Jovanovic, Natural policy gradient primal-dual method
      for constrained Markov decision processes, in: Proceedings of the 34th International
      Conference on Neural Information Processing Systems, NeurIPS ’20, 2020, pp. 8378–8390.
[109] Z. Liu, Z. Cen, V. Isenbaev, W. Liu, Z. S. Wu, B. Li, D. Zhao, Constrained variational policy
      optimization for safe reinforcement learning, in: Proceedings of the 39th International
      Conference on Machine Learning, ICML ’22, 2022, pp. 1–9.
[110] D. Branston, H. van Zuylen, Comparison of queue-length models at signalized intersec-
      tions, Transportation Research 12 (1978) 47–53.
[111] F. Viloria, K. Courage, D. Avery, Comparison of queue-length models at signalized
      intersections, Transportation Research Record 1710 (2000) 222–230.
[112] H. Zhang, S. Feng, C. Liu, Y. Ding, Y. Zhu, Z. Zhou, W. Zhang, Y. Yu, H. Jin, Z. Li, CityFlow:
      A multi-agent reinforcement learning environment for large scale city traffic scenario, in:
      Proceedings of the 2019 World Wide Web Conference, WWW ’19, 2019, pp. 3620–3624.
[113] M. Guo, P. Wang, C.-Y. Chan, S. Askary, A reinforcement learning approach for intelligent
      traffic signal control at urban intersections, in: Proceedings of the 2019 International
      Conference on Intelligent Transportation Systems, ITSC ’19, 2019, pp. 4242–4247.
[114] A. Sharma, E. Smaglik, S. Kothuri, O. Smith, P. Koonce, T. Huang, Leading pedestrian
      intervals: Treating the decision to implement as a marginal benefit–cost problem, Trans-
      portation Research Record 2620 (2017) 96–104.
[115] K. Tang, M. Boltze, Z. Tian, H. Nakamura, Initial comparative analysis of international
      practice in road traffic signal control, in: Global Practices on Road Traffic Signal Control,
      Elsevier, 2019, pp. 285–310.
[116] S. Smith, Surtrac for the People: Upgrading the Surtrac Pittsburgh Deployment to in-
      corporate Pedestrian Friendly Extensions and Remote Monitoring Advances, Technical
      Report 01730614, Mobility21, 2020.
[117] S. Kothuri, A. Kading, E. Smaglik, C. Sobie, Improving Walkability Through Control
      Strategies at Signalized Intersections, Technical Report NITC-RR-782, National Institute
      for Transportation and Communities, 2017.
[118] C. Slavin, W. Feng, M. Figliozzi, P. Koonce, Statistical study of the impact of adaptive
      traffic signal control on traffic and transit performance, Transportation Research Record
      2356 (2016) 117–126.
[119] J. Peters, P. O’Brien, J. Pachman, Memorandum: Farmington Road Adaptive Traffic
      Control Benefits Analysis, Technical Report, DKS Associates, 2011.
[120] A. Mahendran, S. Smith, M. Hebert, X.-F. Xie, Bus Detection for Adaptive Traffic Signal
      Control, Technical Report, Carnegie Mellon University, 2014.
[121] S. Smith, I. Isukapati, E. Bronstein, C. Igoe, Integrating transit signal priority with adaptive
      signal control in a connected vehicle environment: Phase 1 Final Report, Technical Report
      01675986, Mobility21, 2018.
[122] B. Yin, M. Menendez, A reinforcement learning method for traffic signal control at an iso-
      lated intersection with pedestrian flows, in: Proceedings of the 19th COTA International
      Conference of Transportation Professionals, CICTP ’19, 2019, pp. 3123–3135.
[123] Y. Zhang, J. Fricker, Investigating smart traffic signal controllers at signalized crosswalks:
      A reinforcement learning approach, in: Proceedings of the 7th International Conference
      on Models and Technologies for Intelligent Transportation Systems, MT-ITS ’21, 2021,
Rex Chen et al. CEUR Workshop Proceedings                                                      –


      pp. 1–6.
[124] P. Chanloha, J. Chinrungrueng, W. Usaha, C. Aswakul, Cell transmission model-based
      multiagent Q-learning for network-scale signal control with transit priority, The Com-
      puter Journal 57 (2014) 451–468.
[125] S. M. A. Shabestray, B. Abdulhai, Multimodal iNtelligent Deep (MiND) traffic signal con-
      troller, in: Proceedings of the 2019 International Conference on Intelligent Transportation
      Systems, ITSC ’19, 2019, pp. 4532–4539.
[126] L. Zhang, S. Jiang, Z. Wang, Schedule-driven signal priority control for modern trams
      using reinforcement learning, in: Proceedings of the 17th COTA International Conference
      of Transportation Professionals, CICTP ’17, 2017, pp. 2122–2132.
[127] G. Guo, Y. Wang, An integrated MPC and deep reinforcement learning approach to
      trams-priority active signal control, Control Engineering Practice 110 (2021) 104758.
[128] N. Kumar, S. S. Rahman, N. Dhakad, An integrated MPC and deep reinforcement learn-
      ing approach to trams-priority active signal control, IEEE Transactions on Intelligent
      Transportation Systems 22 (2021) 4919–4928.
[129] H. Su, Y. D. Zhong, B. Dey, A. Chakraborty, EMVLight: A decentralized reinforcement
      learning framework for efficient passage of emergency vehicles, in: Proceedings of the
      36th AAAI Conference on Artificial Intelligence, AAAI ’22, 2022, pp. 1–11.
[130] H. Su, K. Shi, J. Chow, L. Jin, Dynamic queue-jump lane for emergency vehicles under
      partially connected settings: A multi-agent deep reinforcement learning approach, arXiv
      preprint (2021). arXiv:2003.01025.
[131] N. Nahar, S. Zhou, G. Lewis, C. Kästner, Collaboration challenges in building ML-enabled
      systems: Communication, documentation, engineering, and process, in: Proceedings of
      the 44th International Conference on Software Engineering, ICSE ’22, 2022, pp. 1–22.
[132] J. Pineau, P. Vincent-Lamarre, K. Sinha, V. Larivière, A. Beygelzimer, F. d’Alché-Buc,
      E. Fox, H. Larochelle, Improving reproducibility in machine learning research (a report
      from the NeurIPS 2019 Reproducibility Program), Journal of Machine Learning Research
      22 (2021) 1–20.